Multi-view community detection with heterogeneous information from social media data
Introduction
Social networking sites such as MySpace, Facebook, orTwitter attract millions of users, who everyday publish an enormous amount of content in the form of pictures, tweets, comments and posts. Social networks can be defined as a set of socially-relevant nodes connected by one or more relations. Nodes in such networks are not limited to people, but also represent other entities such as Web pages, journal articles or geographical places, amongst other possibilities. Users of networking sites are required to create profiles where users can describe themselves by sharing their age, locations, interests and picture, amongst other things. Generally, social networks allow users to create and read content, and establish social connections with other users whose nature and semantics might differ from site to site. For example, followee relations in Twitter, or friendship relations in Facebook. Although the technological features of the different social networking sites are similar, the cultures that emerge around them are diverse [3]. Most sites encourage the maintenance of pre-existing social networks, whilst others help strangers to create new connections based on shared interests. In this context, understanding users’ needs arises as a critical issue [9]. Users’ needs could be regarded as users’ desire to obtain information, which could be further specified as long-term (interests) or instant (intends) user needs. Nonetheless, needs are often latent, so inferring them from the observed data might be challenging.
Social networks affect the way people communicate and interact. The pervasive use of social media offers research opportunities for analysing the behaviour of users when interacting with their friends [32], and how such interactions evolve over time [43], in terms of patterns of appearing and disappearing relationships. Unlike social connections formed by people in the physical world, social media users have greater freedom to connect with a wider spectrum of people for distinct reasons. The low cost of link formation might lead to networks with relationships of heterogeneous nature, origin and strength. For example, in Twitter, a user might follow others because they publish interesting information, they have the same interests, they are celebrities or popular individuals in the micro-blogging community, or only because they share some common friends, amongst other possible explanations. As a result, topological relations could lead to the existence of casual links, which could hinder the utilisation of algorithms solely based on topology. Hence, the nature of structural information must be carefully analysed in conjunction with other sources of information or data views to effectively assess the significance and importance of relations. In addition to social information indicating friendship or simpler user interaction, there are other information sources that might implicitly define connections between users in social media. For example, whether two users use the same terms, hashtags, or post on the same topics. It is worth noting that the content users consume or post might depend, for example, on their mood and environment [9]. In light of the fact that users’ needs are implicit, comprehensive research is needed for discovering the mapping between the heterogeneous, and possible multimedia, information in social networks and users’ needs, and how such mapping can be enriched with contextual information.
One fundamental problem in social networks is the identification of groups of users when group membership is not explicitly available. A group, or community, can be defined as a set of elements (users, posts or other elements) that interact more frequently or are more similar to other community members than to outsiders. Community detection has proven to be valuable in diverse domains such as biology, social sciences and bibliometrics. For example, community detection techniques can be used for identifying groups of users with similar purchase history enabling the creation of more efficient recommendation systems that could better guide customers and enhance business opportunities as in Amazon [16], for detecting topics in collaborative systems [25], for identifying real-world landmarks in Flickr by clustering photos [26], for detecting events on Twitter streams [1], for matching high-quality answers to questions in the context of a question answering system [11], or for solving the influence maximisation problem in Foursquare [19].
Several techniques for community detection can be found in the literature. However, most of them only focus on one data view, even though neither social relations nor content by themselves can accurately indicate community membership. For example, in Twitter social relations might be extremely sparse and two users might belong to the same community even if they are not explicitly socially related. Conversely, social media content might be topically diverse and noisy for extracting valuable topical-based relationships. Combining multiple data views as required by social media data poses new challenges. For instance, how to integrate the different views by adequately assessing their importance in the social network, or how to determine whether such integration could actually improve the quality of detected communities.
Considering the increasing amount of information available in social networks and the necessity of integrating heterogeneous data, this paper focuses on the needs and challenges of combining multiple information sources for performing community detection. This work studies how to integrate multiple social and content-based views or information sources aiming at improving the quality of the detected communities. The final goal of the paper is to provide some insights on how to select the relevant views to consider for the task to develop according to the characteristics of the network under analysis. It is worth noting that the selection of the views to integrate depends on the elements available on the social network under analysis, such as the characteristics and semantics of social relations, the semantics of the messages users’ exchange, or the content of such messages, amongst others. Moreover, several alternatives are proposed for integrating the semantics conveyed by the edge directionality embedded on the selected views. Finally, an extensive experimental evaluation of the benefits of combining the different views on diverse social networking sites is performed.
The rest of this paper is organised as follows. Section 2 discusses related research. Section 3 defines the nature of the diverse views to consider in the analysis, and a technique for combining them, as well as exploiting the semantics of edge directionality. Section 4 describes the experimental evaluation performed over real-world datasets. Finally, Section 5 summarises the conclusions drawn from this study and presents future lines of work.
Section snippets
Related work
Generally, social networks are analysed by means of graphs, representing a group of nodes or vertices, which are connected by links or edges. Edges can be directed (as the Followee/Follower relation on Twitter) or undirected (as the friendship relation on Facebook). Communities refer to potentially overlapping groups of nodes that have dense connections within the community, but sparse connections with nodes of other communities. Communities can be defined globally or locally, depending on
Community detection based on heterogeneous social information
The first step to apply a community detection algorithm is to define the information that is going to be available to the algorithm, i.e. the information on which the underlying graph structure will be built upon. When analysing social media, multiple and diverse graphs can be defined. Nodes can represent not only real people, but also diverse entities such as Web pages, journal articles, countries, neighbourhoods, or positions, amongst others [21]. For example, if the goal of the community
Experimental evaluation
This section presents the experimental evaluation performed to assess the effectiveness of the proposed alternatives for leveraging on heterogeneous information provided by social media data, and is organised as follows. Section 4.1 presents the data collections used for evaluating the effectiveness of the presented technique. Section 4.2 presents implementation details and the metrics used for evaluating the different alternatives. Finally, Section 4.4 presents the results derived from the
Conclusions
This work aimed at integrating multiple information sources for performing community detection in social networks. The proposed technique tackled the problem of how to combine several information sources for effectively finding high-quality community partitions. Moreover, it proposed several alternatives for adequately considering the semantics conveyed by directed relations.
Experimental evaluation conducted on two real-world social media datasets demonstrated that the different information
Antonela Tommasel is a member of ISISTAN Research Institute (CONICET-UNICEN) since 2011. She received her Bachelor in Software Engineering at UNICEN University (Argentina) in November 2012 and a PhD in Computer Science degree at the same institution in December 2017. She is also a teacher assistant at the same university. Her research interests include recommender systems, text mining, user modelling and social web.
References (45)
- et al.
Community detection in social media by leveraging interactions and intensities
WISE (2)
(2013) - et al.
Integrating social media data for community detection
Proceedings of the LNAI
(2012) - et al.
Combining link and content for community detection: a discriminative approach
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09
(2009) - et al.
Expert finding for question answering via graph regularized matrix completion
IEEE Trans. Knowl. Data Eng.
(2015) - et al.
Event detection in social streams
Proceedings of the 2012 SIAM International Conference on Data Mining
(2012) - et al.
Fast unfolding of communities in large networks
J. Stat. Mech. Theory Exper.
(2008) - et al.
Social network sites: definition, history, and scholarship
J. Comput. Med. Commun.
(2007) - et al.
Bigraphs versus digraphs via matrices
J. Graph Theory
(1980) - et al.
Nus-wide: a real-world web image database from national university of singapore
Proceedings of the ACM International Conference on Image and Video Retrieval
(2009) - et al.
Finding community structure in very large networks
Phys. Rev. E
(2004)
A framework for joint community detection across multiple related networks
Neurocomputing
Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach
Social-sensed multimedia computing
IEEE MultiMedia
Measuring quality of similarity functions in approximate data matching
J. Inform.
Community-based question answering via heterogeneous social network learning
Community detection in graphs
Phys. Rep.
The strength of weak ties
Am. J. Sociol.
Module identification in bipartite and directed networks
Phys. Rev. E
A Graph Based Approach to Extract a Neighborhood Customer Community for Collaborative Filtering
Faving reciprocity in content sharing communities: a comparative analysis of flickr and twitter
Empirical comparison of algorithms for network community detection
Proceedings of the 19th International Conference on World Wide Web
Understanding Community Effects on Information Diffusion
Cited by (13)
Parallel multi-objective evolutionary optimization based dynamic community detection in software ecosystem
2022, Knowledge-Based SystemsCitation Excerpt :Further, many scholars adopt multi-view learning and graph representation learning techniques for dynamic community detection, among which also adopt historical information. In multi-view learning, information from different angles is collected [28]. Zhou et al. [29] transformed a dynamic community detection problem into a multi-objective optimization problem, and proposed a multi-objective discrete bat algorithm to capture the structure information of a network.
LILPA: A label importance based label propagation algorithm for community detection with application to core drug discovery
2020, NeurocomputingCitation Excerpt :With the study of complex networks, it is found that community is one of the most important properties of networks. In general, a community in a network consists of a cohesive group of nodes that are relatively densely connected to each other but sparsely connected to other groups [3,4]. There are overlapping nodes shared by several overlapping communities in some networks [5].
Social influence based community detection in event-based social networks
2020, Information Processing and ManagementCitation Excerpt :The UID algorithm first builds a new social network with undirected and weighted edges by considering user interests and edges’ direction in initial network, and then utilizes the hierarchical clustering algorithm for detecting communities. Recently, some works focus on the community detection problem in heterogeneous social networks (Huang et al., 2018; Tommasel & Godoy, 2018). Huang et al. (2018) first attempt to mining the overlapping communities in large-scale heterogeneous networks.
CIPF: Identifying fake profiles on social media using a CNN-based communal influence propagation framework
2024, Multimedia Tools and ApplicationsA label propagation community discovery algorithm combining seed node influence and neighborhood similarity
2024, Knowledge and Information SystemsCDBMA: Community Detection in Heterogeneous Networks Based on Multi-attention Mechanism
2024, Communications in Computer and Information Science
Antonela Tommasel is a member of ISISTAN Research Institute (CONICET-UNICEN) since 2011. She received her Bachelor in Software Engineering at UNICEN University (Argentina) in November 2012 and a PhD in Computer Science degree at the same institution in December 2017. She is also a teacher assistant at the same university. Her research interests include recommender systems, text mining, user modelling and social web.
Daniela Godoy is a researcher at CONICET and a member of ISISTAN Research Institute, Tandil, Argentina. She is also a full-time professor in the Department of Computer Science at UNCPBA, Tandil, Argentina. She obtained her Master’s degree in Systems Engineering (2001) and her PhD in Computer Science (2005) at the same university. Her research interests include intelligent agents, user profiling and text mining.