Community detection using hierarchical clustering based on edge-weighted similarity in cloud environment

https://doi.org/10.1016/j.ipm.2018.10.004Get rights and content

Abstract

Recently, social network has been paid more and more attention by people. Inaccurate community detection in social network can provide better product designs, accurate information recommendation and public services. Thus, the community detection (CD) algorithm based on network topology and user interests is proposed in this paper. This paper mainly includes two parts. In first part, the focused crawler algorithm is used to acquire the personal tags from the tags posted by other users. Then, the tags are selected from the tag set based on the TFIDF weighting scheme, the semantic extension of tags and the user semantic model. In addition, the tag vector of user interests is derived with the respective tag weight calculated by the improved PageRank algorithm. In second part, for detecting communities, an initial social network, which consists of the direct and unweighted edges and the vertexes with interest vectors, is constructed by considering the following/follower relationship. Furthermore, initial social network is converted into a new social network including the undirected and weighted edges. Then, the weights are calculated by the direction and the interest vectors in the initial social network and the similarity between edges is calculated by the edge weights. The communities are detected by the hierarchical clustering algorithm based on the edge-weighted similarity. Finally, the number of detected communities is detected by the partition density. Also, the extensively experimental study shows that the performance of the proposed user interest detection (PUID) algorithm is better than that of CF algorithm and TFIDF algorithm with respect to F-measure, Precision and Recall. Moreover, Precision of the proposed community detection (PCD) algorithm is improved, on average, up to 8.21% comparing with that of Newman algorithm and up to 41.17% comparing with that of CPM algorithm.

Introduction

Recently, social networks have helped users to not only amplify interpersonal circles in social network, but also get the latest information. A community consists of the users with common interests or common characteristics. The common interests or common characteristics play an important role in the public opinion investigations, product recommendations, public sentiment analysis, information flow control, and other fields. However, with the growing number of users, the “information overload” problem becomes more obvious. There is an urgent need to identify communities of social networks for providing better product designs, accurate information recommendation and public services (Buccafurri et al., 2014, Buccafurri et al., 2015). Therefore, how to detect communities of social networks becomes a challenging problem.

Some researchers addressed this problem by considering user interests. The user contents can effectively reflect the user opinions and interests. However, due to the sparsity of the contents and the lack of contextual information, user interest detection (UID) faces new challenges. Some researchers detected user interests based on the text features and the semantic information of the contents (Liu et al., 2010, Tang et al., 2012). But when enriching the semantic information, they need the help of external database. Also, some researchers tried to extract most representative words from a short text to realize UID (Sun, 2012). In addition, some researchers paid attention to the hierarchical clustering algorithm for social network (Kim and Shim, 2013, Zhang et al., 2015). However, few researchers discuss the community detection (CD) by considering both the interests and the interaction behaviors between the users.

For addressing the problems mentioned above, a UID algorithm with respect to the semantic information and the improved PageRank algorithm is discussed. Moreover, a CD algorithm considering network topology and user interests is proposed. The main contributions of this paper are summarized as follows:

  • UID algorithm based on semantic information and improved PageRank algorithm is presented. the PageRank algorithm is improved to calculate the tag weight for target users.

  • A new social network with the undirected and weighted edges is built, whose edge weights are calculated by considering both the user interests and the direction of edges in the initial social network for identifying the overlapping communities.

  • The performance of PCD algorithm and some previous typical CD algorithm are evaluated via extensive experiments. The results indicate that our PCD algorithm can effectively detect communities of social networks. And it can also significantly identify the overlapping communities.

The rest of this paper is organized as follows: In Section 2, several related works are presented. In Section 3, our methodology for CD is described systematically. In Section 4, the datasets, evaluation methodology and experimental procedure are introduced, respectively. The experimental results and discussion are presented in Section 5. Lastly, the conclusion is discussed in Section 6.

Section snippets

Related works

With the development of the social networks, a lot of works have been recently conducted to study UID problems and CD problems. In this section, some brief descriptions about the works are given.

Methodology

In this section, a method to detect user interests based on the semantic information and the improved PageRank algorithm is proposed. Furthermore, a method to detect communities based on the network topology and the user interests is presented.

Datasets

In the experiments, all data are derived from Sina microblog (https://weibo.com/). The focused crawler algorithm is developed to get user tags. The experiment data set is the text data including user ID, tags, follower, critics and so on for 100,000 users. The experiment of 200 users selected randomly from the data set is carried out to demonstrate the effectiveness of proposed algorithm. The size of candidate tag set is 60. The user semantic model is constructed by selecting the top 500 topics

Methodology and metrics

The experiments in this section mainly includes two parts. In first part, the quality of users is discussed by the verification experiments. The statistics of the number of users from data sets are obtained by the statistic methodology (Li et al., 2018, Romeo et al., 2017, Serban et al., 2018) to illustrate the sparsity of keywords and personal tags. In second part, to illustrate the feasibility of Hadoop platform, the results of PCD in Hadoop platform is presented. Furthermore, the execution

Conclusion and future work

In this paper, CD algorithm based on network topology and user interests is presented. This paper mainly includes two parts. In first part, the focused crawler algorithm is used to acquire the personal tags from the tags posted by other users. Then, the tags are selected from the tag set based on the TFIDF weighting scheme, the semantic extension of tags and the user semantic model. In addition, the tag vector of user interests is derived with the respective tag weight calculated by the

Acknowledgments

The work was supported by the National Natural Science Foundation (NSF) under grants (No. 61672397, No. 61873341, No. 61472294), Application Foundation Frontier Project of WuHan (No. 2018010401011290), Open Foundation of State Key Laboratory of Smart Manufacturing for Special Vehicles and Transmission System (GZ2018KF002). Open Foundation of Key Laboratory of Industrial Wireless Network and Networked Control of Ministry of Education. Any opinions, findings, and conclusions are those of the

References (43)

  • B. Tian et al.

    Community detection method based on mixed-norm sparse subspace clustering

    Neurocomputing

    (2018)
  • T. Wang et al.

    A Community detection method based on local similarity and degree clustering information

    physical A: Statistical Mechanics and its Applications

    (2018)
  • J. Xiang et al.

    Phase transition of surprise optimization in community detection

    Physical A: Statistical Mechanics and its Applications

    (2018)
  • Z. Yu et al.

    Friend recommendation with content spread enhancement in social networks

    Information Sciences

    (2015)
  • X. Zhou et al.

    Real-time recommendation for microblogs

    Information Sciences

    (2014)
  • Y. Ahn et al.

    Link communities reveal multiscale complexity in networks

    Nature

    (2010)
  • J. Beel et al.

    Research-paper recommender systems: A literature survey

    International Journal on Digital Libraries

    (2015)
  • Buccafurri, F., Lax, G., Nicolazzo, S., & Nocera, A. (2014). A model to support multi-social-network applications....
  • Kailonng Che et al.

    Collaborative personalized tweet recommendation

  • S. Fortunato et al.

    Resolution limit in community detection

    Proceedings of the National Academy of Sciences of the United States of America

    (2007)
  • M. Girvan et al.

    Community structure in social and biological networks

    Proceedings of The National Academy of Sciences of The United States of America

    (2002)
  • Cited by (50)

    • Information matching model and multi-angle tracking algorithm for loan loss-linking customers based on the family mobile social-contact big data network

      2022, Information Processing and Management
      Citation Excerpt :

      The numbers of common friends, Jaccard similarity and cosine similarity were used to characterize the familiarity between users, thereafter, user familiarity and preference similarity were combined to calculate the similarity between individuals (Qiao et al., 2018). To measure the social network similarity of individuals, scholars established an initial social network of follower relationships, transformed the social network into a new social network with non-directional and weighted edges, and estimated edge weights to measure similarity between the edges of social networks by direction vectors and interest vectors in the initial social network (Li et al., 2019). However, the measure of closeness proposed here is used to measure the degree of closeness between loan customers and their family members.

    View all citing articles on Scopus
    View full text