Elsevier

Information Sciences

Volume 184, Issue 1, 1 February 2012, Pages 215-229
Information Sciences

Subject-based extraction of a latent blog community

https://doi.org/10.1016/j.ins.2011.08.004Get rights and content

Abstract

In the blogosphere, there exist posts relevant to a particular subject and blogs that show interest in the subject. In this paper, we define a set of such posts and blogs as a blog community and propose a method for extracting the blog community associated with a particular subject. The proposed method is based on the idea that the blogs who have performed actions (e.g., read, comment, trackback, scrap) to the posts of a particular subject are the ones with interest in the subject, and that the posts that have received actions from such blogs are the ones that contain the subject. The proposed method starts with a small number of manually-selected seed posts containing the subject. Then, the method selects the blogs that have performed actions to the seed posts over some threshold and the posts that have received actions over some threshold. Repeating these two steps gradually expands the blog community. This paper presents various techniques to improve the accuracy of the proposed method. The experimental results show that the proposed method exhibits a higher level of accuracy than the methods proposed in prior research. This paper also discusses business applications of the extracted community, such as target marketing, market monitoring, improving search results, finding power bloggers, and revitalization of the blogosphere.

Introduction

According to Wellman, communities are networks of interpersonal ties that provide sociability, support, information, a sense of belonging, and social identity [25]. Traditionally, communities were equated with neighborhoods bounded by people living near each other, but with the advance of transportation, communication technology, and global online connectivity in the late 20th century, communities are becoming defined more socially and less spatially. That is, communities are no longer bound by neighborhoods. One may find many such communities on Internet [25]. Online cafes, blogs, chat rooms, portal-operated bulletin boards, support groups, Internet forums, instant messaging buddy lists, MMORPGs, and corporate websites that foster customer participation are just a sample of online communities [13], [23].

As most online communities operate under membership, the boundary of a community is explicit [17]. But the definition of a community can be broadened to a set of interconnected people who share a common interest on a particular subject even when they have not explicitly disclosed their intent of participation or affiliation. In this paper, such an implicit community is defined as a latent community and latent communities are found in a blogosphere.

The blogosphere is a primary example of online social networks where latent communities can be found. In a blogosphere, a blogger is often provided with a functionality that enables him/her to keep track of the blogs of him/her interest, which makes it easy and convenient for him/her to visit those blogs. Such functionality is called bookmark or neighbor. A blogger performs actions on a post in someone else’s blog, such as read, comment, trackback, scrap, etc. Read and comment are reading and putting comments on someone else’s post, respectively. Trackback is writing a new post related to someone else’s post while putting a link to the original post in one’s own blog. Unique to blog sites in Korea, scrap is an action of copying someone else’s post to one’s own blog. The difference is that trackback creates a new post with a link to the original post, whereas a blogger scraps someone else’s post to him/her own. Trackback and scrap actions can post a copy of the original post. The reproduced post may, in turn, trigger someone else read, comment, trackback, and scrap.

In a blogosphere, bloggers express their interests through actions, such as trackbacks and comments. Therefore, a set of the posts with content related to a particular subject and a set of bloggers who have shown interest in that subject through their actions to those posts may be considered a latent community. In this paper, we define the latent community in a blogosphere as a blog community, and the bloggers and posts that belong to the blog community as community members and community posts, respectively.

Since latent communities do not have formal membership constraints, it is difficult to determine the boundary of a latent community or to find the members who belong to the latent community [15]. Most prior research on extracting latent online communities associated with a particular subject has been targeted to web communities [3], [4], [19], [20]. A web community consists of a set of interconnected web pages, and the topology formed by hyperlinks between web pages is exploited to find a latent web community. Compared to the web, which has a single type of objects called web pages, a blogosphere includes two types of objects, bloggers and posts, which makes it difficult to apply the existing methods of extracting web communities directly to extract blog communities. This paper proposes a new method to extract a latent blog community associated with a particular subject.

The proposed method extracts a subject-based blog community by gradually expanding the set of community members and community posts. At first, it selects a small number of seed posts that contain the subject. We provide the guidelines for selecting quality seed posts, such as subject-relevance, popularity, and quality of information contents. The method then selects as community members the bloggers that have performed actions to the seed posts over some threshold. Then, the posts that have received actions over some threshold from the community members are selected as community posts. Repeating these two steps, it expands the blog community. At each step, the number of actions the currently selected community members have performed and the number of actions the currently selected community posts have received are used as the threshold for selecting community members and community posts, respectively. The iterative extraction process is terminated when the number of community posts approaches the estimated total number of subject-related posts in the blogosphere.

We propose three techniques to improve the accuracy of the original method. First, we propose to extract a community based on posts and folders, not based on posts and bloggers. Since bloggers tend to categorize their posts into folders based on subjects, extraction using folders (instead of bloggers) improves the accuracy of the extracted community. Second, the original method uses the selection threshold based on the number of actions, which may result in finding posts or bloggers that are relevant to multiple subjects. To improve accuracy, we propose to use not only the number of actions, but also the purity of actions. Action purity is defined in detail in Section 3. Third, popular posts tend to receive many actions from bloggers who are not interested in the particular subject, which may, in turn, cause posts and bloggers that are not related to the particular subject to be included in the extracted community. Therefore, we propose that the posts having received actions as many as popular posts be temporarily excluded during the community extraction process.

In this paper, we use the real world blog data in experiments. First, we compared the proposed method and the existing extraction methods adopted for blog-community extraction. The proposed method shows 85% accuracy on average, a higher level of accuracy than existing methods. Second, we compared the proposed method with the manual extraction by domain experts. The accuracy of the community extracted by the proposed method is only 5% lower than the manual extraction. This implies that the proposed method can achieve accuracy smiliar to that of the manual extraction with much less effort. In addition, a set of experiments were done to analyze the detail of the proposed method.

The latent blog community of a particular subject, once found, can be used in many business applications. First, the blogger who has an interest on the subject may discover the newest and latest information from the community posts that he/she has not been aware of before [15]. Second, the business may use community members in target marketing. For example, an automobile company may advertise its brand new car to the members of an “automobile” blog community, which results in effective advertising at minimal cost. Third, the extraction of latent communities and recognition of their existence may encourage more participation of bloggers and thus contribute to the revitalization of the entire blogosphere.

The paper consists of the following. Section 2 describes the related work. Section 3 describes the proposed method, the detailed algorithm, and the ways to improve the accuracy of the proposed method. Section 4 analyzes the accuracy of the proposed method through extensive experiments. Section 5 discusses the business applications of the proposed method. Section 6 concludes the paper and suggests further research directions.

Section snippets

Related work

Most prior research on finding a latent online community has been focused on exploiting the topology of hyperlinks between web pages. As blogs become more popular, some new approaches have been proposed on extracting blog communities. Section 2.1 describes the research on extracting web communities, and Section 2.2 discusses the research on extracting blog communities.

Proposed method

In this section, we explain the method for extracting a blog community with a given subject. Section 3.1 describes the proposed method, and Section 3.2 discusses the techniques for improving the accuracy of the proposed method.

Experimental setup

For experimental analysis, we used anonymized data collected from blog.naver.com, one of the largest blogospheres in Korea, for 8 months starting from April 2006. Since the posts with few actions were not likely to exceed the threshold value during the extraction process and thus not likely to be included in the final community, the posts that had received less than or equal to 3 actions

Business applications

The proposed method finds a hidden community composed of the bloggers and posts that are relevant to a particular subject. It can also rank the bloggers and posts based on the number of actions, the degree of action purity, or some combination of the two. In this section, we explore the business applications that utilize the extracted community or the ranking of bloggers and posts.

Conclusions

In the blogosphere, there exist a set of posts relevant to a particular subject and a set of bloggers that show interest in the subject. In this paper, we have defined such a latent community as blog community and have discussed a method to extract the blog community associated with a particular subject from a blogosphere.

We performed a set of experiments to verify the effectiveness of the proposed method. The experimental results show that the proposed method exhibits a higher level of

Acknowledgements

This work was supported by NHN Corp. Any opinions, findings, and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor. This work was also supported by the Mid-Career Researcher Program through the NRF (National Research Foundation) Grant funded by the MEST (Ministry of Education, Science, and Technology) (Grant No. 2008-0061006) and the MKE (The Ministry of Knowledge Economy), Korea, under the ‘National HRD Support Program

References (26)

  • S. Brin, L. Page, The anatomy of large-scale hypertextual web search engine, in: Proceedings of the 7th International...
  • A. Chin, M. Chignell, A social hypertext model for finding community in blogs, in: Proceedings of the 7th Conference on...
  • P.A. Chirita, D. Olmedilla, W. Nejdl, Finding related pages using the link structure of the WWW, in: Proceedings of the...
  • J. Dean, M.R. Henzinger, Finding related pages in the World Wide Web, in: Proceedings of the 8th International...
  • G.W. Flake, S. Lawrence, C.L. Giles, Efficient identification of web communities, in: Proceedings of the 6th ACM SIGKDD...
  • G.W. Flake et al.

    Self-Organization of the Web and Identification of Communities

    IEEE Computer

    (2002)
  • D. Gibson, J.M. Kleinberg, P. Raghavan, Inferring web communities from link topology, in: Proceedings of the 9th ACM...
  • M. Girvan, M. Newman, Community structure in social and biological networks, in: Proceedings of the National Academy of...
  • G. Greco et al.

    Web communities: models and algorithms

    World Wide Web: Internet and Web Information Systems

    (2004)
  • T. Haveliwala

    Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search

    IEEE Transactions on Knowledge and Data Engineering

    (2003)
  • N. Imafuhi, M. Kitsuregawa, Effects of maximum flow algorithm on identifying web community, in: Proceedings of the 4th...
  • K. Ishida, Extracting latent weblog communities – A partitioning algorithm for bipartite graphs, in: Proceedings of the...
  • C. James et al.

    The socialtrust framework for trusted social information management: architecture and algorithms

    Information Sciences

    (2010)
  • Cited by (5)

    • An analysis on information diffusion through BlogCast in a blogosphere

      2015, Information Sciences
      Citation Excerpt :

      In this paper, we use the two terms of the blog and the blogger interchangeably.) Bloggers manage other bloggers who they find interesting as a list of links called blogroll, which makes it easy to keep track of recent activities of the bloggers in the list [42]. The blogroll is the only visible relationship in the blogosphere.

    • Trustable aggregation of online ratings

      2013, International Conference on Information and Knowledge Management, Proceedings
    • Hierarchically clustered technical blogs

      2012, ACM International Conference Proceeding Series
    View full text