1 Introduction

In recent years there has been an ever growing usage of social media, web-based platforms that allow the easy creation and exchange of user generated content. Social media can be centered around video (e.g., Youtube, Vimeo), audio (e.g., Last.fm, MySpace), pictures (e.g., Flickr, Picassa), other media types (bookmarks, books, etc.), and people (e.g., Facebook, Friendster). However, one of the most popular media types is still text. Textual social media come in various forms, each with its own characteristics and users: examples are (micro)blogs, forums, mailing lists, reviews, and comments. In this paper we look at one particular type of social media, blogs. Blogging platforms allow people (bloggers) to write diary entries (blog posts) about topics of their choice.

The growing amount of social media content available online creates new challenges for the information retrieval (IR) community, in terms of search and analysis tasks for this type of content (Weerkamp 2011). One of the main challenges lies in the fact that creators of social media content, the bloggers, are given a large degree of freedom: operating without top-down editorial rules and editors, they produce blog posts of hugely varying quality. Some of the posts are edited, news article-like, whereas others are of very low quality. The quality of a blog post may have an impact on its suitability of being returned in response to a query.

Although some approaches to blog post retrieval (finding blog posts that are relevant to a given topic) use indirect quality measures like elaborate spam filtering (Java et al. 2007) or counting inlinks (Mishne 2007b), few systems turn the credibility (Metzger 2007) of blog posts into an aspect that can benefit the retrieval process. Our hypothesis is that we can use credibility-inspired indicators to improve topical blog post retrieval. In this paper we explore the impact of these credibility-inspired indicators on the task of blog post retrieval.

To make matters concrete, consider Fig. 1: both (blog) posts are relevant to the query “tennis,” but based on obvious surface level features of the posts we quickly determine Post 2 to be more credible than Post 1. The most obvious features are spelling errors, the lack of leading capitals, the large number of exclamation marks and personal pronouns, and the fact that the language usage in the second post is more easily associated with credible information about tennis than the language usage in the first post.

Fig. 1
figure 1

Two blog posts relevant to the query “tennis”

Another case in which credibility plays an important role is so-called online reputation management (Klewes and Wreschniok 2009): companies monitor online activities, for example on blogs and social networking sites, to find mentions of themselves or of their products and services. The goal here is to identify potentially harmful messages, and try to respond fast and adequately to these. While monitoring a company’s reputation, one can come across posts like the ones in Fig. 2: The first post is an extensive and well-written description of someone’s encounter with company X’s help desk. The second is a short, apparently angry shout by a frustrated customer. Company X might decide to act fast after spotting the first post, given that this post sounds reliable, and other people reading it might believe it. The second post is useful for overall statistics on reputation, but is not as important as an individual post.

Fig. 2
figure 2

Two blog posts about “Company X”

Similarly, when looking for information on company X, searchers might be more interested in reading the first post than the second. The first will give them insight in what particular service of this company is not as it should be; the second post does not contain much information besides conveying an opinion.

The idea of considering credibility in the blogosphere is not new: Rubin and Liddy (2006) define a framework for assessing blog credibility, consisting of four main categories: blogger’s expertise and offline identity disclosure; blogger’s trustworthiness and value system; information quality; and appeals and triggers of a personal nature. Under these four categories the authors list a large number of indicators, some of which can be determined from textual sources (e.g., literary appeal), and some of which typically need non-textual evidence (e.g., curiosity trigger). We discuss the indicators in Sect. 3.

Although the Rubin and Liddy (2006) framework is not the only available credibility framework, it is the only framework specifically designed for the blogosphere. Other credibility assessments in social media, like Weimer et al. (2007)’s assessment of forum posts and Agichtein et al. (2008)’s quality detection in cQA, have the advantage that they already identified measurable indicators and have tested the performance of these indicators, but these “frameworks” are specifically designed for other social media platforms. This results in a large group of indicators that do not necessarily apply to our (blog) setting, like content ratings (“thumbs up”), user ratings, and inclusion of HTML code, signatures, and quotes in posts. The indicators proposed by Rubin and Liddy (2006) are not (yet) instantiated and give us the freedom to find appropriate ways of measuring these indicators.

In this paper, we instantiate Rubin and Liddy (2006)’s indicators in a concrete manner and test their impact on blog post retrieval effectiveness. Specifically, we only consider indicators that are textual in nature, and to ensure reproducibility of our results, we only consider indicators that can be derived from the collection at hand (see Sect. 5) and that do not need additional resources such as bloggers’ profiles, that may be hard to obtain for technical or legal reasons. We identify two groups of indicators: (1) blog-level, and (2) post-level indicators. The former group refers to the blog as a whole, that is, to the blogger, and the latter group deals only with characteristics of the post at hand. Blog post retrieval is a precision-oriented task, similar to web search (Manning et al. 2008, Chapter 19). Taking credibility-inspired indicators into account in the retrieval process aims at enhancing precision; there is no obvious reason why these indicators should or could improve recall.

Note that we do not try to explicitly measure the credibility of posts. Although this would be a very interesting and challenging task, we currently have no ways of evaluating the performance on such a task. Rather, we take ideas from the credibility framework and propose a set of credibility-inspired indicators that we put to use on the task of blog post retrieval.

We ask the following research questions:

  1. 1.

    Given the credibility framework developed by Rubin and Liddy (2006), which indicators can we measure from the text of blog posts?

  2. 2.

    Can we incorporate credibility-inspired indicators in the retrieval process, keeping in mind the precision-oriented nature of the task? We try two methods: (i) “Credibility-inspired reranking” based on credibility-inspired scores and (ii) “Combined reranking” based both on credibility-inspired scores and retrieval scores.

  3. 3.

    Can individual credibility-inspired indicators improve precision over a strong baseline?

  4. 4.

    Can we improve performance (further) by combining indicators in blog and post-level groups? And by combining them all?

In our extensive analysis we discuss five issues that were raised during the experiments:

  1. 1.

    What is the performance of our (simple) spam classification system?

  2. 2.

    Given the reranking approaches we take, how do these actually change the rankings of blog posts?

  3. 3.

    Which specific posts are helped or hurt by the credibility-inspired indicators?

  4. 4.

    What is the impact on performance of the number of results we use in reranking?

  5. 5.

    Do we observe differences between topics with regard to the performance of credi-bility-inspired indicators?

  6. 6.

    Which of the credibility-inspired indicators have most influence on retrieval performance and why is this?

Our main findings are that reranking the top results based on credibility-inspired scores is beneficial for precision. Especially indicators on the post level contribute to a great extent to this improvement. We can choose for a more radical reranking approach, leading to high gains and losses, or a smoothed version, leading to more stable results.

In Sect. 2 we discuss related work. We follow in Sect. 3 with the introduction of the credibility framework. We define our credibility-inspired indicators in Sect. 4, and describe the experimental setup for testing their impact on retrieval effectiveness in Sect. 5. We discuss the results of our two methods for incorporating credibility-inspired indicators in Sect. 6 and analyze them in detail in Sect. 7. Finally, we draw conclusions in Sect. 8.

2 Related work

Related work comes in two kinds. First, we briefly introduce work related to credibility assessment in web settings. Next, we zoom in on social media and credibility. Then, the next section introduces the credibility framework by Rubin and Liddy (2006) that we use as basis for our work.

2.1 Credibility on the web

In a web setting, credibility is often couched in terms of authoritativeness and estimated by exploiting the hyperlink structure. Two well-known examples are the PageRank and HITS algorithms (Liu 2007), that use the link structure in a topic independent or topic dependent way, respectively. The idea behind these algorithms is that more pages linking to a certain document is an indication of this page being more authoritative. In calculating the authoritativeness for a page, the authoritativeness of pages linking to it is taken into account.

The idea of using link structure for improving blog post retrieval has been studied, but results do not show improvements, e.g., Mishne (2007b) finds that retrieval performance decreased, probably because linking in blogs indicates a social relation rather than a vote of authoritativeness. This confirms lessons from the TREC web tracks, where participants found no conclusive benefit from the use of link information for ad hoc retrieval tasks (Hawking and Craswell 2002). And although some work suggests that social links can be useful in quality prediction (Lu et al. 2010), this mostly works in (dense) social networks. The blog data at hand contains too little social linkage to show this.

Mandl (2006) tries to determine the quality of web pages using a machine learning approach and uses this automatic assessment in a web search engine; features are mainly extracted from the HTML code and DOM tree.

2.2 Credibility in social media

Credibility-related work in social media comes in various forms, and is applied to different platforms. Weimer et al. (2007) discuss the automatic assessment of forum post quality; they use surface, lexical, syntactic and forum-specific features to classify forum posts as bad posts or good posts. The use of forum-specific features (such as whether or not the post contains HTML, and the fraction of characters that are inside quotes of other posts), gives the highest benefits to the classification.

Working in the community question/answering domain, Agichtein et al. (2008) use content features, as well non-content information available, such as links between items and explicit quality ratings from members of the community to identify high-quality content. In the same domain, Su et al. (2010) try to detect text trustworthiness by incorporating evidentiality (e.g., “I’m certain of this”) in their feature set.

To allow for better presentation of online reviews to users, O’Mahony and Smyth (2009) try to determine the helpfulness of reviews. Their features are divided in reputation features, content features, social features, and sentiment features. Follow-up work also includes readability features (O’Mahony and Smyth 2010).

For blogs, most work related to credibility is aimed at trying to identify blogs worth following. Sriphaew et al. (2008) try to identify “cool blogs,” i.e., blogs that are worth exploring. Their approach follows a combination of credibility-like features with topic consistency, as used in blog feed search (Macdonald et al. 2008b). Similar work is done by Chen and Ohta (2010), who try to filter blog posts using topic concentration and topic variety. One of our indicators (see Sect. 4) is post length, which was further explored by Hearst and Dumais (2009). They found that there is a correlation between the length of posts in a blog and the popularity of that blog. Mishne and de Rijke (2006)’s observation that bloggers often report on news events is the basis for the credibility assessment in Juffinger et al. (2009). The authors compare blog posts to news articles about the same topic, and assign a credibility level based on the similarity between the two. We use a similar technique, but acknowledge that not all blog posts are about news events, hence the need for other indicators. Spam identification may be part of estimating credibility, not only for blogs (or blog posts), but also for other (web) documents. Spam identification has been successfully applied in the blogosphere to improve retrieval effectiveness, for example by Java et al. (2007) and Mishne (2007a).

Recently, credibility-inspired indicators have been successfully applied to post finding in a specific type of blog environment: microblogs (Massoudi et al. 2011). Besides translating indicators to the new environment, the authors also introduced platform-specific indicators like followers, retweets, and recency. For the task of exploring trending topics on Twitter, Castillo et al. (2011) use a similar set of indicators to assess credibility of tweets, and use human assessments to test their approach.

Research into credibility of content is not restricted to textual content. Tsagkias et al. (2010) try to establish the credibility of a particular type of audio: podcasts. They show that, besides podcast-wide metadata (e.g., podcast logo, description length), episode data also plays an important role in determining credibility. We use a similar notion by combining blog level and post level indicators in our work. Finally, Diakopoulos and Essa (2010) explore credibility in video, mainly through the use of smart interfaces and knowledge sharing.

3 Credibility framework

In our choice of credibility indicators we use Rubin and Liddy (2006)’s work as a reference point. We recall the main points of their framework and relate our indicators to it. Rubin and Liddy (2006) proposed a four factor analytical framework for blog readers’ credibility assessment of blog sites, based in part on evidentiality theory (Chafe 1986), website credibility assessment surveys (Stanford et al. 2002), and Van House (2004)’s observations on blog credibility. The four factors—plus indicators for each of them—are listed below.

  1. 1.

    Blogger’s expertise and offline identity disclosure:

    1. a.

      name and geographic location

    2. b.

      credentials

    3. c.

      affiliations

    4. d.

      hyperlinks to others

    5. e.

      stated competencies

    6. f.

      mode of knowing

  2. 2.

    Blogger’s trustworthiness and value system:

    1. a.

      biases

    2. b.

      beliefs

    3. c.

      opinions

    4. d.

      honesty

    5. e.

      preferences

    6. f.

      habits

    7. g.

      slogans

  3. 3.

    Information quality:

    1. a.

      completeness

    2. b.

      accuracy

    3. c.

      appropriateness

    4. d.

      timeliness

    5. e.

      organization (by categories or chronology)

    6. f.

      match to prior expectations

    7. g.

      match to information need

  4. 4.

    Appeals and triggers of a personal nature:

    1. a.

      aesthetic appeal

    2. b.

      literary appeal (i.e., writing style)

    3. c.

      curiosity trigger

    4. d.

      memory trigger

    5. e.

      personal connection

In our decision which indicators to include in our experiments, we followed the following steps. For each, we indicate which of the credibility indicators from Rubin and Liddy (2006)’s framework are excluded.

  1. A.

    We do not use credibility indicators that make use of the searcher’s or blogger’s identity (excluding 1a, 1c, 1e, 2e);

  2. B.

    We include indicators that can be estimated automatically from available test collections only so as to facilitate repeatability of our experiments (excluding 3e, 4a, 4c, 4d, 4e);

  3. C.

    We only select indicators that can be reliably estimated with state-of-the-art language technology (excluding 2b, 2c, 2d, 2g).

  4. D.

    Finally, given the findings by Mishne (2007b), we ignore the “hyperlinks to others” indicator (1d).

Of the 11 indicators that we do consider—1b, 1f, 2a, 2f, 3a, 3b, 3c, 3d, 3f, 3g, 4b—one is part of the baseline retrieval system (3f), and does not require an indicator. The others are organized in two groups, depending on the information source that we use to estimate them: post level and blog(ger) level. The former depends solely on information contained in an individual blog post, and ignores the blog it belongs to. The latter aggregates or averages information from posts to the blog level; these indicator values are therefore equal for all posts in the same blog.

In the next section we explore the 10 selected indicators from Rubin and Liddy (2006)’s credibility framework and introduce ways of estimating these indicators so that they can be applied to the task at hand: blog post retrieval.

4 Credibility-inspired indicators

In this section we introduce our credibility-inspired indicators, explain how they are related to the work by Rubin and Liddy (2006) that was described in the previous section, and offer ways of estimating the indicators. Table 1 summarizes this section, and lists our credibility-inspired indicators and their originating counterpart.

Table 1 Our credibility-inspired indicators and their origins in Rubin and Liddy (2006)

Next, we specify how each of the credibility-inspired indicators is estimated, and briefly discuss why and how these indicators address the issue of credibility. We start with the eight post-level indicators (Sect. 4.1) and conclude with the six blog-level indicators (Sect. 4.2).

4.1 Post-level indicators

As mentioned previously, post-level indicators make use of information contained within individual posts. We go through the indicators capitalization, emoticons, shouting, spelling, punctuation, post length, timeliness, and semantics.

4.1.1 Capitalization

We estimate the capitalization score as follows:

$$ S_{capitalization}(post) = {\frac{n(caps,s_{post})}{|s_{post}|}}, $$
(1)

where n(capss post ) is the number of sentences in post poststarting with a capital and |s post | is the number of sentences in the post; we only consider sentences with five or more words. We consider the use of capitalization to be an indicator of good writing style, which in turn contributes to a sense of credibility.

4.1.2 Emoticons

The emoticons score is estimated as

$$ S_{emoticons}(post) = 1 - \left({\frac{n(emo,post)}{|post|}}\right), $$
(2)

where n(emopost) is the number of emoticons in the post and |post| is the length of the post in words. We identify Western style emoticons (e.g., :-) and :-D) in blog posts, and assume that excessive use indicates a less credible blog post.

4.1.3 Shouting

We use the following equation to estimate the shouting score:

$$ S_{shouting}(post) = 1 - \left({\frac{n(shout,post)}{|post|}}\right), $$
(3)

where n(shoutpost) is the number of all caps words in blog post post and |post| is the post length in words. Words written in all caps are considered shouting in a web environment; we consider shouting to be indicative for non-credible posts. Note that nowadays the use of repeated characters could also be considered shouting, but that we did not try to detect this notion of shouting.

4.1.4 Spelling

The spelling score is estimated as

$$ S_{spelling}(post) = 1 - \left({\frac{n(error,post)}{|post|}}\right), $$
(4)

where n(errorpost) is the number of misspelled or unknown words (with more than 4 characters) in post post and |post| is the post length in words. A credible author should be able to write without (a lot of) spelling errors; the more spelling errors occur in a blog post, the less credible we consider it to be.

4.1.5 Punctuation

The punctuation score is calculated as follows:

$$ S_{punctuation}(post) = 1 - \left({\frac{n(punc,post)}{|post|}}\right), $$
(5)

where n(puncpost) is the number of repetitive occurrences of dots, question marks, or exclamation marks (e.g., “look at this!!!”, “wel…”, or “can you believe it??”) and |post| is the post length in words. If \(n(punc,post)\cdot |post|^{-1}\) is larger than 1, we set S punctuation (post) = 0. We assume that excessive use of repeated punctuation marks is an indication of non-credible posts.

4.1.6 Post length

The post length score is estimated using |post|, the post length in words:

$$ S_{length}(post) = \log(|post|). $$
(6)

We assume that credible texts have a reasonable length; the text should supply enough information to convince the reader of the author’s credibility, and it is an indication of “completeness.”

4.1.7 Timeliness

Assuming that much of what goes on in the blogosphere is inspired by events in the news (Mishne and de Rijke 2006), we believe that, for news related topics, a blog post is more credible if it is published around the time of the triggering news event: it is timely. Bloggers that take (much) longer to respond to news events are considered less timely. To estimate timeliness, we first identify peaks for a topic in a collection of news articles, by summing over the retrieval scores for each date in the the top 500 results, and taking dates with a value higher than twice the standard deviation to be “peak dates”. Two example topics and their peaks are given in Fig. 3.

Fig. 3
figure 3

Peaks in news articles for (Left) topic 853, State of the Union, which was held on January 31, 2006. (Right) topic 882, Seahawks, an American football team that won the NFC on January 22, 2006, and played the Super Bowl on February 5, 2006

Having identified peaks for certain topics, we take the timeliness to be the difference in days between the peak date and the day of the post. More formally:

$$ S_{timeliness}(post,Q) = \left\{ \begin{array}{ll} e^{-(|\tau_{post}-\tau_{peak_Q}|)} & \hbox{if}\,\tau_{post} - \tau_{peak_Q} > -2\\ 0 & \hbox{otherwise.}\\ \end{array}\right. $$
(7)

Here, \(\tau_{peak_Q}\) is the date of the peak (in case the peak spans several days, it is the date closest to the post date), and τ post is the post date. The difference between the dates is calculated in days.

4.1.8 Semantics

For news-related topics, we are looking for posts that “mimic” the semantics of credible sources, like actual news articles. For this, we use a query expansion approach, based on previous work (Diaz and Metzler 2006; Weerkamp et al. 2009). We query the same news collection as before for the topics, and select the top 10 retrieved articles. From these articles we select the 10 most important terms, using Lavrenko and Croft (2001)’s relevance model 2. The selected terms, θ semantic,Q , represent credible semantics for the given topic, and we use these terms as query to score blog posts on the semantics indicator. Table 2 shows the extracted credible terms for three example topics.

Table 2 Terms indicating credible semantics for three topics: Macbook Pro deals with laptops by Apple; Cheney hunting discusses a hunting accident involving vice-president Cheney and his friend Whittington; David Irving is an Austrian historian on trial for denying the Holocaust

4.1.9 Text quality

To limit the number of experiments to run, we combine the following indicators into one text quality indicator: spelling, emoticons, capitalization, shouting, and punctuation. To combine these indicators, we first normalize each individual indicator using min-max normalization (Lee 1995). Then, we take the average value over all these indicators to be the text quality indicator.

4.2 Blog-level indicators

Blog-level indicators say something about the blog as a whole, or about the blogger who wrote the posts. Most indicators aggregate information from individual posts to the blog level, and they all lead to posts from the same blog having equal scores. Here, we go through the indicators spamminess, comments, regularity, consistency, pronouns, and expertise.

4.2.1 Spamminess

To estimate the spamminess of a blog, we take a simple approach. First, we observe that blogs are either completely spam (“splogs”) or not (i.e., there are no blogs with half of the posts spam and half of them non-spam), and this is why we consider this indicator on the blog level. We train an SVM classifier on a labeled splog blog dataset (Kolari et al. 2006) using the top 1,500 words for both spam and non-spam blogs as features. We then apply the trained classifier to our set of blog posts, and assign a spam or no-spam label to each post. We calculate the ratio of spam posts in each blog, and use this ratio as indication of spamminess for the full blog.

$$ S_{spam}(post) = {\frac{n(post_{spam},blog)}{|blog|}}, $$
(8)

where n(post spam blog) is the number of spam posts in the blog, and |blog| is the size of the blog in number of posts. Splogs are not considered credible and we want to demote them in or filter them from the search results. Although the list of splogs for our test collection (see Sect. 5.1) is available, we do not use it in any way in this paper, ensuring our results are still comparable to previously published results. Future work could look at the performance of our spam classification.

4.2.2 Comments

We estimate the comment score as

$$ S_{comment}(post) = \log\left({\frac{\sum_{post\in blog}n(comment,post)}{|blog|}} + 1\right), $$
(9)

where n(commentpost) is the number of comments on post post, and |blog| is the size of the blog in number of posts. Comments are a notable blog feature: readers of a blog post often have the possibility of leaving a comment for other readers or the author. When people comment on a blog post they apparently find the post worth putting effort in, which can be seen as an indicator of credibility (Mishne and Glance 2006).

4.2.3 Regularity

To estimate the regularity score we use

$$ S_{regularity}(post) = \log(\sigma_{interval,blog}), $$
(10)

where σ interval,blog expresses the standard deviation of the temporal intervals between two successive posts in a blog. Blogs consist of multiple posts in (reverse) chronological order. The temporal aspect of blogs may indicate credibility: we assume that bloggers with an irregular posting behavior are less credible than bloggers who post regularly.

4.2.4 Topical consistency

We take into consideration the topical fluctuation of a blogger’s posts. When looking for credible information we would like to retrieve posts from bloggers that have a certain level of (topical) consistency: not the fluctuating behavior of a (personal) blogger, but a solid interest. The coherence score indicator (He et al. 2009) is a relatively cheap, topic-independent way of estimating this. Given a set of blog posts from a blog, blog = {post i } M i=1 , which is drawn from a background collection C, i.e., \(blog\subseteq C\) (i.e., the blogosphere), the coherence score is defined as the proportion of “coherent" pairs of blog posts with respect to the total number of post pairs within blog. The criterion of being a “coherent" pair is that the similarity between the two posts in the pair should meet or exceed a given threshold. Formally, the coherence (Co) of a blog blog is defined as

$$ Co(blog)={\frac{\sum_{i \neq j\in \{1,\ldots,M\} }{\delta(post_i, post_j)}}{{\frac{1}{2}}M(M-1)}}, $$
(11)

where δ(post i post j ) is 1 if posts are similar and 0 otherwise. We use cosine similarity to determine the similarity between two blog posts. More details on the coherence score can be found in (He et al. 2008, He et al. 2009).

4.2.5 Pronouns

We estimate the pronouns score as follows

$$ S_{pronouns}(post) = 1 - \left( {\frac{\sum_{post\in blog} {\frac{n(pron,post)}{|post|}}}{|blog|}}\right), $$
(12)

where n(pronpost) is the number of first person pronouns (I, me, mine, we, us, \(\ldots\)) in post post,  |post| is the size of the post in words, and |blog| is the size of the blog in number of posts. First person pronouns express a bias towards ones own interpretation, and we feel this could harm the credibility of a blog (post). Note that we use simple string matching for this indicator and that this might lead to an overestimation for some pronouns (e.g., “mine” can be used as noun and verb as well). We believe, however, that this is only a marginal issue and should not influence the results of this indicator.

4.2.6 Expertise

To estimate a blogger’s expertise for a given topic, we use the approach described in Weerkamp et al. (2011). We look at the posts written by a blogger, and try to estimate to what extent the given topic is central to the blog. Blogs that are most likely to be relevant to this query are retrieved, and we assign posts in those blogs a higher score on the expertise indicator. As an example, consider topic 856, macbook pro: the top retrieved blogs are (1) MacBook Garage, (2) Enterprise Mac, and (3) tech ronin. The first two are very Apple/Mac oriented, and the third result is more general technology-oriented, but with an interest in Macs. We consider posts from these blogs, blogs with a recurring interest in the topic, to be more credible than posts from blogs mentioning the topic only occasionally.

$$ S_{expertise}(post,Q) = P(blog|Q), $$
(13)

where P(blog|Q) is the retrieval score for blog blog on query Q as given by the Blogger model from Weerkamp et al. (2011).

On top of the individual credibility indicators, we report on the performance of combinations of indicators. We combine indicators into our two levels (post and blog level) and into a full combination, using these steps: (1) normalize indicator scores using min-max normalization (Lee 1995) and (2) average over the indicators belonging to the combination at hand (post level, blog level, or all).

We already introduced the difference between post-level and blog-level indicators, but there is one more dimension on which we can seperate indicators: whether or not the indicator depends on the topic. Most of the indicators get their score independent of the topic (e.g., spelling errors, capitalization), however, three indicators do depend on the topic: semantics, timeliness, and expertise. To summarize this section, Table 3 shows all our indicators and their characteristics.

Table 3 Our credibility-inspired indicators and their characteristics

5 Experimental setup

This section describes the task and collection we use to test our credibility-inspired indicators (Sect. 5.1), the general retrieval framework (Sect. 5.2), and the evaluation metrics (Sect. 5.3).

5.1 Task and collection

We apply our credibility-inspired indicators to the task of retrieving topically relevant blog posts. This task ran at the Text REtrieval Conference (TREC), as part of the blog track, in 2006–2008 (Macdonald et al. 2008a; Ounis et al. 2006, 2009). Given a set of blog posts and a query, we are asked to return relevant blog posts for that query. We apply our model and indicators to the TREC Blog06 corpus (Macdonald and Ounis 2006). This corpus has been constructed by monitoring around 100,000 blog feeds for a period of 11 weeks in early 2006, downloading all posts created in this period. For each permalink (HTML page containing one blog post) the feed id is registered. We can use this id to aggregate post level features to the blog level. In our experiments we use only the HTML documents, and ignore syndicated (RSS) data. We perform two preprocessing steps: (1) keep long sentences (Hofmann and Weerkamp 2008), and (2) apply language identification using TextCat,Footnote 1 to select English posts. The collection statistics are displayed in Table 4.

Table 4 Collection statistics before and after preprocessing

The TREC 2006, 2007, and 2008 Blog tracks each offer 50 topics and assessments, offering us 150 topics in total. For topical relevancy, assessment was done using a standard two-level scale: the content of the post was judged to be topically relevant or not. For all our retrieval tasks we only use the title field (T) of the topic statement as query; this boils down to the use of keyword queries. Table 5 lists several statistics of the queries in our test collection. We see that for 2006 more posts were assessed than for 2007 and 2008, which leads to more relevant posts per query. As to the number of terms per query, we see that 2008 queries are, on average, quite a bit longer than the 2006 and 2007 queries.

Table 5 Query statistics for 2006, 2007, and 2008

To estimate the semantics and timeliness credibility indicators, we need a collection of news papers. Here, we use AQUAINT-2, a set of about 907,000 newswire articles (AQUAINT-2. Guidelines 2007) from six different news sources. Of these articles, 135,763 are contemporaneous with the TREC Blog06 collection, and we use only this subset in our experiments. All news articles are written in English.

5.2 Retrieval framework

Our retrieval framework uses a language modeling for IR approach (Croft and Lafferty 2003), where we estimate the probability of a document generating the query. We select this framework because it is theoretically sound and has shown good and robust performance on a broad range of retrieval tasks. We use the implementation as provided by Indri.Footnote 2

5.3 Evaluation and significance

As explained before, we consider blog post retrieval to be a precision-oriented task, and focus mainly on precision metrics. The evaluation metrics on which we focus are standard precision-oriented IR metrics: mean reciprocal rank (MRR), and precision at ranks 5 and 10 (P5 and P10) (Baeza-Yates and Ribeiro-Neto 2010). For the sake of completeness we also report on the commonly used mean average precision (MAP) metric. In each table the best performing run per metric is bold-faced.

We test for statistical significant differences using a two-tailed paired t-test. Significant improvements over the baseline are marked with \(^{\vartriangle}\,(\alpha = 0.05)\) or  (α = 0.01), and we use and for a drop in performance (for α = 0.05 and α = 0.01, respectively).

6 Results

We present our results in three sections. First, we show the performance of our baseline, see how it compares to previous approaches at TREC, and we show what the influence of spam filtering is (Sect. 6.1). We continue by applying our credibility-inspired indicators on top of our (spam filtered) baseline. Since we aim at improving precision using credibility, we mainly aim at reranking originally retrieved results, assuming that the baseline has a sufficiently strong recall. We start by reranking the top n of the initial run based solely on the credibility-inspired scores (Credibility-inspired reranking) in Sect. 6.2. We then take a step back, and combine retrieval scores and credibility-inspired scores in our Combined reranking approach in Sect. 6.3, and explore reranking the top n results using this combined score.

Both our reranking approaches are applied on the top n of the baseline ranking after spam filtering. We need to decide on a value for n to use, and to make results from the two approaches comparable, we choose the same n for both of them. For the result section we take n = 20, as this value allows measuring changes in early precision (at ranks 5 and 10), without ignoring the initial ranking too much. In Sect. 7.3 we come back to this issue, and explore the influence of n on the performance of our approaches.

6.1 Baseline and spam filtering

We start by establishing our baseline: Table 6 shows the results on the three topic sets. Note that the baseline is strong: Its performance is better than or close to the best performing runs at TREC for all 3 years (our runs would have been at rank 1/15, 4/20, and 8/20). This is impressive knowing that the participating systems incorporate additional techniques like (external) query expansion, especially in 2007 and 2008.

Table 6 Preliminary baseline scores for all three topic sets and their combination (150 topics)

We detailed our spam classification approach in Sect. 4.2, where we assigned a score to each blog based on the ratio of spam posts in that blog. To turn this score into a filter, we need a threshold for this ratio: every blog that has a higher ratio than this threshold is considered a splog and is removed from the results. Given the orientation towards precision we consider blogs that have >25% of their posts classified as spam posts to be splogs. This threshold leads to the removal of 6,412 splogs (198,065 posts).

Table 7 shows the results after filtering out spam. Results show similar performance on the precision metrics and a slight, though significant, drop in terms of MAP. We revisit the results of our spam classifier in Sect. 7.1.

Table 7 Results before and after filtering spam. Significance tested against the baseline

In the remainder of the paper we have two notions of a “baseline.” First, when it comes to comparing performance of our approaches, we do so against the baseline (row one in Table 7). Second, the ranking that is produced after filtering splogs (spam-filtered baseline; row two in Table 7) serves as the starting on top of which we apply our two reranking approaches: Credibility-inspired reranking and Combined reranking; put differently, in our discussions below reranking always includes spam filtering.

6.2 Credibility-inspired reranking

The first method of reranking we explore is Credibility-inspired reranking. As the name indicates, this approach takes only the credibility-inspired scores into account when reranking the top 20 results of our baseline ranking. That is, we take the ranking produced after filtering spam, ignore retrieval scores for the top 20 results, and assign to each of the top 20 posts the score as assigned by each credibility-inspired indicator (viz. Sect. 4), and construct the new ranking based on these scores. The posts ranked lower than position 20 keep their original retrieval score/ranking.

We present the results of Credibility-inspired reranking in Table 8. The results are divided into four groups: (1) the baseline and the manual upper bound (which reranks the posts based on their relevance assessments), (2) the individual post-level indicators, (3) the individual blog-level indicators, and (iv) the combined indicators on post level, blog level, and both. We first focus on the individual indicators.

Table 8 Results for Credibility-inspired reranking on the top 20 results based on each of the credibility-inspired indicator scores for all 150 topics. Significance tested against the baseline

The individual indicators show a wide range in performance. All indicators show a drop in MAP compared to the baseline, but this was expected. We focus on the precision metrics and here we observe that almost all post-level indicators seem to improve over the baseline, although only the improvement on MRR by timeliness is significant. Looking at the blog-level indicators, we find that only the comments indicator improves over the baseline, with MRR showing a significant increase. The other blog-level indicators perform worse than or similar to the baseline. The highest scores on the precision metrics, when looking at the individual indicators, are achieved by three different indicators: comments on MRR, timeliness on P5, and semantics on P10.

Next, we shift our attention to combinations of indicators (the bottom part of Table 8). From these results we observe two things. First, the combined blog-level indicators do not improve over the baseline run on any metric, which is disappointing, but expected given the scores of individual indicators on this level. Second, the combined post-level indicators have the highest scores on the precision metrics, but improvements are not significant.

As an aside, given the strong performance of the comments indicator, it is natural to wonder what would happen if this blog level indicator were included with the post level indicators. That is, we take all post-level indicators and combine these with the comments indicator only. Using this combination we achieve the following scores: MRR 0.8280; P5 0.7280; P10 0.7167; and MAP 0.3744. Here, we find that performance on all metrics is still slightly below post-level indicators only.

Summarizing, we see that the Credibility-inspired reranking approach works well for post-level indicators, although it is hard to obtain significant improvements. The blog- level indicators, with the exception of comments, perform rather disappointing. Given the fact that we completely ignore the retrieval score once we start the reranking process, the results obtained by post-level indicators are quite remarkable and show the possibilities of taking ideas from the credibility framework on board as precision enhancement.

6.3 Combined reranking

Completely ignoring the initial retrieval score sounds like a “bad” idea: there is a reason why certain posts get assigned a higher retrieval score than others and we probably should be using these differences in scores. In this section we take another approach to incorporating ideas from the credibility framework in ranking blog posts: we combine the original retrieval score and the credibility-inspired score of posts to rerank the baseline ranking. We, again, look only at the top 20 results of the original ranking and multiply the retrieval score of each document by the (normalized) score on each credibility-inspired indicator. We present the results similar to the previous section: (1) the baseline and upperbound, (2) the individual post-level indicators, (3) the individual blog-level indicators, and (4) the combinations of indicators. The results are listed in Table 9.

Table 9 Results for Combined reranking using a combination of retrieval and credibility scores, and reranking the top 20 results based on this score for all 150 topics. Significance tested against the baseline

Results show that most post-level indicators are able to improve over the baseline on precision metrics. Especially scores on MRR improve significantly and both the timeliness and semantics indicators show large improvements on MRR and P5 compared to the baseline. Compared to the Credibility-inspired reranking approach in the previous section, we observe better performance on the precision at 5 and 10 metrics, as well as more significant (stable) improvements. Looking at the individual blog-level indicators we see a similar pattern as before: the comments indicator works well on MRR, but coherence, regularity, and expertise cannot improve over the baseline on any metric. An interesting difference with the previous approach is that both the pronouns and regularity indicators, which dropped significantly in performance compared to the baseline in Sect. 6.2 are now comparable to the baseline.

When combining the credibility-inspired indicators on our two levels we notice that scores for the post-level combination are, in absolute sense, slightly below the results of Credibility-inspired reranking, but they do show significant improvements over the baseline on precision metrics, indicating a more stable improvement.

Given the below-baseline performance of some of the blog-level indicators, we experiment by excluding them from the final (all) combination. Table 10 shows the results of using only comments and using both comments and pronouns in this final combination. Results here show that we can indeed improve over the combined post-level indicators when adding comments and pronouns to the combination. The final two runs show a (strong) significant improvement over the baseline on MRR and precision at 5.

Table 10 Results for combining post-level indicators and one or two blog-level indicators. Significance tested against the baseline

Summarizing, we find that Combined reranking resembles a “smoothed” version of Credibility-inspired reranking: It takes away the outliers, leading to slightly lower absolute scores than for Credibility-inspired reranking, but the improvements over the baseline are more often significant. Again, post-level indicators are the better performing ones, although this time we find that combining these with two blog-level indicators (comments and pronouns) leads to even better performance. Combined reranking is a powerful way of incorporating ideas from the credibility framework, resulting in stable improvements.

In the analysis section (Sect. 7), we often look at the two best performing runs from both approaches. For Credibility-inspired reranking this is the post-level combination run, and for Combined reranking it is the post-level + comments + pronouns run.

7 Analysis and discussion

We presented the overall results of our two credibility-inspired reranking approaches in the previous section. These results, however, hide a lot of detail, which could be important to understanding what exactly is happening. In this section we perform extensive analyses on our results from four perspectives. First, in Sect. 7.1, we look at the performance of our spam classifier. In Sect. 7.2 we acknowledge the fact that we are looking at reranking strategies and give more details on how our approaches really affect ranking by looking at swaps, the positions of relevant posts, and specific (relevant) posts that move significantly up or down the ranking. Sect. 7.3 deals with per-topic analyses of our indicators and reranking approaches and compares various runs on a per-topic basis and explores which specific topics show improvement or drops in performance. We discuss the setting of n, the number of results we rerank, in Sect. 7.4, and finally, we explore the interplay between credibility-inspired ranking and relevance in Sect. 7.5.

7.1 Spam classification

The official collection was purposefully injected with spam by gathering blog posts from known splogs. In total, 17,958 splogs were followed during the 11 week period of crawling. As mentioned before, we use a relatively simple approach to splog detection based on a rather small training set and a limited set of features (unigrams). From the 6,412 blogs classified as splogs, 4,148 are really splogs (precision 65%). The recall for our classifier is rather low, with 4,148 out of 17,958 splogs identified (recall 23%).

7.2 Changes in ranking

Our two approaches for incorporating credibility-inspired indicators are based on reranking an initial ranking of posts. Besides looking at scores produced by each of the (re)rankings, we can also look at the rankings themselves and explore how they differ between runs. First, we look at the number of swaps in the top 20 after reranking. The higher this number, the more changes in positions between the baseline and the reranked result lists. We compare the various indicators and also the two reranking approaches, in Table 11. Note that for most analyses in this section the numbers for the timeliness indicators might seem out of the ordinary, but this is because this indicator only affects 50 of the 150 topics, which influences the averages quite a bit.

Table 11 Average number of swaps (changes in ranking) per topic between each run and the (spam-filtered) baseline

We observe that in the Credibility-inspired reranking approach more swaps are generated than in the Combined reranking approach, although in some cases (e.g., timeliness) the difference is only marginal. The reason for the difference between the two approaches is that in the Combined reranking approach the initial retrieval score acts as a kind of “smoothing,” making the changes less radical. In general we see that most of the results in the top 20 get a different position after applying our reranking techniques.

To examine how successful the swaps are, we combine the swaps with relevance information; Tables 12 (Credibility-inspired reranking) and 13 (Combined reranking) show the average number of relevant posts per topic that go up or down in the ranking after reranking has been applied and the average number of positions each of these posts gains or loses. We should note that relevant posts going down in the ranking is not necessarily a problem, as long as the posts crossing them are relevant too.

Table 12 Credibility-inspired reranking: average number of relevant posts per topic that go up or down the ranking after reranking, and the average number of positions these posts go up or down. Also: the ratio of rising vs. dropping relevant posts per indicator
Table 13 Combined reranking: average number of relevant posts per topic that go up or down the ranking after reranking, and the average number of positions these posts go up or down. Also: the ratio of rising vs. dropping relevant posts per indicator

Comparing the two approaches on these numbers, we observe that all the numbers (except the ratios) are higher for Credibility-inspired reranking than for Combined reranking: more relevant posts go up, more relevant posts go down and for both the average number of positions is higher. The only numbers that are consistently higher for Combined reranking are the ratios of number of relevant posts going up vs. relevant posts going down. Here, we see that for most indicators this ratio is above 1 for Combined reranking, whereas it is above 1 for only two indicators for Credibility-inspired reranking.

Looking at the individual indicators for Combined reranking, we notice some interesting differences. The quality indicator has by far the highest ratio of relevant posts up vs. down, but the average number of positions is almost the lowest over all indicators. The comments indicator on the other hand has a mediocre up vs. down ratio, but the average number of positions relevant posts move (either up or down) is much higher than most other indicators.

7.2.1 Per-post analysis

Next, we drill down to the level of individual posts and look at example posts that show “interesting” behavior. First we look at posts that move up or go down most when comparing our approaches to the baseline. Table 14 shows the average of these maxima per topic for two selected indicators and the best performing run per approach. We observe that Credibility-inspired reranking leads to posts going up and also going down a lot, whereas Combined reranking is more modest in both cases.

Table 14 Average maximum number of positions per topic a relevant post goes up or down the top 20 of the ranking for two individual indicators and the best run per approach

We zoom in and look at the posts themselves. Table 15 shows four examples of posts that are relevant to a topic and that show the largest “bump” for that topic after using Combined reranking (with post-level + comments + pronouns). For each example post we give the topic to which it is relevant, the change in positions, the ID, a part of the post’s text, and the reasons why this post went up in the ranking.

Table 15 Examples of relevant posts helped by credibility after reranking using Combined reranking (post-level + comments + pronouns)

The example posts show that we are able to push more credible posts up the ranking. As to the indicators that matter most in these examples, we observe that most have a high (text) quality (few spelling mistakes, correct use of punctuation and capitalization), have many comments, are timely (i.e., published on the day of the related event), and share semantics with related news articles.

We perform a similar analysis for relevant posts that drop in the ranking after using Combined reranking. Table 16 shows four of these posts, again with a snippet from the post and the reasons why the system believes these posts should drop.

Table 16 Examples of relevant posts hurt by credibility after reranking using Combined reranking (post-level + comments + pronouns)

Looking at these posts, we feel that, although relevant, they are less credible than the posts in Table 15. The first post is a collection of links to other sources and contains in itself not much information, which is reflected by its short length and lack of comments. The second post sounds more credible, but is quite biased (i.e., a high number of pronouns) and has again only few comments. The third post is a fake “conversation” between Oprah and George Bush and is considered less credible because improper semantics and low text quality. Finally, the fourth post is characterized by punctuation “abuse” (\(\ldots\) , ??), short length, and very few comments.

In general we see that Credibility-inspired reranking is a more radical reranking approach, leading to many changes in the ranking and many (relevant) posts moving up and down. This is risky; it can lead to high gains, but also to large drops in performance. Combined reranking is a more careful, “smoothed” approach, which shows (slightly) fewer changes and moves in the ranking, but is more stable in its improvements (i.e., the ratio of posts going up and down), leading to significant improvements.

Looking at examples of relevant posts that are helped or hurt by credibility-inspired indicators, we find that posts that are pushed up the ranking are indeed more credible, whereas the posts that are pushed down seem to be less credible (although still relevant). There is not one indicator that leads to these changes, but it is always a combination of indicators (like comments, timeliness, semantics, and quality). We revisit the influence of individual indicators and the interplay between credibility-inspired ranking and relevance in Sect. 7.5.

7.3 Per topic analysis

Performance numbers averaged over 150 topics hide a lot of details. In this section we analyze the performance of our approaches on a per-topic basis and see how their behavior differs for various topics. We start by looking at the results of our best performing Credibility-inspired reranking and Combined reranking runs as compared to the baseline. The plots in Fig. 4 show the increase or decrease on precision metrics for each topic when comparing thetwo approaches to the baseline.

Fig. 4
figure 4

Comparing the baseline against (Left) Credibility-inspired reranking (post-level indicators) and (Right) Combined reranking (post-level + comments + pronouns). A positive bar indicates the topic improves over the baseline, a negative bar indicates a drop compared to the baseline

The plots show some interesting differences between the two reranking approaches. First, both approaches have topics on which they improve over the baseline, as well as topics for which the baseline performs better. In general, we observe that Credibility-inspired reranking has more topics that improve over the baseline than Combined reranking, but also more topics that drop in performance. Both gains and losses are higher for Credibility-inspired reranking compared to Combined reranking. The actual number of topics going up or down for both approaches compared to the baseline are listed in Table 17.

Table 17 Number of topics that increase or decrease as compared to the baseline for both approaches on precision metrics

We move on to the analysis of a selection of individual indicators. Figure 5 shows similar plots as before for four individual indicators; We only show precision at 5, to keep the number of plots limited.

Fig. 5
figure 5

Comparing the baseline against (Left) Credibility-inspired reranking (post-level indicators) and (Right) Combined reranking (post-level + comments + pronouns) on precision at 5 for four individual indicators: (1) quality, (2) timeliness, (3) comments, and (4) expertise. A positive bar indicates the topic improves over the baseline, a negative bar indicates a drop compared to the baseline

The quality indicator shows similar behavior as the combinations of indicators: numbers for Credibility-inspired reranking are higher across the board. This pattern is, however, not so strong for timeliness and comments, where both approaches show similar behavior (i.e., equal number of topics increasing and decreasing compared to the baseline). We included the expertise indicator to show that, although overall performance of this indicator was below the baseline, we can improve over the baseline for a number of topics (32 topics for Credibility-inspired reranking and 30 for Combined reranking).

Finally, we compare the two reranking approaches in the same way: per topic. Figure 6 shows the number of topics that prefer either Credibility-inspired reranking (“negative” bars) or Combined reranking (“positive” bars) on the precision metrics.

Fig. 6
figure 6

Comparing Credibility-inspired reranking (post-level indicators), as baseline, to Combined reranking (post-level + comments + pronouns) on (Top left) RR, (Top right) P5, and (Bottom) P10. A positive bar indicates that Combined reranking makes the topic improve over Credibility-inspired reranking, a negative bar indicates the opposite

The plots show that both reranking approaches have topics on which they clearly outperform the other, although in general the Credibility-inspired reranking is preferred for slightly more topics. To be precise, Credibility-inspired reranking is preferred for 30 (RR), 34 (P5), and 40 (P10) topics, whereas Combined reranking is preferred for 26 (RR), 27 (P5), and 34 (P10) topics.

7.3.1 Very early precision

We shift focus to MRR, the ability to rank the first relevant post as high as possible. We see that our Combined reranking approach is capable of moving the first relevant post from position 2 to position 1 for 13 topics, while another 16 topics show an increase in RR as well. On the other hand, only 9 topics show a decrease in RR. Table 18 shows on the left hand side the topics that improve the most after reranking and on the right the topics that drop the most.

Table 18 Topics that increase or decrease most on RR using Combined reranking (post-level indicators + comments + pronouns), compared to the baseline

We perform the same comparison between Credibility-inspired reranking using post-level indicators and the baseline. Table 19 shows the topics that show the largest difference on RR between the two runs. In total, 42 topics go up in RR, and 24 go down.

Table 19 Topics that increase or decrease most on RR using Credibility-inspired reranking (post-level indicators) compared to the baseline

Some interesting observations can be made from the tables with topics. E.g., we notice that for topic 921 (“christianity today”) it is hard to maintain a relevant post at the first position for both approaches and the same goes for topic 943 (“censure”). Credibility-inspired reranking is capable of pushing the first relevant result quite a bit up for topics 893 (“zyrtec”) and 1012 (“ed norton”), whereas these drop for Combined reranking. All other topics that either increase or decrease are different between both approaches, which again supports the notion that certain topics are helped by Credibility-inspired reranking and others by Combined reranking.

7.4 Impact of parameters on precision

So far, we have looked at the results of reranking only the top 20 of the initial ranking. What happens if we change the value of n and rerank not 20, but the first 15 or 500 results of the ranking? We first explore the impact of different values of n on Credibility-inspired reranking on precision metrics, and then look at Combined reranking.

The plot in Fig. 7 shows the change in performance for Credibility-inspired reranking on precision when using increasing values of n. We start at n = 15, so that we can measure a difference in P10 after reranking. On all metrics performance drops quite rapidly with n going up and it keeps dropping all the way up to n = 1,000.

Fig. 7
figure 7

Influence of reranking top n (x-axis) on precision at 5 (P5) and 10 (P10) and MRR for Credibility-inspired reranking using post-level indicators

The best performance for Credibility-inspired reranking is achieved using either n = 15 (for P5 and MRR) or n = 25 (for P10). Results of these two runs and the baseline are reported in Table 20. The results for MRR using n = 15 are higher than before and show a significant increase over the baseline. For P5 and P10 the results are slightly higher, but are still not significantly better.

Table 20 Results for the best values of n (15 and 25), our baseline, and the run presented before (n = 20) for Credibility-inspired reranking (using post-level indicators). Significance tested against the baseline

Looking at Combined reranking we find a very stable performance on all metrics over all n’s. Smoothing the credibility scores with the initial retrieval score leads to improvements, but the ranking does not change anymore going further down the ranking than position 15–20. The best performance is already achieved using n = 20 and there is no need to present further results here.

7.5 Credibility-inspired ranking vs. relevance ranking

We have seen that the effects of using credibility-inspired indicators on blog post retrieval are positive, but why this is the case? One issue that we should raise is the fact that assessors in the blog post retrieval task are asked to judge whether a blog post is topically relevant for a given topic. This relevance is assessed regardless of other factors that could otherwise influence judgements (e.g., readability, opinionatedness, quality). If we would follow this line of reasoning, we might wonder why credibility-inspired indicators have an effect on the performance at all. In order to gain a better understanding of this matter, we explore the topics that show the biggest increase or decrease in terms of precision at 10 and identify reasons for the change in performance. Below we list the factors that are most influential in performance changes.

Spam filtering:

We already discussed the issue of spam classification in Sect. 7.1. In this analysis we find that spam filtering is one of the main contributors to both improvements and drops in performance. By removing spam blogs, proper blog posts are promoted to higher ranks, leading to better results. Similarly, when spam classification fails and non-spam blogs are filtered out, non-relevant blog posts might take their place in the ranking, leading to a drop in performance.

Timeliness:

For topics that are time sensitive, the timeliness indicator is very influential. It often leads to relevant blog posts being pushed up in the ranking, while non-relevant blog posts are pushed down. Since this indicator is topic-dependent it does not influence all topics.

Semantics:

Another topic-dependent indicator, semantics, shows a large degree of influence on performance. As with the other indicators, semantics can make relevant posts move up the ranking and non-relevant posts down, but also the other way around.

Comments:

We observe that the number of comments a post receives is among the more influential indicators. One of the reasons why this indicator has so much influence could be that the text of the comments is considered to be part of the blog post and thus is being considered when determining relevance. A larger number of comments leads to extra text associated with the post and possibly to a better match between blog post and topic.

Post length:

The influence of the length of a document has attracted a lot of interest over the years (see e.g., Losada and Azzopardi 2008; Singhal et al. 1996), and its influence on retrieval performance is well-studied. In this chapter we also find that post length is one of the indicators with most influence on performance.

We observe that the credibility-inspired indicators each have their own reasons for improving (topical) blog post retrieval performance. However, the credibility framework offers us a principled way of combining these indicators and leaves space to include other indicators as well. Moreover, although we do not have the test collections to prove it, anecdotes suggest that the credibility-inspired indicators do indeed push more credible posts up the ranking.

8 Conclusions

In this paper we explore the use of ideas from a credibility framework in blog post retrieval. Based on a previously introduced credibility framework for blogs, we define several credibility-inspired indicators. These indicators are divided into post-level and blog-level indicators. Post-level indicators include spelling mistakes, correct capitalization, use of emoticons, punctuation abuse, document length, timeliness (when related to a news event), and how its semantics matches formal (news) text. On the blog level we introduce the following indicators: average number of comments, average number of pronouns, regularity of posting, coherence of the blog, and the expertise of the blogger.

Since the task at hand is precision-oriented and we expect credibility to help on precision, we propose to use inspiration from the credibility framework in a reranking approach and we introduce two ways of incorporating the credibility-inspired indicators in our blog post retrieval process. The first approach, Credibility-inspired reranking, simply reranks the top n of a baseline based on the credibility-inspired score. The second approach, Combined reranking, multiplies the credibility-inspired score of the top n results by their retrieval score and reranks based on this score.

Results show that Credibility-inspired reranking leads to larger improvements over the baseline than Combined reranking, but both approaches are capable of improving over an already strong baseline. For Credibility-inspired reranking the best performance is achieved using a combination of all post-level indicators. Combined reranking works best using the post-level indicators combined with comments and pronouns. The blog-level indicators expertise, regularity, and coherence do not contribute positively to the performance, although analysis shows that they can be useful for certain topics.

Analyses revealed that reranking on credibility-inspired scores alone (Credibility-inspired reranking) leads to higher gains and higher drops: its absolute scores are higher than for Combined reranking, but less stable. Combined reranking managed to improve significantly over the baseline on MRR and P5 and Credibility-inspired reranking can only do that after optimizing n to 15. Examples of posts that are affected by the reranking approaches indicated that we get the desired effect of moving credible posts up the ranking, but this is not always reflected in retrieval performance, as our test collection does not allow for direct measurement of credibility. We identified the most influential indicators and explained why these indicators lead to improvements in retrieval performance.

Concluding, in this paper we have shown that we can translate certain credibility indicators to measurable indicators from blog posts and their blogs. Applying two reranking approaches shows that the (early) precision of blog post retrieval can benefit from incorporating credibility-inspired indicators. Interestingly, ignoring the original retrieval score when reranking leads to the highest scores, although combining the two scores leads to more significant improvements in precision. The credibility framework offers us a principled way of adding indicators to a retrieval model, although the real effect on credibility ranking needs to be examined when an appropriate collection is available. Future work focuses around the blog-level indicators, that have proven to be harder to estimate than post-level indicators. We believe that blog-level indicators are important, but that we need other ways of estimating coherence, regularity, an expertise. An important future direction is the direct measurement of credibility using our indicators; for this, we need new collections or assessments.