Hybrid query expansion model for text and microblog information retrieval

Zingla, Meriem Amina; Latiri, Chiraz; Mulhem, Philippe; Berrut, Catherine; Slimani, Yahya

doi:10.1007/s10791-017-9326-6

Hybrid query expansion model for text and microblog information retrieval

Published: 03 February 2018

Volume 21, pages 337–367, (2018)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Hybrid query expansion model for text and microblog information retrieval

Download PDF

Meriem Amina Zingla¹,
Chiraz Latiri²,
Philippe Mulhem³,
Catherine Berrut³ &
…
Yahya Slimani²

822 Accesses
16 Citations
Explore all metrics

Abstract

Query expansion (QE) is an important process in information retrieval applications that improves the user query and helps in retrieving relevant results. In this paper, we introduce a hybrid query expansion model (HQE) that investigates how external resources can be combined to association rules mining and used to enhance expansion terms generation and selection. The HQE model can be processed in different configurations, starting from methods based on association rules and combining it with external knowledge. The HQE model handles the two main phases of a QE process, namely: the candidate terms generation phase and the selection phase. We propose for the first phase, statistical, semantic and conceptual methods to generate new related terms for a given query. For the second phase, we introduce a similarity measure, ESAC, based on the Explicit Semantic Analysis that computes the relatedness between a query and the set of candidate terms. The performance of the proposed HQE model is evaluated within two experimental validations. The first one addresses the tweet search task proposed by TREC Microblog Track 2011 and an ad-hoc IR task related to the hard topics of the TREC Robust 2004. The second experimental validation concerns the tweet contextualization task organized by INEX 2014. Global results highlighted the effectiveness of our HQE model and of association rules mining for QE combined with external resources.

LTR-expand: query expansion model based on learning to rank association rules

Article 21 March 2020

Query expansion using pseudo relevance feedback on wikipedia

Article 17 May 2017

Term Selection for Query Expansion in Medical Cross-Lingual Information Retrieval

1 Introduction and motivations

In text information retrieval (IR), query expansion (QE) refers to the techniques and algorithms that reformulate the original query by adding new terms into the query, in order to improve the retrieval effectiveness. Many query expansion techniques were developed in the past decades. In this respect, an interesting survey on QE is given in Carpineto and Romano (2012).

In the literature, query expansion approaches are mainly classified as global (Xu et al. 1996), local (Buckley et al. 1994) and external approaches. Roughly speaking, global approaches expand the query by adding new query terms that are statistically related with the initial query terms. On the other hand, local approaches use retrieved documents produced by the initial query, and mostely use pseudo-relevance feedback (PRF) (Xu et al. 1996). Whereas, the external approaches rely on external resources such as encyclopedic knowledge extracted from wikipedia (Li et al. 2007), or conceptual ones which are derived from ontologies (Bhogal et al. 2007). In addition, hybrid QE approaches, that rely on the combination of two (or more) QE methods, are also possible. Some examples are those in Han and Chen (2009), Ko et al. (2008). These approaches will be described in the next section.

In this paper, while considering textual and microblog collections, our contributions address tweet search, ad-hoc IR and the tweets contextualization tracks. These tracks are the result of the explosive growth of textual resources on the web, especially in microblogs. In fact, microblog retrieval has drawn tramendous attention in recent years. Therefore, TREC introduced a track for ad-hoc microblog retrieval in 2011 (Ounis et al. 2011) where large tweet collections and annotations for various queries were released. In this respect, different approaches were investigated for microblog retrieval to overcome the special nature of microblog messages, e.g., short, noisy and time-sensitive characters of microblog posts. However, one of the main challenges in microblog retrieval is term mismatch due to short queries. In the recent literature, the term mismatch problem in microblog posts is tackled through various techniques (Meij et al. 2012; Jabeur et al. 2012). Among them, we are interested in those using query expansion (Lau et al. 2011; Massoudi et al. 2011; Bandyopadhyay et al. 2012; Shekarpour et al. 2013; Lv et al. 2015). Our proposed model for expanding queries is divided into two main phases, namely terms generation and the terms selection. Our work focuses on both the text retrieval task and the microblog search results by using, on one hand, the implicit knowledge provided by advanced text mining methods, especially association rules (Agrawal and Skirant 1994); and knowledge extracted from external resources such as wikipedia and dbpedia^{Footnote 1} (Aggarwal and Buitelaar 2012), on the other hand. Moreover, we tackle the issue of expansion terms selection with respect to the semantic relatedness between original query terms and candidate terms (Luo et al. 2012; Klyuev and Haralambous 2011; Bouchoucha et al. 2014).

Thereby, and in order to enhance expansion term generation, we rely on the use of association rules (Agrawal and Skirant 1994) between terms which consists in extracting relations between terms based on a global analysis of a document collection. Those association rules convey statistical relations between terms that are used in an automatic query expansion process. It is also interesting to note that the QE approaches based on association rules do not require a priori knowledge or a complicated linguistic process. They are based on an automatic process without any external or human intervention nor any external knowledge resources (thesaurus, ontology, etc.). The use of association rules highlighted their efficiency in the IR field in previous studies, as in Martín-Bautista et al. (2004), Song et al. (2007), Latiri et al. (2012), Wei et al. (2000), Liu et al. (2013), Belalem et al. (2016). In fact, the extraction of association rules between terms is performed in two steps: the first step consists in extracting termsets, i.e., sets of terms, in a document collection that can be reasonably represented as a family of subsets of terms from a global set. A document collection can then be seen as a family of termsets drawn from a global set of index terms. Whereas, the second step consists in generating the association rules. An association rule is a relation $T_1 \Rightarrow T_2$, where $T_1$ and $T_2$ are two termsets. The advantage of the insight gained through association rules is in the contextual nature of the discovered inter-term correlations. Thus, the confidence of an association rule approximates the probability of having the terms of the conclusion in a document, given that those of the premise are already there.

In this paper, we investigate how external resources and PRF can be combined to association rules mining and used to enhance expansion terms generation and selection. This leads to a hybrid approach to handle query expansion, denoted HQE in the remainder of the paper, that proposes an efficient synergy between local, global and external QE methods. We propose three approaches for incorporating additional knowledge when generating expansion terms, namely: (i) a statistical approach which relies on association rules mining to discover strength correlations between terms, handled as a local method (PRF) combined with a global method. So, if the original query terms are included in the premises of mined rules, they will be thus expanded using the set of terms contained in the conclusion parts of selected rules; (ii) a semantic approach which consists in exploring wikipedia^{Footnote 2} articles, especially, the articles definitions parts, and extracting information from these latter to expand the original query; and, (iii) a conceptual approach based on the dbpedia ontology. This approach consists in accessing the dbpedia data set on the Web and extracting related information for a given query. The two last approaches, i.e., semantic and conceptual are part of external QE methods. Furthermore, the proposed HQE model can be applied in different configurations, starting from the statistical method based on the association rules and combining it with the semantic and conceptual knowledge. The driving idea behind combining these methods is to obtain performance results much better than that of the individual best results. This is achieved by combining several independent query expansion results and choosing the best results that outperform the baseline.

In addition to our efficient term generation, we also propose to enhance the selection of good expansion terms, by introducing a new semantic relatedness measure, named ESAC.

This measure combines the wikipedia-based Explicit Semantic Analysis (ESA) measure (Gabrilovich and Markovitch 2007) and the confidence metric of association rules (Agrawal and Skirant 1994). It allows to estimate a semantic relatedness score between the query and its relevant terms extracted using association rules. We note that the proposed measure ESAC considers both encyclopedic and correlation knowledge about terms. This advantage of ESAC is a key factor to find precise terms for automatic query expansion.

We validate the proposed HQE model on two kinds of evaluations. The first experiments are devoted to IR tasks in the case of difficult cases for which potential mismatches between queries and documents, namely TREC 2011 microblog search and TREC Robust 2004 ad-hoc tracks. We thoroughly evaluate to what degree our proposals aid retrieval effectiveness. The second experimental validation is dedicated to the tweet contextualisation task with INEX 2013 and 2014, aiming at providing, for a given tweet, a context from wikipedia articles, in a way that makes it clear for a reader. In this case, our HQE model is able to extend, properly, the original query tweets in a way to retrieve relevant and diverse wikipedia documents that lead to a higher context quality.

The remainder of the paper is organized as follows: Sect. 2 discusses related works on query expansion for information retrieval. Section 3 introduces some basic definitions related to our work. Then, in Sect. 4, a detailed description of our HQE model for query expansion is presented. Section 5 is devoted to experimental validation within two IR ad-hoc text and microblog tasks. In Sect. 6, we describe the embedding of the proposed query expansion model for INEX tweet contextualization task. The “Conclusion” section wraps up the article and outlines future works.

2 Related work

In this section, we discuss the query expansion approaches, and elicit our HQE model based on statistical, semantic and conceptual methods for generating candidate terms expansion. Hence, these latter can be either extracted from external resources such as wikipedia, dbpedia, etc., known as external resources based QE (Al-Shboul and Myaeng 2014) or from the documents themselves; based on their links with the initial query terms, named document based QE. In the literature, document based QE approaches contain two major classes: global approaches and local approaches (Carpineto and Romano 2012). Here, we will mention the efforts on both of them.

Local QE methods use retrieved documents produced by the initial query. It mainly refers to relevance feedback and pseudo relevance feedback (Buckley et al. 1994) approaches to reformulate the query. These methods use top-ranked documents retrieved by the initial original query. However, the top retrieved documents may not always provide good terms for expansion, particularly for difficult or short queries with few relevant documents in the collection which do not share relevant terms. These methods lead to topic drift and negatively impact the results (Macdonald and Ounis 2007).

Authors in Cao et al. (2008) re-examined the assumption which provides that PRF assumes that most frequent terms in the pseudo-feedback documents are useful for the retrieval does not hold in reality. In Chen and Lu (2010), authors showed that relevant expansion terms can not be distinguished from bad ones merely on their distributions in the feedback documents and in the whole collection. They proposed to integrate a term classification process to predict the usefulness of expansion terms. Recently, in Colace et al. (2015) authors have demonstrated the effectiveness of a new expansion method that extracts weighted word pairs from relevant or pseudo-relevant documents. They have also applied learning to rank methods to select useful terms from a set of candidate expansion terms within a PRF framework. Their obtained results demonstrated that the QE method based on their new structure outperforms the baseline (Colace et al. 2015). Moreover, to take advantage of the word embeddings representation, in Almasri et al. (2016), authors explored the use of the relationships extracted from deep learning vectors for QE. They showed that word embeddings are a promising source for query expansion by comparing it with PRF and the expansion method using mutual information.

Global QE methods unlike local QE, in global methods, candidate terms come from the entire document collection rather than just (pseudo-) relevant documents. In Xu et al. (1996), authors proved that using global analysis techniques produces results that are both more effective and more predictable than simple local feedback. Such QE approaches are generally based on extraction of relationships between terms among the whole document collection and based on their co-occurrences where the window size used is a document.

In Järvelin et al. (2001), authors developed a deductive data model for concept-based query expansion. It is based on three abstraction levels: the conceptual, linguistic and string levels. Concepts and relationships among them are represented at the conceptual level. The linguistic level gives natural language expressions for concepts. Each expression has one or more matching patterns at the string level. In Gong et al. (2006), the authors used WordNet and the Term Semantic Network (TSN) for developing word co-occurrence-based Thesauri. The TSN was used as a filter and provided a supplement for WordNet. However, it was noticed that the Thesauri construction strategy was complex and tedious.

In addition to the global approach based on Thesaurus construction, we focus on association rules mining which targets to retrieve correlated patterns (Agrawal and Skirant 1994) from the documents collection. An association rule binds two sets of terms namely a premise and a conclusion. This means that the conclusion occurs whenever the premise is observed in the set of documents. To each association rule, a confidence value is assigned to measure the likelihood of the association. It has been proven, in the literature, that the use of such dependencies for QE could significantly increase the retrieval effectiveness (Wei et al. 2000). (i.e.) Association rules reflect implicit and strong correlations between terms. Using these correlations for expanding queries allows to enrich the query presentation by adding a set of related terms and consequently improve retrieval performance by matching additional documents. Hence, the authors in Tangpong and Rungsawang (2000) performed a small improvement when using the apriori algorithm (Agrawal and Skirant 1994) with a high confidence threshold (more than $50\%$) that generated a small amount of association rules. Using a lower confidence threshold ($10\%$), authors performed better results (Tangpong and Rungsawang 2000). In Haddad et al. (2000), authors proposed the same approach performing improvement when using the apriori algorithm to extract association rules. The best improvements were performed with low confidence values. The approach in Martín-Bautista et al. (2004) has refined the query based on association rules. Given an initial set of documents retrieved from the web, text transactions are constructed and association rules are formulated. These rules are used by the user to add additional terms to the query for improving the precision of the retrieval. A more adapted mining algorithm to text that avoids redundancy in mined association rules is proposed in Latiri et al. (2012). Non redundant association rules between terms are used to expand the user query considering all terms that appear in the conclusions of these rules whose premise is contained by the original query. Experimental evaluation of this approach shows an improvement of the IR task. Closer to our work, in Song et al. (2007), the authors proposed a novel semantic query expansion technique that combines association rules with ontologies and Natural Language Processing techniques. This technique uses the explicit semantics as well as other linguistic properties of unstructured text corpus. It incorporates contextual properties of important terms discovered by association rules, and ontology entries are added to the query by disambiguating word senses.

External QE methods external QE approaches involve methods that generate expansion terms from other resources besides the target corpus. Many approaches used the wikipedia Corpus for query expansion as it is the biggest encyclopedia and is freely available on the web. Although it has been manually developed, its contents are well structured and growing rapidly with a wide variety of topics, which makes it a good knowledge source for query expansion. In Li et al. (2007), authors used the wikipedia corpus for the query expansion by utilizing the category assignments of the articles. The initial query was used against the wikipedia collection and each category was assigned a weight that was proportional to the number of top-ranked articles assigned to it. The articles were re-ranked based on the sum of the weights of the categories to which they belonged. For short query expansion, authors in Almasri et al. (2013), proposed a semantic approach that expands short queries by semantically related terms extracted from wikipedia. Recently, authors in Gan and Hong (2015) proposed a new approach to extract more term relationships from Markov network for query expansion, where term relationship extracted from wikipedia corpus is superimposed to the basic Markov network pre-built using a single local corpus.

Nevertheless, the aforementioned methods rely on counting word co-occurrences in the documents to select expansion terms knowing that they are not always a good indicator for relevance, whereas some are background words of the whole collection. In order to select good expansion terms, Explicit Semantic Analysis (ESA) is adopted in some contributions such as Luo et al. (2012) where authors used ESA to estimate two kinds of relevance weight. One is the relevance weight between query and its relevant word extracted from the top-ranked documents in initial retrieval results. The other is the relevance weight between each query word and its relevant words extracted from the snapshot of Google search result when that query word is used as search keyword. The estimated relevance weights are used to select good expansion words for second retrieval. Klyuev and Haralambous (2011) investigated the efficiency of the proposed EWC semantic relatedness measure in an ad-hoc retrieval task. This measure combines the wikipedia-based Explicit Semantic Analysis measure, the WordNet path measure and the mixed collocation index. Conducted experiments demonstrated promising results.

Furthermore, Hybrid approaches have achieved a promising results in tackling query expansion issues. Authors, in Ko et al. (2008), use a statistical query expansion technique along with pseudo relevance feedback and query summarization techniques. They try to generate an effective snippet at the beginning as compared to other traditional methods. Authors in Selvaretnam et al. (2013) use linguistic and statistical techniques for query structure classification for the application to query expansion. Authors in Han and Chen (2009) propose a hybrid method for query expansion (HQE). HQE method is a combination of ontology-based collaborative filtering and neural networks. The ontology-based collaborative filtering is used to analyze the semantic relation, while radial basis function networks are used to retrieve the most relevant web documents. Their method can enhance the precision and also user can provide less query information at the beginning as compared to other traditional methods.

In addition, since the proliferation of microbloging platforms, dealing with microblogs has become increasingly important, and as these microblogs messages are short and are, to some extent, ambiguous, QE has been widely used in microblog retrieval, such as tweet retrieval (Lv et al. 2015). Bandyopadhyay et al. (2012) used external corpora as a source for query expansion terms. They relied on the Google Search API (GSA) to retrieve pages from the Web, and expanded the queries employing their titles. Lau et al. (2011) proposed a twitter retrieval framework that focuses on topical features, combined with query expansion using PRF to improve microblog retrieval results. Massoudi et al. (2011) developed a language modeling approach tailored to microblogging characteristics, where redundancy-based IR approaches cannot be used in a straightforward manner. They enhanced this model with two groups of quality indicators: textual and microblog specific. Additionally, they proposed a dynamic query expansion model for microblog post retrieval.

QE is also used to expand microblog posts. A popular task is INEX Tweet contextualization task (Bellot et al. 2016) which addresses the problem of tweet enrichment in order to generate its context and make it more understandable. In Morchid et al. (2013), authors used Latent Dirichlet Analysis (LDA) to obtain a representation of the tweet in a thematic space. This representation allows to find a set of latent topics covered by the tweet. This approach gives good results for the tweet contextualization task. In Zingla et al. (2014), association rules between terms are used to extend the tweet. Authors projected the terms of the tweet on the rules’ premises and added the conclusions to the original tweets. Obtained results highlighted an interesting improvement within the tweet contextualization task.

In this paper, we propose to revisit these QE approaches proposing a hybrid QE model, where the generation of the expansion terms relies on local, global and external QE methods. The driving idea behind our proposal is to enhance QE results since statistical, semantic and conceptual methods are combined to generate new related terms for a given query. These methods use, respectively, correlation knowledge represented by association rules between terms, semantic knowledge from wikipedia and conceptual knowledge extracted from the dbpedia ontology. Furthermore, using a new relatedness measure ESAC, the proposed HQE model leads to different configurations that we validate on two ad-hoc IR tasks and one contextualization tweet task.

It is worth noting that this paper is a large extension of Zingla et al. (2016) as it involves a more detailed formalization of the HQE model and more developed experiments on different IR tasks.

In the next section, we introduce the basic definitions related to our proposed query expansion model.

3 Basic definitions

As aforementioned, query expansion is a technique utilized within information retrieval to solve word mismatch between queries and documents. Expansion words are usually selected by counting word co-occurrences in the documents. However, word co-occurrences are not always a good indicator for relevance, whereas some are background words of the whole collection. To elevate this issue, we introduce a QE model with twofold improvements with respect to the candidate terms for expansion, namely: (1) we rely on association rules between terms to derive efficient candidate terms (Latiri et al. 2012); and (2) in order to select good expansion terms, explicit semantic analysis (ESA) is adopted in our model to estimate a semantic relatedness score between query and its relevant terms extracted from association rules (Luo et al. 2012). In this respect, after introducing some notations, we state the formal definitions of the concepts used in the remainder of the paper related to association rules and Explicit Semantic Analysis. Table 1 provides an overview of the notations used in this and the later sections.

Table 1 Summary of notations

Hybrid query expansion model for text and microblog information retrieval

Abstract

Similar content being viewed by others

LTR-expand: query expansion model based on learning to rank association rules

Query expansion using pseudo relevance feedback on wikipedia

Term Selection for Query Expansion in Medical Cross-Lingual Information Retrieval

1 Introduction and motivations

2 Related work

3 Basic definitions

3.1 Overview of extracting association rules from texts

Definition

3.1.1 Association rules extraction process

3.1.2 charm algorithm

3.2 Explicit semantic analysis (ESA)

4 A hybrid query expansion model

4.1 Phase 1: candidate terms generation

4.1.1 Statistical expansion (STE)

4.1.2 Semantic expansion (SE)

4.1.3 Conceptual expansion (CE)

4.2 Phase 2: candidate term selection

4.3 Configurations for our HQE model

5 Experimental validation in IR

5.1 Test collections description

5.2 Experimental setup

5.3 Experiments and results for tweets search (TREC 2011)

5.4 Experiments and results for TREC 2004 robust retrieval track

6 Embedding hybrid query expansion model for tweet contextualization

6.1 INEX test collections

6.1.1 Data

6.1.2 Evaluation metric

6.2 Results on INEX 2014 test collection

6.2.1 Official INEX 2014 results

6.3 Post INEX 2014 experiments

6.3.1 Configurations results

6.3.2 Significance and effectiveness evaluation

6.3.3 Results on INEX 2013 collection

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation