Query reformulation mining: models, patterns, and applications

Boldi, Paolo; Bonchi, Francesco; Castillo, Carlos; Vigna, Sebastiano

doi:10.1007/s10791-010-9155-3

Query reformulation mining: models, patterns, and applications

Web Mining for Search
Published: 10 December 2010

Volume 14, pages 257–289, (2011)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Query reformulation mining: models, patterns, and applications

Download PDF

Paolo Boldi¹,
Francesco Bonchi²,
Carlos Castillo² &
…
Sebastiano Vigna¹

572 Accesses
25 Citations
Explore all metrics

Abstract

Understanding query reformulation patterns is a key task towards next generation web search engines. If we can do that, then we can build systems able to understand and possibly predict user intent, providing the needed assistance at the right time, and thus helping users locate information more effectively and improving their web-search experience. As a step in this direction, we build a very accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We then apply the model to automatically label two very large query logs sampled from different geographic areas, and containing a total of approximately 17 million query reformulations. We study the resulting reformulation patterns, matching some results from previous studies performed on smaller manually annotated datasets, and discovering new interesting reformulation patterns, including connections between reformulation types and topical categories. We annotate two large query-flow graphs with reformulation type information, and run several graph-characterization experiments on these graphs, extracting new insights about the relationships between the different query reformulation types. Finally we study query recommendations based on short random walks on the query-flow graphs. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without the need of users’ clicks. Our experiments also show that it is important to consider transition-type labels on edges for having recommendations of good quality.

Pseudo-Query Reformulation

Regional Effects on Query Reformulation Patterns

Investigating Query Reformulation Behavior of Search Users

1 Introduction

Information retrieval is an interactive and iterative process: only in approximately half of the cases an information need is satisfied with just a single query (Rieh and Xie 2006; Spink et al. 2002). In the other half of the cases, the user has to reformulate her initial query because it was over- or under-specified, or did not use terminology matching relevant documents, or simply contained errors or typos. The picture is made even more complex by the fact that, although queries are typically short (Belkin 2000), they usually form together chains of topically-related activities (Radlinski and Joachims 2005). Users are increasingly relying on search to accomplish complex objectives, such as planning a holiday (from travel to lodging, to sightseeing, dining and nightlife) entirely online. Additional complexity is brought on by those search tasks that are so difficult and important for the user (e.g., deciding which car to buy, finding a new job, moving to another city), that she can go back to the same search mission again and again during a long period (Donato et al. 2010).

In order to assist the users in locating information more effectively, most large-scale Web search engines have started offering various supporting tools. As an example, query recommendations are a mechanism to help user reformulating their queries: these recommendations are typically queries similar to the original one, and they are obtained by analyzing the query logs, for instance, finding recommendations by clustering of queries (Wen et al. 2001), or by identifying frequent re-phrasings (Baeza-yates et al. 2004).

Query logs are in fact the main source of information for building search assisting systems. Web query logs contain a wealth of information about how users interact with the search engine. Extracting behavioral patterns from this abundance of information is a key step towards improving the service provided by search engines and towards developing innovative web-search paradigms. In particular, and this is the focus of this paper, by mining query logs we can understand the dynamics underlying the query reformulation process, and use this knowledge in applications aimed at improving the users’ web-search experience. In this context we identify two main tasks:

1.
Identifying search mission borders, by distinguishing query transitions that are reformulations, i.e., queries with a similar information need (Jones and Klinkner 2008; Radlinski and Joachims 2005), from query transitions that represent a mission change. Search missions are also known as chains (Radlinski and Joachims 2005) and in the rest of the paper we use the two terms as synonymous.
2.
After identifying the search missions, the query reformulations inside each chain must be classified into query reformulation types (abbreviated QRT). In this paper we focus on four query reformulation types: generalization, specialization, error correction, parallel move.

We tackled the first problem in our previous work (Boldi et al. 2008), where we built a machine learning model for predicting the probability that two subsequent queries are part of the same search mission. Such model was then used to annotate the arcs of the query-flow graph—an aggregated representation of latent querying behavior which is contained in a query log. We then used the query-flow graph in applications such as query recommendation and segmentation of user sessions.

In this article instead, we focus on the tasks of modeling query reformulation types and characterizing query reformulation patterns, approaching these as data mining problems.

Our main contributions are:

Model. We show that accurate automatic classification of QRTs is possible. Learning automatically from a human-labeled query log sample, we build a model for automatic classification of QRTs (Sect. 4). Our model exhibits a very high accuracy, ≈92% discriminating among four different reformulation types. The classifier is able to predict correctly even very difficult cases. We describe in detail the process followed to build the model, and then we inspect the model behavior. To the best of our knowledge this is the first work learning a model for the automatic classification of QRTs by mining a query log.
Patterns of reformulation strategies. Thanks to our automatic classifier, we are able to label very large query logs and to analyze them (Sect. 5). We divide users sessions into search missions, then we label each mission with our model and transform into a string of QRTs. Thus the query log is transformed into a bag of strings from which we can compute frequent sequential patterns which represent high-level search strategies. We analyze approximately 17 millions QRTs, and we compare our findings with the ones in the literature obtained mostly over small, manually-assessed collections. Moreover, we investigate the connections between the reformulation type and the topical categories of the queries in the reformulation.
Reformulation graphs. Using our model we can annotate the arcs of a Query–Flow Graph (Boldi et al. 2008) with QRTs, obtaining what we call a query transition graph (Sect. 6). We present a study regarding the properties of this annotated graph, including the relationships between the various query reformulation types.
Recommendations. We propose a family of methods for query recommendation based on short random walks performed on different slices of the query-flow graph (Sect. 7). Our experiments show that these methods can match in precision, and often improve, query-click based recommendations without using clicks. Moreover our methods provide more diversity in the result sets. Our experiments also show that transition probabilities from one query to the next are not enough, and for obtaining good recommendations it is important to filter out queries that are not part of the same search mission, and to add QRT labels to edges.

Section 2 describes related work, whereas in Sect. 3 we discuss the taxonomy of QRTs that we adopt in this paper. Based on this taxonomy, in Sect. 4 we build a classifier which we deeply characterize. Then in Sect. 5 we apply the classifier to label the query transitions in two query logs. From the labelled query logs we extract query reformulation patterns that we analyze from different perspective. Using the same classifier, in Sect. 6 we label a query-flow graph with QRTs, and we deeply characterize such graph, which is then used for query recommendation based on short random walks in Sect. 7. The last section presents our conclusions and outlines future work.

A preliminary version of this paper was presented in Boldi et al. (2009).

2 Related work

2.1 Determining reformulation types

In one of the oldest papers on the subject, Lau and Horvitz (1999) study a hand-tagged log from the Excite search engine, and propose a classification of QRTs. Their aim is to build a Bayesian model of the user behavior exploiting also temporal information. Rie and Xie (2006) consider in more detail reformulation patterns: also in their case, there is no automatic classification model; instead, they manually label and analyze 313 search missions. While defining the classes of query reformulation types (Sect. 3) we basically follow their taxonomy. The difference is that we adopt a coarser granularity for those reformulations that, while being part of the same search mission, are not simple direct reformulations of the previous query. Recently, Jansen et al. (2007a, b) analyzed a larger query log (1.5 millions query reformulations) using an automatic classifier. Their classifier is manually built following the concepts presented in the paper by He et al. (2002). By “manually built” we mean that the rule that identify a class of QRT and the definition of the class itself coincide perfectly, i.e., there’s no automatic learning involved. The classification is built on 6 features all based on term differences between the two queries.

Beside the size of the dataset we use, the fundamental difference of our work with the studies mentioned previously is that we learn a model by mining a large query log, using 27 features. As an example, the classifier adopted by Jansen et al. (2007a, b) defines specialization as a query with additional terms; instead, by learning the model we obtain that specialization (as judged by humans) can be characterized by a combination of factors including query length and cosine similarity of n-grams. Similarly, a generalization was just a query composed of a subset of the original terms. Our classifier, instead, besides finding these expected rules is also able to discover unexpected things, e.g., that the reformulation from “ dango ” to “ japanese cakes ” is actually a generalization, even if the two queries have no terms in common and the second one is longer than the first one.

In the context of image search (Goodrum et al. 2003), analyzed manually assessed reformulations of a group of users. In that case, of course, a number of reformulations involves interactions between text and images.

2.2 Query log analysis and applications

The importance of mining query logs to extract useful information about user behavior is clear since the seminal works (Jansen et al. 1998; Silverstein et al. 1998); such analysis has found fruitful application in many different contexts such as query recommendation (Baeza-yates et al. 2004; Zhang and Nasraoui 2006) and document ranking (Craswell et al. 2008).

Most of the work on query recommendation has focused on measures of query similarity (Fonseca et al. 2003; Zhang and Nasraoui 2006) that can be used for query expansion (Baeza-yates et al. 2004) or query clustering (Baeza-yates et al. 2004; Wen et al. 2001). A first attempt to model the user sequential search behavior is presented by Zhang and Nasraoui (2006): the arcs between consecutive queries in the same session are weighted by a damping factor d, whereas the similarity values for non consecutive queries are calculated by multiplying the values of arcs that join them. Instead, Fonseca et al. (2003) discover related queries with a method based on association rules. Each session in the query log is seen as a transaction in which a single user submits a sequence of related queries in a time interval.

Baeza-yates et al. (2004) study the problem of suggesting related queries issued by other users and query expansion methods to construct artificial queries. The clustering developed is based on a term-weight vector representation of queries, obtained from the aggregation of the term-weight vectors of the URLs clicked after the query: the objective is to recommend queries that are related to the input query but may search for different aspects of it. Wen et al. (2001) also present a clustering method for query recommendation that is centered around four notions of query distance: keywords of the query; string matching of keywords; common clicked URLs; and distance of the clicked documents in some pre-defined hierarchy. Jones et al. (2006) introduced the notion of query substitution. Similar queries can be obtained by replacing the query as a whole, or by substituting constituent phrases. Both similar queries and phrases are derived from user query sessions, and they proposed models for query re-ranking based on the similarity between the new query and the original one.

Particularly relevant for this paper is the application of query log analysis to the segmentation of sessions into user missions, a.k.a. chains (Radlinski and Joachims 2005): successful examples of such an application were presented in Boldi et al. (2008), and Jones and Klinkner (2008). Even if most research on query logs focused on single sessions, recent works (Richardson 2008) suggest their usefulness also to determine long-term interests of users.

Donato et al. (2010) present a machine learning module, based on query log analysis, which is at the basis of Search Pad, a novel Yahoo! application that was launched in June 2009. Search Pad helps users keeping trace of the queries they have done, and results they have consulted. These automatically collected notes can be edited by the user that can add comments, additional information, move or delete notes, and save the note pad for later reuse. The novelty of Search Pad is that unlike previous notes-taking products, it is automatically triggered only when the system decides, with a fair level of confidence, that the user is undertaking a “complex research mission” and thus is in the right context for gathering notes. A complex research mission is a search task that requires the user to go back to the search engine again and again over a period of time with related questions. Example of such tasks could be: organizing a holiday, deciding which digital camera to buy, finding a job or gathering information on a health issue. Once Search Pad receives the triggering signal and is aware that the user is engaged in a research session, it prompts the user with a message asking if she wants to take notes related to this search.

The information extracted from query logs can be summarized and suitably represented through query graphs (Baeza-Yates and Tiberi 2007), whose specific definition is geared on the application at hand. Some examples can be found in Boldi et al. (2008), Craswell and Szummer (2007), and Glance (2001). A recent application of query graphs to query-recommendation clustering is presented in Sadikov et al. (2010), where a graph extracted from query logs is clustered to enhance the diversity in the set of query-refinement suggestions. The authors of Sadikov et al. (2010) also model “off-topic drift” which corresponds to mission change in the nomenclature we adopt and to the terminal state in our query-flow graph.

2.3 Recommendations based on random walks

Craswell and Szummer (2007) describe a method based on random walks on the query-click graph (Beeferman and Berger 2000), that can be used to provide query recommendations as follows: given the input query, it computes the personalized PageRank (Jeh and Widom 2003) (with restart to the original query) of all the other queries, and then picks the top ones as recommendations. There are more details about this method in Sect. 7.2. Fuxman et al. (2008) experiment with a similar approach in the context of finding related keywords for advertising.

Mei et al. (2008) instead use a computation of hitting time: assume that Q ₀ is the input query: they start setting h(Q _i, 0) = 0 for all queries Q _i except for the original query Q ₀ which has $h(Q_0, t)=1\,\forall t \ge 0$, and then iterate the following for a fixed number m of iterations:

$$ h(Q_i,t) = \sum_{j \ne i} p_{ji} h(Q_j,t-1), $$

where p _ji is the probability of transition from Q _j to Q _i. For a query Q _i, what their process computes in h(Q _i, t) is the probability that a random walk arrives to node Q _i within t steps or less.

Query recommendation systems can also be personalized by taking into account the user’s history. Zhang and Nasraoui (2006) bias recommendations exploiting user’s history and introducing a “forgetting factor” which discounts older queries to favor more recent ones. A similar approach is used in Boldi et al. (2008) where a random walk with restart to the queries in the history of the user is done, preferring recent queries over older ones. As a general observation, recent works have shown that not only the previous query, but the long-term interests of users, are important for understanding his/her information need (Luxenburger et al. 2008; Richardson 2008).

3 Query transition types

In this article we adopt a taxonomy of query transitions which is largely inspired by the similar classification in Rieh and Xie (2006), with some differences that we summarize below. As depicted in Fig. 1, there are basically two axes: a generalization-specialization axis, and a dissimilarity axis.

Along the dissimilarity axis (horizontal in Fig. 1) we find a continuous variety of different types of query transition: as we move along the axis (from left to right, in the picture) the syntactic and semantic gap between the two queries, in terms of user’s intent, gets larger and larger. We start with zero dissimilarity (Same query), followed very closely by Error correction: the user is supposedly correcting an error (e.g., a typo) from her previous query, or trying a different spelling/capitalization of a query. Further along the dissimilarity axis we find Equivalent rephrasing: the user is re-phrasing, changing the wording of the query, but she has exactly the same goal (in the sense of Jones and Klinkner 2008) as before: she just decided that the new formulation was more likely to return the results desired for. Then we find Parallel move: according to Rieh and Xie (2006), this occurs when the “user modifies the queries from one aspect of an entity to another or from one thing to another, both of which share common characteristics”; the user is moving her focus to something related, but not equivalent—something that might have happened probably as a result of visiting some of the pages in the result set. Finally, we have mission change: the user is completely changing topic and she is looking for something else (Jones and Klinkner 2008; Radlinski and Joachims 2005).

Along the vertical axis instead we have Generalization and Specialization. Generalization occurs when the new query $q^{\prime}$ is more general than q (i.e., it should be satisfied by a superset of the results that are relevant for $q^{\prime}$); in many cases (but not always) a generalization can be automatically identified because $q^{\prime}$ is a conjunction with a proper subset of the terms of q. There are other more difficult cases: for example, a user querying for the name of a specific soccer team and then querying to find a sports web site. In a specialization, instead, the new query $q^{\prime}$ is more specific than q (i.e., it should be satisfied by a subset of the results that are relevant for q); probably, the previous query returned too many results, few of them being of interest for the user. In a sense, generalization reflects the user’s desire to increase recall, whereas specialization is the need to improve precision.

In our previous work (Boldi et al. 2008), we developed a model for breaking sessions into chains or, in other terms, a model to detect mission changes. This model is represented in Fig. 1 by the hyperplane separating Mission change from the rest. In this work we keep using that model for detecting mission changes, while we develop a new model for distinguishing QRTs. In particular on the dissimilarity axis we decided to cut in between what is a simple syntactical dissimilarity (we call this class C for Correction), and more substantial query reformulations which however remain in the same search mission (we call this class P from Parallel move). On the other axis we simply distinguish between class G (generalizations) and S (specializations). Some real examples for each kind of reformulation are shown in Table 1.

Table 1 Examples of reformulations

Query reformulation mining: models, patterns, and applications

Abstract

Similar content being viewed by others

Pseudo-Query Reformulation

Regional Effects on Query Reformulation Patterns

Investigating Query Reformulation Behavior of Search Users

1 Introduction

2 Related work

2.1 Determining reformulation types

2.2 Query log analysis and applications

2.3 Recommendations based on random walks

3 Query transition types

4 Automatic classification

4.1 The dataset construction

4.2 The features used

4.2.1 Discussion

4.3 Building the model

4.4 Further insight in the model

5 Query reformulation patterns

5.1 Datasets

5.2 Query reformulation distribution

5.3 Conditional reformulation probability

5.4 Interesting frequent reformulation pattern

5.5 Topic patterns

5.5.1 Query topical classification

5.5.2 Results

6 Transition graph

6.1 Overall properties

6.2 Anti-symmetry and correlations

6.3 Entropy of query reformulations

7 Query recommendation

7.1 Experimental framework

7.2 Baseline for query recommendation

7.3 Assessment method

7.4 Results

7.4.1 Usefulness score

7.4.2 Diversity score

8 Conclusion

8.1 Main findings

8.2 Follow-up work

8.3 Future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation