A machine learning approach for result caching in web search engines

https://doi.org/10.1016/j.ipm.2017.02.006Get rights and content

Highlights

  • To the best of our knowledge, our work is the

    rst in literature to apply machine learning techniques to the result caching problem in search engines, for both static, dynamic, and state-of-the-art static-dynamic cache organizations.

  • We evaluate a large set of features and illustrate that they can be exploited to increase the hit rate of result caches.

  • We evaluate various oracle caching strategies to illustrate the potential room for improvement in the result caching problem.

  • We show that the proposed machine learning framework can improve the hit rate of result caches, potentially reducing the energy consumption in search engines.

Abstract

A commonly used technique for improving search engine performance is result caching. In result caching, precomputed results (e.g., URLs and snippets of best matching pages) of certain queries are stored in a fast-access storage. The future occurrences of a query whose results are already stored in the cache can be directly served by the result cache, eliminating the need to process the query using costly computing resources. Although other performance metrics are possible, the main performance metric for evaluating the success of a result cache is hit rate. In this work, we present a machine learning approach to improve the hit rate of a result cache by facilitating a large number of features extracted from search engine query logs. We then apply the proposed machine learning approach to static, dynamic, and static-dynamic caching. Compared to the previous methods in the literature, the proposed approach improves the hit rate of the result cache up to 0.66%, which corresponds to 9.60% of the potential room for improvement.

Introduction

Scalability and efficiency are two crucial aspects of performance in search engines (Cambazoglu & Baeza-Yates, 2015). A commonly used technique for improving search engine performance is result caching (Baeza-Yates et al., 2007a). In result caching, precomputed results (e.g., URLs and snippets of best matching pages) of certain queries are stored in a fast-access storage. The future occurrences of a query whose results are already cached can be directly served by the result cache, eliminating the need to process the query using costly computing resources. Result caching has two immediate benefits to a search engine. First, by reducing the computational load on the server side, it enables higher query processing throughput. Second, it reduces the average query response time perceived by the users.

In the result caching problem, the goal is to maintain a set of previously computed query results in a limited-capacity cache such that some performance metric is optimized over the time as new queries are received. Although other performance metrics are possible (Altingovde, Ozcan, Ulusoy, 2009, Gan, Suel, 2009), the main performance metric for evaluating the success of a result cache is hit rate, i.e., the fraction of queries that are answered by the cache. In practice, increasing the hit rate requires careful selection of queries whose results are to be cached.

In the literature, there are two types of result caches: static and dynamic. In static caching, the cache is filled in an offline manner with the results of queries selected from the past query logs. The underlying assumption in static caching is the steady behavior in the query stream, i.e., queries which were popular in the past remain popular in the future. Therefore, popular queries are preferred over the rest when making the caching decisions. In dynamic caching, the caching decisions are online and the goal is to identify queries that are least likely to result in a cache hit and evict such queries to make room for other, more recent queries. The underlying assumption in dynamic caching is the bursty behavior in the query stream, i.e., queries which are submitted more recently have higher probability of reoccurring in the future. Fagni, Perego, Silvestri, and Orlando (2006) show that a hybrid caching approach combining these two types of caches performs better than using them in isolation.

Past research has tried to improve the performance of the result caching by relying on a single or limited number of features. The primary objective of this work is to devise a unifying caching framework which exploits an extensive set of features. To this end, we propose a machine learning approach that combines a large variety of features extracted from search engine query logs and evaluate the impact of this framework on the hit rate of a search engine result cache. We devise different learning models for static and dynamic result caching. For the former type of caches, we focus on the offline cache allocation problem. For the latter type of caches, we focus on the online eviction problem. As a secondary objective, we aim to identify the potential room for improvement in result cache hit rate. To this end, we propose and evaluate various oracle algorithms to understand the potential room for improvement in hit rate. Although machine learning is used in different caching tasks (e.g., time-to-leave prediction (Alici, Altingovde, Ozcan, Cambazoglu, & Ulusoy, 2012), refreshing (Jonassen & Bratsberg, 2012), and admission (Ozcan, Altingovde, Cambazoglu, Junqueira, & Ulusoy, 2012)), our approach is the first that employs machine learning in result caching and eviction.

Our experiments are conducted using a large, real-life query log obtained from Yahoo Web Search. We evaluate our models within the state-of-the-art static-dynamic caching framework proposed in (Fagni et al., 2006). Compared to this state-of-the-art framework, the proposed approach improves the hit rate by 0.47%, which corresponds to 7.8% of the possible improvement. Although the improvement in the hit rate is quite modest, it presents a large potential financial benefit for commercial search engines.

The rest of this paper is organized as follows. In Section 2, the previous work on result caching is surveyed. In Section 3, we present the features used in our machine learning models. Section 4 presents the proposed techniques. The details of our query log and experimental setup are explained in Section 5. In Section 6, we present the results of our experiments. We present an extended discussion on the result caching problem and the results of our experiments in Section 7. Finally, we conclude the paper in Section 8.

Section snippets

Related work

The query result caching problem is investigated by Markatos (2001) for the first time in the literature. The author evaluates four eviction policies for dynamic result caching, revealing that policies that take into account both frequency and recency perform better than those that rely only on recency. The author also proposes a static caching policy.

Fagni et al. (2006) describe a new caching architecture referred to as static-dynamic caching. In this architecture, the result cache is split

Features

The machine learning models that we build in our work rely on a large number of features, which we will briefly explain in this section. For clarity of the presentation, we classify these features into six categories: query, session, index, term frequency, and query frequency. Table 1 provides a list of the extracted features.

Query features. These features are derived from the query. Therefore they do not change values with recurrent submissions of the query. The QUERY_LENGTH and TERM_COUNT

Techniques

In this section, we present several caching strategies for the query result caching problem with the objective of maximizing the hit rate. To this end, we first consider two extreme result cache organizations: fully static and fully dynamic result caching. Then, we provide techniques for the state-of-the-art static-dynamic caching approach.

For each of the mentioned cache organizations, we present the evaluated techniques under three headings: baseline, oracle, and proposed techniques. The

Data and setup

We conduct extensive experiments on a realistic dataset to evaluate the caching techniques presented in Section 4. In this section, we first introduce the query log and discuss the experimental setup. Then, we present the machine learning setup used in our experiments.

Results

In this section, we present a comprehensive experimental evaluation and comparison of the discussed caching techniques. The proposed experiments are categorized within three groups: static, dynamic, and static-dynamic caching. We perform our experiments with varying cache capacities. We select the cache capacity as a function of distinct queries in the test set. During our evaluations, we have used 1%, 2%, 4%, 8%, and 16% of the number of test queries as the cache capacity. As the evaluation

Discussion

In the literature, data sets having heavy-tailed data distributions can be treated in three parts (Brynjolfsson, Hu, & Smith, 2003) according to the frequency distribution of data items: head, torso, and tail. The queries in the search result caching problem also exhibit a heavy tailed distribution. The head queries are the most frequent queries. They form a large fraction of the overall search engine query traffic relative to their small number. The queries in the torso appear less frequently,

Conclusion

In this paper, we presented machine learning models for result caching in web search engines. These models were trained using a large number of features extracted from the query string, query logs, search index, and search session, and were exploited to make caching decisions, aiming to increase the hit rate of the cache. We evaluated different learning models for both static and dynamic result caching. In the case of static caching, the learning model is used to select queries that will be

References (46)

  • J.H. Friedman

    Stochastic gradient boosting

    Computational Statatical Data Analysis

    (2002)
  • E.P. Markatos

    On caching search engine query results

    Computer Communications

    (2001)
  • R. Ozcan et al.

    A five-level static cache architecture for web search engines

    Information Processing & Management

    (2012)
  • S. Alici et al.

    Timestamp-based result cache invalidation for web search engines

    Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval

    (2011)
  • S. Alici et al.

    Adaptive time-to-live strategies for query result caching in web search engines

    Proceedings of the 34th european conference on advances in information retrieval. ECIR’12

    (2012)
  • I. Altingovde et al.

    A cost-aware strategy for query result caching in web search engines

  • I.S. Altingovde et al.

    Second chance: A hybrid approach for dynamic result caching in search engines

    Proceedings of the 33rd european conference on advances in information retrieval

    (2011)
  • R. Baeza-Yates et al.

    The impact of caching on search engines

    Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval

    (2007)
  • R. Baeza-Yates et al.

    Admission policies for caches of search engine results

  • R. Baeza-Yates et al.

    A three level search engine index based in query log distribution

  • X. Bai et al.

    Online result cache invalidation for real-time web search

    Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval

    (2012)
  • L.A. Belady

    A study of replacement algorithms for virtual-storage computer

    IBM Systems Journal

    (1966)
  • R. Blanco et al.

    Caching search engine results over incremental indices

    Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval

    (2010)
  • E. Bortnikov et al.

    Caching for realtime search

    Proceedings of the 33rd european conference on advances in information retrieval

    (2011)
  • E. Brynjolfsson et al.

    Consumer surplus in the digital economy: Estimating the value of increased product variety at online booksellers

    Management Science

    (2003)
  • B.B. Cambazoglu et al.

    Cache-based query processing for search engines

    ACM Transactions on the Web

    (2012)
  • B.B. Cambazoglu et al.

    Scalability challenges in web search engines. synthesis lectures on information concepts, retrieval, and services

    (2015)
  • B.B. Cambazoglu et al.

    A refreshing perspective of search engine caching

    Proceedings of the 19th international conference on world wide web

    (2010)
  • T. Curk et al.

    Microarray data mining with visual programming

    Bioinformatics

    (2005)
  • T. Fagni et al.

    Boosting the performance of web search engines: Caching and prefetching query results by exploiting historical usage data

    ACM Transactions on Information Systems

    (2006)
  • R.-E. Fan et al.

    Liblinear: A library for large linear classification

    Journal of Machine Learning Research

    (2008)
  • G. Frances et al.

    Improving the efficiency of multi-site web search engines

    Proceedings of the 7th ACM international conference on web search and data mining

    (2014)
  • Gabrilovich, E., Ringgaard, M., & Subramanya, A. (2013). FACC1: Freebase annotation of ClueWeb corpora version 1....
  • Cited by (0)

    1

    The work was performed while the author was affiliated with Yahoo Labs, Barcelona, Spain.

    View full text