research-article

Exploring web scale language models for search query processing

Authors:
Jian Huang

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Jianfeng Gao

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Jiangbo Miao

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Xiaolong Li

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Kuansan Wang

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Fritz Behr

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

,
C. Lee Giles

Information Sciences and Technology, Pennsylvania State University, PA, USA

Information Sciences and Technology, Pennsylvania State University, PA, USA
View Profile

WWW '10: Proceedings of the 19th international conference on World wide webApril 2010Pages 451–460https://doi.org/10.1145/1772690.1772737

Published:26 April 2010Publication History

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 451–460

ABSTRACT

It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In this paper, we present an extensive study on this issue by examining the language model properties of search queries and the three text streams associated with each web document: the body, the title, and the anchor text. Our information theoretical analysis shows that queries seem to be composed in a way most similar to how authors summarize documents in anchor texts or titles, offering a quantitative explanation to the observations in past work.

We apply these web scale n-gram language models to three search query processing (SQP) tasks: query spelling correction, query bracketing and long query segmentation. By controlling the size and the order of different language models, we find that the perplexity metric to be a good accuracy indicator for these query processing tasks. We show that using smoothed language models yields significant accuracy gains for query bracketing for instance, compared to using web counts as in the literature. We also demonstrate that applying web-scale language models can have marked accuracy advantage over smaller ones.

References

Hitwise 2009 press releases, 2009.Google Scholar
Special issue on web as corpus. Computational Linguistics, 29(3), September 2003.Google Scholar
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of 29th international ACM conference on Research and development in information retrieval (SIGIR), pages 19--26, 2006. Google ScholarDigital Library
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of 39th Annual Meeting on Association for Computational Linguistics (ACL), pages 26--33, 2001. Google ScholarDigital Library
C. Barr, R. Jones, and M. Regelson. The linguistic structure of english web-search queries. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1021--1030, 2008. Google ScholarDigital Library
S. Bergsma, D. Lin, and R. Goebel. Web-scale n-gram models for lexical disambiguation. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 1507--1512, 2009. Google ScholarDigital Library
S. Bergsma and Q. I. Wang. Learning noun phrase query segmentation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), pages 819--826, 2007.Google Scholar
T. Brants and A. Franz. Web 1T 5-gram corpus version 1.1. Technical report, Google Research, 2006.Google Scholar
T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), pages 858--867, 2007.Google Scholar
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(10):359--394, 1999.Google ScholarDigital Library
K. Church, T. Hard, and J. Gao. Compressing trigram language models with Golomb coding. In Proceedings of EMNLP and CoNLL, pages 199--207, 2007.Google Scholar
S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, pages 293--300, 2004.Google Scholar
M. Gamon, J. Gao, C. Brockett, A. Klementiev, W. Dolan, D. Belenko, and L. Vanderwende. Using contextual speller techniques and language modeling for ESL error correction. In Proc. of IJCNLP, 2008.Google Scholar
J. Gao, J. Goodman, and J. Miao. The use of clustering techniques for language modelling - application to Asian languages. Computational Linguistics and Chinese Language Processing, 6(1):27--60, 2001.Google Scholar
J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proceedings of the 32nd international SIGIR conference on Research and development in information retrieval (SIGIR), pages 355--362, 2009. Google ScholarDigital Library
A. R. Golding and Y. Schabes. Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proceedings of the 34th ACL, pages 71--78, 1996. Google ScholarDigital Library
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009. Google ScholarDigital Library
X. Huang, A. Acero, and H.-W. Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, 2001. Google ScholarDigital Library
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proc. of 15th World Wide Web (WWW), pages 387--396, 2006. Google ScholarDigital Library
R. Kneser and H. Ney. Improved backing-off for M-gram language modeling. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1:181--184, 1995.Google ScholarCross Ref
G. Kumaran and V. R. Carvalho. Reducing long queries using query quality predictors. In Proc. of 32nd international conf. on Research and development in information retrieval (SIGIR), pages 564--571, 2009. Google ScholarDigital Library
M. Lapata and F. Keller. The web as a baseline: Evaluating the performance of unsupervised web-based models for a range of NLP tasks. In Proc. of Human Language Technologies - North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 121--128, 2004.Google Scholar
M. Lauer. Corpus statistics meet the noun compound: some empirical results. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL), pages 47--54, 1995. Google ScholarDigital Library
P. Nakov and M. Hearst. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In Proc. of 9th Conf. on Computational Natural Language Learning, pages 17--24, 2005. Google ScholarDigital Library
P. Nguyen, J. Gao, and M. Mahajan. MSRLM: a scalable language modeling toolkit. Technical report TR-2007-144, Microsoft Research, 2007.Google Scholar
A. Spink, D. Wolfram, M. B. J. Jansen, and T. Saracevic. Searching the web: the public and their queries. Journal of American Society for Information Science and Technology, 52(3):226--234, 2001. Google ScholarDigital Library
K. Svore and C. Burges. A machine learning approach for improved bm25 retrieval. In Proceedings of 18th ACM Conference on Information and Knowledge Management (CIKM), pages 1811--1814, 2009. Google ScholarDigital Library
B. Tan and F. Peng. Unsupervised query segmentation using generative language models and Wikipedia. In Proceeding of the 17th international conference on World Wide Web (WWW), pages 347--356, 2008. Google ScholarDigital Library
D. Vadas and J. R. Curran. Corpus statistics meet the noun compound: some empirical results. In Proceedings of 10th Conference of the Pacific Association for Computational Linguistics (PACLING), pages 104--112, 2007.Google Scholar
K. Wang and X. Li. Efficacy of a constantly adaptive language modeling technique for web-scale applications. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4733--4736, 2009. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarDigital Library

Index Terms

Exploring web scale language models for search query processing
1. Information systems
  1. Information retrieval

Recommendations

A graph query language and its query processing
Read More
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Read More
Discovering missing click-through query language information for web search
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

The click-through information in web query logs has been widely used for web search tasks. However, it usually suffers from the data sparseness problem, known as the missing/incomplete click problems, where large volume of pages receive few or no ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '10: Proceedings of the 19th international conference on World wide web
April 2010
1407 pages
ISBN:9781605587998
DOI:10.1145/1772690
General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India
Copyright © 2010 International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
language models
search query processing
very large-scale experiments
web n-gram
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 692
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ePub

View this article in ePub.

View ePub

Exploring web scale language models for search query processing

WWW '10: Proceedings of the 19th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A graph query language and its query processing

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

Discovering missing click-through query language information for web search