Semi-supervised document retrieval

https://doi.org/10.1016/j.ipm.2008.11.002Get rights and content

Abstract

This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.

Introduction

Recently, supervised machine learning methods have been applied to ranking function construction in document retrieval (Burges et al., 2005, Cao et al., 2006, Cao et al., 2007, de Almeida et al., 2007, Gao et al., 2005, Joachims, 2002, Xu and Li, 2007, Yue et al., 2007). This approach offers many advantages, because it employs a rich model for document ranking. For instance, it is easy to add new ‘features’ into the ranking model. In fact, recent investigations have demonstrated that supervised learning approach works better than the conventional IR methods for relevance ranking. On the other hand, the machine learning approach also suffers from a drawback that traditional IR approaches such as BM25 and Language Modeling do not. That is, it needs a large amount of labeled data for training and usually the labeling of data is expensive. (In that sense, the traditional IR methods are ‘unsupervised learning methods’.)

One question arises here: can we leverage the merits of the two approaches and develop a method that combines the uses of the two? This is exactly the issue we address in this paper. Specifically, we propose a method on the basis of semi-supervised learning. To the best of our knowledge, there has been no previous work focusing on this problem.

A ranking function based on unsupervised learning can always be created without data labeling. Such a function can work reasonably well. Thus, the problem in this paper can be recast as that of how to enhance the ranking accuracy of a traditional IR model by using a supervised learning method and a small amount of labeled data. On the other hand, supervised learning for ranking usually requires the use of a large amount of labeled data to accurately train a model, which is very expensive. The addressed problem can also be viewed as that of how to train a supervised learning model for ranking by using a small amount of labeled data and by leveraging a traditional IR model.

The key issue for our current research, therefore, is to design a method that can effectively use a small number of labeled data and a large number of unlabeled data, and can effectively combine supervised learning (e.g. RankNet) and unsupervised learning (e.g. BM25) methods for ranking model construction.

Our method, referred to as SSRank (Semi-Supervised RANK), naturally utilizes the machinery of semi-supervised learning to achieve our goal. In training, given a certain number of queries and the associated labeled documents, SSRank ranks all the documents for the queries using a supervised learning model trained with the labeled data, as well as using an unsupervised learning model. As a result, for each query, two ranking results of the documents with respect to the query are obtained. SSRank then calculates the relevance score of each unlabeled document for each query, specifically, the probability of being relevant or being in a high rank of relevance. It labels the unlabeled documents if their relevance scores are high enough. With the labeled data, a new supervised learning model can be constructed. SSRank repeats the process, until a stopping criterion is met. In this paper, we propose a stopping criterion on the basis of machine learning theory.

Experimental results on three benchmark datasets and one web search dataset show that the proposed method can significantly outperform baseline methods (either a supervised method using the same amount of labeled data or an unsupervised method).

The setting of SSRank is somewhat similar to that of relevance feedback (or pseudo-relevance feedback). There are also some clear differences between SSRank and relevance feedback (or pseudo-relevance feedback), however, as will be explained in Section 2.

The rest of the paper is organized as follows. Section 2 introduces related work. Section 3 explains the semi-supervised learning method: SSRank. Section 4 gives the experimental results. Section 5 provides our conclusion and discusses future work.

Section snippets

Learning for document retrieval

In Information Retrieval, traditionally ranking models are constructed in an unsupervised fashion, for example, BM25 (Robertson & Hull, 2000) and Language Model (e.g. Lafferty & Zhai, 2001) are functions based on degree of matching between query and document. There is no need of data labeling, which is no doubt an advantage. Many experimental results show that these models are very effective and they represent state-of-the-art methods for document retrieval.

In machine learning, the problem of

General framework

Suppose that there is a document collection. In retrieval, the documents retrieved with a given query are sorted using a ranking model such that the documents relevant to the query are on the top, while the ranking model is created using machine learning. In learning, a number of queries are given, and for each query a number of documents are retrieved and the corresponding labels are attached. The labels associated with the documents for a query represent the relevance degrees of the documents

Benchmark datasets

We used three benchmark datasets on document retrieval in our experiments.

The first two datasets are from the TREC ad hoc retrieval track. The document collections are from The Wall Street Journal (WSJ) and Associated Press (AP), which can be found in TREC Data Disk 2 and 3. WSJ contains 74,521 articles from 1990 to 1992, and AP contains 158,241 articles from 1988 and 1990. The queries are from the description fields of 200 TREC topics (No. 101–No. 300). Each query has a number of documents

Conclusion and future work

This paper addresses the issue of ranking model construction in document retrieval, particularly when there are only a small amount of labeled data available. The paper proposes a semi-supervised learning method SSRank for performing the task. It leverages the uses of both labeled data and unlabeled data, utilizes views from both traditional IR and supervised learning to conduct data labeling, and relies on a criterion to control the process of data labeling. Several conclusions can be drawn

Acknowledgement

We want to thank the anonymous reviewers for their helpful comments and suggestions. Part of the research was supported by the National Science Foundation of China (60505013, 60635030, 60721002), the Jiangsu Science Foundation (BK2008018), the Foundation for the Author of National Excellent Doctoral Dissertation of China (200343), the Microsoft Professorship Award and the Microsoft Research Asia Internet Services Program.

References (58)

  • W. Fan et al.

    A generic ranking function discovery framework by genetic programming for information retrieval

    Information Processing and Management

    (2004)
  • Amini, M.-R., Truong, T.-V., & Goutte, C., 2008. A boosting algorithm for learning bipartite ranking functions with...
  • D. Angluin et al.

    Learning from noisy examples

    Machine Learning

    (1988)
  • R. Attar et al.

    Local feedback in full-text retrieval systems

    Journal of the ACM

    (1977)
  • R. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Balcan, M.-F., Blum, A., & Yang, K. (2005). Co-training and expansion: towards bridging theory and practice. In...
  • M. Belkin et al.

    Semi-supervised learning on riemannian manifolds

    Machine Learning

    (2004)
  • Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th...
  • Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th...
  • Brefeld, U., Gartner, T., Scheffer, T., & Wrobel, S. (2006). Efficient co-regularised least squares regression. In...
  • Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N. et al. (2005). Learning to rank using...
  • Cao, Y., Xu, J., Li, H., Huang, Y., & Hon, H.-W. (2006). Adapting ranking SVM to document retrieval. In Proceedings of...
  • Cao, Z., Qin, T., Liu, T.-Y., Tsai, M., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach....
  • O. Chapelle et al.

    Semi-supervised learning

    (2006)
  • Chu, W., & Ghahramani, Z. (2005). Extension of Gaussion process for ranking: Semi-supervised and active learning. In:...
  • Cummins, R., & O’Riordan, C. (2006). Term-weighting in information retrieval using genetic programming: A three stage...
  • de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding...
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of Royal Statistical Society

    (1977)
  • Duh, K., & Kirchhoff, K. (2008). Learning to rank with partially-labeled data. In Proceedings of the 31st annual...
  • Y. Freund et al.

    An efficient boosting algorithm for combining preferences

    Journal of Machine Learning Research

    (2003)
  • Gao, J., Qi, H., Xia, X., & Nie, J.-Y. (2005). Discriminant model for information retrieval. In Proceedings of the 28th...
  • Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of the 17th...
  • Harman, D. (1992). Relevance feedback revised. In Proceedings of the 15th annual international ACM SIGIR conference on...
  • Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In A.J. Smola,...
  • Hersh, W. R., Buckley, C., Leone, T., & Hickam, D. H. (1994). OHSUMED: An iterative retrieval evaluation and new large...
  • Huang, X., Huang, Y. R., Wen, M., An, A., Liu, Y., & Poon, J. (2006). Applying data mining to pseudo-relevance feedback...
  • Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In: Proceedings...
  • Joachims, T. (2002). Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD...
  • Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information...
  • Cited by (46)

    • Search task success evaluation by exploiting multi-view active semi-supervised learning

      2020, Information Processing and Management
      Citation Excerpt :

      Semi-supervised learning attempts to achieve strong generalization by exploiting unlabeled data (Chapelle, Schölkopf, & Zien, 2009). Disagreement-based semi-supervised learning is an excellent paradigm, which has been widely applied to e-mail classification (Koprinska, Poon, Clark, & Chan, 2007), information retrieval (Li, Li, & Zhou, 2009), sentiment classification (Hajmohammadi, Ibrahim, Selamat, & Fujita, 2015), etc. In disagreement-based semi-supervised learning approaches, multiple diverse classifiers are generated, and unlabeled data are used via the cooperation of these classifiers (Zhou & Li, 2010).

    • Predicting Academic Digital Library OPAC Users' Cross-device Transitions

      2019, Data and Information Management
      Citation Excerpt :

      The theory and technology of machine learning can provide valuable support for digital library to develop more intelligent digital services (Esposito et al., 1998). Li et al. (2009) used a semi-supervised machine learning framework, combining with traditional literature retrieval methods to construct a ranking model for document retrieval structures based on semi-supervised learning of library user preferences. Sun et al. (2008) analyzed the transaction log data of digital library users within 1 year to predict user's preference changes and proposed a method for predicting users' future behavior preferences.

    • Using evolutionary computation for discovering spam patterns from e-mail samples

      2018, Information Processing and Management
      Citation Excerpt :

      Earlier ML approaches were mainly based on the use of Naïve Bayes classifiers (Borodin, Polishchuk, Mahmud, Ramakrishnan, & Stent, 2013; Graham, 2009; Metsis, Androutsopoulos, & Paliouras, 2006; Rennie, 2000) being successfully introduced in some popular filtering frameworks (Apache Software Foundation, 2007; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2013). Additionally, the scientific community also evaluated other ML algorithms with the potential to filter spam including Support Vector Machines (SVM) (Amayri & Bouguila, 2010; Bouguila & Amayri, 2009; Li, Li, & Zhou, 2009; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2016), Artificial Immune Systems (AIS) (Borodin et al., 2013; Guzella, Mota-Santos, Uchõa, & Caminhas, 2008), Memory-Based Systems (Delany, Cunningham, & Tsymbal, 2006; Delany, Cunningham, Tsymbal, & Coyle, 2005; Fdez-Riverola, Iglesias, Díaz, Méndez, & Corchado, 2007; Pang & Jiang, 2013; Sakkis et al., 2003) or Rough Sets (RS) (Pérez-Díaz, Ruano-Ordás, Méndez, Gálvez, & Fdez-Riverola, 2012; Zhao & Zhang, 2005). Alongside, some other collaborative filtering schemes also emerged from particular companies or Internet communities to fight against spam (Damiani, di Vimercati, Paraboschi, & Samarati, 2004; Del Castillo & Serrano, 2005; Prakash, 2007).

    • The best answer prediction by exploiting heterogeneous data on software development Q&A forum

      2017, Neurocomputing
      Citation Excerpt :

      Hence, we extract these features by pairing questions and each of its answers. For word-based criterion, we design the features based on [15], which improves the traditional Term Frequency Inverse Document Frequency. These features generated by combining Term Frequency with Inverse Document Frequency according to different ways of calculating is verified effectively in measuring similarity between two texts.

    View all citing articles on Scopus
    View full text