Semi-supervised document retrieval
Introduction
Recently, supervised machine learning methods have been applied to ranking function construction in document retrieval (Burges et al., 2005, Cao et al., 2006, Cao et al., 2007, de Almeida et al., 2007, Gao et al., 2005, Joachims, 2002, Xu and Li, 2007, Yue et al., 2007). This approach offers many advantages, because it employs a rich model for document ranking. For instance, it is easy to add new ‘features’ into the ranking model. In fact, recent investigations have demonstrated that supervised learning approach works better than the conventional IR methods for relevance ranking. On the other hand, the machine learning approach also suffers from a drawback that traditional IR approaches such as BM25 and Language Modeling do not. That is, it needs a large amount of labeled data for training and usually the labeling of data is expensive. (In that sense, the traditional IR methods are ‘unsupervised learning methods’.)
One question arises here: can we leverage the merits of the two approaches and develop a method that combines the uses of the two? This is exactly the issue we address in this paper. Specifically, we propose a method on the basis of semi-supervised learning. To the best of our knowledge, there has been no previous work focusing on this problem.
A ranking function based on unsupervised learning can always be created without data labeling. Such a function can work reasonably well. Thus, the problem in this paper can be recast as that of how to enhance the ranking accuracy of a traditional IR model by using a supervised learning method and a small amount of labeled data. On the other hand, supervised learning for ranking usually requires the use of a large amount of labeled data to accurately train a model, which is very expensive. The addressed problem can also be viewed as that of how to train a supervised learning model for ranking by using a small amount of labeled data and by leveraging a traditional IR model.
The key issue for our current research, therefore, is to design a method that can effectively use a small number of labeled data and a large number of unlabeled data, and can effectively combine supervised learning (e.g. RankNet) and unsupervised learning (e.g. BM25) methods for ranking model construction.
Our method, referred to as SSRank (Semi-Supervised RANK), naturally utilizes the machinery of semi-supervised learning to achieve our goal. In training, given a certain number of queries and the associated labeled documents, SSRank ranks all the documents for the queries using a supervised learning model trained with the labeled data, as well as using an unsupervised learning model. As a result, for each query, two ranking results of the documents with respect to the query are obtained. SSRank then calculates the relevance score of each unlabeled document for each query, specifically, the probability of being relevant or being in a high rank of relevance. It labels the unlabeled documents if their relevance scores are high enough. With the labeled data, a new supervised learning model can be constructed. SSRank repeats the process, until a stopping criterion is met. In this paper, we propose a stopping criterion on the basis of machine learning theory.
Experimental results on three benchmark datasets and one web search dataset show that the proposed method can significantly outperform baseline methods (either a supervised method using the same amount of labeled data or an unsupervised method).
The setting of SSRank is somewhat similar to that of relevance feedback (or pseudo-relevance feedback). There are also some clear differences between SSRank and relevance feedback (or pseudo-relevance feedback), however, as will be explained in Section 2.
The rest of the paper is organized as follows. Section 2 introduces related work. Section 3 explains the semi-supervised learning method: SSRank. Section 4 gives the experimental results. Section 5 provides our conclusion and discusses future work.
Section snippets
Learning for document retrieval
In Information Retrieval, traditionally ranking models are constructed in an unsupervised fashion, for example, BM25 (Robertson & Hull, 2000) and Language Model (e.g. Lafferty & Zhai, 2001) are functions based on degree of matching between query and document. There is no need of data labeling, which is no doubt an advantage. Many experimental results show that these models are very effective and they represent state-of-the-art methods for document retrieval.
In machine learning, the problem of
General framework
Suppose that there is a document collection. In retrieval, the documents retrieved with a given query are sorted using a ranking model such that the documents relevant to the query are on the top, while the ranking model is created using machine learning. In learning, a number of queries are given, and for each query a number of documents are retrieved and the corresponding labels are attached. The labels associated with the documents for a query represent the relevance degrees of the documents
Benchmark datasets
We used three benchmark datasets on document retrieval in our experiments.
The first two datasets are from the TREC ad hoc retrieval track. The document collections are from The Wall Street Journal (WSJ) and Associated Press (AP), which can be found in TREC Data Disk 2 and 3. WSJ contains 74,521 articles from 1990 to 1992, and AP contains 158,241 articles from 1988 and 1990. The queries are from the description fields of 200 TREC topics (No. 101–No. 300). Each query has a number of documents
Conclusion and future work
This paper addresses the issue of ranking model construction in document retrieval, particularly when there are only a small amount of labeled data available. The paper proposes a semi-supervised learning method SSRank for performing the task. It leverages the uses of both labeled data and unlabeled data, utilizes views from both traditional IR and supervised learning to conduct data labeling, and relies on a criterion to control the process of data labeling. Several conclusions can be drawn
Acknowledgement
We want to thank the anonymous reviewers for their helpful comments and suggestions. Part of the research was supported by the National Science Foundation of China (60505013, 60635030, 60721002), the Jiangsu Science Foundation (BK2008018), the Foundation for the Author of National Excellent Doctoral Dissertation of China (200343), the Microsoft Professorship Award and the Microsoft Research Asia Internet Services Program.
References (58)
- et al.
A generic ranking function discovery framework by genetic programming for information retrieval
Information Processing and Management
(2004) - Amini, M.-R., Truong, T.-V., & Goutte, C., 2008. A boosting algorithm for learning bipartite ranking functions with...
- et al.
Learning from noisy examples
Machine Learning
(1988) - et al.
Local feedback in full-text retrieval systems
Journal of the ACM
(1977) - et al.
Modern information retrieval
(1999) - Balcan, M.-F., Blum, A., & Yang, K. (2005). Co-training and expansion: towards bridging theory and practice. In...
- et al.
Semi-supervised learning on riemannian manifolds
Machine Learning
(2004) - Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th...
- Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th...
- Brefeld, U., Gartner, T., Scheffer, T., & Wrobel, S. (2006). Efficient co-regularised least squares regression. In...
Semi-supervised learning
Maximum likelihood from incomplete data via the EM algorithm
Journal of Royal Statistical Society
An efficient boosting algorithm for combining preferences
Journal of Machine Learning Research
Cited by (46)
Disagreement attention: Let us agree to disagree on computed tomography segmentation
2023, Biomedical Signal Processing and ControlSearch task success evaluation by exploiting multi-view active semi-supervised learning
2020, Information Processing and ManagementCitation Excerpt :Semi-supervised learning attempts to achieve strong generalization by exploiting unlabeled data (Chapelle, Schölkopf, & Zien, 2009). Disagreement-based semi-supervised learning is an excellent paradigm, which has been widely applied to e-mail classification (Koprinska, Poon, Clark, & Chan, 2007), information retrieval (Li, Li, & Zhou, 2009), sentiment classification (Hajmohammadi, Ibrahim, Selamat, & Fujita, 2015), etc. In disagreement-based semi-supervised learning approaches, multiple diverse classifiers are generated, and unlabeled data are used via the cooperation of these classifiers (Zhou & Li, 2010).
Predicting Academic Digital Library OPAC Users' Cross-device Transitions
2019, Data and Information ManagementCitation Excerpt :The theory and technology of machine learning can provide valuable support for digital library to develop more intelligent digital services (Esposito et al., 1998). Li et al. (2009) used a semi-supervised machine learning framework, combining with traditional literature retrieval methods to construct a ranking model for document retrieval structures based on semi-supervised learning of library user preferences. Sun et al. (2008) analyzed the transaction log data of digital library users within 1 year to predict user's preference changes and proposed a method for predicting users' future behavior preferences.
Using evolutionary computation for discovering spam patterns from e-mail samples
2018, Information Processing and ManagementCitation Excerpt :Earlier ML approaches were mainly based on the use of Naïve Bayes classifiers (Borodin, Polishchuk, Mahmud, Ramakrishnan, & Stent, 2013; Graham, 2009; Metsis, Androutsopoulos, & Paliouras, 2006; Rennie, 2000) being successfully introduced in some popular filtering frameworks (Apache Software Foundation, 2007; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2013). Additionally, the scientific community also evaluated other ML algorithms with the potential to filter spam including Support Vector Machines (SVM) (Amayri & Bouguila, 2010; Bouguila & Amayri, 2009; Li, Li, & Zhou, 2009; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2016), Artificial Immune Systems (AIS) (Borodin et al., 2013; Guzella, Mota-Santos, Uchõa, & Caminhas, 2008), Memory-Based Systems (Delany, Cunningham, & Tsymbal, 2006; Delany, Cunningham, Tsymbal, & Coyle, 2005; Fdez-Riverola, Iglesias, Díaz, Méndez, & Corchado, 2007; Pang & Jiang, 2013; Sakkis et al., 2003) or Rough Sets (RS) (Pérez-Díaz, Ruano-Ordás, Méndez, Gálvez, & Fdez-Riverola, 2012; Zhao & Zhang, 2005). Alongside, some other collaborative filtering schemes also emerged from particular companies or Internet communities to fight against spam (Damiani, di Vimercati, Paraboschi, & Samarati, 2004; Del Castillo & Serrano, 2005; Prakash, 2007).
The best answer prediction by exploiting heterogeneous data on software development Q&A forum
2017, NeurocomputingCitation Excerpt :Hence, we extract these features by pairing questions and each of its answers. For word-based criterion, we design the features based on [15], which improves the traditional Term Frequency Inverse Document Frequency. These features generated by combining Term Frequency with Inverse Document Frequency according to different ways of calculating is verified effectively in measuring similarity between two texts.
A general framework for co-training and its applications
2015, Neurocomputing