Semi-supervised document retrieval

doi:10.1016/j.ipm.2008.11.002

Information Processing & Management

Volume 45, Issue 3, May 2009, Pages 341-355

https://doi.org/10.1016/j.ipm.2008.11.002 Get rights and content

Abstract

This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.

Introduction

Recently, supervised machine learning methods have been applied to ranking function construction in document retrieval (Burges et al., 2005, Cao et al., 2006, Cao et al., 2007, de Almeida et al., 2007, Gao et al., 2005, Joachims, 2002, Xu and Li, 2007, Yue et al., 2007). This approach offers many advantages, because it employs a rich model for document ranking. For instance, it is easy to add new ‘features’ into the ranking model. In fact, recent investigations have demonstrated that supervised learning approach works better than the conventional IR methods for relevance ranking. On the other hand, the machine learning approach also suffers from a drawback that traditional IR approaches such as BM25 and Language Modeling do not. That is, it needs a large amount of labeled data for training and usually the labeling of data is expensive. (In that sense, the traditional IR methods are ‘unsupervised learning methods’.)

One question arises here: can we leverage the merits of the two approaches and develop a method that combines the uses of the two? This is exactly the issue we address in this paper. Specifically, we propose a method on the basis of semi-supervised learning. To the best of our knowledge, there has been no previous work focusing on this problem.

A ranking function based on unsupervised learning can always be created without data labeling. Such a function can work reasonably well. Thus, the problem in this paper can be recast as that of how to enhance the ranking accuracy of a traditional IR model by using a supervised learning method and a small amount of labeled data. On the other hand, supervised learning for ranking usually requires the use of a large amount of labeled data to accurately train a model, which is very expensive. The addressed problem can also be viewed as that of how to train a supervised learning model for ranking by using a small amount of labeled data and by leveraging a traditional IR model.

The key issue for our current research, therefore, is to design a method that can effectively use a small number of labeled data and a large number of unlabeled data, and can effectively combine supervised learning (e.g. RankNet) and unsupervised learning (e.g. BM25) methods for ranking model construction.

Our method, referred to as SSRank (Semi-Supervised RANK), naturally utilizes the machinery of semi-supervised learning to achieve our goal. In training, given a certain number of queries and the associated labeled documents, SSRank ranks all the documents for the queries using a supervised learning model trained with the labeled data, as well as using an unsupervised learning model. As a result, for each query, two ranking results of the documents with respect to the query are obtained. SSRank then calculates the relevance score of each unlabeled document for each query, specifically, the probability of being relevant or being in a high rank of relevance. It labels the unlabeled documents if their relevance scores are high enough. With the labeled data, a new supervised learning model can be constructed. SSRank repeats the process, until a stopping criterion is met. In this paper, we propose a stopping criterion on the basis of machine learning theory.

Experimental results on three benchmark datasets and one web search dataset show that the proposed method can significantly outperform baseline methods (either a supervised method using the same amount of labeled data or an unsupervised method).

The setting of SSRank is somewhat similar to that of relevance feedback (or pseudo-relevance feedback). There are also some clear differences between SSRank and relevance feedback (or pseudo-relevance feedback), however, as will be explained in Section 2.

The rest of the paper is organized as follows. Section 2 introduces related work. Section 3 explains the semi-supervised learning method: SSRank. Section 4 gives the experimental results. Section 5 provides our conclusion and discusses future work.

Section snippets

Learning for document retrieval

In Information Retrieval, traditionally ranking models are constructed in an unsupervised fashion, for example, BM25 (Robertson & Hull, 2000) and Language Model (e.g. Lafferty & Zhai, 2001) are functions based on degree of matching between query and document. There is no need of data labeling, which is no doubt an advantage. Many experimental results show that these models are very effective and they represent state-of-the-art methods for document retrieval.

In machine learning, the problem of

General framework

Suppose that there is a document collection. In retrieval, the documents retrieved with a given query are sorted using a ranking model such that the documents relevant to the query are on the top, while the ranking model is created using machine learning. In learning, a number of queries are given, and for each query a number of documents are retrieved and the corresponding labels are attached. The labels associated with the documents for a query represent the relevance degrees of the documents

Benchmark datasets

We used three benchmark datasets on document retrieval in our experiments.

The first two datasets are from the TREC ad hoc retrieval track. The document collections are from The Wall Street Journal (WSJ) and Associated Press (AP), which can be found in TREC Data Disk 2 and 3. WSJ contains 74,521 articles from 1990 to 1992, and AP contains 158,241 articles from 1988 and 1990. The queries are from the description fields of 200 TREC topics (No. 101–No. 300). Each query has a number of documents

Conclusion and future work

This paper addresses the issue of ranking model construction in document retrieval, particularly when there are only a small amount of labeled data available. The paper proposes a semi-supervised learning method SSRank for performing the task. It leverages the uses of both labeled data and unlabeled data, utilizes views from both traditional IR and supervised learning to conduct data labeling, and relies on a criterion to control the process of data labeling. Several conclusions can be drawn

Acknowledgement

We want to thank the anonymous reviewers for their helpful comments and suggestions. Part of the research was supported by the National Science Foundation of China (60505013, 60635030, 60721002), the Jiangsu Science Foundation (BK2008018), the Foundation for the Author of National Excellent Doctoral Dissertation of China (200343), the Microsoft Professorship Award and the Microsoft Research Asia Internet Services Program.

References (58)

W. Fan et al.
A generic ranking function discovery framework by genetic programming for information retrieval
Information Processing and Management
(2004)
Amini, M.-R., Truong, T.-V., & Goutte, C., 2008. A boosting algorithm for learning bipartite ranking functions with...
D. Angluin et al.
Learning from noisy examples
Machine Learning
(1988)
R. Attar et al.
Local feedback in full-text retrieval systems
Journal of the ACM
(1977)
R. Baeza-Yates et al.
Modern information retrieval
(1999)
Balcan, M.-F., Blum, A., & Yang, K. (2005). Co-training and expansion: towards bridging theory and practice. In...
M. Belkin et al.
Semi-supervised learning on riemannian manifolds
Machine Learning
(2004)
Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the 18th...
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th...
Brefeld, U., Gartner, T., Scheffer, T., & Wrobel, S. (2006). Efficient co-regularised least squares regression. In...

Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N. et al. (2005). Learning to rank using...

Cao, Y., Xu, J., Li, H., Huang, Y., & Hon, H.-W. (2006). Adapting ranking SVM to document retrieval. In Proceedings of...

Cao, Z., Qin, T., Liu, T.-Y., Tsai, M., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach....

O. Chapelle et al.

Semi-supervised learning

(2006)

Chu, W., & Ghahramani, Z. (2005). Extension of Gaussion process for ranking: Semi-supervised and active learning. In:...

Cummins, R., & O’Riordan, C. (2006). Term-weighting in information retrieval using genetic programming: A three stage...

de Almeida, H. M., Gonçalves, M. A., Cristo, M., & Calado, P. (2007). A combined component approach for finding...

A.P. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm

Journal of Royal Statistical Society

(1977)

Duh, K., & Kirchhoff, K. (2008). Learning to rank with partially-labeled data. In Proceedings of the 31st annual...

Y. Freund et al.

An efficient boosting algorithm for combining preferences

Journal of Machine Learning Research

(2003)

Gao, J., Qi, H., Xia, X., & Nie, J.-Y. (2005). Discriminant model for information retrieval. In Proceedings of the 28th...

Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of the 17th...

Harman, D. (1992). Relevance feedback revised. In Proceedings of the 15th annual international ACM SIGIR conference on...

Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In A.J. Smola,...

Hersh, W. R., Buckley, C., Leone, T., & Hickam, D. H. (1994). OHSUMED: An iterative retrieval evaluation and new large...

Huang, X., Huang, Y. R., Wen, M., An, A., Liu, Y., & Poon, J. (2006). Applying data mining to pseudo-relevance feedback...

Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In: Proceedings...

Joachims, T. (2002). Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD...

Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information...

Cited by (46)

Disagreement attention: Let us agree to disagree on computed tomography segmentation
2023, Biomedical Signal Processing and Control
Semantic segmentation is a popular technique successfully applied to various fields like self-driving cars, natural, medical, and satellite images, among others. On the one hand, a well-known concept of disagreement where two models help each other to learn better discriminative features comes from co-training. On the other hand, attention mechanisms are proven to improve the segmentation results; nevertheless, this technique solely focuses on signals with some kind of alignment. This research leverage both concepts in a new kind of attention based on disagreement (Pure, Embedded, and Mixed-Embedded disagreement attention) that improves the model generalisation. Furthermore, we introduce an innovative deep supervision approach (alternating deep supervision) which trains the model following the sequence of supervision branches. Extensive experiments over the segmentation benchmark datasets LiTS17 and CT-82 verify the effectiveness of the proposed approaches. The code is available at https://github.com/giussepi/disagreement-attention.
Search task success evaluation by exploiting multi-view active semi-supervised learning
2020, Information Processing and Management
Citation Excerpt :
Semi-supervised learning attempts to achieve strong generalization by exploiting unlabeled data (Chapelle, Schölkopf, & Zien, 2009). Disagreement-based semi-supervised learning is an excellent paradigm, which has been widely applied to e-mail classification (Koprinska, Poon, Clark, & Chan, 2007), information retrieval (Li, Li, & Zhou, 2009), sentiment classification (Hajmohammadi, Ibrahim, Selamat, & Fujita, 2015), etc. In disagreement-based semi-supervised learning approaches, multiple diverse classifiers are generated, and unlabeled data are used via the cooperation of these classifiers (Zhou & Li, 2010).
Search task success rate is an important indicator to measure the performance of search engines. In contrast to most of the previous approaches that rely on labeled search tasks provided by users or third-party editors, this paper attempts to improve the performance of search task success evaluation by exploiting unlabeled search tasks that are existing in search logs as well as a small amount of labeled ones. Concretely, the Multi-view Active Semi-Supervised Search task Success Evaluation (MA4SE) approach is proposed, which exploits labeled data and unlabeled data by integrating the advantages of both semi-supervised learning and active learning with the multi-view mechanism. In the semi-supervised learning part of MA4SE, we employ a multi-view semi-supervised learning approach that utilizes different parameter configurations to achieve the disagreement between base classifiers. The base classifiers are trained separately from the pre-defined action and time views. In the active learning part of MA4SE, each classifier received from semi-supervised learning is applied to unlabeled search tasks, and the search tasks that need to be manually annotated are selected based on both the degree of disagreement between base classifiers and a regional density measurement. We evaluate the proposed approach on open datasets with two different definitions of search tasks success. The experimental results show that MA4SE outperforms the state-of-the-art semi-supervised search task success evaluation approach.
Predicting Academic Digital Library OPAC Users' Cross-device Transitions
2019, Data and Information Management
Citation Excerpt :
The theory and technology of machine learning can provide valuable support for digital library to develop more intelligent digital services (Esposito et al., 1998). Li et al. (2009) used a semi-supervised machine learning framework, combining with traditional literature retrieval methods to construct a ranking model for document retrieval structures based on semi-supervised learning of library user preferences. Sun et al. (2008) analyzed the transaction log data of digital library users within 1 year to predict user's preference changes and proposed a method for predicting users' future behavior preferences.
With more and more users using different devices, such as personal computers, iPads, and smartphones, they can access OPAC (online public access catalog) services and other digital library services in different contexts. This leads to the phenomenon that user's behavior can be transferred to different devices, which leads to the richness and diversity of user's behavior data in digital libraries. A large number of user data challenge digital libraries to analyze user's behavior, such as search preferences and borrowing habits. In this study, we study the user's cross-device transition behavior when using OPAC. Based on the large-scale OPAC transaction log, the online activities between device transitions in the process of using OPAC are studied. In order to predict the follow-up activities that users may take, and the next device that users may use, we detect features from several perspectives and analyze the feature importance. We find that the activity and time interval on the first device are more important for predicting the user's next activity and the next device. In addition, features of operating system help to better predict the next device. The next device used is more likely to predict the next activity after the device transition. This study examines the cross-device transition prediction in library OPAC, which can help libraries provide smart services for users when accessing OPAC on different devices.
Using evolutionary computation for discovering spam patterns from e-mail samples
2018, Information Processing and Management
Citation Excerpt :
Earlier ML approaches were mainly based on the use of Naïve Bayes classifiers (Borodin, Polishchuk, Mahmud, Ramakrishnan, & Stent, 2013; Graham, 2009; Metsis, Androutsopoulos, & Paliouras, 2006; Rennie, 2000) being successfully introduced in some popular filtering frameworks (Apache Software Foundation, 2007; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2013). Additionally, the scientific community also evaluated other ML algorithms with the potential to filter spam including Support Vector Machines (SVM) (Amayri & Bouguila, 2010; Bouguila & Amayri, 2009; Li, Li, & Zhou, 2009; Pérez-Díaz, Ruano-Ordás, Fdez-Riverola, & Méndez, 2016), Artificial Immune Systems (AIS) (Borodin et al., 2013; Guzella, Mota-Santos, Uchõa, & Caminhas, 2008), Memory-Based Systems (Delany, Cunningham, & Tsymbal, 2006; Delany, Cunningham, Tsymbal, & Coyle, 2005; Fdez-Riverola, Iglesias, Díaz, Méndez, & Corchado, 2007; Pang & Jiang, 2013; Sakkis et al., 2003) or Rough Sets (RS) (Pérez-Díaz, Ruano-Ordás, Méndez, Gálvez, & Fdez-Riverola, 2012; Zhao & Zhang, 2005). Alongside, some other collaborative filtering schemes also emerged from particular companies or Internet communities to fight against spam (Damiani, di Vimercati, Paraboschi, & Samarati, 2004; Del Castillo & Serrano, 2005; Prakash, 2007).
One of the most relevant problems affecting the efficient use of e-mail to communicate worldwide is the spam phenomenon. Spamming involves flooding Internet with undesired messages aimed to promote illegal or low value products and services. Beyond the existence of different well-known machine learning techniques, collaborative schemes and other complementary approaches, some popular anti-spam frameworks such as SpamAssassin or Wirebrush4SPAM enabled the possibility of using regular expressions to effectively improve filter performance. In this work, we provide a review of existing proposals to automatically generate fully functional regular expressions from any input dataset combining spam and ham messages. Due to configuration difficulties and the low performance achieved by analysed schemes, in this work we introduce DiscoverRegex, a novel automatic spam pattern-finding tool. Patterns generated DiscoverRegex outperform those created by existing approaches (able to avoid FP errors) whilst minimising the computational resources required for its proper operation. DiscoverRegex source code is publicly available at https://github.com/sing-group/DiscoverRegex.
The best answer prediction by exploiting heterogeneous data on software development Q&A forum
2017, Neurocomputing
Citation Excerpt :
Hence, we extract these features by pairing questions and each of its answers. For word-based criterion, we design the features based on [15], which improves the traditional Term Frequency Inverse Document Frequency. These features generated by combining Term Frequency with Inverse Document Frequency according to different ways of calculating is verified effectively in measuring similarity between two texts.
Recently, Questions and Answers (Q&A) forum for software development (e.g. Stack Overflow) becomes popular. Identifying the best answer to a raised question is important for Q&A forum since the best answer which provides an excellent solution to the raised question may guide the developers to solve their problems in practice. However, the best answers are often not explicitly tagged by question owners. It would be time-consuming for other developers with the same question to check all candidate answers to find the appropriate one. In this paper, we propose a novel approach to predict the best answers to the questions raised on Stack Overflow by exploiting heterogeneous data sources on the forum. We extract different groups features from multiple data sources and combine them for final prediction via multi-view learning. Experimental results indicate that the proposed method is effective in identifying the best answers to questions raised on Stack Overflow.
A general framework for co-training and its applications
2015, Neurocomputing
Co-training is one of the major semi-supervised learning paradigms in which two classifiers are alternately trained on two distinct views and they teach each other by adding the predictions of unlabeled data to the training set of the other view. Co-training can achieve promising performance, especially when there is only a small number of labeled data. Hence, co-training has received considerable attention, and many variant co-training algorithms have been developed. It is essential and informative to provide a systematic framework for a better understanding of the common properties and differences in these algorithms. In this paper, we propose a general framework for co-training according to the diverse learners constructed in co-training. Specifically, we provide three types of co-training implementations, including co-training on multiple views, co-training on multiple classifiers, and co-training on multiple manifolds. Finally, comprehensive experiments of different methods are conducted on the UCF-iPhone dataset for human action recognition and the USAA dataset for social activity recognition. The experimental results demonstrate the effectiveness of the proposed solutions.

View all citing articles on Scopus

View full text

Semi-supervised document retrieval

Abstract

Introduction

Section snippets

Learning for document retrieval

General framework

Benchmark datasets

Conclusion and future work

Acknowledgement

Information Processing and Management

Learning from noisy examples

Machine Learning

Local feedback in full-text retrieval systems

Journal of the ACM

Modern information retrieval