Strengthening weak supervision for information retrieval

Haddad, Dany M.

Strengthening weak supervision for information retrieval

Access full-text files

HADDAD-THESIS-2019.pdf (1.38 MB)

Date

2019-05-08

Authors

Haddad, Dany M.

Abstract

The limited availability of ground truth relevance labels has been a major impediment to the application of supervised machine learning techniques to ad-hoc document retrieval and ranking. As a result, unsupervised scoring methods, such as BM25 and TF-IDF, remain strong competitors to deep learning approaches whose counterparts have brought on dramatic improvements in other domains, such as computer vision and natural language processing. However, recent works have shown that it is possible to take advantage of the performance of unsupervised methods to generate training data necessary for learning-to-rank models. Surprisingly, machine learning models trained on this generated data can outperform the original unsupervised method. The key limitation to this line of work is the size of the training set required to surpass the performance of the original unsupervised method, which can be as large as 10¹³ training examples. Building on these insights, this work proposes two methods to reduce the amount of training data required. The first method takes inspiration from crowdsourcing, and leverages multiple unsupervised rankers to generate soft, or noise-aware, training labels. The second identifies harmful, or mislabeled, training examples and removes them from the training set. We show that our methods allow us to surpass the performance of the unsupervised baseline with far fewer training examples than previous works.