1 Introduction

Ranking is a core problem in information retrieval, specially in web search engine. Recently, learning to rank, a rapidly developing branch of machine learning, has shown its strong ability for the ranking task. When being applied to document retrieval, learning to rank first establishes a ranking model learned from the training data sets, which consist of queries and associated documents with relevance grades. For a new query and a set of associated documents, the trained ranker (model) assigns each document a score, and then sorts them by their scores. To evaluate the ranker’s accuracy, evaluation measures such as Mean Average Precision (MAP) (Baeza-Yates et al. 1999) and Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen 2000) can be utilized.

Recent works on learning to rank can be categorized into three main paradigms: pointwise, pairwise, and listwise. The pointwise approach, such as Pranking (Crammer and Singer 2001) and McRank (Li et al. 2007), formulates ranking as an ordinal regression or classification problem. Unfortunately, this approach fails to handle pairwise preferences and the orders of retrieved documents. The pairwise approach, such as Ranking SVM (Smola 2000; Joachims 2002), RankBoost (Freund et al. 2003) and RankNet (Burges et al. 2005), deals with the ranking problem by treating documents pairs as training instances, and trains models via the minimization of related risks, such as the number of misordered pairs and the cross-entropy loss (Burges et al. 2005). However, in this approach, the ground truth labels with respect to partial orders of documents are ignored. The listwise approach, such as ListNet (Cao et al. 2007), AdaRank (Xu and Li 2007), ListMLE (Xia et al. 2008) and RankCosine (Qin et al. 2008), views list of documents as a whole in the training, and focus on recovering the true permutation of the retrieved documents.

Previous works (Cao et al. 2007; Lan et al. 2009; Liu 2011; Niu et al. 2012) demonstrated that the listwise approach can capture enough information, and therefore can address the ranking problem in a very natural way. However, the majority of existing algorithms depend heavily upon the human-elicited relevance grades, and ignore the intrinsic relationships between documents and ranking features (Moon et al. 2010).

The aim of our work is to fill the gap, and shows a different way of constructing weak learners to improve ranking performance. We employ Data Envelopment Analysis (DEA) (Charnes et al. 1978; Seiford and Thrall 1990) technique to recover the information which the conventional listwise approaches lost. Since the 1970s, DEA had been widely used in evaluating the relative efficiencies of a set of homogenous Decision Making Units (DMUs) with multiple inputs and outputs. Without explicit specification of the functional relation between the inputs and outputs, DEA assigns each unit an efficiency score. One unit which gets a higher score is considered to be more efficient. A score of one implies an ideal efficient unit. The efficient DMUs form a piecewise linear frontier to assist in measuring the degree of efficiency for units. DEA has the natural ability for dichotomized classification and ranking, and it is straightforward to group DMUs into two sets: efficient and inefficient. It has been applied to deal with tasks such as classification (Troutt et al. 1996; Seiford and Zhu 1998; Emel et al. 2003; Xu and Wang 2009; Pendharkar 2011; Yan and Wei 2011), clustering (Po et al. 2009) and association rule mining (Chen 2007; Toloo et al. 2009). To assist decision making, a variety of methods have also been proposed for DMUs ranking in the literature of DEA, and a comprehensive survey can be found in the works (Adler et al. 2002; Jahanshahloo et al. 2008). All the DMUs ranking approaches in DEA context, such as cross-efficiency method (Sexton et al. 1986) and super-efficiency method (Andersen and Petersen 1993), directly use the calculated efficiency scores to rank the given DMUs. However, they are unable to make prediction for other unknown DMUs.

In this paper, two DEA models, namely CCR-I and CCR-O are derived to evaluate documents’ relative efficiency. Both models are based on the assumption that each document is a DMU, but there is a difference in the role of documents’ feature vectors. In the model of CCR-I, documents are represented as units with multiple outputs (the features) and the same input (a query). In the model of CCR-O, each document is represented as a unit with multiple inputs (the features) and one output (a human-elicited relevance score).

On the basis of these two DEA models, we propose a novel weak construction method to build a pool of weak ranker candidates. Each candidate in the pool is an optimal weight solved from CCR-I or CCR-O. Inspired by the works (Freund et al. 2003; Xu and Li 2007; Qin et al. 2008), we incorporate Boosting technique (Schapire 1990) into the rank learning. The combination of the objective evaluation of DEA and the weak learnability of Boosting leads to an improved rank learning method called DEARank.

The rest of this paper is organized as follows. In Sect. 2, we give a brief review of DEA technique, Boosting technique and AdaRank algorithm, as well as some notations for readers’ convenience. The DEARank method is presented in Sect. 3, and the experimental results are reported in Sect. 4. In Sect. 5, we discuss the effect of error diversity on ranking performance, and present an alternative approach to weight the weak rankers, then end the work with a conclusion.

2 Background

2.1 Data envelopment analysis

Data envelopment analysis (DEA) is a typical non-parametric LP approach for evaluating DMU’s performance by a relative efficiency score, which is defined as a ratio of the sum of weighted outputs to the sum of weighted inputs. It allows DMUs to search the most advantageous weight vectors in the calculations of relative efficiency (Kao and Hung 2005). When clear from context, we use the terms “DMU”, “unit”, “document”, and “document unit” interchangeably.

Suppose there are \(N\) DMUs to be evaluated, and the \(n\)th DMU consumes \(M\)-inputs \(X_n\in R^M\) and produces \(S\)-outputs \(Y_n\in R^S\). We also assume that the inputs and the outputs are nonnegative, i.e., \(X_{n}\ge 0\) and \(Y_{n}\ge 0\). Let \(w_n\in R^M\) denote the weight vector of the inputs \(X_n\), and \(u_n\in R^S\) be the weight vector of the outputs \(Y_n\).

The first classical DEA model CCR is pioneered by Charnes et al. (1978). The model aims at maximizing the relative efficiency of the \(n\)th unit, with the constraints that the relative efficiency of each DMU with the optimal weights \(w_n\) and \(u_n\) is not greater than one. The model can be formulated as follows

$$\begin{aligned} \begin{array}{l@{\quad }l@{\quad }l} \max \limits _{w_n, u_n} &{} {u_n^TY_n}/{w_n^TX_n}&{}\\ \textit{s.t.} &{} {u_n^TY_i}/{w_n^TX_i}\le 1, &{}i=1,2,\dots ,N\\ &{} w_n\ge 0, u_n\ge 0.&{} \end{array} \end{aligned}$$
(1)

With the Charnes-Cooper transformation

$$\begin{aligned} \left\{ \begin{array}{l} t = 1/w_n^T X_n, \\ \nu = t w_n, \\ \mu = t u_n, \end{array} \right. \end{aligned}$$
(2)

we have \(\nu ^T X_n=t w_n^T X_n = 1\), and

$$\begin{aligned} u_n^T Y_n/w_n^T X_n= \mu ^T Y_n~. \end{aligned}$$
(3)

Therefore, Eq. (1) can be solved via an equivalent LP model

$$\begin{aligned} \begin{array}{l@{\quad }l@{\quad }l} \max \limits _{\mu ,\nu } &{} \mu ^TY_n&{}\\ \textit{s.t.} &{} \mu ^TY_i -\nu ^TX_i\le 0, &{}i=1,2,\dots ,N\\ &{} \nu ^TX_n = 1&{} \\ &{} \nu \ge 0, \mu \ge 0.&{} \end{array} \end{aligned}$$
(4)

The optimal solution \((\mu ^*, \nu ^*)\) to Eq. (4) represents the \(n\)th DMU’s most advantageous preference weight pair, which may vary for different DMUs. It’s the diversity of preferences that results in a very competitive performance in contrast to the weak feature rankers used in (Freund et al. 2003; Xu and Li 2007).

2.2 Efficiency reference set

Definition 1

(Efficiency Reference Set) Suppose (\(u_n^*, w_n^*\)) be an optimal weight vector pair of unit \(n\). The Efficiency Reference Set (ERS) of unit \(n\) is a set defined as

$$\begin{aligned} \text {ERS}_n = \bigg \{~i~\Big |~ \frac{(u_n^*)^T Y_i}{(w_n^*)^T X_i} = 1, \forall i = 1,\ldots , N\bigg \}, \end{aligned}$$

with a common weight vector pair (\(u_n^*, w_n^*\)).

Theorem 1

The common optimal weight vector pair (\(u_n^*, w_n^*\)) of \(\text {ERS}_n\), is also an optimal solution of the problem CCR\(_i\) associated to every unit \(i\in \text {ERS}_n\).

Proof

According to CCR model, the pair (\(u_n^*, w_n^*\)) satisfies the constraints

$$\begin{aligned} \frac{(u_n^*)^T Y_j}{(w_n^*)^T X_j} \le 1, j = 1, \dots , N. \end{aligned}$$

which implies that it is a feasible solution to the programming \(\hbox {CCR}_i\). Since unit \(i\) is a member of \(\text {ERS}_n\), we have

$$\begin{aligned} \frac{(u_n^*)^T Y_i}{(w_n^*)^T X_i} = 1~, \end{aligned}$$

i.e., the pair (\(u_n^*, w_n^*\)) is also an optimal solution of the problem \(\hbox {CCR}_i\), and the \(i\)th unit is efficient. \(\square \)

Each CCR model is represented by a LP problem, and corresponds to one DMU. Suppose there are N units, in order to examine the relative efficiency of each unit, we have to run LP solving procedure N times. Although LPs can be solved in polynomial-time, the repeated solving of LPs tends to be computationally intensive and time consuming, especially when large data sets are involved. According to Theorem 1, the units in a same Efficiency Reference Set share a common optimal weight. This property helps to alleviate the computational burden. Also, we can find different approaches (Ali 1993; Barr and Durchholz 1997; Zhu 2003; Emrouznejad and Shale 2009) (detailed discussion of which is beyond the scope of this work) to address this issue.

2.3 Boosting technique

Boosting is a general method based on the idea of creating a highly accurate model by combining many relatively weak rules. Boosting has its roots in the PAC model. Schapire (1990) is the first to show that the weak learning algorithms which perform just slightly better than coin flip can be combined to create an arbitrarily accurate strong learning algorithm. The proof gives birth to the first polynomial-time Boosting procedure. Later, Freund (1990) develops a boost-by-majority version. Both algorithms share a common limitation, they require the losses of weak learners are bounded. Fortunately, AdaBoost (Freund and Schapire 1995) removes the inapplicable constraint and has been successfully applied to different learning tasks.

Due to its advantages of simplicity and ease of implementation, Boosting has been extensively used to improve the performance of different learning algorithms. Several Boosting-based methods, such as RankBoost (Freund et al. 2003), AdaRank (Xu and Li 2007), RankCosine (Qin et al. 2008), PermuRank (Xu et al. 2008) and cascade ranking model (Wang et al. 2011) have been developed for rank learning. Among them, AdaRank is most suitable for combining weak rankers that are generated independently, and also it can directly optimize the evaluation metrics for ranking.

2.4 AdaRank algorithm

AdaRank is one of the state-of-the-art listwise learning to rank algorithms, which extends the famous AdaBoost (Freund and Schapire 1995) philosophy from binary classification to rank learning. It takes queries (with associated document lists) as instances, and directly optimizes an exponential loss which bounds a specific IR measure (e.g., MAP, NDCG). With the maintained weight distribution over the training queries, AdaRank minimizes the loss function through continuously re-weighting the weight distribution over the queries. At each round, AdaRank selects the weak ranker that performs best in terms of the specific measure, and assigns it a weight according to its ranking accuracy, and combines it with the current learned model. Simultaneously, the algorithm increases the weight of each query that has not been ranked well by the current model, and decreases the weights of each query on which the selected weak shows good performance, so that the training procedure in the next round can focus on the hard-to-learn queries. AdaRank fits a forward stage-wise additive model (i.e., the final ranking function) represented in the form of a weighted combination of the selected weak rankers.

2.5 Notations

Suppose the training data set \(S\) contains \(|Q|\) queries \(Q=\{q_1,\ldots ,q_{|Q|}\}\). For any query \(q_i \in Q\), \(S\) provides a set of associated documents \(D_i=\{d_{i1},\ldots ,d_{in_i}\}\), where \(n_i\) denotes the number of documents in \(D_i\). A query-document pair \((q_i, d_{ij})\) can be represented by one \(m\)-dimensional feature vector \(x_{ij}\), and each query corresponds to one vector set \(x_i=\{x_{i1},\ldots ,x_{in_{i}}\}\), where \(m\) denotes the dimension of features. The relevance grades are denoted by \(y_i=\{y_{i1},\ldots ,y_{in_{i}}\}, i=1,\ldots ,|Q|\).

The learned ranking function \(F\) assigns scores to documents, e.g., \(s_{ij}=F(x_{ij})\), and its ranking performance can be evaluated with a generic measure denoted by \(E(x_i,y_i, F)\). Given a well-defined loss function \(L\), the ranker can also be evaluated by loss \(L(x_i,y_i,F)\). The notation rules are summarized in Table 1.

Table 1 Summary of notations

3 DEARank method

In this section, we first formulate two modified DEA models CCR-I and CCR-O, then introduce DEARank algorithm. To our knowledge, this is the first time that the DEA technique is employed in the process of documents rank learning.

3.1 Modified DEA models

It is crucial for DEA to select the input and output variables, which may severely affect the efficiency scores of DMUs. However, DEA does not provide a general rule for the specification of the inputs and outputs, rather, it is left to the analysts’ judgments. In the rank learning, the variables are the blended document features. We propose two different variable-selection methods, according to the relationships among queries, features, relevance grades, and particularly, documents.

Given a query \(q\), suppose there are \(n\) associated documents \(D = \{d_1,\ldots , d_n\}\). We take all documents as the units with a constant input (e.g., 1) and multiple outputs. The document features are all benefit, and used as the output variables. For two documents \(d_i\) and \(d_j\), if one feature value of \(d_i\) is higher than that of \(d_j\) (e.g., page rank score (Brin and Page 1998)), while the other features being the same, then \(d_i\) should be preferred and deserves a higher relevancy score. In DEA, a larger relevancy score results in a higher relative efficiency level. With the aforementioned assumption, we formulate a degenerated DEA model with one input and multiple outputs (Lovell and Pastor 1999; Liu et al. 2011; Kostrzewa et al. 2011), and named CCR-I:

$$\begin{aligned} \begin{array}{l@{\quad }l@{\quad }l} \max \limits _{\mu } &{} \mu ^Tx_k&{} \\ \textit{s.t.} &{} \mu ^Tx_i \le 1,&{} i=1,\ldots ,n \\ &{} \mu \ge 0.&{} \end{array} \end{aligned}$$
(5)

In document ranking problem, the relevance grades of documents for the training set are assumed to be given. Here, we explore the human-judgement procedure: how relevance grades of documents with respect to the associated queries are generated. We assume that, in the procedure, the input features characterizing a document are the major factors dominating assessors’ decision, and the units (i.e., the documents) invest multiple input features and harvest from the assessors the relevance scores. According to the model CCR-I, a document with absolutely higher features deserves a higher score. Alternatively, we may assume that the assessor gives the document a lower score. This counterintuitive assumption is not well self-explained. For instance, one document with duplicate keywords has higher feature values, e.g., term frequency, sometimes is suspicious of black-hat-backed spam pages, therefore, may be punished by search engines (Ntoulas et al. 2006). In order to cope with this case, we propose to use another model named CCR-O:

$$\begin{aligned} \begin{array}{l@{\quad }l@{\quad }l} \min \limits _{\nu } &{} \nu ^Tx_k&{}\\ \textit{s.t.} &{} \nu ^Tx_i\ge \varphi (y_i),&{} i=1,2,\dots ,n\\ &{} \nu \ge 0,&{}\\ \end{array} \end{aligned}$$
(6)

where \(\varphi \) is a continuous real-valued, order-preserving function.

The documents formulated via CCR-O model are viewed as the units with multiple inputs (features) and one output (the relevance score). The documents under CCR-I model, however, are viewed as the units with one input (a constant, e.g., 1) and multiple outputs (features).

Both models are standard LP problems, and can be efficiently solved by the well-known simplex method or interior-point method. To seek a high relative efficiency score, DMUs tend to search a perfect weighting structure on features. Each optimal weight vector \(\mu ^*\) in (5) or \(\nu ^*\) in (6) corresponding to unit \(k\) represents the underlying importance it places on the features. In economical words, the weight vector is a pricing list, the sum of weighted inputs/outputs indicates a intrinsic value of the resources. These optimal weights, therefore can be used as weak rankers on documents and make up the building blocks for our method.

3.2 Feature subset weak construction: DEARank

AdaRank directly uses the raw features as the weak rankers. Instead of using the raw features, we apply DEA technique to create weak ranker candidates. Specifically, given a query and its associated documents, we formulate a CCR-I or CCR-O model for each document, and then solve it. Within the query, each document has at least one optimal solution to the corresponding derived DEA model. The solution, representing a weighting strategy adopted by the unit under consideration, is used as a weak ranker candidate. With a pool of weak ranker candidates, we start the training procedure, as described in Algorithm 1. Notice that MAP or NDCG (to be defined in Sect. 4.2) is used as the evaluation measure throughout this paper.

figure a

At each round, DEARank examines all the weak ranker candidates in the pool in step 1, and the one who performs best on the whole training queries with the weight distribution will be selected. Then, the selected weak ranker is assigned with a weight which is obtained by Eq.  (9) in step 2, and it is combined with the current learned model to make the empirical loss decrease in step 3. To focus on the hard-to-learn queries, DEARank decreases the weights of queries on which the current learned model performs well, and increases the weights of queries that are not ranked well by the learned model. The weight distribution over queries is updated (step 4) according to this adaptive strategy.

Another modification is made due to the fact that AdaRank may be dominated by the weak ranker which performs well for most training queries, and the weak learning procedure cannot further improve the performance of the ranking model consequently (Cartright et al. 2009). We adopt the strategy given by Cartright et al. (2009) to deal with this domination problem in DEARank.

In the algorithm, we minimize an upper bound of the exponential loss of the combiner \(F_t\) on the training set to maximize the ranking accuracy with respect to the generic IR measure \(E\)

$$\begin{aligned} \min \limits _{h\in \varPhi , \beta \in R} \sum \limits _{i=1}^{|Q|} \exp \big \{-E(x_i, y_i, F_{t-1} + \beta h)\big \}~. \end{aligned}$$
(7)

To minimize the surrogate loss function, we select a weak ranker

$$\begin{aligned} h_t = \arg \max \limits _{h\in \varPhi } \sum _{i=1}^{|Q|} P_t(i) E(x_i,y_i,h), \end{aligned}$$
(8)

and assign it a weight

$$\begin{aligned} \beta _t = \frac{1}{2} \ln {\frac{1+\sum _{i=1}^{|Q|} P_t(i) E(x_i,y_i,h)}{1-\sum _{i=1}^{|Q|} P_t(i) E(x_i,y_i,h)}}, \end{aligned}$$
(9)

according to its performance \(\sum _{i=1}^{|Q|} P_t(i) E(x_i,y_i,h)\) based on the training set with the query weight distribution \(P_t\) at round \(t\), as indicated in (Xu and Li , 2007).

In summary, DEARank first constructs the candidates of weak rankers via the optimal weights of documents obtained by DEA. Each optimal weight represents the most advantageous preference on the inputs/outputs for one document. With the boosting technique, DEARank trains the ranking model by repeatedly selecting weak rankers, which may come from different queries. The variety of the weak ranker, representing the collective intelligence of units, helps to improve the ranking performance of the whole set of documents.

4 Experimental evaluation

4.1 Data sets

To examine the performance of DEARank, we conduct experiments on LETOR (Qin et al. 2010), which is a collection of benchmark data sets for research on learning to rank. LETOR contains standard features, relevance judgments, as well as the results of dozen state-of-the-art learning to rank algorithms. The versions LETOR 3.0 and LETOR 4.0 are used in this work. LETOR 3.0 contains seven data sets: HP2003, HP2004, NP2003, NP2004, TD2003, TD2004 and OHSUMED. LETOR 4.0 contains two data sets: MQ2007 and MQ2008.

LETOR 3.0 is composed of two document corpora: the Gov corpus and the OHSUMED corpus. The OHSUMED corpus (Hersh et al. 1994) is a set of medical publications, and contains millions of records from 270 medical journals. There are 106 queries used to retrieve records in OHSUMED. The Gov web page collection is searched using three query sets: topic distillation (TD), home page finding (HP) and named page finding (NP). While NP is related to the traditional information retrieval task, TD and HP are related to more navigational tasks, in which only one document is the right answer to a query. LETOR 4.0 is created based on the Gov2 web page collection and two query sets from Million Query track of TREC 2007 and TREC 2008, respectively. There are about 1,700 queries in MQ2007 and about 800 queries in MQ2008 with labeled documents.

Each query-document pair in LETOR is represented by standard features, including the traditional information retrieval features, such as tf \(\times \) idf, document title length as well as some link-based features, e.g., PageRank (Brin and Page 1998) and HostRank (Xue et al. 2005). Each of these data sets generates five fold partitions for cross validation, with about the same number of queries. In each fold, there are three subsets for learning: training set (3/5), validation set (1/5, for model selection) and testing set (1/5). All experimental results reported are those averaged over the five trails.

4.2 Evaluation measures

Two popular information retrieval measures: MAP (Baeza-Yates et al. 1999) and NDCG (Järvelin and Kekäläinen 2000) are used for evaluation in our experiments.

4.2.1 Mean average precision

MAP is a measure on precision of ranking results. It is assumed that there are two relevance levels for each item: relevant and irrelevant. Given a ranking list \(\pi \), precision at \(k\) measures the accuracy of top \(k\) results according to the ground truth labels

$$\begin{aligned} P@k = \frac{1}{k} \sum _{i=1}^k y_{\pi ^{-1}}(i)~, \end{aligned}$$
(10)

where \(\pi ^{-1}(i)\) represents the item ranked at the position \(i\) in the list \(\pi \), and \(y_{\pi ^{-1}}(i)\) denotes the item’s ground truth label. The average precision (AP) of the ranking list \(\pi \) is calculated based on the precision at \(k\)

$$\begin{aligned} AP = \frac{1}{m_r} \sum _{k=1}^{n}y_{\pi ^{-1}(k)}P@k~, \end{aligned}$$
(11)

where \(n\) is the total number of items in the ranking list, \(m_r\) denotes the number of relevant items. The measure MAP is defined as the mean value of AP over all the test queries. The relevance degrees of documents with respect to the queries in the data sets of MQ2007, MQ2008 and OHSUMED, are judged by human on three levels: definitely relevant, partially relevant, or irrelevant. We define ‘definitely relevant’ and ‘partially relevant’ as relevant for MAP calculation.

4.2.2 Normalized discounted cumulative gain

NDCG is designed to evaluate ranking quality in multiple level graded ranking applications (Busa-Fekete et al. 2012). The higher NDCG is, the better the ranking quality is. Given a ranking list \(\pi \), DCG at position \(k\) can be computed via

$$\begin{aligned} DCG@k = \sum _{i=1}^{k}{\frac{2^{y_{\pi ^{-1}}(i)}-1}{\log (i+1)}}~, \end{aligned}$$
(12)

where the numerator denotes a gain, and the denominator serves as a discount. To balance the influence of individual queries, the acknowledged measure NDCG is used. It divides the DCG by the ideal discounted cumulative gain, which is the maximum possible DCG until position \(k\), and it is defined as

$$\begin{aligned} NDCG@k = \frac{DCG@k}{IDCG@k}. \end{aligned}$$
(13)

4.3 Experimental results

Depending on which DEA variant (CCR-I or CCR-O) is used to construct the weak rankers, we denote DEARank as DEARank-I or DEARank-O. Furthermore, during the weak learning process, both MAP and NDCG can be used as evaluation measure to select weak rankers. Correspondingly, DEARank-I or DEARank-O is given the suffix MAP or NDCG. For convenience, we denote DEARank-I-MAP, DEARank-I-NDCG, DEARank-O-MAP, DEARank-O-NDCG in abbreviated forms: DIM, DIN, DOM, and DON, respectively. Both DIN and DON use NDCG@5 for weak learning. The number of iterations in DEARank is fixed at 200, we select models on the validation set basing on the mean value of MAP and NDCG@1. The implementation of DEARank has two components: the computation of DEA models, and the learning procedure of boosting weak models to a stronger one. The transformation function \(\varphi (x) = \ln (1+x)\) is used for the CCR-O model. The source codes for DEARank algorithm are available at the Google code repository (https://code.google.com/p/l2r/).

We compare the ranking performance of DEARank with twelve learning to rank baselines,Footnote 1 including two pointwise algorithms (Linear Regression and Ridge Regression), five pairwise algorithms (RankSVM, RankSVM-Primal (Chapelle and Keerthi 2010), RankSVM-Struct (Joachims 2006), RankBoost and FRank (Tsai et al. 2007)), and five listwise algorithms (SVM-MAP (Yue et al. 2007), AdaRank-MAP, AdaRank-NDCG, ListNet and SmoothRank (Chapelle and Wu 2010)). For ease of reference, they are denoted in order as LR, RR, RS, RSP, RSS, RB, FR, SM, ARM, ARN, LN and SR, respectively. Except FRank and RankBoost, all methods are linear ranking functions. At the LETOR website, only five reported baseline algorithms (including RB, RSS, LN, ARM and ARN) are available for LETOR 4.0.

We conduct experiments on all data sets in LETOR collections, and report the comparison results in Tables 2, 3, 4, and 5. According to the experimental results, several observations can be made: (1) Different algorithms perform differently on different data sets and no algorithm can always give the best performance on all data sets, with respect to different measures. For instance, RankBoost outperforms all the other methods on TD2004, but it tends to be nearly the worst one on NP2004. (2) It’s generally believed that the listwise learning to rank approach is superior to the pairwise and pointwise approaches. However, the results in Table 3 seems do not support this viewpoint, and the simplistic pointwise Ridge Regression can also yield impressive results on NP2003. (3) DEARank learns the boosted ranking model using weak ranker candidates, which are created from two DEA variants: CCR-I or CCR-O. Compared with CCR-O model, CCR-I utilizes less information about the training set, since it does not take into account the human-elicited labels of the documents. However, on average, DEARank-I method performs better than DEARank-O method, as shown in Fig. 1. We evaluate every raw feature ranker in terms of NDCG@10 in each data set, and compute the standard variance of their performances. We discover that, DEARank-I outperforms DEARank-O with 2–10 % gains in HP2003, HP2004, NP2003 and NP2004, whose standard variances are about three times greater than those in TD2003, TD2004 and MQ2007. DEARank-O wins DEARank-I with only 0.5–4 % gains in the latter three data sets. These facts indicate that CCR-O is vulnerable to noisy features (Gomes et al. 2013) and duplicate features (Geng et al. 2007). Moreover, when the training set contains many noises, CCR-I is more suitable to formulate the relationships between features and documents. (4) It is very important for a ranking model to assign higher scores to the most relevant items. DEARank performs best in terms of NDCG@1 and MAP on two thirds of all the data sets.

Fig. 1
figure 1

Average performances of DEARank-I and DEARank-O. Each bar represents the average performance of two DEARank methods (DIM and DIN for the bars in blue, DOM and DON for the bars in red) with respect to the mean value of NDCG@1–NDCG@10 and MAP (Color figure online)

Table 2 Ranking performances (data bolded and italicized indicate the best results, and data bolded indicate the second best results w.r.t a particular measure) on HP2003 and HP2004
Table 3 Ranking performances (data bolded and italicized indicate the best results, and data bolded indicate the second best results) on NP2003 and NP2004
Table 4 Ranking performances (data bolded and italicized indicate the best results, and data bolded indicate the second best results) on TD2003 and TD2004
Table 5 Ranking performances (data bolded and italicized indicate the best results, and data bolded indicate the second best results) on OHSUMED, MQ2007 and MQ2008

To give an overall rank for all the algorithms in the experiments, we use one pairwise comparison method (Liu 2011) to count their winning numbers (the number of one algorithm beats others) with respect to all measures over all data sets:

$$\begin{aligned} W_A = \sum \limits _B \sum \limits _E \sum \limits _S I\big \{perf(A, E, S) > perf(B, E, S)\big \}, \end{aligned}$$
(14)

where \(A\) and \(B\) denote the index of algorithms, \(E\) denotes the index for measures (NDCG@1, \(\ldots \), NDCG@10 or MAP), and \(S\) is the index of data sets. \(perf(A, E, S)\) represents the ranking performance of algorithm \(A\) with respect to the measure \(E\) on the data set \(S\). Because only five baselines are used for comparisons in LETOR 4.0, we present the winning numbers separately, as shown in Tables 6 and 7. From the two tables, we notice that DIN is the best method in both collections, and the second best is SmoothRank in LETOR 3.0, and ListNet in LETOR 4.0.

Table 6 Winning numbers on LETOR 3.0 (including HP2003, HP2004, NP2003, NP2004, TD2003, TD2004 and OHSUMED) (data bolded indicate the best result)
Table 7 Winning numbers on LETOR 4.0 (including MQ2007 and MQ2008) (data bolded indicate the best result)

4.4 Experiments with a reduced pool

In DEARank, at each round, all the weak ranker candidates in the pool should be examined, and the one who performs best on the training set with the weight distribution over queries, is selected. The training process requires amount of time for weak learning. It is clearly, as the iteration proceeds, many candidates perform so badly that they have little chances of being selected. Therefore, we keep only top \(K\)(\(\ll N\), the total number of documents in the training set) candidates which perform best with respect to an IR measure (e.g., MAP, NDCG) on the whole training set.

We conduct experiments on five representative data sets: HP2003, NP2003, TD2003, OHSUMED and MQ2008, with a reduced pool of candidates. Here, we fix \(K=100\), and use the training measure (MAP for DIM and DOM, NDCG@5 for DIN and DON) to filter candidates for the training process. The experimental results on the reduced pool are reported in Table 8.

Table 8 Performance of DEARank training with a reduced candidates’ pool

It can be seen from the results that,on average, the performance change of DEARank ranges from the worst 4.82 % loss on NP2003 with respect to NDCG@1, to the best 2.78 % gain on OHSUMED in terms of NDCG@3. When compared with the baselines, however, the rank of our method in terms of winning number does not hurt. Taking HP2003 for instance, DIM dominates the performance in terms of NDCG@1, NDCG@5 to NDCG@10, and MAP on the complete pool. With the reduced pool of weak ranker candidates, DIN obtains the highest scores on all measures except NDCG@4, on which DIN is ranked at the second best place. After filtering the candidates, DEARank can be more efficient without much loss of performance.

5 Discussions and conclusions

The proposed DEARank shows to be a useful rank learning method. In this section, we would like to discuss the impact of one special local property of the weak candidates and a new weak weighting strategy on the ranking performance of the method.

5.1 Local overfitting and error diversity

To learn a ranking model, DEARank requires to construct plenty of weak candidates from the training set. Each weak candidate has a strong relationship with a query and its associated documents, and may perform better on its associated query over other queries. We call this phenomena as local overfitting, and the corresponding query and its associated documents are referred to as the home-set.

At the weak learning stage, the weak rankers are selected based on their average performances. If local overfitting occurs, the selection will be biased and may deteriorate the overall ranking accuracy.

To speculate whether there exists local overfitting on the used data sets, all the constructed candidates are examined. Firstly, all the candidates are used as linear rankers to sort the associated document of the queries, and measured by MAP. Secondly, the overall performance is averaged and compared with the performance on the home-set. Finally, we simply count the number of rankers that attain better performance on their home-sets than the overall average performances, and the percentages are then calculated, as shown in Fig. 2.

Fig. 2
figure 2

Local diversity with local overfitting

The two groups, one is CCR-I based, the other is CCR-O based, report the percentages of home-set biased rankers on all data sets. For example, on average, about 64.3 % of the candidates which are built upon CCR-O may produce biased ranking accuracy towards the home-set in MQ2007. With higher proportion (seven out of nine data sets), the candidates based on CCR-O may produce biased ranking accuracy. We notice that, the home-set biased ranker candidates can produce higher ranking accuracy on their home-set queries. If they do not generalize well for other queries, the local overfitting may deteriorate the ranking performance, as shown by the results on HP2003, HP2004, NP2003, NP2004, TD2003 and MQ2008 (see Fig. 1). On the other side, according to previous works (Cunningham and Carney 2000; Tsymbal et al. 2005), the base learners who have error diversity in their predictions can effectively improve the ensemble’s accuracy. If the home-set biased ranker candidates can also make accurate predictions for other queries, the local overfitting may bring beneficial effects on increasing the error diversity of the ensemble model, and results in an improved overall performance, as indicated by the results on TD2004, OHSUMED and MQ2007 in Fig. 1.

To further improve the combined ranker’s overall performance, we should carry out a series thorough experiments on the local overfitting phenomenon, and investigate its effect on the error diversity of the ensemble model.

5.2 Feature subset weighting strategy

In this work, we adopt the strategy (see Eqs. (8) and (9)) given in AdaRank to minimize the empirical risk on training set. In this case, all the selected features are treated equally, and the contribution from each single feature is ignored. Actually, each weak ranker can be decomposed into multiple single feature rankers, and thus each selected feature can be assigned a weight proportional to its contribution to the overall ranking performance

$$\begin{aligned} \beta _{tk} \propto \sum \limits _{i=1}^{|Q|} E(x_i, y_i, f_k), ~~k = 1,\ldots , m, \end{aligned}$$
(15)

where \(f_k\) represents the \(k\)th feature selected by the weak ranker \(h_t\). This kind of weighting scheme deservers further critical appraisal.

5.3 Conclusions

The key contribution of this work is to introduce data envelopment analysis (DEA) into the field of learning to rank and propose DEARank algorithm. Making use of DEA’s powerful potential in capturing the intrinsic characteristics of documents, we construct the weak ranker candidates using the optimal weights of features for units. The optimal weights are all solved from the DEA variant: CCR-I or CCR-O, and experimental results demonstrate that DEARank provides a promising alternative for rank learning. The incorporation of DEA into rank learning also opens up many challenges and possibilities, e.g., the computational complexity, the generalization ability, the performance of other kinds of DEA models (e.g., BCC (Banker et al. 1984), additive model (Charnes et al. 1985)), and the relationship between the local overfitting phenomenon and the error diversity. We plan to explore these issues in further detail in our future works.