Keywords

1 Introduction

The goal of medical systematic review literature search is to retrieve all research publications relevant to a highly focused research question that satisfies an inclusion criteria. This is so that all literature relevant to the review’s research question can be synthesised in the systematic review [23]. Search takes place using a Boolean query that is formulated by highly trained information specialists using their own intuition and domain knowledge, in order to capture the information need of the systematic review [8]. Afterwards, every study retrieved by the Boolean query is screened (assessed) for inclusion using the titles and abstracts of studies (abstract level assessment). Identified relevant abstracts are further processed by acquiring the full-text for additional assessment, information extraction and synthesis [23].

The process of creating a medical systematic review typically involves large monetary and temporal costs; the average Cochrane review costs $350K to create [35] and it takes up to two years to publish – thus often rendering the result of the systematic review already out-of-date at the time of publication. The process that incurs the most cost when creating a systematic review is the screening of studies retrieved by the Boolean query; often a large set of studies is retrieved, but only a handful are relevant.

A number of solutions have arisen to address the amount of time spent screening documents, including: screening prioritisation (which seeks to re-rank the set of retrieved documents to show more relevant documents first, thus starting the full-text screening earlier), and stopping estimation (which seeks to predict at what point continuing to screen will no longer contribute gain) [25, 26, 40]. In this paper, we propose and evaluate a Boolean query ranking function aimed at tackling these two tasks. The proposed method incorporates intuitions from both coordination level matching of Boolean queries and search engine rank fusion.

This paper proposes an extension to coordination level matching (CLM) by exploiting the query-document relationship with rank fusion. CLM is a ranking function originally proposed for Boolean queries that scores documents using the occurrences of documents retrieved by different clauses of the query. The proposed extension, coordination level fusion (CLF), has many advantages over CLM that enable it to use multiple weighting schemes (rankers) and different fusion methods dependent on the Boolean clauses. We use CLF to rank studies in the screening prioritisation task of systematic reviews. We further plan to study the use of a cut-off threshold tuned on training data to control when the screening of studies should be stopped based on the CLF retrieval score. The empirical results obtained on the CLEF Technology Assisted Review datasets [25, 26] show that CLF significantly outperforms existing state-of-the-art methods that consider similar settings, including the ranking method currently used in PubMed (a popular database to search for literature for systematic reviews).

2 Related Work

Systematic reviews are costly and often out-of-date by the time they are published due to the amount of time involved in their creation. A wide range of systematic review creation processes have been considered for automation or improvement using semi-automatic techniques [40], including: query formulation [27, 46, 48], screening prioritisation [1,2,3, 7, 28, 29, 37, 47, 54, 56], stopping prediction [7, 16, 24], assessment of bias [33, 43], among others. This paper proposes a technique for screening prioritisation, thus the remainder of this section focuses on this specific task.

Active learning has been explored extensively for screening prioritisation and automatic assessment [1, 10, 37, 56]. However, the main drawbacks of active learning are that a poor initial ranking will slow down the rate of learning, and that explicit human effort is required to update the ranking. While current practice prescribes all documents must be screened (therefore explicit assessments could be used for active learning), an initially poor ranking would require many assessments before the system is able to identify relevant documents. Thus the analysis of the full-text of eligible documents may be delayed. Automatic assessment has been suggested to be used in place of a second researcher performing screening [40]. Fully automatic methods of screening prioritisation allow for other processes of systematic reviews to begin earlier and do not require the effort of humans, saving more time (and costs). In this paper, we do not consider screening prioritisation methods based on active learning. However, we note that CLF could be used as the first pass ranking in the context of an active learning method. Then, active learning could be used to augment CLF to performing re-ranking in the presence of continuous, iterative relevance feedback. We leave the study of CLF in an active learning setting for future work.

The CLEF Technology Assisted Reviews (TAR) track [25, 26] considers both screening prioritisation and stopping prediction tasks. The screening prioritisation task has gained substantial interest from CLEF participants, with submitted methods including active learning [12, 13], relevance feedback [4, 18, 21, 36, 38, 39, 52, 55], automatic supervised [9, 17, 30, 47, 51], and automatic unsupervised methods (which do not rely on any relevance feedback or human intervention) [2, 3, 7, 54]. Meanwhile, the stopping prediction task has seen little participation and naïve techniques like static score-based cut-offs [24], as well as techniques based on continuous relevance feedback [16] are used. Many of the participants also do not use the Boolean queries directly, instead resorting only to the title of the review (a sentence), which is contrived and unrealistic in the context of systematic review literature search. This work overcomes these shortcomings by only using the Boolean query to rank documents, with no additional effort required by the information specialist.

Fig. 1.
figure 1

Types of clauses in a Boolean query. Dashed lines surround Boolean clauses, dotted lines surround atomic clauses.

Several approaches to ranking documents retrieved by Boolean queries were proposed in the ‘80s and ‘90s outside of the context of systematic review creation. Most of these approaches rely on users explicitly weighting terms in the query [42], probabilistic retrieval using fuzzy set theory [6, 41] and term dependencies [15]. A drawback of these methods is their heavy reliance on the users to impose a ranking over retrieved documents (e.g., the requirement that users must specify individual term weightings). Users often are unable to provide such weights, or it creates an additional hindrance in using the retrieval system.

A ranking function for Boolean queries which relies solely on the structure of the Boolean query, without further user intervention, is Coordination Level Matching (CLM) [31]. The intuition behind CLM is that nested sub-clauses of a Boolean query could be considered as separate but related queries, and therefore documents that appear in multiple clauses should be ranked higher. For example, a very common way information specialists formulate Boolean queries for systematic review literature search is to break a search down into three of four categories based on the Population, Intervention, Controls, Outcomes (PICO) framework [8]. Query terms from each category become a clause in the Boolean query, grouped together by a single AND operator [8]. Formally, in CLM the score of a document d is the number of Boolean clauses of the query Q that are satisfied by it. A clause can be considered as both a single atomic keyword, and the grouping of several keywords or other nested groupings by a single Boolean operator (Boolean clause). Figure 1 visualises the differences between atomic clauses and Boolean clauses.

Rankings produced by CLM typically perform poorly (as supported by our empirical findings in Sect. 5.1). This is because the amount of information about the query being exploited to produce a document ranking is low. CLM has been noted to be more effective when weighting occurrences of documents by, for example, IDF or TF-IDF [14]. Which weighting scheme to use for CLM is then unclear, and some documents may be ranked higher than others using different weighting schemes. Moreover, when computing scores, CLM does not account for the different Boolean operators present in the query, i.e., scores are summed in the same manner irrespective of the operator used, e.g., AND, OR.

The CLF method proposed in this paper exploits rank fusion [49], i.e., the combination of multiple document rankings, typically returned by different systems or weighting schemes for the same query (although recent work has applied fusion to different query variations [5]). There are many methods for fusion of rankings, and they can be classified into two main categories [22]: score-based [49] and rank-based [32]. Score-based methods fuse rankings using the original scores of documents in different rankings to infer the new fused ranking. As systems and weighting schemes will typically assign wildly different scores to documents, scores are often normalised before fusion (e.g., using min-max normalisation). Rank-based methods fuse rankings using only the rank positions of documents (similarly to electoral vote fusion [32]).

The novelty of our contribution is that by combining insights from decades-old research about ranking documents directly with Boolean queries with relatively more recent research about the fusion of ranked lists, significant gains in effectiveness can be obtained.

Fig. 2.
figure 2

Bottom-up visualisation of the fusion of ranked lists using the CLF method. First one or more ranked lists of an atomic clause are fused, then the results of each Boolean clause are fused. Each clause that has fusion applied to is encapsulated in a dashed box. The nested clauses which it encapsulates are included inside it. Each applicable fusion method is labelled within each respective box. Note that all atomic clauses use the same range of weighting schemes: in this figure only one is shown for space reasons.

3 Coordination Level Fusion

In this paper, we propose Coordination Level Fusion (CLF), a novel method that extends the traditional Coordination Level Matching (CLM) [31] by integrating rank fusion into the Boolean retrieval model by exploiting the semantic and syntactic aspects of the Boolean query.

CLM’s intuition is that documents retrieved by many clauses should be considered more likely to be relevant. We note that this intuition is supported by axioms put forward in axiomatic analyses of ranking functions [19], and, more importantly for our work, it is similar to the intuition of rank fusion, namely, the chorus effect: the fact that “several retrieval approaches suggest that an item is relevant to a query” [53]. CLF leverages this intuition to further boost relevant documents higher up the ranking, using the agreement from multiple weighting schemes (rankers) and the agreement afforded by the structure of Boolean queries. Next, we describe the CLF method for ranking documents.

3.1 Producing a Ranking

We assume that a set R of rankings \(r_1, r_2, \ldots , r_k\) is available for each atomic Boolean clause (i.e., a term in the Boolean query, see Fig. 1) . These rankings could be produced by any weighting scheme available, e.g., IDF, BM25, etc. A ranking is an ordered list of documents: \(r ={<}d_0,d_1, ..., d_k{>}\) with \(s(d_i,r_j)\) representing the score of document \(d_i\) within ranking \(r_j\). In CLF, these rankings are recursively fused, first at an atomic clause level, then at the level of (often nested) Boolean operators, until the highest level of the Boolean query is considered (typically represented by an AND operator): at this level, rankings are again fused together to produce a single, final ranking. This is achieved by applying the CLF fusion function to each document d as:

(1)

where R is the set of rankings associated with the clauses of the Boolean query considered at the current level, and T is the type of Boolean operator applied. In this work, we consider T as being either identifying an atomic clause, or the AND and OR operators. The queries we consider do not have NOT clauses (therefore we do not have a fusion method for this operator). According to Eq. 1, CLF performs CombSUM fusion [49] if the Boolean clause is AND (\(T=\) AND). Likewise, CombMNZ fusion [49] is used when dealing with atomic clauses or the OR operator. Figure 2 visualises how fusion is performed for different Boolean clauses. When scoring exploded MeSH terms, the score provided by a weighting scheme is the summed score of each child in the subsumption (similar for phrases). Both CombSUM and CombMNZ boost the documents which multiple rankers estimate to be highly relevant (i.e., the chorus effect), however CombMNZ at the OR and atomic levels is used to combat less accurate estimates of relevance (i.e., the dark horse effect). That is, documents where only a single ranker estimates them as highly relevant are not boosted.

Fig. 3.
figure 3

Example query formatted to be issued to PubMed for re-ranking. Constructing the query like above ensures only the documents specified (e.g., document number 23593613) are retrieved, and therefore re-ranked.

3.2 Stopping Prediction

The task of stopping prediction in systematic review literature search is that: given a ranking of the set of documents retrieved by the Boolean query, at what position should screening stop? We model this task with an equivalent description: given a set of documents retrieved by a Boolean query, what is the subset of documents which does not need to be screened? In this work, stopping prediction is performed by exploiting the scores of documents for each atomic term after fusion. Rather than setting a fixed cut-off on scores similar to participants in the CLEF TAR task [24], here a gain-based approach is used. Our approach is as follows: Given that researchers will screen documents starting at the first document and continuing to the next document for the entire list, they are accumulating gain from documents (equal to the document score) as they continue down the list of documents. Once enough gain from documents has been accumulated, they can stop screening. To model this, we use a \(\kappa \) parameter to control what percentage of the total gain a researcher can accumulate before stopping. The stopping point therefore becomes the position of the document in the ranked list where the cumulative gain exceeds the total allowable gain. When \(\kappa \) is set to 1, no documents are discarded. In the task of screening prioritisation, where documents are assessed, \(\kappa \) is set to 1.

4 Experimental Setup

Empirical evaluation is conducted on the CLEF TAR 2017 and 2018 collections [25, 26]. For the 2018 collection, evaluation is performed on topics from Task 2. Experiments are compared with respect to two baselines: a ranking obtained by submitting queries directly to PubMed (explained in detail below), and a ranking obtained by using CLM. The results of the CLF rankings are also compared to the rankings produced by the participants of the CLEF TAR task. Note that many of these participants do not rank directly according to the terms and structure of the Boolean query (while we do), and often consider the query as a bag-of-words, and incorporate terms from the title for re-ranking. Also note that many of the participants used feedback from the relevance assessments and created active learning solutions. The comparisons between participants and our results only consider those which reported to not use relevance assessments and do not use human intervention to rank (fully automatic, thus excluding active learning settings). In other words, we experiment considering the first round of retrieval.

All experiments are run using the QueryLab domain-specific Information Retrieval framework [45]. To obtain statistics for ranking documents, the documents retrieved by each query are fetched from PubMed and indexed by QueryLab. No stopwording or stemming is applied. The particular queries in this collection contain terms which are explicitly stemmed. Therefore, we use the PubMed Entrez API [44] to identify the original terms in documents from the explicitly stemmed term (this backward approach to stemming is to allow information specialists fine-grained control over their search). The title, abstract, MeSH headings, and publication date of each PubMed document is stored in four separate fields. When a title was not available for a document, the book title field was used instead; if no book title was available, the field was left empty (this replicates how searching on the title field works in PubMed). All of the experimental code to reproduce the experiments is made available at https://github.com/ielab/clf.

The following weighting schemes are used in our experiments to produce document rankings for an atomic clause: IDF, TF-IDF, BM25, InL2 of Divergence from Randomness, PubMed, term position, text score, publication date, and document length. The PubMed weighting scheme uses the state-of-the-art learning to rank system of Pubmed [20]. The best match ranking system of PubMed uses a three-stage ranking system: first, documents are retrieved using the Boolean query; then, documents are ranked using BM25; finally, top-ranked documents are re-ranked using LambdaMART trained on click data, using document features such as document length, publication date, and past usage. Note that the PubMed best match ranker can only rank documents given a term or phrase, not a Boolean query. After the first stage, the Boolean query is translated into a bag-of-words type of query, similar to those seen in web search (it is often the case that the query translation results in fewer documents retrieved). Therefore, by embedding the PubMed ranker into CLF, the query translation step may be skipped entirely. The term position weighting scheme is defined as the relative position of a term in a document (0 if the term does not appear in the document). Publication date scores documents higher if the document is newer (accounting for recency, linearly). Document length scores documents higher the longer the document is. Text score weights documents by the fields a term appears in: for example, a document is scored higher if a term appears in the title and the body than if the term appears only in the body. When queries are submitted to PubMed, they are modified to restrict them to only the PMIDs reported in the CLEF topic file (in order to account for minor discrepancies in retrieval after different time periods, see Fig. 3 for an example), and set the retrieval mode in PubMed to ‘relevance’ in order to obtain a ranked list of documents by relevance (instead of the default ranking by publication date). Prior to fusion for any clause, ranked lists are normalised using min-max normalisation. Z-score and softmax normalisation were also considered, however through early empirical testing, min-max normalisation provided the consistently higher effectiveness compared to z-score and softmax. When there are ties in the ranking, the document which has a more recent publication date is ranked higher. The different modifications made to CLF used in this paper are taxonimised below:

  • CLM – The basic form of coordination level matching using the approach described in Sect. 1.

  • CLF+PubMed – CLF, using the PubMed ranker via the PubMed Entrez API.

  • CLF+weighting – CLF, using the weighting schemes described in the paragraph above (excluding the PubMed weighting scheme).

  • CLF+weighting+PubMed – CLF, using all of the weighting schemes from CLF+weighting in addition to the PubMed ranker from CLF+PubMed.

  • CLF+weighting+qe – CLF, using all the weighting schemes from CLF+weighting, but with a naíve query expansion method using terms from the topic titles and terms specific to DTA systematic reviews (obtained from an information specialist). Here, two additional Boolean OR clauses are constructed, each containing terms from the title and DTA specific terms respectively. Terms from the title have stopwording and Porter stemming applied.

  • CLF+weighting+PubMed+qe – CLF, using all of the weighting schemes from CLF+weighting, in addition to the PubMed ranker from CLF+PubMed, and the approach to query expansion from CLF+weighting+qe.

4.1 Evaluation

Evaluation is performed differently depending on the task. For the screening prioritisation task, rank-based measures are used. For comparison between the CLEF TAR participants (of which we acquired the runs), the MAP measure is included. The nDCG measure is included as a more realistic model of user behaviour. Reciprocal rank (RR) is used to demonstrate the effectiveness of systems in an active learning scenario (to show how soon the first relevant document would be shown and an update to the ranking potentially triggered). Precision after R documents (Rprec) is used to show the theoretical best possible precision obtainable in the stopping task, along with last relevant (Last Rel) that reports at what rank position the very last relevant document was shown. Participant runs are chosen for comparison if they are a fully automatic, unsupervised method, which does not use the training data or explicit relevance feedback, and do not set a threshold (as categorised in the TAR overview papers [25, 26]). Note that the tables in the CLEF TAR overview papers contain errors regarding these aspects, instead each of the participant’s papers were considered to individually determine which runs to directly compare our methods to. For the stopping prediction task, several standard set-based measures are used: precision, recall, \(F_{\beta =\{0.5,1,3\}}\), total cost, and reliability [11]. Reliability is a loss measure (i.e., where smaller values are better) specifically designed for the TAR task. It has two components: \(loss_r=1-(\text {recall})^2\) and \(loss_e=(n/(R+100)*100/N)^2\), where n is the number of documents retrieved, N is the size of the collection, and R is the total number of relevant documents. Therefore, \(\text {Reliability}=loss_r+loss_e\). Participants runs are chosen if they are fully automatic, supervised or unsupervised (thus we consider approaches that used training data), do not use explicit relevance feedback, and do set a threshold. Runs are evaluated using trec_eval or the evaluation scripts that are provided by the CLEF TAR organisers, where applicable.

When used for predicting when to stop screening, \(\kappa \) is tuned on training queries using a grid search to determine the best value. The parameter space searched in these experiments is \(\{0.05, 0.075, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 0.9, 0.95\}\). Note that \(\kappa \) can be set at a clause-level, therefore it is possible for it to be adaptive based on the clause. We leave learning an adaptive \(\kappa \) for future work, and here we fix \(\kappa \) to a set value across all clauses.

5 Results

5.1 Screening Prioritisation

Tables 1 and 2 present the results of the screening prioritisation task for the 2017 and 2018 CLEF TAR collections. Comparing CLM to CLF (without query expansion), CLF is statistically significantly better than CLM in all of the evaluation measures presented in both 2017 and 2018 tables (using a two-tails t-test where \(p<0.05\)). Comparing the CLM and CLF methods to the state-of-the-art PubMed ranking, CLM is often statistically significantly worse than the PubMed ranker, whereas some CLF-based methods are able to perform statistically significantly better than the PubMed method. Next, the best performing CLF method (CLF+weighting+PubMed+qe) and the best performing CLEF participant method for each year is compared. For 2017 topics, the best performing methods are Sheffield-run-2 (documents ranked with TF-IDF vector space model using terms from topic title and terms extracted from the Boolean query) and Sheffield-run-4 (same as Sheffield-run-2 except a PubMed stopword list is used) [3]. The CLF method does not perform statistically significantly better than these two methods in any evaluation measure considered (however in all measures apart from MAP and last relevant, CLF is better). For 2018 topics, the best performing method is Sheffield-general-terms (same as Sheffield-run-4 from 2017, however terms specifically designed to identify systematic reviews are added to the query) [2]. Comparing this method to CLF, the CLF method performs statistically significantly better in RR (and has gains in all evaluation measures apart from last relevant). Overall, CLF is able to obtain the highest MAP overall for 2018 topics, and the highest overall nDCG, RR, and Rprec for both 2017 topics and 2018 topics, performing statically significantly better than the state-of-the-art PubMed ranker.

Table 1. Results for CLEF TAR 2017. The first row of results is obtained by issuing queries to PubMed, the next set of rows is are results of the various configurations of CLF, and the last set of rows are the relevant runs from participants for that year. Two-tailed t-test between the PubMed ranker and the other methods with \(p<0.05\) is indicated by \(*\) and \(p<0.01\) by \(\dag \).
Table 2. Results for CLEF TAR 2018. Presentation of results and statistical significance is indicated the same was as in Table 1.
Table 3. Results of CLF for stopping prediction for CLEF TAR 2017. The first row are the results from the original queries, the second row is when CLF with \(\kappa =0.4\). Two-tailed t-test between the original results and the other methods with \(p<0.05\) is indicated by \(*\) and \(p<0.01\) by \(\dag \).
Table 4. Results of CLF for stopping prediction for CLEF TAR 2018. The first row are the results from the original queries, the second row is when CLF with \(\kappa =0.4\). Significance is indicated the same as in Table 3.

5.2 Stopping Prediction

Tables 3 and 4 present the results of the stopping prediction task using the cut-off parameter \(\kappa \). A \(\kappa \) value of 0.4 through parameter tuning on training data was found to provide the least loss in Reliability, and was therefore chosen for the test queries for both 2017 and 2018. Results of the parameter tuning process on the training portion of the CLEF 2017 and 2018 topics are presented in Fig. 4. The CLF method used in this task was CLF+weighting+PubMed+qe as it obtained the highest performance on the screening task.

Examining first Table 3, CLF obtains the highest precision, \(F_1\), \(F_0.5\), \(F_3\), and lowest loss in reliability. CLF also obtains the second-lowest total cost, and maintains both a low total cost and reliability for this set of queries. Losses in recall are within a tolerable threshold [11].Table 4, reveals similar results to the 2017 topics. Significant improvements over the original queries in terms of precision, \(F_1\), \(F_{0.5}\), \(F_3\), and total cost, with a tolerable reduction in recall can be observed. However, the Reliability on this set of queries is higher (thus worse). Given that the total cost is low, this indicates that the \(loss_r\) component of Reliability does not decrease at the same rate as \(loss_e\) increases for these topics. There were no participants which contributed a comparable run to the 2018 TAR task, therefore no comparisons to other systems can be made for this collection.

While there is a drop in recall, there are real monetary savings associated with the increase in precision. Across the 2017 and 2018 topics, the CLF method provides savings between approximately USD$5000 and USD$12,000, according to estimates reported by McGowan et al. [35] when considering double screening.

Fig. 4.
figure 4

Tuning the \(\kappa \) parameter on the training portions of the 2017 (left) and 2018 (right) CLEF TAR topics. Lowest value for both plots is 0.4.

6 Conclusion and Future Work

In this paper, a novel approach to ranking documents for systematic review literature search using rank fusion applied to coordination level matching was presented. The method, dubbed Coordination Level Fusion (CLF), outperformed the current state of the art for two different tasks. For the screening prioritisation task, CLF significantly outperformed the existing PubMed ranking system, as well as participants that submitted comparable runs to the CLEF TAR tasks. The results of the screening prioritisation task demonstrate the applicability of CLF to systematic review literature search when prioritisation is considered, and suggest it may also be applied to obtain an effective early ranking in settings that consider active learning. For the stopping prediction task, CLF could significantly reduce the cost of screening with tolerable losses in recall. The results of the stopping prediction task demonstrate the applicability of CLF to specific systematic reviews where total recall is not essential, such as in rapid reviews [34].

There are many aspects about CLF that require further investigation. First, we propose to study the effectiveness of CLF within an active learning setting. In this context, CLF can be used as the first ranker, before relevance feedback is collected. Then, feedback could be further weaved into CLF by devising and integrating weighting schemes that account for this. We also plan to investigate the use of CLF as a method for query performance prediction (e.g., as a post-retrieval predictor using reference lists [50], or as a candidate selection function in query transformation chain frameworks [48]). In terms of extending CLF, the weighting schemes themselves can be weighted (i.e., one weighting scheme may have more importance over others); e.g., using the linear combination fusion method [53] which assigns weights to each ranker being fused. The problem then is learning the weight to assign to each weighting scheme (ranker) used for rank fusion. Rather than using fusion methods like CombMNZ, it is foreseeable to use a different combination of weights for each Boolean clause considered.