Reply The following article is Open access

Reply to Comment on 'Quantifying the consensus on anthropogenic global warming in the scientific literature'

and

Published 19 March 2015 © 2015 IOP Publishing Ltd
, , Citation John Cook and Kevin Cowtan 2015 Environ. Res. Lett. 10 039002 DOI 10.1088/1748-9326/10/3/039002

1748-9326/10/3/039002

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Inter-rater reliability statistics may be trivially calculated from the released data available at www.skepticalscience.com/docs/tcp_allratings.txt. We have placed R-code for this calculation on the project website at www.skepticalscience.com/docs/interrating.r. The unweighted Cohen kappa is 0.35 using the seven fine-grained categories used in the initial rating process. However, the consensus statistics are based on only three categories: 'endorse', 'reject' or 'no position'; for these categories, kappa rises to 0.46. Subdividing rating categories is known to depress kappa values. The more appropriate Fleiss kappa gives the same results. In our view, the categories should be considered as nominative (Cook et al 2014). However, if they are treated as ordered, the kappa value for the fine-grained categories approaches the value for the consensus categories. Kappa values are also depressed in the case when category counts are very uneven (Sim and Wright 2005). Our data is an extreme case with two orders of magnitude difference between the most and least populous categories.

The interpretation of these statistics is problematic. Landis and Koch (1977) propose an ad-hoc metric by which the agreement on the fine-grained categories would be called 'fair' and on the consensus categories 'moderate'. However, there is no theoretical basis for these labels. Dean cites Kottner et al (2011), who discuss kappa values for a rather different application (medical diagnosis), in which the accuracy of individual ratings has consequences for patient health. If however the physician were simply conducting a survey of the prevalence of a condition, agreement rates are less critical as long as the ratings are not biased. Similarly in our case, the agreement rate affects the uncertainty in the result, but only a bias would lead to an incorrect value for the consensus.

Because the consensus ratio is determined by two of the three categories, differences in allocation of papers to the 'no position' category have minimal impact on the conclusions. The proportion of ratings in the relevant categories (i.e. endorse, no position, reject) for the 12 raters who contributed at least 500 ratings were decomposed by change of variable into consensus invariant and consensus altering terms. The inter-rater variability in the consensus invariant variable was more than twenty times larger than in the consensus altering variable. Thus the primary cause of inter-rater variability arises from differing interpretations of the no-position criteria, but at the same time the raters applied their individual criteria consistently to both the endorse and reject categories. This suggests that inter-rater variability could be substantially reduced by clarification and training on the no-position criteria, but that doing so would not affect the final consensus percentages.

The final consensus percentages calculated for the 12 most prolific raters gives an estimate of the uncertainty in the results. Extreme values were 95.7% and 98.2%, with an interquartile range of 96.2% to 97.6%.

Potential bias among the raters was tested a second way by use of the author self-ratings (bearing in mind that the authors had access to the whole paper). The author ratings were assumed to be correct and were then used to calculate a correction to the abstract ratings. This correction was then applied across all the abstracts, to estimate the consensus score which would have been obtained had the authors rated all of the papers. The results are virtually unchanged (97.2% versus 97.1%). Thus this second method of bias evaluation also suggests that bias was not a significant problem. Nonetheless, we encourage third parties to independently examine the abstracts as a further audit of our results. Tools have been made available to facilitate this task at www.skepticalscience.com/tcp.php.

Please wait… references are loading.