INTRODUCTION

In pharmaceutical industry, it is important to evaluate drug safety during drug development and after approval.1 Adverse events caused by side effects of new drugs are often rare; thus, it is hard to accurately estimate their rates in a single study. Meta-analysis can integrate quantitative information from multiple clinical trials to provide a more reliable and powerful assessment, which can be used to underpin guidelines, aid patients’ decision-making, and validate healthcare products.2,3 However, there are potential methodological challenges for meta-analyses of adverse events due to sparse outcome data.4

Previous studies have demonstrated that conventional meta-analysis methods are not suitable when applied to the rare events studies, because they are mostly based on large-sample theories to make inferences.5,6,7,8 Another concern of the conventional methods is that the log odds ratio and log risk ratio are either infinite or undefined if no event occurred in one or both treatment arms.9 In most available software packages, a continuity correction (usually 0.5) is often added to zero cells in single-zero-event studies by default while double-zero-events studies (DZS) are excluded from the analysis.9,10 However, when a meta-analysis contains a relatively high proportion of DZS, excluding these studies may lead to fairly different conclusions and waste useful resources.11,12

To overcome the problems of conventional methods caused by rare events, some methods including DZS without continuity correction have been proposed. Focusing on the risk difference, Tian et al.5 suggested an exact method of combining confidence intervals (CIs). Liu et al.7 developed an exact method by combing the so-called p value functions using several different effect sizes. Yang et al.8 pointed out that these two exact methods can be well combined under the framework of combining confidence distributions, and an R package “gmeta” has been developed for performing these methods. Besides the well-discussed frequentist methods, a variety of “exact” Bayesian methods based on the fixed and random effects models for the odds ratio or relative risk have also been proposed.6,13,14

The three most commonly used effect sizes include the odds ratio, relative risk, and risk difference. Another effect measure, arcsine difference, can handle zero events naturally but is seldom used possibly because of its indirect interpretation.15 Because the odds ratio and relative risk have similar values when the event is rare and the odds ratio has some advantages over the relative risk (e.g., consistency to events labeling and valid inferences under different sampling methods), we focus on statistical methods which use the odds ratios as the primary measure in this article.16,17

Different from previous studies which compared the performance of various statistical methods in meta-analysis with rare events through simulation studies,9,10,11 this article evaluates the performance empirically using a large collection of published meta-analyses in the Cochrane Database of Systematic Reviews (CDSR). Specifically, Cohen’s κ coefficient is used to estimate chance-adjusted pairwise agreement among these statistical methods.18 In addition, we evaluate the impact of DZS on the agreement. A meta-analysis of 13 comparative trials of haloperidol versus chlorpromazine for schizophrenia is used as a worked example of comparing the selected methods.19

METHODS

Data Source

To empirically evaluate the performance of the various meta-analysis methods for rare events, we downloaded all systematic reviews published on all issues in the CDSR available by January 2016 using the Cochrane scraper software,20 and we selected meta-analyses with binary outcomes and at least 5 studies from all reviews. Because the meta-analyses from the same systematic review likely contained some common studies, we selected the meta-analysis with the largest number of studies from each systematic review to avoid possible association between meta-analyses. The event in a meta-analysis was considered rare if the overall crude prevalence (the ratio of the cumulative number of events over the total sample size across studies) was less than 5%. Note that rare events might be defined differently (e.g., less than 1%) by different researchers21; we used the relatively large cutoff 5% because it could yield a relatively large number of meta-analyses for our empirical study. Finally, we excluded meta-analyses without any DZS or in which all studies were DZS.

Continuity Corrections

The log odds ratio is infinite or undefined if the study contains no events in one or both treatment arms. To avoid this problem, the commonly used methods add an artificial continuity correction to the studies with zero cells. Sweeting et al.9 have comprehensively compared the performance of a variety of continuity corrections, and three corrections, including the constant 0.5 correction, the treatment arm continuity correction (TACC), and the empirical correction (EMP), have been recommended. Table B.2 in the Supplementary Material presents detailed illustrations for TACC and EMP corrections.

Conventional Methods

The inverse-variance method, Mantel–Haenszel method, and Peto method were the three most frequently used methods for combining effect sizes in a fixed effect meta-analysis.22,23 For the inverse-variance and Mantel–Haenszel methods, we applied the above three continuity corrections to studies with zero event in one treatment arm (single-zero-event studies, SZS) or both (double-zero-event studies, DZS). Table B.3 in the Supplementary Material summarizes software packages for handling zero events in a meta-analysis. The Peto method, in contrast, can combine single-zero-event studies without continuity correction.

The Mantel–Haenszel method is the default fixed effects method in RevMan 5 (which is the software specially designed to prepare and maintain Cochrane reviews)24 and the “metabin” function in the R package “meta”.25 Also, it is an option of function “metan” in STATA.26 The Peto method can provide the least bias and most powerful results with rare events and balanced studies. However, it may not perform well when applied to studies with unbalanced designs or when the log odds ratio significantly differs from zero.27 Besides conducting the fixed effect meta-analyses via the above three methods, the DerSimonian–Laird method was used to implement the random effects meta-analyses.28

Exact Method

Liu et al.7 proposed an exact method which can use the information from all studies, including those having no event in a single or both arms, without any continuity correction. To implement the exact method, the so-called p value functions for the odds ratio are first obtained for individual studies from the mid-p adaption of Fisher’s exact test. Then, the confidence distribution combination method is used to integrate the individual p value functions to obtain the combined p value function. Finally, the point estimate, CI, and p value are obtained through the combined p value function. We used the R package “gmeta” to perform the exact method for the Cochrane meta-analyses.29

Tian et al.5 proposed another exact method, but they focused on the risk difference while we primarily considered the odds ratio; therefore, their method was not used in this article.

Bayesian Method

In addition to the exact method by Liu et al.7 under the frequentist framework, the Bayesian approach can alternatively handle studies with no events in a meta-analysis without any artificial continuity correction. We implemented the Bayesian method for each Cochrane meta-analysis under both the fixed and random effects settings via WinBUGS with a burn-in period of 1000 iterations followed by 100,000 further iterations for posterior inference.30 The posterior median and 95% equal-tail posterior credible interval (CrI) were summarized to estimate the odds ratio. Appendix A in the Supplementary Material provides more details of the methods used in our empirical study.

Agreement Among Different Methods

We used Cohen’s κ coefficient to assess the agreement between the results produced by different methods for rare events. This coefficient is a popular measure of the agreement between two categorical variables.18 By convention, κ value smaller than 0.2 suggests slight agreement; a κ value between 0.2 and 0.4 suggests fair agreement; 0.4 to 0.6 suggests moderate agreement, between 0.61 and 0.80 suggests substantial agreement, and from 0.81 to 1.00 suggests almost perfect agreement.31 Specifically, the p value of the overall effect estimate was obtained for each Cochrane meta-analysis using each method discussed above, and the significance level was set to 0.05. Based on the dichotomous results (significant or not) from the p values, Cohen’s κ coefficient was calculated for each pair of the methods, and the impact of DZS on the agreement was also examined.

RESULTS

Performance of Meta-analysis Methods for Rare Events in the CDSR

We collected a total of 5677 systematic reviews from the CDSR, which contained more than 180,000 meta-analyses. After selecting the largest meta-analyses from the reviews with binary outcomes and at least 5 studies, 2393 meta-analyses were retained, among which 583 had rare events. After excluding 215 meta-analyses without any DZS or any event, our final analysis data set consisted of 368 meta-analyses. The flow chart of our data collection is presented in Figure 1.

Figure 1
figure 1

Flow chart of the selection criteria for the eligible meta-analyses from the CDSR. AE, adverse events; CDSR, Cochrane Database of Systematic Reviews; CONT, continuous; DICH, dichotomous; MA, meta-analysis.

Figures 2 and 3 show the scatter plots for the agreement of log odds ratios and its associated CI/CrI length between different methods including DZS, respectively. Both figures illustrate that the best agreement was achieved between the inverse-variance method and Mantel–Haenszel method using the continuity correction of 0.5 and TACC. Their agreement with the exact method was intermediate and even weak with the methods using the EMP correction for the log odds ratio. For example, when the log odds ratio was far away from zero, the exact method and the methods using the EMP correction tended to produce larger estimates than other methods. Considerable disagreement patterns were seen between the associated CI/CrI lengths. For example, the CI/CrI length was generally wider using the exact method and the methods using the EMP correction, compared with the 0.5 correction and TACC. The agreement between the Bayesian and other methods was weak for both the log odds ratio and its CI/CrI length, which was due to some considerably large estimates for some meta-analyses with a high proportion of DZS and no events in one arm for all studies. For such meta-analyses, the exact method failed to produce a summary estimate. A similar agreement pattern was observed between the methods adding different continuity corrections only to studies with zero event in a single arm. Note that the agreements between the Peto method and the methods with different continuity corrections for the log odds ratio were moderately strong, particularly when the log odds ratio was not heavily departed from zero, and the Peto method is likely to result in narrower CI length comparing with methods with EMP (Figure C. 1 and Figure C. 2 in the Supplementary Material).

Figure 2
figure 2

Scatter plot matrix of the agreement between the estimated overall log odds ratios produced by the fixed effect methods that include double-zero-event studies. I5, inverse-variance method with 0.5 continuity correction; ITACC, inverse-variance method with treatment arm continuity correction; IEMP, inverse-variance method with empirical continuity correction; MH5, Mantel–Haenszel method with 0.5 continuity correction; MHTACC, Mantel–Haenszel method with treatment arm continuity correction; MHEMP, Mantel–Haenszel method with empirical continuity correction.

Figure 3
figure 3

Scatter plot matrix of the agreement between the lengths of 95% confidence intervals produced by the fixed effect methods that include double-zero-event studies. I5, inverse-variance method with 0.5 continuity correction; ITACC, inverse-variance method with treatment arm continuity correction; IEMP, inverse-variance method with empirical continuity correction; MH5, Mantel–Haenszel method with 0.5 continuity correction; MHTACC, Mantel–Haenszel method with treatment arm continuity correction; MHEMP, Mantel–Haenszel method with empirical continuity correction.

Table 1 shows Cohen’s κ coefficients for assessing the agreement among the different methods. The top sub-table shows the agreement among the methods excluding DZS while the bottom one is for the agreement including DZS. The lower and upper shaded triangular results are based on the outcomes from the fixed and random effects models, respectively. We focused on interpreting the results from the fixed effect models; the conclusions about the results from the random effects models can be similarly drawn.

Table 1 Cohen’s κ Coefficients of the Agreement Among the Various Methods Based on Cochrane Meta-analyses with Rare Events

Generally, the discrepancy between the methods with different continuity corrections was more significant when including DZS for analysis compared with excluding DZS. In particular, the agreement between the methods using the 0.5 correction (or TACC) and that with the EMP correction was considerably decreased when including DZS in the analyses. For example, κ was 0.97 for the agreement between the Mantel–Haenszel methods with the TACC and EMP corrections, and it decreased to 0.78 after including DZS. This implies that the inclusion and exclusion of DZS might lead to inconsistent conclusions from these methods for some meta-analyses. However, the decrease of the agreement between the methods with the continuity correction of 0.5 and TACC was ignorable. For example, κ was 1 for the Mantel–Haenszel method using the continuity correction of 0.5 and TACC when excluding DZS, and it decreased to 0.95 when including DZS. The slight impact of DZS was probably due to the roughly equal allocation sample size between the two treatment arms.

In addition, the exact method, Bayesian method, and Peto method achieved better agreement with the Mantel–Haenszel method than the inverse-variance method under the fixed effect framework. The Bayesian method agreed better with the methods with the EMP correction than the TACC or 0.5 correction, while the exact method had better agreement with the methods using the TACC or 0.5 correction than the EMP correction. Note that there was no obvious difference between the methods with different continuity corrections and the Peto method. For example, the agreement was fairly strong between the Peto method and the Mantel–Haenszel method with different corrections (κ > 0.9).

Table 2 shows the impact of DZS on the agreement among the selected fixed effect methods including DZS. The upper shaded and lower triangular results were based on the outcomes from meta-analyses with the proportion of DZS less or equal to 50% and greater than 50%, respectively. The coefficients in the upper triangular results were generally larger than those in the lower triangular, suggesting that the selected methods agreed with each other better when the proportions of DZS were less than 50%. For example, κ = 0.81 between the exact and Mantel–Haenszel methods with the TACC decreased to zero when the proportion of DZS was higher than 50%. Note that the proportion of DZS had ignorable impact on the agreement between some of the selected methods. For example, the agreement between the inverse-variance method and the Mantel–Haenszel method with the TACC or the EMP correction was almost perfect, even when the proportion of DZS was higher than 50%. Another example is that κ was 0.78 between the Bayesian and Mantel–Haenszel methods with the EMP correction, and it decreased slightly to 0.71.

Table 2 Cohen’s κ Coefficients for the Agreement Among the Various Fixed Effect Methods Including DZS

Table 3 shows Cohen’s κ coefficients for the agreement between methods excluding and including DZS. A better agreement was achieved when the proportion of DZS was less than 50%. For example, the almost perfect agreement between the Peto and Bayesian methods decreased considerably (κ changed from 1 to 0.77) when the proportion of DZS was higher than 50%. Note that the agreement between the Peto method and the exact method was relatively robust to the proportion of DZS; only a slight decrease was observed (from 0.89 to 0.80) when the proportion of DZS was higher than 50%.

Table 3 Cohen’s κ Coefficients for the Agreement Among the Various Methods Excluding and Including DZS

The impact of the proportion of DZS on the log odds ratio and its associated CI/CrI length was also investigated; the results are in Figures C.3 and C.4 in the Supplementary Material. Generally, as the proportion of DZS increased, the variability of the estimates by the selected methods also increased, which was consistent with the impact of DZS on the agreement between the methods.

Case Study

A meta-analysis of 13 comparative trials compared two drugs, haloperidol and chlorpromazine, which are commonly used to treat schizophrenia.19 Patients leaving studies because of adverse events were recorded as outcomes. In this meta-analysis, 8 studies (62%) contained no events in any treatment arm, and events only occurred in one treatment arm in the other 5 (38%) studies. The sample sizes from both treatment arms were comparable for most of the studies, and only one study contained almost three times sample sizes in the haloperidol group than that in the chlorpromazine group. Table B.4 in the Supplementary Material presents the data.

Table 4 shows the summary odds ratio, 95% CI/CrI, and p values from the various methods. p values less than 0.05 are shown in bold. Because it had been suggested that the Peto method was unbiased when the group size was balanced, the estimates by this method were used as the benchmark for the other methods. Its point estimate was 0.18 with 95% CI (0.04, 0.75), which suggested a significant difference between the two treatment arms. It is of interest to note the impact of inclusion of DZS on the point and confidence interval estimates of all these fixed effects methods. Overall, the methods with the EMP correction tended to pull the results away from the null, producing narrower confidence intervals and more significant p values, while the methods with the 0.5 or TACC tended to pull the results toward the null. For example, without DZS, the point estimate and 95% CI from the inverse-variance method with the EMP correction were 0.17 (0.02, 1.23); with DZS, they were 0.18 (0.04, 0.71). For the Mantel–Haenszel method with a continuity correction of 0.5, the estimates were 0.26 (0.07, 0.99) and 0.48 (0.19, 1.21) when excluding and including DZS, respectively. The exact method yielded in a non-significant conservative result with the widest CI due to the inclusion of DZS, while the Bayesian method resulted in the smallest point estimate 0.06 with a considerable narrow CI (0.00, 0.53). In summary, the estimate was inconsistent across the various methods because of differences in dealing with DZS.

Table 4 Results Produced by Different Fixed Effect Methods in the Schizophrenia Comparative Study

DISCUSSION

Main Findings and Recommendations

Using 368 meta-analyses with DZS selected from the CDSR, this study illustrated that with a large proportion of DZS (i.e., when the adverse event was typically rare21), the agreement among the meta-analysis methods was low, which suggested that researchers should not rely on only a particular statistical method to make inference on the impact of treatments on adverse events.

No method is suitable for all sparse data scenarios. The Peto method which requires excluding DZS may not be a good choice, because it cannot fully utilize the information from all studies. On the other hand, to account for all available data including DZS,32 the exact and Bayesian methods may be preferable methods. However, if there is no event in one treatment arm across all studies, the exact method cannot be implemented, and the Bayesian method usually yields unreasonably large estimates.

The methods with different continuity corrections can be conveniently implemented through some existing statistical software. For example, the “metabin” function in the R package “meta” can include DZS or SZS and the metan function in STATA can include SZS by adding the 0.5 or TACC to all studies; however, the impact of constant 0.5 correction is hard to estimate, especially when the ratios of group size are unbalanced across studies and the TACC is prone to pull the estimates toward the null. Alternatively, the EMP correction borrows information about the odds ratio from other non-zero event studies under the frequentist framework. It has been suggested as a counterpart of the Bayesian method and may help researchers make reliable inferences. However, as publication bias often exists in meta-analyses,33,34,35 applying the EMP correction may introduce bias that leads the results away from the null when there is no treatment effect on adverse events.

In summary, when adverse events are particularly rare (i.e., a large proportion of DZS), to give a more reliable conclusion, we recommend to apply the exact or Bayesian method for utilizing the information from all the studies. In the situation of no event in one arm for all studies, if the sample sizes between two groups are roughly balanced and the odds ratio is not heavily deviated from 1, the Peto method is recommended as a reliable reference approach because it generally does not incur large bias. We recommend not using continuity corrections when the proportion of DZS is high, because even a small continuity correction may lead to distorted inferences. However, the Mantel–Haenszel method with the EMP correction may be considered if external evidence indicates a substantial treatment effect. Generally, we should avoid applying continuity correction with 0.5 or TACC because it may result in considerable bias when zero events are common in one or both arms, especially when sample sizes vary between studies. Therefore, it might be informative to apply several methods with low agreement as a sensitivity analysis, and the results should be interpreted cautiously.

Limitations

In this study, we did not comprehensively discuss the difference between the fixed and random effects models, because we were interested in the impact of DZS on the estimated effect sizes rather than comparing the effects of the fixed and random effects models. Also, we evaluated only the agreement among some commonly used methods for synthesizing the odds ratios; the agreement of other methods for other types of effect sizes may be an interesting topic for future research. Meanwhile, the agreement was based on the meta-analyses from the CDSR which were related primarily to healthcare, so our conclusions may not be representative for all published meta-analyses with rare adverse events. It has been suggested that well-conducted observational studies can be combined with randomized clinical trials appropriately without incurring bias, and the benefits of including observational studies in systematic reviews of adverse events may overcome the disadvantages of wasting their information.36,37 Therefore, evaluating the agreement among a more comprehensive pool of statistical methods in the context of rare events with a larger database (including observational studies) may help healthcare researchers estimate treatment effects on adverse events in a more reliable way.