FormalPara Key Points

We extended our analysis of a previously published scoping review to compare the timing and characteristics of signals of designated medical events with those of all other events.

Regardless of type of event, signals supported by well-documented reports tended to be communicated earlier compared with lesser degrees of completeness.

We found that signals of designated medical events were supported by significantly fewer reports and significantly higher completeness scores. However, the differences in effect sizes were small, suggesting that the list of designated medical events may not be having its intended effect.

1 Introduction

According to the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH), a serious adverse drug reaction (ADR) is “any untoward medical occurrence [sc. attributed to a medication] that at any dose: results in death, is life-threatening, requires inpatient hospitalization or results in prolongation of existing hospitalization, results in persistent or significant disability/incapacity, is a congenital anomaly or birth defect, [or] is a medically important event or reaction” [1]. In many countries, any serious ADR must be reported within 15 days to the relevant regulatory agency. Two lists of medical events, currently in use internationally, include items that should be regarded as serious and requiring intervention: important medical events (IMEs) [2] and designated medical events (DMEs) [3]. The inclusion criteria for the list of IMEs are based on the ICH definition of a serious ADR. As of January 2023, the IME’s list constitutes a supplement of 7525 Medical Dictionary for Regulatory Activities preferred terms (MedDRA PTs) that were deemed useful for analysing aggregated data, classifying and assessing cases in routine pharmacovigilance activities. A total of 62 of the PTs therein make up the list of DMEs, that is, a collection of PTs regarded as ‘inherently serious’ and considered to be ‘often medicine-related’. Crucially, the purpose of the list of DMEs is to prioritize adverse events in signal detection, and it is described by the EMA as a ‘safety net that ensures signals are not missed’ [3].

Systematic reviews of withdrawals of marketing authorisations because of fatal ADRs have shown that the interval between the first report of a death and withdrawal of marketing authorisation in any country did not substantially change over the years between 1950 and 2013. The authors suggested that the delays may have been explained in part by the need for subsequent studies after the first indications of drug-attributed deaths [4].

In a previously published scoping review of signals of ADRs and signals of disproportionate reporting (SDRs), we identified over 10,000 signals/SDRs. The median time interval between the first report in VigiBase, the WHO’s global database of reports of suspected ADRs, and the year in which a signal was communicated was 9 years [5].

We are unaware of prior research into the characteristics and timing of signals of DMEs compared with signals of other ADRs. Having not explored these aspects in our previous study, we therefore sought to fill this knowledge gap by characterizing types of evidence and timeliness of communication (i.e. the written transmission or exchange of information pertaining to signals, such as minutes of committee meetings), comparing and contrasting these two types of signals.

2 Methods

2.1 Data Sources

We used the dataset from our published scoping review of the evidence underpinning signals [5], which included studies of signals/SDRs communicated by stakeholders in pharmacovigilance between 1986 and 2020. Each study in the dataset had had a level of evidence attributed according to the Oxford Centre for Evidence-Based Medicine (OCEBM) classification tool; study designs were thus ranked from 1 (available evidence of the highest quality for decision-making) to 4 (lowest quality), following the row “What are the rare harms?”. We retained the postulated subtypes of evidence from the scoping review to ensure granularity. The highest level of evidence applied when multiple studies supported a signal (e.g. if a signal was supported by a meta-analysis of randomised controlled trials and by reports of ADRs, the signal was categorised as OCEBM 1). To avoid undue biases in analysing possible delays in communicating signals/SDRs and other variables (see Sect. 2.2), we excluded signals/SDRs detected from studies whose aims were to develop, validate, or evaluate novel methods for signal detection, as these studies do not urge prompt regulatory or verificatory action. For the full list of excluded studies, see Supplementary Materials 1. We mapped the signals/SDRs from the scoping review to the MedDRA dictionary and anatomical therapeutic classification (ATC); for details see Supplementary Materials 1.

We obtained the list of DMEs from the European Medicines Agency (EMA), as published on 15 June 2020, to categorize DME signals that concerned at least one of the MedDRA (v. 23.1) PTs, whether the event of interest was composite or not. Conversely, we classified signals as ‘non-DME’ if all the events mapped to terms outside the EMA’s list.

2.2 Variables

We extracted data from VigiBase using Structured Query Language together with Python. We queried a deduplicated [6] and frozen version of VigiBase (lock point: 30 August 2020), to extract a range of characteristics of reports containing the medicinal products (standardized to WHODrug, B3/C3 format, 01/09/2020) and adverse events involved in DME and non-DME signals, setting involvement of the medicinal products as either ‘suspected’ or ‘interacting’. Thus, any characteristics of DME signals we retrieved referred only to reports that included PTs belonging to the list of DMEs, and the same applied to non-DMEs. For each DME or non-DME signal/SDR, we obtained the following information:

  1. (1)

    The first year in which a report was entered into the database (E2b fields: FirstDateDatabase or ReceiveDate, whichever was earlier) and the year in which three such reports became available in the database.

  2. (2)

    The first year in which the reporting of a medicinal product-event combination became disproportionate, using the information component (IC [7]) for medicinal product-event combinations and the omega interaction measure (Ω [8], for drug–drug-event combinations), defined as a positive IC025 or Ω025. The omega interaction measure is based on a model with additive risk for the occurrence of adverse events during concomitant use of non-interacting drugs.

  3. (3)

    The number of reports up to the year before that of communication.

  4. (4)

    A breakdown by type of report, namely: spontaneous, from a study, from prescription event monitoring or special monitoring, unknown type, or ‘other’ (i.e. literature reports whose type, whether spontaneous or from a study, could not be ascertained at submission to a database or from follow-ups).

  5. (5)

    The average vigiGrade completeness score of the case series [9]; a vigiGrade completeness score of 1 is assigned to a report with detailed information on time to onset, patient age and sex, indication for treatment, dosage, outcome, type of report, type of notifier and country of origin, plus some free text description; the score falls by a pre-specified multiplicative factor for each piece of information that is missing.

  6. (6)

    The number of positive dechallenges and/or rechallenges at the medicinal product-event level.

We established the earliest known launch year for each medicinal product, cross-referencing the websites of 27 regulatory agencies and hand searching textbooks (for a complete list, see Supplementary Materials 1). From the dataset of the scoping review, we obtained the earliest year in which a stakeholder, irrespective of country of origin, first communicated a signal/SDR.

2.3 Data Analysis

We summarised data using proportions, medians and interquartile ranges (IQR) and plotted the data using boxplots or stacked bar charts, with years of communication or of launch as independent variables and the characteristics presented above as dependent variables. For boxplots involving completeness scores, we categorized as ‘well-documented’ values strictly above 0.80 (as per [9]); as ‘below average or average’ those from 0.00 to 0.46 inclusive, based on the average completeness in the whole of VigiBase as of August 2020, and as ‘above average’ values between the two categories.

To calculate the delays in communication (time to communication, TTC), we subtracted the year of the first report, irrespective of country of origin, from the year of communication. When we had sufficient data, we computed the difference between (1) the year of communication and the year in which at least three reports had accumulated, (2) the year of communication and that in which a signal became disproportionate and (3) the year of communication and the launch year.

Pilot analyses showed that the data were not normally distributed and heteroskedastic, violating core assumptions of some statistical tests (e.g. the Wilcoxon rank-sum test). We therefore required a non-parametric test that made no distributional assumptions and chose the Brunner–Munzel test [10, 11]. This test determines whether there is stochastic equality between two groups, by comparing the entire empirical distribution functions of a variable in the two groups and accounting for their means, variances and other distributional properties. Essentially, it checks whether the probability that a randomly chosen observation from one group is higher (or lower) than a randomly chosen observation from the other group is equal to the probability of the reverse, across the whole range of values of the variable (i.e. P(X<Y) + 0.5×P(X = Y)). We ran two-tailed Brunner–Munzel tests across time periods or levels of evidence, comparing DME versus non-DME signals or (non-)DME versus (non-)DME signals. We computed P values and 95% confidence intervals (CI) [10]. In summary, when applied to this study, sample estimates exceeding 0.50 (or 50%) suggest that a random observation in the DME group has a higher probability of being lower than a random observation in the non-DME group. Sample estimates below 0.50 suggest a lower probability of being lower. For a result to be statistically significant, the sample estimates had to fall within the 95% CI and have a P value < 0.05. A hypothetical comparison of a variable X across two groups, 1 and 2, using the Brunner–Munzel test, which produces a statistically significant sample estimate of 0.62 suggests an estimated probability of 0.62 of observing lower values of X in group 1 than in group 2.

Data for calculations of statistical measures were managed in Microsoft Excel, whereas calculations and figures were made in R Statistical Software (v. 4.2.0) [12].

3 Results

We obtained 10,861 signals/SDRs from the dataset included in the primary analysis of the scoping review. Of the 4520 signals/SDRs remaining after exclusion of methods studies, 919 (20%) concerned DMEs and 3601 (80%) non-DMEs. 195 (4.3%) signals/SDRs were on drug–drug interactions, 37 (19%) of which were DMEs and 158 (81%) non-DMEs. A total of 3937 (87%) had at least one report in VigiBase in the year before communication, 3639 (80%) had at least three reports in VigiBase and 2448 (54%) were disproportionate as of the year of communication. The full results of our analyses are available in Supplementary Materials 1. Supplementary Materials 2 includes the dates of launch of the medicinal products.

3.1 Descriptive Analyses

We observed different median numbers of reports for each signal between DME and non-DME signals in the dataset; each DME signal was supported by 15 reports (IQR 6–38 reports), the others by 20 (IQR 6–84 reports). There were also differences in average completeness scores: each DME report had a median of 0.52 (IQR 0.43–0.62) and each non-DME report a median of 0.46 (IQR 0.38–0.56). The median numbers of dechallenges, rechallenges, countries of origin and report types were equal across the two categories.

3.1.1 Timing of Communications

DMEs and non-DMEs had the same medians across all the three measures of the timings we prespecified, i.e. the time from first report to communication, or TTC (9 years), from the year in which there were three reports to that of communication (7 years) and from the year in which a signal became disproportionate in VigiBase and that of communication (6 years).

There was an apparent increase in the median TTC over time (Fig. 1). In fact, for DME signals the TTC nearly doubled from a median of 5 years (IQR 2–14 years) during 1986–2005 to a median of 9 years (IQR 5–14 years) during 2006–2020. For non-DME signals, the medians were 4 years (IQR 2–9 years) for the first period, and 10 years (IQR 5–17 years) for the second. The same held when using the year in which there were at least 3 reports in VigiBase and that in which signals/SDRs became disproportionate as independent variables (Supplementary Materials 1).

Fig. 1
figure 1

Box plots of 3937 signals/signals of disproportionate reporting, categorized as DME (designated medical events, red) or non-DME (blue), with at least one report in VigiBase before communication and a positive, non-zero, time to communication. On the x-axis, the years of communication in 5-year periods, on the y-axis the delay in years in communicating signals, with interquartile ranges (whiskers); median values are indicated by horizontal lines within the boxes

DME signals supported by ‘well-documented’ reports (completeness score > 0.8) had a median TTC of 2 years (IQR 2–4; Fig. 2). This was about five times shorter than the median TTC for DME signals supported by reports of ‘above average’ completeness (9 years, IQR 5–14 years, vigiGrade score 0.47–0.80) or ‘below average or average’ completeness (10 years, IQR 4–15 years, vigiGrade score 0.00–0.46). Similar median TTC values applied to non-DME signals (3, 8 and 10 median years, for the respective classifications of completeness).

Fig. 2
figure 2

Time to communication versus average completeness for 3937 signals/signals of disproportionate reporting with at least one report in VigiBase in the years leading up to that of communication. ‘Well documented’ was defined as an average completeness of strictly above 0.80; ‘below average or average’ refers to an average completeness score between 0.00 and 0.46 based on the average completeness in VigiBase as of August 2020; median values are indicated by horizontal lines within the boxes

3.1.2 Interval from Launch to Communication

The median launch to communication interval was 15 years for DMEs (IQR 6–30 years) and 14 years for non-DMEs (IQR 6–28 years). There was an apparent increase in the proportion of DME or non-DME signals concerning medicinal products that were launched 10 years (inclusive) before communication; for the period 1986–2005, the proportion of signals whose medicinal products were launched 10 years (inclusive) before communication was 49%, while in 2006–2020 it rose to 64%.

3.1.3 Levels of Evidence

The 4520 signals/SDRs fell predominantly under OCEBM level 4 (3983, 88%) and its subtypes; 2203 (49%) signals were supported by clinical assessments of reports of ADRs, of which 479 were classified as DME signals (22% of 2203) and 1724 (78%) as non-DME signals. Of all OCEBM 4 signals, 1443 (32%) were supported by analyses of disproportionality, with 307 (21% of 1443) categorised as DME signals and 1136 (79%) non-DME signals.

In Fig. 3, we show the time intervals to communication, from first report in VigiBase and from launch year. The results suggest higher medians of either intervals for signals/SDRs belonging to OCEBM 4, with minor differences between signals of DMEs and non-DMEs.

Fig. 3
figure 3

Box plots of a time to communication for 3858/3937 signals/signals of disproportionate reporting (SDR) by Oxford Centre for Evidence-Based Medicine (OCEBM) level, excluding studies with unclear design. b Box plots of the intervals in years between launch and communication of 4426/4513 signals/SDRs by OCEBM level, excluding studies with unclear design. OCEBM levels 1 through 3 were aggregated. The y-axis in (b) was truncated at 60 years (max, 119 years). Median values are indicated by horizontal lines within the boxes

3.2 Statistical Analysis

The full results of the statistical analysis are available in Supplementary Materials 1. We report the main findings from the comparison of DME and non-DME signals using the Brunner–Munzel test in Table 1. There were statistically significant differences in average completeness, numbers of reports, and numbers of dechallenges and rechallenges. The only time interval that was statistically significant was that between the year in which a signal became disproportionate in VigiBase and the year of communication.

Table 1 Results of the Brunner–Munzel test comparisons between designated medical events (DME) and non-DME signals, with respect to the characteristics of the case reports in VigiBase and the time intervals considered in the study

In Table 2, we report additional comparisons across DME signals, relevant to the descriptive analyses, and in further support of the apparent trends shown in the figures above. In comparisons across OCEBM level and average completeness of the case series, we noted large effect sizes in TTC, numbers of reports and differences in intervals between launch and communication. We obtained similar findings for non-DME signals, all of which are reported in Supplementary Materials 1.

Table 2 Brunner–Munzel test results for the comparisons over completeness score, level of evidence and communication year for signals of designated medical events

3.3 ATC and MedDRA System Organ Classes of the Signals

The results of ATC and MedDRA System Organ Class coding are in Supplementary Materials 1. There were no unexpected imbalances in the proportions of DME or non-DME signals across either of these standardized terminologies.

4 Discussion

4.1 Summary of Key Results

This analysis highlights statistically significant differences in the characteristics of the case series of DME and non-DME signals as they appeared in VigiBase up to the year of communication. These were: the numbers of reports and their average completeness and the counts of positive dechallenges and rechallenges. Furthermore, except for a statistically significant difference in the interval in years between the first indication of disproportionality and the year of communication, the timing of communication did not differ between the two groups of signals. Finally, we found statistically significant patterns shared between DME and non-DME signals, such as the strong association between TTC and both completeness score and level of evidence, and how the time after launch of medicinal products relative to communication increased over the last 15 years of the study period.

4.2 Statistical Significance and Practical Relevance

While the comparisons of DME and non-DME signals were statistically significant, they were accompanied by small effect sizes in Brunner–Munzel estimates for numbers of reports, dechallenges, rechallenges and average completeness scores. This was especially surprising in relation to the difference in numbers of reports, as one might expect a DME signal in some cases to be based on as few as one report (i.e. ‘between-the-eyes’ adverse reactions [13]). However, DMEs tended to be supported by a number of reports exceeding by several times the (canonical) minimum of three required for signal detection [14]. Less stringent criteria for signal detection when fatal events are involved have been previously advocated [15], and the same may be extended to DMEs. The EMA states that member states use the categorisation of a range of adverse events as DMEs to focus on reports of suspected adverse reactions that deserve special attention. However, we could not find evidence of such an effect in our analysis, and prioritization of such signals may require further attention. An important consideration is that the size of case series in VigiBase may have been larger than those on which the communicated signals were based. In view of this, it may be helpful for pharmacovigilance stakeholders to consult global databases of case reports when a signal of DME is detected to ensure more data are available for its clinical assessment.

4.3 Relationship Between Strength of Evidence and TTC

Irrespective of categorization into DME or non-DME signals, we found not only statistical significance but also larger effect sizes in the association between the TTC and the strength of evidence. Whether in the form of higher quality evidence (i.e. OCEBM 1–3) or high average completeness of the information in an underlying case series (i.e. ‘well documented’), the strength of evidence appeared to be linked to an up to fivefold shorter TTC (Fig. 2 and Table 2).

4.3.1 Relationship Between OCEBM Level and TTC

A possible contributor to the observed relationship between OCEBM level and TTC may be that evidence of higher quality (OCEBM 1–3) tends to be collected and appraised in pre-approval stages, as evidenced, in part, by the negative intervals between launch and communication (Fig. 3b). Conversely, evidence of lower quality (OCEBM 4) begins to accrue later, during post-marketing; in this phase, signals are detected mainly through reports of ADRs and are continuously prioritized as per good vigilance practices through analyses of patient exposure and estimates of frequencies of ADRs [16]. Limitations inherent to the systems for collecting reports of ADR, such as under-reporting or low completeness of the reports, may have further contributed to the relationships observed in Table 2. Nevertheless, the types and frequencies of ADRs detected through pre- and post-marketing are different, the latter phase being primarily concerned with rare ADRs.

4.3.2 Association Between Completeness of Information in a Case Series and TTC

Well-documented reports have been associated with ‘certain’, ‘probable’ or ‘possible’ outcomes of causality assessments or with reports flagged as serious by international standards [17, 18]. We should stress that completeness of reports is accounted for in methods for signal detection, such as vigiRank [19], which increase the rate of detected signals compared with disproportionality analysis [20], but does not necessarily have a bearing on the timeliness of signal detection from disproportionality analyses [21]. Rather, completeness of reports has been regarded as useful in performing clinical reviews [14], which also constitute the main type of evidence underpinning signals [5]. We reiterate that although the difference in median completeness score between DME and non-DME signals was statistically significant, the effect size was small and we did not record a difference in TTC for DME and non-DME signals supported by well-documented reports. These results may call for improved international collaboration between regulators and reporters, with the aim of increasing the completeness of information in reports of suspected ADRs, as means of facilitating clinical reviews and expediting the TTCs of both DME and non-DME signals. The matter of completeness becomes more relevant when one appreciates that the volume of reports in databases has increased substantially over the past 30 years [22] and may continue to do so as developing countries progress towards more mature pharmacovigilance systems [23, 24]. Since a high degree of completeness is not always achievable [25,26,27] and completeness may vary across settings or attitudes of health carers and patients towards reporting [28], any intervention geared towards increasing the completeness of reports would probably be a complex one [29].

4.4 Increase in TTC Over Time

Our observations support prior research showing a growing proportion of signals supported by medicinal products launched over 10 years before communication. Early reviews of signals of the Pharmacovigilance Risk Assessment Committee suggested that the concerned medicinal products were on the market for a median of 12 years and 42% of them for less than 10 years [30]. In 2010, the median time on market of medicinal products involved in regulatory actions in the USA was 11 years [31]. Thus, on one hand, the increase in TTC may reflect evolving pharmacovigilance systems, able to manage signals concerning medicinal products that have been on the market for several decades, as noted in [30], namely: improved monitoring, completion of long-term observational studies to evaluate suspected harms or changes in patterns of use of medicinal products. On the other hand, it is worth considering that some adverse effects may be only detected with enough length of exposure; indeed, medicinal products that require longer durations of exposure have been found to be associated with larger numbers of amendments to product information [32]. In addition, the amount of post-approval exposure data (rather than pre-approval) predicts changes to the sections of untoward effects, and warnings and precautions, in European summaries of products characteristics (SmPCs) [33]. Taking these insights together, the increase in TTC may be conditional on the time needed to accrue sufficient data in the postmarketing phase, a time that may have been longer for some classes of medicinal products.

4.5 Strengths and Limitations

We compared large sets of DME and non-DME signals, relying on systematically collected data covering roughly 30 years. We used a heteroskedasticity-robust statistical method to compare the two groups of signals, ensuring intuitive interpretability of the results, excluding signals that may have distorted calculation of the intervals we had postulated. We are not aware of similar published work. These findings may provide a way forward for regulators and researchers in prioritising and communicating signals of rare events that are typically associated with medicines.

As this study concerned any reported signals/SDRs, irrespective of regulatory requirements for action or verificatory studies, our findings are relevant to the communication of signals alone. In other words, any differences or lack thereof we have identified may not necessarily solely concern signals that have significant effects on public health [16].

We used the list of DMEs rather than the list of important medical events (IMEs), since the latter includes far more events that may cause a report to be marked as ‘serious’ by international standards [34]. Both lists, however, presuppose seriousness; we chose to use DMEs, as they are regarded as often drug related. Relatedly, we did not quantify proportions of serious reports in either group of signals, so we cannot conclude whether DME signals were supported mostly (or not) by reports marked as non-serious.

Findings about TTC should be considered carefully. Although we have manually verified dates of receipt of reports (in VigiBase or at the national centres level) that were discrepant with launch years, we could not control for potential data entry errors in VigiBase. In addition, data retrieval was based on the definition of the events in the original publications; in the case of composite events, we retrieved all relevant MedDRA preferred terms. Consequently, frequently reported events may have biased the TTC of some of the most recently communicated signals/SDRs.

In our search for launch years, we have encountered minor mistakes in the available data sources (and have reported them to the data holders). For medicinal products launched in countries that no longer exist (e.g. Eastern Germany since 1978), some dates may have been replaced by default values by regulatory agencies, but we did not encounter enough examples (four in all) to warrant concern. Furthermore, we did not systematically evaluate any discrepancies between the sources we used to obtain launch years and the published literature, so there may be instances in which some medicinal products may have been launched earlier than recorded. More important is the effect of censoring, which may not have allowed sufficient time for an ADR to be recognized by reporters for medicinal products launched in recent years.

The method used to compute the completeness of a case series measures technical completeness but not clinical utility. In other words, formally complete case reports may still not necessarily contain sufficient information to produce a clinically sound judgment on a possible causal relationship between a medicinal product and an adverse event. It may well be that signals that were communicated rapidly contained a higher degree of clinically relevant information, which we could not measure.

5 Conclusions

We found that DME and non-DME signals differed in number of reports, completeness score and counts of positive dechallenges and rechallenges. The differences in the effect sizes were small, albeit statistically significant. The threshold in the median number of reports supporting DME signals by far exceeded the minimum number of reports required for detecting any signals. As such, the list of DMEs may not be attaining its intended purpose of prioritizing signals that concern suspected ADRs deemed to be often medication-related. The stronger the evidence, either as completeness of the case reports or in the form of higher quality of evidence, the shorter the TTC in both cases. Because clinical reviews of reports of ADRs are the main type of evidence that supports signals, we suggest that improved quality of reports may come with better prioritisation of communication of DME and non-DME signals alike.