1 Introduction

Health systems are designed to meet the daily health care demand of a designated region. But in an overwhelming demand scenario, such as the SARS-CoV-2 pandemic, pressure over health systems may outburst their predicted capacity to deal with such extreme situations (Emanuel 2020). One of the critical points in the context of a pandemic of such magnitude is the potential collapse of the health care system, leading to suboptimal treatment due to a lack of resources and a subsequent increase in mortality. Therefore, in order to successfully face a health emergency, scientific evidence and validated models are needed to provide real-time information that could be used by any health center (Pencina et al. 2020; Wynants 2020a).

Solid organ transplant (SOT) recipients are considered a high-risk population for any infection (Kumar and Ison 2019; Ison and Hirsch 2019; Silva et al. 2020). Preliminary data on SARS-CoV-2 infection shows a mortality as high as 28% (Fernández-Ruiz et al. 2020; Akalin et al. 2020) compared to 7.2% among the general population (Onder et al. 2020). It must be highlighted that fatalities may be attributed to chronic immunosuppression and its associated risks, since these patients are also older and present more comorbidities (Fernández-Ruiz et al. 2020; Akalin et al. 2020; Alberici et al. 2020a; Pereira et al. 2020; The 2019; Guan 2020).

Transplant centers have adapted their policies and developed specific circuits for SOT recipients in the current pandemic scenario (Martino et al. 2020; Boyarsky 2020; Angelico 2020). In particular, strategies for the management of immunosuppression were designed based on pathophysiology and reports of case series, rather than solid studies (Boyarsky 2020; Alberici et al. 2020b).

Predictive models are particularly relevant in situations in which solid evidence is scarce or absent since they may help identifying those patients with high risk of unfavorable outcomes. In the COVID-19 pandemic, they may be of particular benefit for high risk but also for low-volume populations, such as SOT recipients. Although some models are already being developed in kidney transplantation scenarios (Loupy 2019; Aubert 2019), and even in SARS-CoV-2 infection ones (Wynants 2020a; Giordano 2020; Weitz et al. 2020), they are often limited to large cohorts of patients.

The hybrid technique proposed exploits the capacity of Artificial Neural Networks (ANNs) to extrapolate the clinical course of patients using the results of the analyses performed at hospital admission while learning from the categories defined through Data Envelopment Analysis (DEA). This latter method is generally applied in industrial engineering to measure the efficiency of production processes (Misiunas et al. 2016; Ahmadvand and Pishvaee 2018). Its applications in medicine to evaluate the course of patients have overlooked its qualities as a categorization mechanism, a particularly useful feature when dealing with relatively small datasets (Misiunas et al. 2016; Ahmadvand and Pishvaee 2018). The importance of the suggested technique and its main contribution consist in the substantial improvement in the accuracy of the ANN relative to a direct implementation of the latter to the categories defined directly from the output variables.

Therefore, our objective is to develop a predictive model based on the implementation of DEA-ANN to the data retrieved from our cohort of SOT patients with COVID-19 at hospital admission and extrapolate the clinical course of the disease while identifying patients at risk of progressing towards severe disease.

The paper presents the main results obtained from the implementation of our hybrid model to the data retrieved—and analyzed—by the team of doctors. Additional methodological explanations regarding the design of the study, together with the formal definition of the hybrid model and the alternative configurations considered, are provided in the “Appendix” sections.

1.1 Literature review

Physicians tend to distrust the results derived from artificial intelligence models (Bae 2020; Wynants 2020b; Editorial. 2021). This behavior can be intuitively verified through the results obtained in the current paper, where, as we will observe, the machine and deep learning techniques presented do not perform particularly well on their own. Despite these drawbacks, artificial intelligence techniques have been consistently applied to medical settings within the current pandemic scenario. Among the main applications considered, we must highlight their use as early detection mechanisms and monitoring devices (Arora et al. 2020; Rasheed 2020; Vaishya et al. 2020). That is, the use of artificial intelligence techniques in medical environments has mainly focused on disease identification and propagation models (Rashid and Wang 2021). The predictive capacity of these techniques is quite limited, particularly when dealing with small amounts of data and requiring physicians to consider multiple outputs to extrapolate the potential evolution of patients, as is the case in the current paper.

A classification alternative to artificial intelligence models is given by the implementation of multiple criteria decision making techniques. These techniques are usually applied to rank COVID-19 symptoms in terms of their importance and imply a considerable degree of subjectivity in the selection of the weights assigned to the criteria (Alzubaidi et al. 2021). Their extrapolation capacity is also limited. More precisely, their rankings are conditioned by the relative importance assigned to the different decision criteria while lacking an explicit formalization of the actual interactions taking place between the input and output variables (Albahri et al. 2021).

Kidney transplant studies apply machine learning techniques such as random forest (Massie 2020), and artificial intelligence models such as Bayesian networks (Siga 2020) to estimate the relative importance that different factors have on the potential evolution of transplant patients. These techniques remain constrained in their capacity to select several potential output variables simultaneously, imposing a restriction on their ability to extrapolate the actual behavior of patients.

Outside the medical domain, hybrid models defining evaluation indexes combined with machine learning techniques such as random forest have been applied to analyze, for instance, the credit risk of borrowers (Rao et al. 2020a, b). Similarly, DEA has been consistently combined with ANN to generate classification models exploiting the main features of both techniques (Toloo et al. 2015). However, hybrid models combining DEA and ANN are generally based on the efficiency indicators provided by DEA, namely, the \(\theta_{o}^{*}\) variables derived from Eq. (1) in the “Appendix” section, a consistent feature that remains mainly unmodified nowadays (Tsolas et al. 2020).

The main problem with this latter approach when applied to a medical context is the fact that physicians do not have an intuitive interpretation for the notion of efficiency. Moreover, imposing the efficiency value as the main selection mechanism implies a substantial loss of information, ignoring the importance of the slack variables introduced in Eq. (2) within the “Appendix” section. Both these drawbacks motivate the design of our slack-based performance index, whose contribution to the identification and classification capacities of different machine learning techniques is summarized in the next section.

2 Contribution

The paper balances three very different research areas. First, it provides a precise description of the main analyses performed by the medical team, illustrating their evaluation processes, and defining a benchmark that can be easily recognized by other physicians. Second, it extends the standard analysis of industrial engineers on DEA to generate an index from the slack variables provided by the corresponding optimization model. That is, the use of the slack variables to create a relative performance index that summarizes the potential evolution of patients constitutes a contribution in itself. This index extends the potential applications of DEA beyond standard efficiency considerations, which, in situations such as the one analyzed in the paper, may not be of particular use to physicians. Third, the artificial intelligence techniques employed—and enhanced through the DEA index—provide a general picture of the results to be expected by physicians when dealing with relatively small data sets combining both quantitative discrete and categorical variables. We demonstrate how our hybrid technique performs when data are scarce while requiring basic behavioral guidelines from the simulations so that the model can be understood and applied by physicians under regular circumstances.

The balance among areas has been kept by describing the main medical variables and standard statistical analyses within the main body of the paper while relegating the optimization techniques and other detailed medical exams to the “Appendixes”. In this regard, the results obtained from the hybrid DEA-ANN model and the alternative machine learning techniques and output configurations implemented constitute the main focus of the paper.

The main objective of the manuscript is to illustrate the enhanced identification capacity that results from the design of a hybrid model based on the collaboration of physicians, engineers and computer scientists. As already noted, models combining DEA with artificial intelligence techniques already exist, but the corresponding literature tends to focus on the efficiency scores of the decision making units (Toloo et al. 2015).

From a medical viewpoint, efficiency remains an elusive and complex variable to interpret. Physicians, especially in emergency situations, require instruments based on techniques that—while constrained by a relatively low amount of data and a substantial number of potential output variables—provide them with a set of guidelines describing the potential consequences from an initial evaluation assessment. Artificial intelligence and deep learning models such as neural networks constitute the required techniques. However, their identification and classification capacities are substantially constrained when dealing with few observations and multiple categorical variables. The incorporation of the performance index derived from the implementation of DEA allows us to overcome these drawbacks, leading to a hybrid model whose identification and classification capacities improve substantially upon those of the artificial intelligence techniques applied unilaterally.

3 Results

3.1 Patient characteristics

During the study period 1006 recipients were visited for regular follow-up or medical issues. Thirty-eight consecutive adult transplant recipients developed confirmed COVID-19 disease requiring hospital admission (Fig. 1), with a mean age of 59 years (range, 33 to 87). Twenty-nine recipients (76.3%) had hypertension, 12 (31.6%) had diabetes mellitus, and 3 (7.9%) had lung comorbidities. Nine patients (23.7%) had been transplanted before, and 16 (42.1%) were under treatment with angiotensin-converting-enzyme inhibitors (ACEI) or angiotensin-receptor blockers (ARB) at hospital admission.

Fig. 1
figure 1

Transplant recipients controlled or attended at the transplant unit. Patients were followed-up from March 3th to April 24th, 2020, as either regular follow-ups, or contacted due to COVID-19 suspicion or other reasons. All patients were asked to report suspicious symptoms. Only patients with high evidence of SARS-CoV-19 infection requiring hospital admission were included in the analysis

Most recipients consulted with fever (94.7%), cough (60.5%), dyspnea (39.5%) and diarrhea (28.9%), and were admitted after a mean of 7.32 ± 5.87 days from the beginning of symptoms. Thirty-one patients (81.6%) developed pneumonia, sixteen of them (42.1% of the total population) with probable or confirmed bacterial super-infection. Patient characteristics are summarized in Table 1 and Extended Data Table S1.

Table 1 Demographic Characteristics at Hospital Admission, infectious and immunosuppressive therapy management, and clinical outcomes in Adult Transplant Recipients Hospitalized with COVID-19

3.2 Infectious and transplant management

Following our center protocol, lopinavir/ritonavir (73.7%), hydroxychloroquine (94.7%), and azithromycin (92.1%) were given as antiviral therapy. Eighteen recipients (47.4%) required tocilizumab and 17 (44.7%) methylprednisolone pulses. During the follow-up, five patients (13.2%) died with a functioning graft, and 22 (57.9%) developed an acute kidney injury. Thirty patients (78.9%) were discharged from the hospital after a mean of 12.02 ± 6.74 days from admission. Clinical and laboratory data are summarized in Table 1 and Extended Data Table S2.

3.3 Comparative analysis of the poor clinical course of COVID-19

In order to evaluate the primary composite outcomes, we analyzed the clinical worsening profile of recipients in Table 2. Twenty-four recipients (63.2%) fulfilled the criteria for clinical worsening behavior. Cough as a presenting symptom (P = 0.000), pneumonia (P = 0.011), and high levels of lactate dehydrogenase (LDH) (P = 0.031) were admission factors associated with complicated clinical evolutions. Lopinavir/ritonavir and azithromycin were used more frequently in patients displaying a poor evolution. Death-censored graft loss was higher in the composite-group (P = 0.049), absent differences in the management of immunosuppressive therapy. Although a non-significant trend was evidenced in patient and graft survival, no patient died due to COVID-19 in the non-composite-event group. When considering multivariate analyses, cough was the differential risk factor at admission.

Table 2 Comparative analysis of composite events

Nine patients (23.7%) were admitted to the ICU (refer to the Extended Data Table S3). Interestingly, cough, pneumonia (P = 0.039), and high levels of LDH (P = 0.030) prevailed as the differential presenting symptoms.

3.4 Prediction models for COVID-19 severity

Figure 2 provides a flowchart describing the implementation and evaluation stages of the hybrid model and the alternative configurations described through the current section. The index generated by implementing DEA before being incorporated into the ANN allows us to identify the relative performance of each patient across the variables composing the input set. The intuition validating this result can be inferred from the clustering qualities of the index, already observable in Fig. 3. As a result, we can create average profiles of the patients composing each performance category. Figure 4 provides a concise description of the main variables where patients underperform, with higher values representing relatively worse performances. For instance, note that patients composing the worst quartile exhibit highly suboptimal performances for Cough as presenting symptom, Pneumonia, and Days from starting symptoms to hospital admission. Patients located in the third quartile perform considerably better, with suboptimality arising for Hypertension, Pneumonia, Acute kidney injury and Cough as presenting symptom, describing a similar profile to that of the patients composing the best quartile.

Fig. 2
figure 2

Implementation and evaluation stages of the selection process: hybrid model versus alternative configurations

Fig. 3
figure 3

Indexing effect across input variables. The red circles represent the value of the triples described by the data. The blue circles correspond to the index values generated after implementing DEA

Fig. 4
figure 4

Patient performance across input variables for the different index categories generated using DEA. The index has been normalized within the interval [0, 1]. Higher values represent relatively worse performances

The battery of techniques described in Table 3 has been presented to illustrate the relative performance of the different configurations and their stability. We want to avoid selecting a biased evaluation technique favoring our model while ignoring others where the competing configurations outperform the suggested hybrid. The hybrid DEA model delivers the highest accuracy (96.3%) relative to any of the alternative configurations when considering the ANN that constitutes the main object of analysis. It also outperforms the alternative configurations when considering two well-known techniques such as logistic regression (77.8%) and random forest (codified as bagged trees by the MATLAB interface, 55.6%).

Table 3 Accuracy across models and methods

Finally, Table 3 illustrates how the suggested hybrid model tends to outperform the alternative configurations when considering the remaining set of techniques. We have emphasized the highest accuracy score achieved by each configuration through the whole battery of remaining machine learning tests, with the hybrid model performing considerably better than any of the alternatives.

4 Discussion

The SARS-CoV-2 infection has become a devastating pandemic worldwide that has pushed all health systems (regardless of their geography) to extreme limits, causing an unprecedented consumption of resources. Due to the potential severity of COVID-19, ICUs have been one of the most needed health devices. In this sense, it is key to determine which patients are candidates to intensive cares, exhaustively evaluating the benefit but also the damage that these intensive measures can cause to patients with extremely high morbidity and mortality.

SOT recipients are more prone to opportunistic infections and worse outcomes from community acquired infections. In patients with the COVID-19 disease, an ICU admission rate as high as 34% (Pereira et al. 2020) and a mortality of 28% have been reported (Fernández-Ruiz et al. 2020; Akalin et al. 2020). These features provide valuable data regarding patient outcomes, but their small cohort sizes and heterogeneity of therapeutic approaches preclude the extrapolation of data into practical decision making.

In the general population, advanced age is associated with a higher risk of ICU admission and mortality (Zhou 2020; Huang 2020). In our composite model, cough, pneumonia, and biochemical parameters at hospital admission, but not age, are correlated with a worse outcome. A confounding factor may emerge when dealing with maintenance immunosuppression. Ageing of the immune system is a well-known process characterized by a defective immune system and increased systemic inflammation (Sato and Yanagita 2019). In patients with chronic kidney disease, these phenomena are accelerated, appearing at younger ages and long-lasting in time (Sato and Yanagita 2019). The association of inflammageing (as it is termed) with chronic immunosuppression is a probable explanation for the dissipation of the age effect in our predictive model. Induction immunosuppression, particularly T-cell depleting agents such as thymoglobulin, may increase the risk of disease severity. In an analysis of peripheral lymphocyte populations, Akalin et al. (2020) identified kidney transplant recipients admitted due to COVID19 to have lower CD3, CD4, and CD8 counts than those reported in the general population (Gautret 2020).

Most SOT recipients receive prednisone as their maintenance immunosuppression therapy. Steroids have been described to reduce pathogen clearance and increase the viral shedding of SARS-CoV2 if administered early in the course of the disease (Cao 2020). Hence, most organizations advise against its routine administration in patients with COVID-19 (except if indicated for another reason, such as asthma or acute respiratory distress syndrome) (Luo 2020; Cortegiani et al. 2020; Fontana et al. 2020). Interestingly, steroids, either as monotherapy or in combination with CNI’s, were maintained in most reported series (Akalin et al. 2020; Alberici et al. 2020a; Pereira et al. 2020) and management protocols (Alberici et al. 2020b).

Following the policy of our center, we use tocilizumab and steroids as rescue therapy in patients with poor clinical outcomes despite an initial treatment with lopinavir/ritonavir, hydroxychloroquine, and azithromycin. However, up to now, no treatment has proven to be able to modify the natural history of SARS-CoV-2 infection in a randomized control trial, the evidence supporting the potential benefit of these drugs being preliminary and, sometimes, controversial—even more in SOT recipients (Gautret 2020; Cao 2020; Luo 2020; Cortegiani et al. 2020; Fontana et al. 2020). There are several reports of tocilizumab being used among SOT recipients with COVID19 (Akalin et al. 2020; Alberici et al. 2020a; Fontana et al. 2020), but its benefits are hard to extrapolate, in part due to its use as a second-line treatment, hence only in patients with moderate-severe disease. On the other hand, the use of tocilizumab translates into a good predictor of disease progression and severity. In this study, we were able to identify non-traditional risk factors for disease severity using tocilizumab in a composite model of worse outcomes.

In previous studies, cough has been found with relatively greater frequency in patients with poor outcomes, although, to this date, said symptom has not been included in predictive models to identify those patients with worse prognosis and a high likelihood of being admitted to the ICU (Wynants 2020a; Alberici et al. 2020a; Guan 2020; Zhou 2020). Herein, we have illustrated that the presence of cough at COVID-19 diagnosis constitutes a factor significantly related to the composite outcome of clinical progression and need for intensive care. Analytically, the usefulness of LDH and C-reactive protein (CRP) as markers of more severe forms of COVID-19 has been previously described (Giordano 2020; Zhou 2020). Therefore, some predictive models have used these parameters to identify forms of COVID-19 with a poor prognosis (Wynants 2020a). Similarly, in this study, LDH values at hospital admission have been directly associated with the composite outcome of poor prognosis and a higher probability of ICU admittance.

4.1 Enhancing the accuracy of machine learning techniques

The purpose of implementing a combination of DEA with an ANN is threefold. First, the optimization process on which DEA is built allows us to identify the performance of each patient regarding every analysis-related variable considered. Second, the corresponding slack values are used to design an index indicating the relative performance of each patient when considering several outputs simultaneously. Standard artificial intelligence models are generally constrained by one output variable distributed across different categories, though slightly more complex combinatorial environments may be defined. Finally, the index obtained can be used to orientate the ANN, which would otherwise omit the efficiency interactions existing across variables, leading to a relatively poorer performance.

The performance index has been generated by normalizing the slack values retrieved from DEA relative to the initial ones, being these either quantitative discrete or categorical, and adding up the results across all the input variables. Quantitative discrete variables have been normalized dividing the slack value obtained from DEA by the original value of the variable. The normalization of categorical variables requires dividing the sum of the initial value and the slack obtained from DEA by the sum of the initial value and the maximum value of the category. The final slack-based efficiency sum per patient has been normalized relative to the highest value obtained, implying that higher values of the index correspond to relatively worse performances.

One of the most remarkable features of the hybrid DEA-ANN model is its significant prediction accuracy in environments characterized by relatively low numbers of observations. In addition, it provides a reliable framework of analysis for data structures where multiple input and output variables must be evaluated simultaneously.

When considering machine learning techniques, logistic regression (Bagley et al. 2001) and random forest (Cafri et al. 2018) are among the most commonly used ones in the medical literature. Despite their general applicability, their accuracy tends to decrease when dealing with small data samples, as is the case in the current analysis. We have illustrated how the suggested model (96.3% accuracy) improves substantially upon the best performance of both techniques under any alternative configuration (65.5% for logistic regression and 44.8% for random forest). In addition, the accuracy of both techniques has also been improved when the index generated through DEA is used to define the corresponding output categories (logistic regression achieves an accuracy of 85.2% with the second DEA configuration defined in Table S6, while random forest reaches 55.6%).

4.2 Comparing ANN and DEA-ANN

We start by noting that a common shortcoming of the machine and deep learning techniques implemented throughout the paper is given by their limited capacity to incorporate multiple output categories into the analysis, with the logistic regression representing a limit case. This constitutes a considerable constraint whenever physicians need to consider multiple output variables and their potential interactions. The capacity of these models to identify patterns when directly implementing the categories defined by the different output combinations is also quite limited. We present below the confusion matrices obtained when applying the ANN and the DEA-ANN ranking categorization to the different output categories described in the study. A basic interpretation of the confusion matrices described through this section is provided in Table 4.

Table 4 Description of the entries composing a confusion matrix

As can be observed in Figs. 5 and 6a, b, the direct implementation of the ANN to the different categorization scenarios described, which encompass all the potential combinations of the composite primary outcomes defined by the physicians, does not produce significant results. Note, in particular, the difficulties faced by the ANN to identify those patients located in the worst categories, namely, those displaying relatively poorer output values. The enhanced identification and classification capacities of the DEA-ANN hybrid model are illustrated in Fig. 6c. A similar intuition based on the same type of results follows from the alternative evaluation scenario described in the “Appendix” section, where death and days spent in intensive care are taken as outputs.

Fig. 5
figure 5

Quartile setting. CNF3 configuration

Fig. 6
figure 6figure 6figure 6

Tercile setting.

5 Conclusions

Prediction models are needed to counteract the controversies in data and lack of information in specific populations under the current SARS-CoV-19 pandemic. The prediction of a worse clinical course in high-risk populations facing a sanitary emergency becomes a feasible option with the hybrid DEA-ANN model introduced through the paper. As shown in the study with kidney transplant recipients, cough, pneumonia, and LDH at admission constitute key risk factors for a more severe course, a profile validated and extended through the slacks obtained via DEA. The prediction accuracy of the DEA-ANN model peaks at 96.3%, while the implementation of the ANN to configurations directly based on the value of the output variables achieves a maximum of 69%. A consistent improvement in the prediction accuracy of different machine learning techniques has also been observed when the output categorization process is determined by DEA.

The capacity of DEA to synthesize the behavior of complex processes through its slack variables enhances the ability of the ANN to identify the main relationships existing among the characteristics defining the alternatives. In this regard, one of the main contributions of our hybrid model is the improvement in the predictive power of different machine learning techniques when dealing with few data, a particularly important problem in developing countries, whose hospitals are generally constrained to work with small datasets. Being able to extrapolate the potential evolution of patients should improve the efficiency of patient management and resource allocation processes across hospitals, a particularly important feature in emergency situations.

As can be inferred from the empirical results presented, a drawback of the proposed hybrid model stems from the actual definition of the quartiles, particularly in settings where few observations may result in an empty quartile. Clearly, the model is still able to perform properly and enhance the identification capacity of the ANN, but physicians may have to consider adjusting the distribution of patients across quartiles or terciles, decreasing their confidence in the validity of the results obtained. On the other hand, the model provides a complete profile of the patients composing each quartile category, allowing for a direct evaluation of their relative performances and the potential behavior of the corresponding output variables.

Future research should aim at incorporating dynamic DEA structures in the generation of the index so as to account for complex interactions across input and output categories within a simple framework. Moreover, incorporating the subjective beliefs of physicians into the design of the model would modify its performance but allow for direct interactions with the users, improving its applicability as a decision support system. In this regard, different weights could be assigned to the input and output variables, integrating techniques such as the best–worst ranking method within the current hybrid DEA-ANN framework.