Introduction

Medical artificial intelligence (AI) modules have been developed for use in various fields1,2,3,4. The role of AI in medicine is quite diverse as AI is involved in a wide range of medical care processes in various ways, from diagnosis to predicting prognoses5,6,7,8,9. AI approaches can be categorized into several types based on their functions as clinical practice tools: (1) analysis of complex medical data to derive medical insights, (2) image data analysis and interpretation, and (3) monitoring continuous medical data to plan appropriate treatment strategies and follow-up2.

Several clinical decisions are required during routine medical care processes; accurate and rapid clinical decisions are directly linked to effective patient outcomes10,11,12,13. It has been proven that AI assistance in clinical decisions improves diagnoses and treatment processes in terms of accuracy and timeliness in well-designed interventions6,14,15,16,17,18. However, most studies only describe the mechanisms of AI impact using mathematical and computational methods19; few studies focus on the clinical decision-making process, e.g., explaining the effectiveness of a machine learning (ML)-based clinical decision support program.

Immediate clinical decision-making is particularly important in the emergency department (ED)20,21,22, where decisions such as obtaining a head CT for a pediatric patient are the most challenging23,24,25. Head CT is a simple yet critical decision, because while the head CT is a diagnostic tool of choice for intracranial hemorrhage (ICH)26, it is also associated with an increase in malignancies27,28.

Conventional head CT clinical decision rules, such as Pediatric Emergency Care Applied Research Network (PECARN) rules, have been introduced to guide clinical decisions; however, the rules require detailed history and are difficult to apply in clinical practice29,30,31. AI prediction models have been developed to overcome such limitations. Although models for head CT scans with good performance and accuracy have been developed25,32,33, their impact on clinical decision-making has not been explored.

This study aims to measure the effectiveness of the deep-learning model for predicting traumatic ICH (DEEPTICH) in the decisions of emergency physicians regarding head CTs, and to evaluate the factors associated with the effectiveness of DEEPTICH.

Results

Prediction model performance

DEEPTICH predicted ICHs such as cerebral contusion, subdural hemorrhage, epidural hemorrhage, subarachnoid hemorrhage, intraventricular hemorrhage, intracerebral hemorrhage, and cerebellar hemorrhage, but not microhemorrhage. DEEPTICH obtained a value of 0.927 (95% CI 0.924–0.930) for the area under the receiver operating characteristic curve (AUROC) on internal validation from the 80,508 cases in the national database, and 0.886 (95% CI 0.878–0.895) AUROC on external validation based on the local hospital database (Fig. 1). In the external validation sets, the overall sensitivity, specificity, positive predictive value, and negative predictive value for clinical performance were 0.95, 0.67, 0.02, and 0.99, respectively. The specificity was lower for patients under 3 years. The detailed clinical performance values by age group are shown in Supplementary Table S1.

Figure 1
figure 1

Receiver operating characteristics (ROC) curve for internal validation outcome on the time-validation set.

Demographics for decision simulation study

A total of 22 emergency physicians completed 24 simulation test cases and surveys. We obtained 528 responses for the simulation test case. The participants comprised eight junior residents of postgraduate year (PGY) 2 and 3, nine senior residents of PGY 4 and 5, and five specialists. The most common age group was 30 to 40 years, with 12 of them (54.5%) having an average age of 31.5 years. Ten (45.5%) participants had more than five years of experience each (Table 1).

Table 1 Characteristics of the physician.

The influence of recommendation directions

Figure 2 shows the flows of the responses by participants regarding the head CT binary decision before and after the DEEPTICH recommendation. The responses in which the initial decision was the same as the DEEPTICH recommendations (n = 327, 61.9%) were excluded. We analyzed the responses that were different from the initial decision and the DEEPTICH recommendations (n = 201, 38.1%).

Figure 2
figure 2

Ordering a head CT binary decision results on the simulation cases.

Of the 201 responses, 56 decisions were not to order head CTs as the initial decision; however, when DEEPTICH recommended the head CT, 40 of the 56 (71.4%) decisions were changed, i.e., respondents decided to order head CTs (p < 0.001). When DEEPTICH recommended not to order head CTs, only 54 (36.6%) of the 145 initial decisions to order CTs were changed (p < 0.001).

We analyzed the responses of all the five-scale head CT ordering willingness scores based on DEEPTICH recommendation (n = 528). We found considerable score changes and decision augmentations according to DEEPTICH recommendation. In cases where DEEPTICH advised head CTs, the mean of the willingness score changed from 3.46 to 3.97 (Δwillingness, 0.51). In cases where DEEPTICH advised not to order head CTs, the mean score changed from 2.69 to 2.27 (Δwillingness, − 0.42). The detailed results are presented in Supplementary Table S2.

The physician’s factor of influence

It was observed that when DEEPTICH recommended not to order a head CT, the decision effect differed based on the age and experience of the physician. Relatively inexperienced physicians were more likely to accept the recommendation than experienced physicians (− 43 (29%) vs. − 11 (7.3%), p < 0.001). Physicians older than 40 years did not change their decision, even though most of the physicians in the age group of 30–40 years did so: 0 (0.0%) vs. − 36 (20.0%), p = 0.021 (Table 2).

Table 2 Changes in the decision to order a head CT based on the physician’s characteristics.

Factors associated with AI acceptance

We conducted univariate and multivariate logistic analyses to identify the factors associated with the effectiveness of DEEPTICH. The participants were more likely to accept AI recommendations when the PECARN risk was high (Odds ratio (OR), 15.02; 95% CI 1.60–473.38) and when the initial head CT decision was no (OR, 2.68; CI 1.08–6.88) (Table 3). We also conducted a logistic regression on the survey outcomes; no item was associated with the effectiveness of DEEPTICH.

Table 3 Logistic regression analysis of factors affecting effectiveness of machine learning model to assist head CT order.

DEEPTICH effectiveness by PECARN risk

We analyzed head CT decisions prior to and after DEEPTICH recommendation using PECARN risk rules (Table 4). For cases of age less than 2 years old, using DEEPTICH could have avoided 15 (13.6%) unnecessary CTs (p < 0.001) for every 110 low risk head injury patients. For every 44 high risk head injury patients, 15 (34.1%) necessary CTs were properly ordered (p < 0.001). For cases older than 2 years old, using DEEPTICH could have avoided 18 (16.4%) unnecessary CTs (p < 0.001) for every 110 low risk head injury patients. DEEPTICH did not induce significant decision changes in the intermediate group for all ages.

Table 4 Ordering a head CT decision result according to risk of cases.

Table 5 presents the overall clinical outcomes using DEEPTICH. When using DEEPTICH, 46 (11.6%) unnecessary CTs were avoided (p < 0.001) and 10 (11.4%) all traumatic ICHs that would have been otherwise missed were found (p = 0.039).

Table 5 Clinical impact of the model in the study.

Survey outcome

The survey outcomes are presented in Table 6. Regarding general medical AI, only seven (31.8%) participants had prior experience, and five (22.7%) had the technical knowledge. In addition, 16 participants (72.7%) did not have sufficient knowledge regarding data-driven AI, but had the intention to learn, and their belief in the positive impacts of AI was significantly high (Intent, 20 [90.9%]; Optimism, 21 [95.5%]).

Table 6 Participant’s response to survey (n = 22).

Among the participants, 19 (86.4%) responded that they understood the mechanisms of DEEPTICH. The participants mostly disagreed with DEEPTICH regarding clinical safety. Only 15 (68.2%) participants agreed with the recommendations on clinical safety, whereas five (22.7%) participants disagreed with the quality of information obtained from DEEPTICH.

Discussion

To the best of our knowledge, this is the first study to develop a decision simulation study design and to investigate the acceptance of AI in clinical decisions by physicians. Most AI-based clinical decision support system (AI-CDSS) studies have reported improved diagnostic accuracy or efficiency based on the agreements with AI by doctors34,35. However, before considering the effectiveness of an AI approach on accuracy, it is necessary to know in detail its function in the clinical decision-making process.

In this study, we developed DEEPTICH, a deep-learning model for predicting traumatic ICHs. DEEPTICH had higher AUROC than previously known pediatric head CT rules36. Because the rate of traumatic brain injury (TBI) is higher in younger age groups than in older age groups, this difference in data was considered when setting the model threshold, which made DEEPTICH have less specificity in younger age group37,38.

Subsequently, we identified that the effect of AI on decision making of the physician is influenced by various factors; one of those factors is the recommendation direction (positive vs. negative). We found that when the suggestion direction of the model is positive, emergency physicians are more likely to accept the recommendations of the model; whereas, when the suggestion direction of the model is negative, the decision change differs based on work years and age of the physician, suggesting that inexperienced clinicians are significantly more likely to be influenced by AI tools than experienced clinicians.

We demonstrated that DEEPTICH is effective, even when the AI-CDSS and the initial decisions of physicians are the same. After realizing that AI-CDSS concur with their initial decision, the level of confidence increased significantly, which is important, because clinical decisions are often challenged by non-clinical factors, both socially and psychologically.

As DEEPTICH only predicts ICHs, excluding microhemorrhages, there may be some reluctance in adopting its recommendations by clinicians because microhemorrhages are a clinically important sign of significant diffuse TBI. Despite ICH being the most common pediatric TBI for neurosurgical intervention39,40, for a more effective and reliable model, the prediction of other abnormalities must be considered.

We believe that DEEPTICH can make an impact in improving clinical outcomes. Overall, DEEPTICH is helpful in reducing unnecessary head CTs and missed ICH cases. Although, the model decision effect is not significant in the intermediate group, approximately 70% of children in the low- or high-risk groups of head trauma can benefit from using DEEPTICH through enhanced ordering head CT in high risk groups and decreasing ordering head CT in low risk groups.

The survey outcome indicated that physicians were concerned about the clinical safety and information quality of DEEPTICH. Therefore, we propose that for the clinical use of medical AI, development information, such as data processing and modeling should be described in a greater detail to physicians to alleviate their concerns.

Consequently, considering the results and sensitivity of DEEPTICH, we suggest using DEEPTICH with conventional head CT rules in optimizing the prevention of adverse outcomes and unnecessary head CTs. This model can be used to supplement standard head CT rules even if the case history is not filled out or it has been 24 h since the last visit, especially for doctors with less experience.

This study has two limitations. First, the simulation cases are not representative of real-world pediatric TBI populations; there is a greater proportion of low-risk TBI patients in the real world, therefore the decision effect and clinical performance of the model may be different. Second, the simulation cases were non-randomly selected; in the selected cases, DEEPTICH results were correlated to real cases. Therefore, we did not evaluate the accuracy and superiority of DEEPTICH in this well-designed decision simulation study. When implementing DEEPTICH in a real-world clinical setting, the rate of AI acceptance by the physician might be different.

We found that AI acceptance was affected by multiple factors, such as the characteristics of the physician, risk of cases, and the recommendation of DEEPTICH, making it difficult to predict the effect of the model in the real world. Therefore, when implementing AI CDSS in clinical scenarios, we suggest considering the model performance along with its acceptance by physicians. To assess improvements in the clinical outcomes, randomized clinical trials in real-world setting are required.

Conclusions

DEEPTICH affects decisions of emergency physicians to order head CTs, as demonstrated by the decision simulation study. The effectiveness of the model is more significant when the model recommends ordering of head CTs.

Methods

This study was approved by the Institutional Review Board (IRB) of Samsung Medical Center IRB Nos. 2020-07-072 and 2020-09-218. We conducted the decision simulation study from April 26, 2021 to June 5, 2021. Informed consent was obtained from all participants. We confirm that all the experiments were performed in accordance with the relevant guidelines and regulations.

Deep-learning model for predicting traumatic intracranial hemorrhages (DEEPTICH) development in clinical decision support system (CDSS)

Dataset for deep learning

Two data sources were used in this study: the ED-based injury in-depth surveillance (EDIIS) database, and the trauma registry database of the Samsung Medical Center (SMC). The EDIIS dataset was used for model training and internal validation; the SMC dataset was used for external validation and to investigate the effectiveness of DEEPTICH. Detailed data selection criteria are provided in Supplementary Fig. S1.

The EDIIS database was established based on the International Classification of External Causes of Injuries by the World Health Organization. The database includes prehospital records, clinical findings, diagnosis, treatment, dispositions in the ED, inpatient information, demographics, and injury-related factors of the patients. Information of 1.8 million patients from 25 EDs were included in this surveillance database. Each participating hospital assigned coordinators for data collection and management, and the Korea Centers for Disease Control regularly checked the quality of the entire data from the 25 EDs. In this study, the records of 750,000 patients with head injuries from January 1, 2011 to December 31, 2017 were used for derivation and time-split validation.

The SMC database contains medical records from a tertiary academic hospital in South Korea with approximately 2000 beds and an average of 200 ED patient visits per day. This database includes the records of procedures and clinical notes, as well as the same information collected in EDIIS. The SMC dataset was collected from January 1, 2012 to December 31, 2019, with 67,578 patient records. These data were used for multi-center validation in this study.

Model training for deep learning

We used the patient demographics, vital signs, mental status, injury-related factors, date and time-related information regarding the injury onset and visit, and symptoms for predictors. Demographics included the age and sex. Vital signs included the respiratory rate, body temperature, systolic and diastolic blood pressure, and pulse rate. Mental status data included the “alert, verbal, pain, unresponsive” scale and Glasgow coma scale (GCS) scores. Injury-related factors included the injury mechanism, activity during the injury, alcohol-related factors, intentions, place of the injury, material causing the injury, and the time taken from injury onset to the visit. Time-related predictors included the injury onset and visit date and time information, such as the day of the week and hour of the injury onset.

Multiple outcomes were used for training the model. The primary outcome was ICH, such as cerebral contusion, subdural hemorrhage, epidural hemorrhage, subarachnoid hemorrhage, intraventricular hemorrhage, intracerebral hemorrhage, and cerebellar hemorrhage. Other outcomes, such as TBIs other than ICH, visit dispositions, and operations related to head injuries were considered as secondary outcomes. The purpose of secondary outcomes was to improve the prediction performance of the model in multi-task learning.

Machine learning algorithm

Multi-task learning was used for the ML algorithm to classify the ICHs and secondary outcomes. Multi-task deep learning is a method of training multiple learning tasks simultaneously during the training phase. The advantage of multi-task learning is that it can exploit useful information based on the commonalities and differences in the different tasks during training. In our case, there were commonalities and differences in hemorrhages and other TBIs, visit dispositions (patients with ICHs and serious TBIs are more likely to be admitted to the hospital), and head-injury related operations (some ICHs require acute interventions).

Algorithm threshold selection

There were numerous options to select an appropriate threshold for binary prediction for the DEEPTICH, including the Youden index, thresholds for generating the best F-1 score, 0.97/0.95/0.9 sensitivity, and mean threshold among groups of populations whose outcome was 1. We generated case examples for each option, and each option was reviewed by a clinician. Finally, the best threshold selected as a negative predicted value in each age group was 0.99 because it best reflected the clinical decision-making in real clinical circumstances.

Participants for decision simulation study

The participants were residents and specialists in ED of the SMC, a single tertiary academic hospital in South Korea. We defined DEEPTICH effectiveness as a change in a head CT decision based on the DEEPTICH recommendation when the initial decision of the participant differed from the DEEPTICH recommendation. We calculated the DEEPTICH effectiveness mean and SD for five emergency physicians, which were respectively 51.0% and 10.6%. We derived the appropriate number of participants as 20, with the width of the 95% confidence interval as ± 5%. We conducted a study involving 22 emergency physicians, while considering study failures.

Simulation cases selection

We conducted a simulation study using pediatric cases because the decision of ordering a head CT in the pediatric population is more challenging than that in adult patients. We selected 24 simulated pediatric cases who visited the ED with a fall down mechanism from the SMC validation dataset. We stratified cases based on the PECARN rule and patient age (Supplementary Table S3).

This study focused on the effectiveness of AI regarding the decision of the physician, not its accuracy, and therefore selected only cases in which DEEPTICH had the same results as the patient outcome. We performed a trend test (Cochran Armitage test) and Spearman correlation analysis to verify the validity of the simulated cases. For high-risk cases, we confirmed that the extent of the decision on the head CT ordering was significant, and that the head CT ordering willingness on a five-point scale was high.

Four sub-questions were asked for each case. Sub-questions 1 and 2 were asked before the DEEPTICH recommendation, and sub-questions 3 and 4 were asked after the DEEPTICH recommendation. Sub-questions 1 and 3 were identical to the head CT order binary decision. Sub-questions 2 and 4 were also identical to those concerning the willingness of head CT ordering, i.e., five-point scale score. The participants answered all four sub-questions. DEEPTICH presented three pieces of information: (1) the ICH probability of the case; (2) a top percentage of probability of an ICH in the same age group; and (3) a head CT order decision from the DEEPTICH. The process of the simulation scenario is shown in Textbox 1.

Textbox 1
figure 3

Process of simulation scenario.

Survey development

We performed a survey to investigate the factors affecting the DEEPTICH effectiveness. The survey consisted of five questions regarding general medical AI and seven questions regarding the AI used in this study (i.e., the DEEPTICH).

Study process

Consent was obtained from all participants. The participants were provided with the model information and the PECARN rules. Using the PECARN rules was left to the discretion of the physician, and its frequency was not measured. We explained the characteristics and development of DEEPTICH and its clinical performance, i.e., the sensitivity, specificity, negative predictive value, and positive predictive value. Details of model development are in the Supplementary Method. The participants were asked to answer four sub-questions for each simulation case: they were required to answer two questions before the DEEPTICH recommendation and the same two questions after DEEPTICH recommendation (Textbox 1). The simulation case was viewed in a Q-card format. After the completing the 24 simulation cases, the participants responded to the survey.

Outcomes

The primary outcome was the change in head CT order binary decision when the initial binary decision was different from the DEEPTICH recommendation. The secondary outcome was the change in the five-point willingness scale score, and factors that affect the decision changes. We determined that DEEPTICH was effective when the physicians changed their binary decisions based on the DEEPTICH recommendation.

Statistical method

We used the McNemar test for the changes in the proportion for the paired data, and the paired t-test for the continuous values to conduct comparisons before and after the DEEPTICH recommendation. We conducted univariable and multivariable logistic regression analyses to evaluate the factors associated with the DEEPTICH effectiveness. p < 0.05 was considered as statistically significant for all statistical tests. The R software (Version 4.0.2) was used for the statistical analysis.