Introduction

The application of Artificial Intelligence (AI) in medicine is getting to increased last decade. Several studies reported the application of AI in clinical grading systems, assessment of cochlear implant function, parathyroid recognition, and prediction of clinical prognosis in otorhinolaryngology–head and neck surgery (ORL–HNS) [1,2,3,4,5]. Ethical concerns such as autonomy, beneficence, nonmaleficence, and justice were emphasized in the paper by Arambula et al. [6].

Chatbots are one of the trending topics of the AI nowadays. ChatGPT (by OpenAI) is one of the most commonly used Chatbots due to the literature. Several studies investigated the application of ChatGPT in medical exams, making a clinical diagnosis, article writing, etc. [4, 7,8,9].

UpToDate® is a well-known medical knowledge source for physicians that is used in daily clinical practice in worldwide and our hospital [10]. Studies reported its effectiveness on health care quality, decreasing diagnostic error and mortality, association with shorter length of hospital stay, and lower complication rate [11,12,13,14]. Another study reported that UpToDate was faster and gave detailed knowledge compared to similar database systems [15].

In this study, we aimed to compare the ChatGPT to UpToDate® for their usefulness and reliability in common clinical presentations of ORL–HNS.

Materials and methods

Study design: cross-sectional comparative

Study description

ChatGPT version 3.5 [accessed on 27 August 2023 (1–6 cases) and 23 October 2023 (7–25 cases)] and UpToDate® [accessed on 28 August 2023 (1–6 cases) and 23 October 2023 (7–25 cases)] were used for the study. We created 25 case scenarios that are related to the subspecialties of the ORL–HNS. We consider common clinical presentations of the ORL–HNS in the literature while making them [16,17,18,19,20,21,22,23]. These case scenarios include almost equal ratios of the sexes—female/male is 12:13—and different age segments 7 decades of life—of the patients. Clinical presentations are described in Table 1. Then, we asked the ChatGPT “Tell me how would you manage a “number of the age”-year-old male/female patient comes with “... symptoms” that started/for/since day/week/month. Give me references at the end of your response.” and the meantime searched the case on UpToDate.

Table 1 Clinical presentations

We assessed the accuracy of the references in the ChatGPT answers. The scale is: 0—the reference is not available with the described DOI number and source link or is not correct; 1—the reference is available with the described DOI number and source link but not so related to the specific topic; 2—the reference is available with the described DOI number and source link and strongly related to the topic. Then, we calculate the mean score for each answer. In addition, we used the score from 1 to 3 to assess the usefulness of the ChatGPT and UpToDate answers; the scale was reported by Johnson et al. [24]: 1—incomplete answer and not useful; 2—semi-complete answer, somewhat useful but should need some extra knowledge; and 3—complete answer and useful in management.

Afterward, four reviewers assessed each case scenario for ChatGPT answers and related UpToDate papers regarding the search result. Reviewers were blinded to each other’s assessment results.

Ethical approval

Not applicable to this study because of not include patient data.

Statistical analysis

The frequencies and percentages were given for categorical variables; and median (IQR: Q1–Q3) values were given for numerical variables as descriptive statistics. The agreement among the usefulness responses of reviewers for ChatGPT and UpToDate was determined using the coefficients of agreement of “Percent agreement (PA), Fleiss's κ and Gwet AC1” [25,26,27]. All coefficients were presented with 95% confidence intervals (CI). Especially, due to the problems encountered with the Kappa coefficient [28], the Gwet AC1 coefficient, which gives more consistent and reliable results, was preferred, but according to the published guide [26], the other two coefficients were also given to present more than one coefficient of agreement. The interpretation of the coefficients was carried out by Gwet's probabilistic method according to the Landis and Koch scale [29]. The McNemar–Bowker test was used to test the symmetry between ChatGPT and UpToDate usefulness responses of each reviewer. In addition, the Wilcoxon rank signed test was used to compare ChatGPT–UpToDate usefulness response means calculated over reviewers.

Statistical significance was assessed at p < 0.05 and all statistical analyses were performed using R software (R software, version 4.0.5, packages: arsenal-irrcac-ggplot2, R Foundation for Statistical Computing, Vienna, Austria; http://r project.org).

Results

A comparison of ChatGPT answers to UpToDate search results is described in Appendix 1 in supplementary material.

UpToDate supported its information with references from peer-reviewed journals, conference papers, book chapters, etc. However, ChatGPT did not give references in some questions. The overall mean accuracy score of references in ChatGPT answers was 0.25–weak/unrelated; the mean score of each question was described in Appendix 1 in supplementary material.

The mean usefulness score was 1.5 ± 0.51 for ChatGPT and 2.73 ± 0.31 for UpToDate. Each reviewer scored the UpToDate responses 2 or 3 points; therefore, UpToDate had a higher overall mean score than ChatGPT. The median (Q1–Q3) was 1.00 (1.25–2.00) for ChatGPT and 2.63 (2.75–3.00) for UpToDate, and the difference was statistically significant (Wilcoxon test, p < 0.001) (Tables 2, 3 and Fig. 1). When the usefulness scores were compared for two groups for each reviewer, the result was found to be statistically significant (McNmear–Bowker p values for each reviewer, p < 0.001). The mean usefulness score distribution for ChatGPT and UpToDate is also described in Figs. 2 and 3, respectively.

Table 2 Distribution of usefulness score in ChatGPT and UpToDate
Table 3 Agreement among the usefulness responses of reviewers for ChatGPT and UpToDate
Fig. 1
figure 1

Wilcoxon rank signed test result for comparison of ChatGPT-UpToDate usefulness response means

Fig. 2
figure 2

The mean usefulness score distribution for ChatGPT

Fig. 3
figure 3

The mean usefulness score distribution for UpToDate

UpToDate supported the topic with algorithms, figures, and tables that are different from ChatGPT. ChatGPT supported many answers by declaring “I am not a doctor” and advising to ask physicians for professional medical advice (highlighted in bold in Appendix 1 in supplementary material). The knowledge by the ChatGPT was extracted from sources with limited to older date, 2021 year (please look at the end of the answer of the first case scenario in Appendix 1 in supplementary material).

Discussion

The usage of AI in medicine is increasing. Its application to surgical fields has been on trend in recent years. ChatGPT (version 3.5) is a free AI Chatbot and was released by OpenAI at the end of last year. Afterward, it became a trended research topic for doctors and researchers very quickly. Over a thousand article is found in PubMed while searching with the “ChatGPT” keyword right now (accessed on 28 Aug 2023).

There are a limited number of studies evaluating the ChatGPT in ENT&HNS in the literature. Most of them focused the exam-based work. Brennan et al. reported ChatGPT benefit on ear, nose, and throat (ENT) surgical education [30]. Qu et al. evaluated the diagnostic application of ChatGPT and reported the low quality of the Chatbot [4]. Hoch et al. assessed ChatGPT skills in single and multiple choice ENT board questions, and it performed a low correct answer percentage [8]. Other studies evaluated the triage and radiologic diagnosis accuracy of ChatGPT, but the accurate decision ratio was below that of physicians [31, 32]. Ayoub et al. compared the ChatGPT with Google Search and reported the first one had a good result for general medical knowledge but a worse result for medical advice than the second one [33].

UpToDate differs from ChatGPT with a subscription fee—institutional or personal [34]. However, ChatGPT was free access for people when released date, and version 3.5—used in our study—is still free, which makes it useful and reachable for all physicians. However, the upper version requires payment [35]. In addition, ChatGPT can search for more databases/websites and extract knowledge from various sources and languages. UpToDate supports sixteen languages (accessed 28 Aug 2023), but ChatGPT can extract data from more than 25 languages (accessed 28 Aug 2023). The papers’ contents are the same in all languages in UpToDate. However, the answer may change with a wide range of different languages in ChatGPT.

Another nuance is that ChatGPT’s answer depends on the question style and writing format. It requires “well-written” questions to get better answers. We should emphasize that answers to the same question also could result in a wary range depending on the question style. We tried the different versions of the question style and finally unanimously decided on “Tell me how would you manage a “number of the age”-year-old male/female patient comes with “... symptoms” that started/for/since day/week/month. Give me references at the end of your response” format. This nuance is subjective and could be a bias for studies asking open questions to ChatGPT like our study. When we decided to question format, we considered the details of the answers, and in addition, asked for references to improve sources. Because, when we asked ChatGPT a question without the phrase “Give me references at the end of your response”, it did not give any references. Therefore, if a physician wants to get a reference to find out more information related to the topic, he/she should write an extra sentence while asking the question. This decreases the usefulness and reliability of the ChatGPT. Supporting the knowledge with references from peer-reviewed journals, conference papers, and book chapters increases the reliability and makes the knowledge transparent in UpToDate. Besides, promoting the topic with algorithms, figures, and tables makes UpToDate more systematic and beneficial.

UpToDate’s search tool finds the related paper from its database regarding the search keywords. However, ChatGPT searches for many websites and databases. Papers in UpToDate included main subheadings that ease the physician's work to find the wanted information quickly within the paper. In addition, ChatGPT gave a subheading while asking about the management of patients, however, this heading contains non-specific sentences. Therefore, it looks like a useful feature of UpToDate. On the contrary, ChatGPT replies to the questions quickly differ from UpToDate and decrease the time to reach out for knowledge. It is one of the strong features of ChatGPT. UpToDate requires finding related papers and headings/subheadings within the papers manually and takes time.

ChatGPT’s information base is limited to 2021 due to its training; therefore, it is a weak feature of Chatbot regarding further and most updated knowledge [33]. In addition, we observed the same result while looking at the references of Chatbot’s answers. ChatGPT emphasized that in some answers reference parts its last knowledge was updated in September 2021. Informing the users on this issue is a good point regarding ethics. On the other hand, medical knowledge in UpToDate is reviewed and updated by doctors, well-experienced specialists, and academicians continuously.

Interestingly, ChatGPT cited and recommended the UpToDate while answering our questions in the 7th and 25th cases.

It was observed that ChatGPT give medical recommendation in contrast to basic medical knowledge in the reported studies 33. This is an important concern for the safety of patients. In our study, we did not observe it. In addition, UpToDate gives medical recommendations, but these are evidence-based and supported by studies. In our study, most of the references in ChatGPT answers were unrelated to the question and some of them were inaccessible/unavailable. ChatGPT supported many answers by declaring “I am not a doctor” and advises referral to physicians for professional medical advice. This is a good point for ethical issues related to the AI. In addition, repeating sentences in the same answer in ChatGPT may be wordy while reading.

ChatGPT’s answers may vary on different computers, in different locations, and at different times. The questions in our paper were answered differently according to this issue. We used the same computer device for asking the question to ChatGPT.

Twenty-five clinical case scenarios were investigated in the study which is a limited number. ChatGPT summarized the result itself, but we searched and selected the appropriate monograph in the UpToDate. Hence, it is a subjective factor of the authors’ selection. Because there are several monographs for the same search result in the UpToDate. In this study, UpToDate had more usefulness scores and reliability than ChatGPT with statistical significance.

Conclusion

In this study, we aimed to investigate the usefulness and reliability of ChatGPT in comparison with UpToDate in common clinical presentations of otorhinolaryngology–head and neck surgery. In this stage, UpToDate looks more useful and reliable than ChatGPT. Developers need to improve the ChatGPT with evidence-based search and analysis skills and update its database.