FormalPara Key Summary Points

Why carry out this study?

Rare eye diseases are the leading cause of visual impairment and blindness in children and young adults, which can adversely decrease the life quality of patients and their families. Therefore, there is an urgent need to develop automated and accurate tools to quickly and accurately diagnose rare eye diseases to support patients.

Recently, large language models (LLMs), especially GPT (Chat Generative Pre-training Transformer), have motivated numerous researchers to evaluate their ability in various tasks. Nevertheless, the capability of GPT-4 in the ophthalmology field of identifying rare eye diseases is still largely unknown.

This study aims to evaluate the capability and explore the potential implementation of GPT-4 in identifying rare ophthalmic diseases in simulated scenarios of patient, family physician, and junior ophthalmologist.

What was learned from the study?

Most responses (83.3%) output by GPT-4 were graded as “appropriate” by senior ophthalmologists from the perspective of suitability. GPT-4 could provide considerably “right” diagnoses when chief complaints, history of present illness, and descriptions of ophthalmic and other necessary examinations focusing on ocular imaging were provided.

In the future, GPT-4 may serve as a consultation assisting tool for patients and family physicians to obtain referral suggestions and an assisting tool for junior ophthalmologists to diagnose rare eye diseases. However, it is important to approach GPT-4 with caution and acknowledge the need for verification and careful referrals in clinical settings.

Introduction

There are approximately 7000 rare diseases, and patients with rare diseases are estimated to constitute about 10% of the population [1]. Many rare diseases can adversely decrease the life quality of patients and their families. However, timely and accurate diagnoses remain difficult [2]. Rare eye diseases are the leading cause of visual impairment and blindness in children and young adults in Europe. Over 900 eye disorders are included in this heterogeneous group of conditions, ranging from relatively prevalent disorders, such as retinitis pigmentosa, to very rare entities, such as developmental eye anomalies [3]. Therefore, there is an urgent need to develop automated and accurate tools to quickly and accurately diagnose rare eye diseases to support patients.

Deep learning methods have already been approved to achieve good performance in many healthcare tasks, and some works have attempted to utilize deep learning methods to address the challenges of detecting rare eye diseases. Burlina et al. [4] suggested the potential benefits of using low-shot methods for rare ophthalmic disease diagnostics when a limited number of annotated training retinal images is available. Yoo et al. [5] introduced a method that combined the few-shot learning and generative adversarial network to improve the applicability of deep learning in the optical coherence tomography diagnosis of rare retinal diseases. However, these methods only output diagnosis results, do not offer explanations, and cannot interact with end-users. Studies using conversational chatbots that can be used by different end-users by interacting with people to diagnose rare eye diseases with explanations are lacking.

Applying expert knowledge to refine artificial intelligence models’ output is often carried out in practice, and there have been various efforts to investigate this field. Recently, large language models (LLMs), especially ChatGPT (Chat Generative Pre-training Transformer), trained by reinforcement learning from human feedback strategy, have attracted public, media, and scientific attention from various fields worldwide [6] and motivated numerous researchers to evaluate their ability in various tasks, e.g., data analysis [7], software development [8], and education [9]. A few reports have already demonstrated the potential applications of ChatGPT in medicine, even in the field of ophthalmology. In the medical field, Kanjee et al. [10] proposed that GPT-4 could provide a numerically superior mean differential quality score in a complex diagnostic challenge compared with some differential diagnosis generators. Sorin et al. [11] assessed the potential application of ChatGPT in patient management in breast tumor board decisions as a clinical decision support tool. In the ophthalmology field, Mihalache et al. [12] designed a study to evaluate ChatGPT’s ability to answer practice questions for board certification in ophthalmology. Balas et al. [13] investigated ChatGPT’s accuracy in formulating provisional and differential diagnoses from text case report descriptions. Antaki et al. [14] tested ChatGPT on two popular multiple-choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program examination, and ChatGPT showed encouraging performance on the examination. Rasmussen et al. [15] evaluated the performance of ChatGPT’s responses to typical patient-related questions on vernal keratoconjunctivitis. Nevertheless, the capability of GPT-4 in the ophthalmology field of identifying rare eye diseases is still largely unknown [16].

In this study, we aim to qualitatively evaluate the ability of GPT-4, the recent successor to ChatGPT, in identifying rare ophthalmic diseases in simulated patient, family physician, and junior ophthalmologist scenarios.

Methods

We selected ten cases of treatable rare ophthalmic disease [17] with confirmed diagnosis (i.e., the ground truth) from the publicly available EyeRounds service [18]. For each case, we simulated different end-users, including patients, family physicians, and junior ophthalmologists, utilizing GPT-4. Because these end-users have different information available, they may provide different input when using GPT-4. We assumed that these three end-users would input the following information into GPT-4, respectively: Scenario 1 (patient): chief complaints; Scenario 2 (family physician): chief complaints and history of present illness; Scenario 3 (junior ophthalmologist): chief complaints, history of present illness, and descriptions of ophthalmic and other necessary examinations focusing on ocular imaging. GPT-4 was accessed on May 10, 2023, via https://chat.openai.com/, and all responses were obtained and recorded at that time. The prompts were from EyeRounds, including chief complaints, history of present illness, and descriptions of ophthalmic and other necessary examinations focusing on ocular imaging for different scenarios with the question ‘What eye disease may I/he/she have?’ We evaluated GPT-4’s responses in two different aspects: suitability (appropriate or inappropriate) and accuracy (right or wrong). Senior ophthalmologists, who had > 10 years’ experience and were blinded to the ground truth, graded GPT-4’s responses as “appropriate” or “inappropriate.” We assigned each case to a senior ophthalmologist (> 10 years’ experience) specialized in the relevant field for the grading. An “appropriate” GPT-4 response was defined as no misconceptions and had reasonable descriptions of diagnosis differentiation process based on input information in each scenario. Each response was further classified as “right” or “wrong.” A “right” response was defined as GPT-4 confirming the diagnosis the same as the ground truth.

This article is based on an online database and does not contain any new studies with human participants performed by any of the authors; therefore, ethics committee approval was not required.

Results

Twenty-five out of 30 (83.3%) responses were graded as “appropriate” by senior ophthalmologists. For the simulated patient, family physician, and junior ophthalmologist scenarios, seven (70%), ten (100%), and eight (80%) responses were graded as “appropriate” by senior ophthalmologists, respectively. When comparing with the ground truth in the simulated patient scenario, GPT-4 could only output several possible diseases generally, and no responses were “right.” In the simulated family physician scenario, five (50%) responses output by GPT-4 were right. In the simulated junior ophthalmologist scenario, most of the responses output by GPT-4, 9 (90%) were “right.” Details are summarized in Table 1.

Table 1 Evaluation GPT-4's output for the ten cases in different scenarios

Discussion

Our study found that in the scenario of patient and family physician, most of GPT-4’s responses were “appropriate.” However, in these two scenarios, GPT-4 could not output “right” responses for most cases. Specifically, in the patient scenario, GPT-4 tended to output several possible but relatively broad and common eye diseases (e.g., refractive errors, retinal diseases, and glaucoma). In the family physician scenario, GPT-4 started to output more specific responses (e.g., case 7 as optic neuritis); however, most of the responses were still “wrong.” The reason could be that the prompts for these two simulated scenarios had insufficient information related to eye conditions and GPT-4 could not ask for additional information such as visual acuity or medical and ocular history to further diagnose diseases as ophthalmologists usually do. This indicates that the current GPT-4 is not a suitable diagnostic tool in the scenarios of patient and family physician. Nevertheless, GPT-4 may still serve as a consultation assisting tool for referral suggestions in the future.

In the scenario of junior ophthalmologist, GPT-4 provided a more specific diagnosis, 90% of responses were “right,” and it could explain how it obtained the diagnosis in detail. For the only case classified as “wrong," GPT-4’s primary diagnosis was optic neuritis, which was different from the ground truth (i.e., case 7, Leber’s hereditary optic neuropathy, LHON). Nevertheless, GPT-4 still mentioned that LHON should be considered (Fig. 1), and the output of why GPT-4 gave its diagnosis as optic neuritis was graded as “appropriate” by senior ophthalmologists. Our results indicate that GPT-4 may serve as an assisting tool for junior ophthalmologists to diagnose rare eye diseases quickly and accurately.

Fig. 1
figure 1

Example of case 7 in different scenarios. The input and output of case 7 in different scenarios, which is the only case that GPT-4 got “wrong” in the scenario of junior ophthalmologist. The ground truth was Leber hereditary optic neuropathy (LHON). a Input and output of GPT-4 in the scenario of patient. GPT-4 output several possible but relatively broad and common eye diseases. b Input and output of GPT-4 in the scenario of family physicians. GPT-4 output more specific diagnostic results. c Input and output of GPT-4 in the scenario of junior ophthalmologist. GPT-4 output the most possible diagnosis as optic neuritis and listed reasons why it obtained this result. Although the primary diagnosis output by GPT-4 (optic neuritis) was different from the ground truth (LHON), GPT-4 still output “However, it’s a bit unusual for optic neuritis to be painless and bilateral in a patient this age, so other conditions such as Leber’s Hereditary Optic Neuropathy (LHON) should also be considered”

There are some inherent limitations of GPT-4. First, it may raise the concern of patient’s privacy when enquiry is uploaded to the OpenAI server for computation, especially in the field of healthcare. Second, GPT-4 may output misconceptions as it was originally designed for general purposes instead of making clinical diagnoses and trained on unverified data. Third, OpenAI has not publicly disclosed the specific information on datasets used for model training, meaning there is a risk of overestimating the capabilities of GPT-4 if EyeRounds were used for training the model. In addition, GPT-4 may generate different responses and different primary diagnoses even if end-users feed the same input into GPT-4 multiple times, which means that GPT-4 still has a lack of robustness and cannot provide end-users with consistent suggestions and diagnoses. Lastly, technical details of how GPT-4 generates the responses are not known. This lack of transparency hinders users' ability to have fine-tuned control of the generated responses [19], which may bring adverse effects to end-users for medical purposes. In addition to these concerns, GPT-4 faces several other challenges, including the need for huge computational resources, and can only function effectively in large computational environments; it has difficulty delivering up-to-date information, and "hallucinations" occur [20]. In conclusion, despite GPT-4’s impressive capabilities across various domains, we must still acknowledge its limitations.

Future research should compare GPT-4 with other state-of-the-art LLMs, e.g., Bard or LLaMA, using different languages in the ophthalmology field. Artificial intelligence chatbots that are designed and trained specifically for ophthalmic diagnosis purposes and chatbots that can actively ask for information that end-users have not provided, as ophthalmologists usually do, are warranted. Moreover, direct inputting of inputting images into GPT-4 will be available to the public next year. It can be anticipated that if the model can capture information from images and output relevant descriptions, it can potentially be applied in clinical settings to assist junior ophthalmologists to diagnose rare eye diseases.

Conclusion

To our knowlege, this is the first proof-of-concept brief report that shows GPT-4 can potentially identify rare eye diseases in simulated patient, family physician, and junior ophthalmologist scenarios. The results indicate GPT-4’s huge potential as a consultation assisting tool for patients and family physicians to obtain referral suggestions. Additionally, GPT-4 may serve as an assisting tool for junior ophthalmologists to diagnose rare eye diseases quickly and accurately in the future, especially when feeding images into GPT-4 becomes available and GPT-4 can capture underlying information from images. However, it is important to approach GPT-4 with caution and acknowledge the need for verification and careful referrals in clinical settings.