Introduction

Accurate fracture classification substantially improves clinical decision-making while inaccurate classification can lead to inappropriate treatment strategies. The Arbeitsgemeinschaft Osteosynthese (AO) classification is an established system for fracture evaluation and adherence streamlines workflows and allows for optimized patient care, improved diagnostic accuracy and treatment planning, and ultimately better patient outcomes (https://www.aofoundation.org/).

While the description of a fracture is frequently correctly included in the report text, the translation of the imaging information into an accurate classification of a fracture is more challenging. Thus, appropriate usage of classification systems is highly dependent on experience1 and hampered by the parallelity of various classification systems2.

Artificial intelligence (AI)-based algorithms employing large language models (LLM) can address this by comprehending and interpreting human language. OpenAI introduced ChatGPT to a wide audience in November 2022. This chatbot, specifically trained for conversation, is based on a generative pre-trained transformer and the latest iteration GPT 3.5-Turbo. ChatGPT and especially using GPT 4 (released in March 2023, with at the moment still limited access) were shown to provide substantial medical knowledge being able to pass the United States Medical Licensing Examination (USMLE)3, 4.

ChatGPT enables rapid processing of complex information and its potential applications in clinical radiology routines have been extensively explored and published in preprints. These comprise the preparation of radiological reports5, transferring radiological reports into plain language by simplifying them6, 7, or providing clinical decision support on differential diagnoses, diagnostic procedures, final diagnosis, and treatment8.

However, ChatGPT is limited to its training data (GPT 3.5-Turbo and GPT 4 trained on data up to September 2021) and thus may not have access to the latest and most specialized knowledge, or might be biased due to the large number of different sources of training data. Therefore, ChatGPT could provide incorrect or incomplete information.

Recently, the feasibility of enhancing ChatGPT’s performance to produce recommendations according to imaging appropriateness guidelines was demonstrated by providing specialized knowledge to the chatbot9. For this, data was structured via the LlamaIndex python library. This allows for connecting this specific data to a general LLM in a modular fashion.

We sought to create an AO classification criteria context-aware chatbot (FraCChat) to provide accurate fracture classification based on the report text. To explore this approach, we benchmarked radiologists and publicly available generic chatbots based on GPT 3.5-Turbo and GPT 4 against the FraCChat chatbot built upon GPT 3.5-Turbo and GPT 4 and enhanced with knowledge of the AO classification criteria using a vectorized knowledge index.

Methods

A schematic of the study workflow is depicted in Fig. 1.

Figure 1
figure 1

Schematic of the workflow of the case file creation, indexing of the context-aware chatbot and performance analysis. AO/OTA—Arbeitsgemeinschaft Osteosynthese/Orthopaedic Trauma Association.

Technical implementation of the chatbots

Data preparation and indexing

To develop and evaluate the proposed FraCChat chatbots, we relied on the AOOTA Fracture and Dislocation Classification Compendium—2018 as a foundational knowledge base.

We employed LlamaIndex as an interface between external data and the GPT 3.5-Turbo or GPT 4 (https://github.com/jerryjliu/llama_index Version 0.5.0). Text information was extracted from the classification compendium and an index was created using the GPTVectorIndex function of LlamaIndex. For this, the document text was divided into smaller chunks of up to 512 tokens (a measure of text content) and converted to data nodes. These nodes were then encoded using an embedding model (text-embedding-ada-002-v2 by OpenAI) and the result was stored in a dictionary-like structure.

Prompting strategy and answer synthesis

To customize the output of all chatbots for the case-based scenarios, the question posed to the system in each case was: "Examine the given fracture description and identify the AO Classification up to Subgroups. Begin your response always with "Full AO Code:"

The direct output of GPT 3.5-Turbo and GPT 4, was captured as the response. For our context-based chatbot, a multistep answer creation and refinement method was used. As a first step, the case description was transferred to an embedding using the same embedding model (text-embedding-ada-002-v2 by OpenAI) and the five best matching data nodes from the index were retrieved. These nodes were used in the answer creation method using either GPT 3.5-Turbo or GPT 4. As a new input the task based on the given prompt, the case and the references where presented as new input to the language model with an overarching prompt to answer the given task based on the presented information. The final output was then captured.

Preparation of case files

The chatbots’ accuracy in comparison to human performance was tested in a scenario resembling clinical routine. For this, 100 radiological reports describing fractures were created. From the more than 500 options of the AO classification (neglecting qualifications and universal modifiers), we randomly chose 100. This resulted in 14 cases for humerus, 1 for scapula, 4 for clavicle, 9 for radius, 4 for ulna, 9 for femur, 3 for patella, 14 for tibia, 7 for tibia/fibula, 3 for fibula, 12 for pelvis, 12 for hand/carpus and 8 for the foot. Groundtruth AO codes for the case files comprised information on bone, segment and type of fracture in all cases while information on group was available in 73% and on subgroup in 56%.

The reports were then created by an experienced resident describing the fracture in a manner sufficient for classification.

Assessment of human and chatbot performance

The 100 case files were presented to two radiologists and classified independently according to the AO classification. During this assessment consultation of guidelines was allowed.

We utilized a script-based approach on the 100 case files to perform a fivefold repetition testing for all four chatbots. From this, the median of the performance of the five runs was calculated.

The performance of humans and chatbots was assessed through qualitative analysis of the different levels of the AO classification extracted from the full AO code: (I) full AO code, (II) bone location, (III) part of the bone, (IV) type of fracture, (V) group, and (VI) subgroup. To compare the respective outputs of humans and chatbots, we employed a Generalized Linear Mixed Model (GLMM) with a binomial family (logit link) to investigate the relationship between the dependent variable "rating" and the independent variable "method." The rating method (human, GPT 3.5-Turbo, GPT 4, FraCChat 3.5 and FraCChat 4) was included as a fixed factor, while allowing for random intercepts for the „case “ and the „rater “.

The consistency of the chatbots was evaluated by comparing the proportion of correct answers that were provided consistently across all five runs versus inconsistently provided correct answers. For this, we calculated the ratios and employed one-sided chi-squared test to assess whether more consistent correct answers were provided than inconsistent ones.

Assessment of time-effectiveness of radiologists and chatbots

To investigate time-effectiveness, we investigated the time to decision per human reader and chatbot and compared this duration via student’s t-test.

Results

Human and chatbot performance

Both radiologists (with access to the AO classification documents) classified the reports according to the AO with high accuracy as they both provided the correct full AO code in 95% of the cases and even higher proportions of correct classifications were noted for the location, fracture type and group.

In contrast, the chatbots generally performed inferiorly as given in Tables 1 and 2. Here, the generic chatbots GPT 3.5-Turbo and GPT 4 only provided correct full AO codes in 5 and 7% of the cases. Better performance was noted for the FraCChat based on GPT 3.5-Turbo (57% correct full AO codes) while the FraCChat relying on GPT 4 reached 83% correct full AO codes. This was corroborated by the performance in the localisation, type and subtype were we noted a similar gradient, too, with mostly significantly better performance of FraCChat 4 over FraCChat 3.5 over GPT 4 over GPT 3.5-Turbo (please see Table 2 for more details). Of note, the performance of FraCChat 4 was not inferior to human rating regarding location, part of bone, fracture type and group as presented in Table 2.

Table 1 Performance of human experts and chatbots in AO classification tasks and respective time consumption.
Table 2 Detailed comparison of the performance of humans and chatbots accross the different levels of the AO classification system.

Analysis of consistency revealed a higher proportion of consistently correct answers for the context-aware chatbots across all investigated levels of the AO code. Further, we observed that context-aware chatbots did provide significantly more consistently correct answers for the levels type and subgroup, while generic chatbots did not. Details are provided in Table 3.

Table 3 Number of correctly provided AO codes and proportion of consistently correct answers.

Time effectiveness of radiologists and chatbots

Duration to provide answers on the 100 case files was substantially lower for the generic chatbots than for human readers (generic chatbot mean 3.3 min vs. human mean 83.2 min; p < 0.001). In addition, generic chatbots were significantly faster than context-aware chatbots (generic chatbot mean 3.3 min vs. context-aware chatbot mean of 7.2 min; p < 0.001) and chatbots based on GPT 3.5 were both faster than the respective GPT 4 based one (both p < 0.001).

Discussion

In this study, we assessed the feasibility of generic and context-aware chatbots to provide AO fracture classification codes from radiological reports. Though the time needed to provide AO classification was significantly lower for chatbots than humans, we observed an inferior performance compared to human radiologists. However, we found that providing specific knowledge substantially enhances the chatbot’s performance and consistency and not being inferior to human rating in terms of location, part of bone, fracture type and group.

Various AI approaches have been proposed for fracture detection based on the image dataset, reaching to whole-body approaches as commercially available tools10. AI based methods are also available for image-based fracture classification, but mostly are limited to a particular region of the body. For example, for the ankle11, the distal radius12, the hip13 and the knee14, convincing results were obtained by deep learning and the respective AO classifications derived directly from the image dataset.

Our approach follows a different track—human radiologists are easily capable of adequately describing a fracture in terms of its morphology and complexity. The translation of this description into the respective AO code classification was performed by an AI approach in this proof-of-concept work. Although we did not reach the performance of a human classifier on all levels of the AO system but only for location, part of bone, fracture type and group, we were able to show that chatbots based on LLM are very well suited for this purpose and might improve workflows as they are more time efficient than human radiologists. In addition, the provision of specialized knowledge showed a clear superiority over generic chatbots; not only the performance but also the consistency were substantially improved.

Previous studies employed natural language processing (NLP) to extract data from radiology reports. Good performance was found in NLP algorithms that were specifically trained to extract fracture presence15 or fracture site16. Deep learning-based NLP methods using Convolutional Neural Networks have shown improved performance compared to classical NLP approaches. However, these approaches require rather large datasets as the performance scales with the amount of training data as shown for classification of hip fractures or smoking status17. In contrast to those specifically trained models, the method of providing specific knowledge via a vector index to a pretrained generic LLM comes with several advantages. The LlamaIndex interface offers a flexible data connector between an existing data source and LLM, structuring and optimizing data. Through this, the specific knowledge base can be exchanged swiftly and independently of the original LLM, removing the need of a task specific trained or fine-tuned model. Furthermore, the generated index is editable and individual references can be added or deleted (for example, if new guidelines are published or amended). In return, an existing index can be provided to a new LLM as content, avoiding the need for resource-consuming re-training of the AI algorithm itself.

The significantly higher performance of chatbots based on GPT version 4 compared with 3.5-Turbo is corroborated by recent studies on the simplification of radiological reports7 or passing United States Medical Licensing Examination questions4. Further improvement in performance is expected with newer GPT models and better vector indices as we noted a significant increase in performance and consistency for GPT 4 based generic and context-aware chatbots over the GPT 3.5-Turbo based variants. This might also address the limitation we encountered with all chatbots in the case of the complex hindfoot fracture (AO 89A). Here, the chatbots mostly correctly classified the individual fractures of the talus and the calcaneus, but did not manage the translation to AO 89A. An interesting pattern was observed in the case file for AO 2U3A3. Despite the description of "multiple bone fragments", the chatbots incorrectly classified 2U2 whereas the human raters assigned the correct AO code.

Potential applications of the presented approach go beyond just supporting AO classification in clinical routine, but extent to retrospective analyses of existing reports and comparison with further procedures of the patients for quality control. In addition, the extension to other classification systems is also conceivable as well as providing information on evidence-based therapies with provision of the respective evidence in clinical routine. Nevertheless, this approach requires validation with real-world case files to test the generalizability.

In conclusion, we found that a context-aware chatbot can derive structured information on fracture classification from radiological reports. Though not reaching human performance, the time consumption was significantly lower.

We nevertheless provide evidence that refining and providing specific context to ChatGPT will be the next essential step in harnessing its power.