Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions
Article information
Abstract
Purpose
Large language models (LLMs) like ChatGPT (OpenAI) are increasingly used in healthcare, raising questions about their accuracy and reliability for medical information. This study compared 2 versions of ChatGPT in answering post-discharge follow-up questions in the area of pediatric emergency medicine (PEM).
Methods
Twenty-three common post-discharge questions were posed to ChatGPT-4 and -3.5, with responses generated before and after a simplification request. Two blinded PEM physicians evaluated appropriateness and accuracy as the primary endpoint. Secondary endpoints included word count and readability. Six established reading scales were averaged, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease. T-tests and Cohen’s kappa were used to determine differences and inter-rater agreement, respectively.
Results
The physician evaluations showed high appropriateness for both defaults (ChatGPT-4, 91.3%-100% vs. ChatGPT-3.5, 91.3%) and simplified responses (both 87.0%-91.3%). The accuracy was also high for default (87.0%-95.7% vs. 87.0%-91.3%) and simplified responses (both 82.6%-91.3%). The inter-rater agreement was fair overall (κ = 0.37; P < 0.001). For default responses, ChatGPT-4 produced longer outputs than ChatGPT-3.5 (233.0 ± 97.1 vs. 199.6 ± 94.7 words; P = 0.043), with a similar readability (13.3 ± 1.9 vs. 13.5 ± 1.8; P = 0.404). After simplification, both LLMs improved word count and readability (P < 0.001), with ChatGPT-4 achieving a readability suitable for the eighth grade students in the United States (7.7 ± 1.3 vs. 8.2 ± 1.5; P = 0.027).
Conclusion
The responses of ChatGPT-4 and -3.5 to post-discharge questions were deemed appropriate and accurate by the PEM physicians. While ChatGPT-4 showed an edge in simplifying language, neither LLM consistently met the recommended reading level of sixth grade students. These findings suggest a potential for LLMs to communicate with guardians.
Introduction
Visits to pediatric emergency departments (EDs) are often constrained by time and continuity. While emergency physicians (EPs) strive to provide comprehensive discharge instructions and answer questions before discharge, several factors complicate this process. Pediatric patients’ guardians are often stressed and sleep-deprived while receiving information from EPs, hindering their understanding and retention of important details. Furthermore, the varying quality of discharge instructions and lack of follow-ups can prevent EPs from fully addressing all concerns. It is common for new questions and concerns to emerge once the families return home. In the hours and days following the discharge, guardians frequently develop additional questions regarding their children’s ongoing care. They commonly need further guidance on post-discharge care, medication compliance, symptom management, and treatment adherence. Accessing reliable and understandable medical information remains a significant challenge, particularly for those with limited health literacy.
Traditionally, guardians seeking medical information outside the hospital have turned to various sources such as Google. While these sources have been shown to increase guardians’ trust in their providers, there still exist concerns about the large quantity of information and variable consistency of information (1,2). Additionally, pediatrics-related websites for guardians may contain medical jargon that is difficult for laypersons to understand (3). The National Institutes of Health recommends that materials for patient education remain at the reading level of 6th graders in the United States (4). This effort can lead to a clear need for accessible, accurate, and easy-to-understand medical information tailored to the specific needs of pediatric patients and their guardians, particularly given that guardians’ reading literacy may be related to their children’s health outcomes (5,6).
Recent advancements in artificial intelligence (AI), specifically large language models (LLMs) like ChatGPT (OpenAI), offer a promising solution to bridge this information gap. ChatGPT is a state-of-the-art conversational agent capable of generating human-like responses to a wide range of queries. Its potential application in healthcare settings, particularly for providing follow-up care instructions, merits thorough investigation. This pilot study aimed to assess the accuracy and consistency of 2 versions of ChatGPT, ChatGPT-4 and -3.5, in responding to follow-up questions for pediatric patients discharged from the ED. ChatGPT-3.5 is an earlier LLM with 175 × 109 parameters, versus ChatGPT-4, which was trained on 100 × 1012 plus parameters. Despite this, we employed both LLMs given the former's free cost and consequently better accessibility to guardians, compared with the latter. The explored questions addressed frequently met domains of pediatric emergency medicine (PEM) across a variety of age ranges. By systematically assessing the performance of the LLMs, our study aimed to provide insights into the feasibility of leveraging LLMs to support guardians in providing clear, accurate, and guardian-friendly medical advice in this context.
Methods
This pilot study investigated the potential of the 2 versions of ChatGPT to accurately and consistently answer follow-up questions for pediatric patients discharged from the ED. Twenty-three commonly asked follow-up questions were collected in conjunction with EPs. These questions spanned a variety of topics ranging from laceration care or febrile seizures to medication dosing (e.g., acetaminophen or ibuprofen). Each question was asked in English, structured to include the patient’s age and diagnosed condition before the question per se. A primary prompt was posed as a guardian asking the question: “For my (age) old with (condition), (question).” For example, “For my 2-year-old with acute otitis media, my child’s symptoms have improved after 3 days of antibiotics. Do they need to finish the entire course?” After the 2 ChatGPT versions generated an initial response, a secondary prompt was used to request to simplify the response for better readability: “Can you make this easier to understand?”
The responses generated by the 2 ChatGPT versions before and after the simplification requests were collected by the research team. An incognito window was utilized to prevent model learning. For each question, there were 4 sets of 3 responses: default responses from ChatGPT-3.5, simplified responses from ChatGPT-3.5, default responses from ChatGPT-4, and simplified responses from ChatGPT-4. These data were collected in April of 2024.
The primary outcomes were the appropriateness and accuracy of the ChatGPT responses by 2 board-certified PEM physicians originally trained in pediatrics, each with 20 years and plus of clinical experience, who were blinded to the LLMs generating the responses. The reviewers graded each set of responses as either “appropriate,” “not appropriate,” or “unreliable” based on their clinical judgment, without any further context. The “unreliable” was chosen when the responses varied in appropriateness, with some appropriate and others not. They also evaluated the medical advice provided in the responses, selecting whether the set of responses was “accurate,” “partially accurate,” or “inaccurate,” which might be subjective but was based on the methodology used in the relevant literature (7,8).
Secondary outcomes included readability and word count. To assess the readability of both the default and simplified ChatGPT responses, we used 6 established readability scales, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease, which were average reading level (ARL) (7). These scales and word count were averaged for each ChatGPT version and response type, allowing for evaluation of the effectiveness of the 2 ChatGPT versions in simplifying the language across the 2 versions. A readability level of 6 indicated that of sixth graders, whereas 7 did that of seventh graders, and so forth.
Given the large sample size, 2-sample t-tests were used to compare readability and word count between the 2 ChatGPT versions, along the lines of trial type (default vs. simplified) responses for each model. There was no correction for multiplicity. Fisher’s exact tests were used to evaluate the reviewers’ ratings, specifically “appropriate” vs. “not appropriate or unreliable,” or “accurate” vs. “partially accurate or inaccurate.” Cohen’s kappa was used to determine inter-rater agreement. The results were considered significant at P < 0.05. R (Posit) was utilized for statistical analysis. This study was exempted from an institutional board approval as it did not face human participants (IRB no. STUDY00006767).
Results
A total of 23 questions were posed to the 2 versions of ChatGPT, 3 separate times, with and without a request for simplification, resulting in a total of 276 independent responses. In total, the PEM physicians performed 92 evaluations to assess the appropriateness and accuracy of each group of outputs. Tables 1 and 2 list the results of each LLM by question and simplification results.
For ChatGPT-3.5, default and simplified responses were deemed appropriate on 91.3% by both reviewers (Table 3). Simplified responses were deemed appropriate 87.0% and 91.3% by reviewers 1 and 2, respectively. Accuracy ratings for default responses were 87.0% and 91.3% for the reviewers 1 and 2, respectively. The equivalent ratings for simplified responses were 82.6% and 91.3% for the reviewers 1 and 2, respectively. For ChatGPT-4, default responses were deemed appropriate 91.3% and 100% by reviewers 1 and 2, respectively. The other variables tended to be similar to those of ChatGPT-3.5. These percentages were significantly higher than those values regarding the “not appropriate or unreliable,” or “not accurate or unreliable.”
The reviewers’ ratings also showed varying levels of agreement, with a fair overall inter-rater reliability (κ = 0.37; P < 0.001). For the default responses, agreement was moderate (κ = 0.50; P < 0.001), while for the simplified responses, agreement was fair (κ = 0.27; P < 0.001). ChatGPT-3.5 evaluations showed moderate agreement (κ = 0.56; P < 0.001), but ChatGPT-4 evaluations had only slight agreement (κ = 0.125; P = 0.013).
The default responses of ChatGPT-3.5 had fewer words compared to those of ChatGPT-4 (199.6.0 ± 94.7 vs. 233.0 ± 97.1 words; P = 0.043; Table 4). Similarly, the simplified responses of ChatGPT-3.5 had fewer words compared to those of ChatGPT-4 (141.2 ± 59.2 vs. 164.0 ± 67.0 words; P = 0.036). ChatGPT-3.5’s default responses had an ARL which was not different from ChatGPT-4’s value (13.5 ± 1.8 vs. 13.3 ± 1.9; P = 0.404). However, for the simplified responses, ChatGPT-3.5 had a higher ARL compared to those of ChatGPT-4 (8.2 ± 1.5 vs. 7.7 ± 1.3; P = 0.027). Both LLMs showed significant decreases in ARL between default and simplified responses. For ChatGPT-3.5, the ARL decreased from 13.5 ± 1.8 to 8.2 ± 1.5 (P < 0.001); for ChatGPT-4, it decreased from 13.3 ± 1.9 to 7.7 ± 1.3 (P < 0.001).
Discussion
The study assessed the performance of ChatGPT-4 and -3.5 in generating PEM follow-up care instructions for 23 frequently asked questions. The high appropriateness and accuracy ratings from board-certified PEM physicians underscore the potential of these AI tools to complement traditional patient education methods used in EDs. ChatGPT-4 demonstrated slightly better performance than ChatGPT-3.5 in accuracy and appropriateness, with both LLMs maintaining high ratings even after the simplification. The mean appropriateness and accuracy ratings consistently exceeded 85% for both LLMs and response types, suggesting a strong potential for LLMs to reliably answer guardians’ questions. Notably, a slight decrease in both appropriateness and accuracy was observed for both LLMs after the simplification, indicating a potential trade-off between simplicity and comprehensiveness.
The robust performance of both LLMs demonstrates their ability to adapt to language complexity without compromising the quality of medical information. This adaptability is crucial, as it allows the LLMs to tailor information to patients and guardians with varying levels of health literacy, a factor known to impact health outcomes and the use of EDs (6). The capacity to provide personalized and understandable information could improve guardians’ satisfaction and comprehension, addressing a key challenge in healthcare communication.
Furthermore, research has shown that LLMs may provide more empathetic responses compared to those generated by physicians (9). This combination of accuracy, adaptability, and empathy may position LLMs as promising tools for enhancing patient education and support, particularly in time-constrained emergency settings.
Default responses from both versions of ChatGPT had similar ARLs, but ChatGPT-4 was better at simplifying ARL compared to ChatGPT-3.5. This finding suggests that the premium LLM can better adjust to accommodate the reading levels of users. While ChatGPT-3.5 is an older model and may not have the same capabilities as ChatGPT-4, it is important to evaluate the former model because more advanced LLMs may be more expensive and less accessible to guardians.
The simplification findings indicate differences in the word count and ARL between the 2 LLMs and between simplified and default responses, with both LLMs showing a capability to reduce complexity when asked to simplify their responses. Neither LLM consistently reached a readability of the sixth graders recommended for optimal caregiver comprehension (4). This suggests that while LLMs can improve the accessibility of medical information, there is still room for enhancement to meet established health literacy guidelines. Further simplification trials were attempted, and some occasionally showed an improvement in readability, indicating that further prompting may be useful to reach the necessary ARL. Many other studies have verified the simplifying abilities of the LLMs (10).
A key observation from our study is the LLMs' performance in medication-related questions. Both versions of ChatGPT showed decreased appropriateness and accuracy in questions regarding medication dosing, particularly for over-the-counter medications, such as ibuprofen, acetaminophen, and polyethylene glycol. The responses provided by the LLMs were generated in the absence of dosing weights, leading to incorrect outputs. Questions were deliberately asked without weights included, with the assumption that EPs usually address the questions rather than allowing guardians to dose by themselves. This finding aligns with previous research cautioning against relying on AI for medication instructions and highlights a critical area for improvement in the LLMs (11). It may be more pronounced in such instructions for pediatric patients, which are often more complex due to the practice of weight-based dosing than the more standardized adult dosing. EPs should be cautious about general use of LLMs for unproven medical advice. Outside of the medication-based questions, the LLMs generally provided appropriate and accurate advice.
Despite the limitations discussed below, this study demonstrates that the LLMs hold an important potential in patient education. They could address barriers to personalized education, improving guardians’ understanding of discharge instructions for and reducing return visits of pediatric patients who visited EDs. However, it is crucial to emphasize that these AI-generated instructions should complement, not replace personal physician-patient communication.
This study has several limitations. The focus on only 23 common post-ED discharge questions limits its generalizability. Although the versions of ChatGPT were most commonly available during the study period, a newer version might produce different results. In addition, while the overall inter-rater agreement was fair, there was variation in such agreement levels across different conditions, suggesting a need for more standardized evaluation criteria. Further refinements to meet end-user needs can be guided by incorporating direct feedback from guardians regarding the clarity, usefulness, and satisfaction with LLM-generated instructions.
In conclusion, this study shows the promising potential of ChatGPT-4 and -3.5 in generating accurate and appropriate responses to common PEM post-discharge questions. While there are areas for improvement, particularly in medication-related advice and achieving optimal readability levels, the overall performance of the LLMs suggests that they could be valuable tools in enhancing patient education in EDs. Further research and refinement are necessary before considering widespread implementation in clinical practice, but the results indicate a promising direction for improving communication and potential health outcomes in the area of PEM.
Notes
Author contributions
Conceptualization and Methodology: Gupta M and Aufricht G
Validation: Kahlun A, Sur R, and Gupta P
Formal analysis, Resources, Visualization, Software, and Project administration: Gupta M
Investigation: Gupta M, Kahlun A, Sur R, and Gupta P
Data curation: Gupta M, Gupta P, Kienstra A, and Whitaker W
Supervision: Aufricht G, Kienstra A, and Whitaker W
Writing-original draft: Gupta M, Kahlun A, Sur R, and Aufricht G
Writing-review and editing: all authors
All authors read and approved the final manuscript.
Conflicts of interest
No potential conflicts of interest relevant to this article were reported.
Funding sources
No funding source relevant to this article was reported.
Data availability
Data generated or analyzed during the study are available from the corresponding author by request.