Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions

Article information

Pediatr Emerg Med J. 2025;.pemj.2024.01074
Publication date (electronic) : 2025 January 11
doi : https://doi.org/10.22470/pemj.2024.01074
1Department of Diagnostic Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USA
2Undergraduate Program, The University of Texas at Austin, Austin, TX, USA
3Department of Pediatric Emergency Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USA
Corresponding author: Mitul Gupta Department of Diagnostic Medicine, Dell Medical School, The University of Texas at Austin, 1501 Red River Street, Austin, TX 78712, USA Tel: +1-512-495-5555; E-mail: mitul.gupta@utexas.edu
Received 2024 August 21; Revised 2024 November 8; Accepted 2024 November 12.

Abstract

Purpose

Large language models (LLMs) like ChatGPT (OpenAI) are increasingly used in healthcare, raising questions about their accuracy and reliability for medical information. This study compared 2 versions of ChatGPT in answering post-discharge follow-up questions in the area of pediatric emergency medicine (PEM).

Methods

Twenty-three common post-discharge questions were posed to ChatGPT-4 and -3.5, with responses generated before and after a simplification request. Two blinded PEM physicians evaluated appropriateness and accuracy as the primary endpoint. Secondary endpoints included word count and readability. Six established reading scales were averaged, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease. T-tests and Cohen’s kappa were used to determine differences and inter-rater agreement, respectively.

Results

The physician evaluations showed high appropriateness for both defaults (ChatGPT-4, 91.3%-100% vs. ChatGPT-3.5, 91.3%) and simplified responses (both 87.0%-91.3%). The accuracy was also high for default (87.0%-95.7% vs. 87.0%-91.3%) and simplified responses (both 82.6%-91.3%). The inter-rater agreement was fair overall (κ = 0.37; P < 0.001). For default responses, ChatGPT-4 produced longer outputs than ChatGPT-3.5 (233.0 ± 97.1 vs. 199.6 ± 94.7 words; P = 0.043), with a similar readability (13.3 ± 1.9 vs. 13.5 ± 1.8; P = 0.404). After simplification, both LLMs improved word count and readability (P < 0.001), with ChatGPT-4 achieving a readability suitable for the eighth grade students in the United States (7.7 ± 1.3 vs. 8.2 ± 1.5; P = 0.027).

Conclusion

The responses of ChatGPT-4 and -3.5 to post-discharge questions were deemed appropriate and accurate by the PEM physicians. While ChatGPT-4 showed an edge in simplifying language, neither LLM consistently met the recommended reading level of sixth grade students. These findings suggest a potential for LLMs to communicate with guardians.

Introduction

Visits to pediatric emergency departments (EDs) are often constrained by time and continuity. While emergency physicians (EPs) strive to provide comprehensive discharge instructions and answer questions before discharge, several factors complicate this process. Pediatric patients’ guardians are often stressed and sleep-deprived while receiving information from EPs, hindering their understanding and retention of important details. Furthermore, the varying quality of discharge instructions and lack of follow-ups can prevent EPs from fully addressing all concerns. It is common for new questions and concerns to emerge once the families return home. In the hours and days following the discharge, guardians frequently develop additional questions regarding their children’s ongoing care. They commonly need further guidance on post-discharge care, medication compliance, symptom management, and treatment adherence. Accessing reliable and understandable medical information remains a significant challenge, particularly for those with limited health literacy.

Traditionally, guardians seeking medical information outside the hospital have turned to various sources such as Google. While these sources have been shown to increase guardians’ trust in their providers, there still exist concerns about the large quantity of information and variable consistency of information (1,2). Additionally, pediatrics-related websites for guardians may contain medical jargon that is difficult for laypersons to understand (3). The National Institutes of Health recommends that materials for patient education remain at the reading level of 6th graders in the United States (4). This effort can lead to a clear need for accessible, accurate, and easy-to-understand medical information tailored to the specific needs of pediatric patients and their guardians, particularly given that guardians’ reading literacy may be related to their children’s health outcomes (5,6).

Recent advancements in artificial intelligence (AI), specifically large language models (LLMs) like ChatGPT (OpenAI), offer a promising solution to bridge this information gap. ChatGPT is a state-of-the-art conversational agent capable of generating human-like responses to a wide range of queries. Its potential application in healthcare settings, particularly for providing follow-up care instructions, merits thorough investigation. This pilot study aimed to assess the accuracy and consistency of 2 versions of ChatGPT, ChatGPT-4 and -3.5, in responding to follow-up questions for pediatric patients discharged from the ED. ChatGPT-3.5 is an earlier LLM with 175 × 109 parameters, versus ChatGPT-4, which was trained on 100 × 1012 plus parameters. Despite this, we employed both LLMs given the former's free cost and consequently better accessibility to guardians, compared with the latter. The explored questions addressed frequently met domains of pediatric emergency medicine (PEM) across a variety of age ranges. By systematically assessing the performance of the LLMs, our study aimed to provide insights into the feasibility of leveraging LLMs to support guardians in providing clear, accurate, and guardian-friendly medical advice in this context.

Methods

This pilot study investigated the potential of the 2 versions of ChatGPT to accurately and consistently answer follow-up questions for pediatric patients discharged from the ED. Twenty-three commonly asked follow-up questions were collected in conjunction with EPs. These questions spanned a variety of topics ranging from laceration care or febrile seizures to medication dosing (e.g., acetaminophen or ibuprofen). Each question was asked in English, structured to include the patient’s age and diagnosed condition before the question per se. A primary prompt was posed as a guardian asking the question: “For my (age) old with (condition), (question).” For example, “For my 2-year-old with acute otitis media, my child’s symptoms have improved after 3 days of antibiotics. Do they need to finish the entire course?” After the 2 ChatGPT versions generated an initial response, a secondary prompt was used to request to simplify the response for better readability: “Can you make this easier to understand?”

The responses generated by the 2 ChatGPT versions before and after the simplification requests were collected by the research team. An incognito window was utilized to prevent model learning. For each question, there were 4 sets of 3 responses: default responses from ChatGPT-3.5, simplified responses from ChatGPT-3.5, default responses from ChatGPT-4, and simplified responses from ChatGPT-4. These data were collected in April of 2024.

The primary outcomes were the appropriateness and accuracy of the ChatGPT responses by 2 board-certified PEM physicians originally trained in pediatrics, each with 20 years and plus of clinical experience, who were blinded to the LLMs generating the responses. The reviewers graded each set of responses as either “appropriate,” “not appropriate,” or “unreliable” based on their clinical judgment, without any further context. The “unreliable” was chosen when the responses varied in appropriateness, with some appropriate and others not. They also evaluated the medical advice provided in the responses, selecting whether the set of responses was “accurate,” “partially accurate,” or “inaccurate,” which might be subjective but was based on the methodology used in the relevant literature (7,8).

Secondary outcomes included readability and word count. To assess the readability of both the default and simplified ChatGPT responses, we used 6 established readability scales, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease, which were average reading level (ARL) (7). These scales and word count were averaged for each ChatGPT version and response type, allowing for evaluation of the effectiveness of the 2 ChatGPT versions in simplifying the language across the 2 versions. A readability level of 6 indicated that of sixth graders, whereas 7 did that of seventh graders, and so forth.

Given the large sample size, 2-sample t-tests were used to compare readability and word count between the 2 ChatGPT versions, along the lines of trial type (default vs. simplified) responses for each model. There was no correction for multiplicity. Fisher’s exact tests were used to evaluate the reviewers’ ratings, specifically “appropriate” vs. “not appropriate or unreliable,” or “accurate” vs. “partially accurate or inaccurate.” Cohen’s kappa was used to determine inter-rater agreement. The results were considered significant at P < 0.05. R (Posit) was utilized for statistical analysis. This study was exempted from an institutional board approval as it did not face human participants (IRB no. STUDY00006767).

Results

A total of 23 questions were posed to the 2 versions of ChatGPT, 3 separate times, with and without a request for simplification, resulting in a total of 276 independent responses. In total, the PEM physicians performed 92 evaluations to assess the appropriateness and accuracy of each group of outputs. Tables 1 and 2 list the results of each LLM by question and simplification results.

ChatGPT-4 output evaluation

ChatGPT-3.5 output evaluation

For ChatGPT-3.5, default and simplified responses were deemed appropriate on 91.3% by both reviewers (Table 3). Simplified responses were deemed appropriate 87.0% and 91.3% by reviewers 1 and 2, respectively. Accuracy ratings for default responses were 87.0% and 91.3% for the reviewers 1 and 2, respectively. The equivalent ratings for simplified responses were 82.6% and 91.3% for the reviewers 1 and 2, respectively. For ChatGPT-4, default responses were deemed appropriate 91.3% and 100% by reviewers 1 and 2, respectively. The other variables tended to be similar to those of ChatGPT-3.5. These percentages were significantly higher than those values regarding the “not appropriate or unreliable,” or “not accurate or unreliable.”

Physician evaluation of model outputs: ChatGPT-4 and -3.5 (N = 23)

The reviewers’ ratings also showed varying levels of agreement, with a fair overall inter-rater reliability (κ = 0.37; P < 0.001). For the default responses, agreement was moderate (κ = 0.50; P < 0.001), while for the simplified responses, agreement was fair (κ = 0.27; P < 0.001). ChatGPT-3.5 evaluations showed moderate agreement (κ = 0.56; P < 0.001), but ChatGPT-4 evaluations had only slight agreement (κ = 0.125; P = 0.013).

The default responses of ChatGPT-3.5 had fewer words compared to those of ChatGPT-4 (199.6.0 ± 94.7 vs. 233.0 ± 97.1 words; P = 0.043; Table 4). Similarly, the simplified responses of ChatGPT-3.5 had fewer words compared to those of ChatGPT-4 (141.2 ± 59.2 vs. 164.0 ± 67.0 words; P = 0.036). ChatGPT-3.5’s default responses had an ARL which was not different from ChatGPT-4’s value (13.5 ± 1.8 vs. 13.3 ± 1.9; P = 0.404). However, for the simplified responses, ChatGPT-3.5 had a higher ARL compared to those of ChatGPT-4 (8.2 ± 1.5 vs. 7.7 ± 1.3; P = 0.027). Both LLMs showed significant decreases in ARL between default and simplified responses. For ChatGPT-3.5, the ARL decreased from 13.5 ± 1.8 to 8.2 ± 1.5 (P < 0.001); for ChatGPT-4, it decreased from 13.3 ± 1.9 to 7.7 ± 1.3 (P < 0.001).

Comparison of word count and 6 established readability scales: ChatGPT-4 and -3.5

Discussion

The study assessed the performance of ChatGPT-4 and -3.5 in generating PEM follow-up care instructions for 23 frequently asked questions. The high appropriateness and accuracy ratings from board-certified PEM physicians underscore the potential of these AI tools to complement traditional patient education methods used in EDs. ChatGPT-4 demonstrated slightly better performance than ChatGPT-3.5 in accuracy and appropriateness, with both LLMs maintaining high ratings even after the simplification. The mean appropriateness and accuracy ratings consistently exceeded 85% for both LLMs and response types, suggesting a strong potential for LLMs to reliably answer guardians’ questions. Notably, a slight decrease in both appropriateness and accuracy was observed for both LLMs after the simplification, indicating a potential trade-off between simplicity and comprehensiveness.

The robust performance of both LLMs demonstrates their ability to adapt to language complexity without compromising the quality of medical information. This adaptability is crucial, as it allows the LLMs to tailor information to patients and guardians with varying levels of health literacy, a factor known to impact health outcomes and the use of EDs (6). The capacity to provide personalized and understandable information could improve guardians’ satisfaction and comprehension, addressing a key challenge in healthcare communication.

Furthermore, research has shown that LLMs may provide more empathetic responses compared to those generated by physicians (9). This combination of accuracy, adaptability, and empathy may position LLMs as promising tools for enhancing patient education and support, particularly in time-constrained emergency settings.

Default responses from both versions of ChatGPT had similar ARLs, but ChatGPT-4 was better at simplifying ARL compared to ChatGPT-3.5. This finding suggests that the premium LLM can better adjust to accommodate the reading levels of users. While ChatGPT-3.5 is an older model and may not have the same capabilities as ChatGPT-4, it is important to evaluate the former model because more advanced LLMs may be more expensive and less accessible to guardians.

The simplification findings indicate differences in the word count and ARL between the 2 LLMs and between simplified and default responses, with both LLMs showing a capability to reduce complexity when asked to simplify their responses. Neither LLM consistently reached a readability of the sixth graders recommended for optimal caregiver comprehension (4). This suggests that while LLMs can improve the accessibility of medical information, there is still room for enhancement to meet established health literacy guidelines. Further simplification trials were attempted, and some occasionally showed an improvement in readability, indicating that further prompting may be useful to reach the necessary ARL. Many other studies have verified the simplifying abilities of the LLMs (10).

A key observation from our study is the LLMs' performance in medication-related questions. Both versions of ChatGPT showed decreased appropriateness and accuracy in questions regarding medication dosing, particularly for over-the-counter medications, such as ibuprofen, acetaminophen, and polyethylene glycol. The responses provided by the LLMs were generated in the absence of dosing weights, leading to incorrect outputs. Questions were deliberately asked without weights included, with the assumption that EPs usually address the questions rather than allowing guardians to dose by themselves. This finding aligns with previous research cautioning against relying on AI for medication instructions and highlights a critical area for improvement in the LLMs (11). It may be more pronounced in such instructions for pediatric patients, which are often more complex due to the practice of weight-based dosing than the more standardized adult dosing. EPs should be cautious about general use of LLMs for unproven medical advice. Outside of the medication-based questions, the LLMs generally provided appropriate and accurate advice.

Despite the limitations discussed below, this study demonstrates that the LLMs hold an important potential in patient education. They could address barriers to personalized education, improving guardians’ understanding of discharge instructions for and reducing return visits of pediatric patients who visited EDs. However, it is crucial to emphasize that these AI-generated instructions should complement, not replace personal physician-patient communication.

This study has several limitations. The focus on only 23 common post-ED discharge questions limits its generalizability. Although the versions of ChatGPT were most commonly available during the study period, a newer version might produce different results. In addition, while the overall inter-rater agreement was fair, there was variation in such agreement levels across different conditions, suggesting a need for more standardized evaluation criteria. Further refinements to meet end-user needs can be guided by incorporating direct feedback from guardians regarding the clarity, usefulness, and satisfaction with LLM-generated instructions.

In conclusion, this study shows the promising potential of ChatGPT-4 and -3.5 in generating accurate and appropriate responses to common PEM post-discharge questions. While there are areas for improvement, particularly in medication-related advice and achieving optimal readability levels, the overall performance of the LLMs suggests that they could be valuable tools in enhancing patient education in EDs. Further research and refinement are necessary before considering widespread implementation in clinical practice, but the results indicate a promising direction for improving communication and potential health outcomes in the area of PEM.

Notes

Author contributions

Conceptualization and Methodology: Gupta M and Aufricht G

Validation: Kahlun A, Sur R, and Gupta P

Formal analysis, Resources, Visualization, Software, and Project administration: Gupta M

Investigation: Gupta M, Kahlun A, Sur R, and Gupta P

Data curation: Gupta M, Gupta P, Kienstra A, and Whitaker W

Supervision: Aufricht G, Kienstra A, and Whitaker W

Writing-original draft: Gupta M, Kahlun A, Sur R, and Aufricht G

Writing-review and editing: all authors

All authors read and approved the final manuscript.

Conflicts of interest

No potential conflicts of interest relevant to this article were reported.

Funding sources

No funding source relevant to this article was reported.

Data availability

Data generated or analyzed during the study are available from the corresponding author by request.

References

1. Cocco AM, Zordan R, Taylor DM, Weiland TJ, Dilley SJ, Kant J, et al. Dr Google in the ED: searching for online health information by adult emergency department patients. Med J Aust 2018;209:342–7.
2. Van Riel N, Auwerx K, Debbaut P, Van Hees S, Schoenmakers B. The effect of Dr Google on doctor-patient encounters in primary care: a quantitative, observational, cross-sectional study. BJGP Open 2017;1:bjgpopen17X100833.
3. Man A, van Ballegooie C. Assessment of the readability of web-based patient education material from major Canadian pediatric associations: cross-sectional study. JMIR Pediatr Parent 2022;5:e31820.
4. National Institutes of Health (NIH). Clear & simple [Internet]. NIH; 2015 [cited 2024 May 3]. Available from: https://www.nih.gov/institutes-nih/nih-office-director/office-communications-public-liaison/clear-communication/clear-simple.
5. Morrison AK, Schapira MM, Gorelick MH, Hoffmann RG, Brousseau DC. Low caregiver health literacy is associated with higher pediatric emergency department use and nonurgent visits. Acad Pediatr 2014;14:309–14.
6. Rak EC, Hooper SR, Belsante MJ, Burnett O, Layton B, Tauer D, et al. Caregiver word reading literacy and health outcomes among children treated in a pediatric nephrology practice. Clin Kidney J 2016;9:510–5.
7. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology 2023;307:e230922.
8. Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 2023;329:842–4.
9. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023;183:589–96.
10. Sudharshan R, Shen A, Gupta S, Zhang-Nunes S. Assessing the utility of ChatGPT in simplifying text complexity of patient educational materials. Cureus 2024;16:e55304.
11. Korngiebel DM, Mooney SD. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. Npj Digit Med 2021;4:93.

Article information Continued

Table 1.

ChatGPT-4 output evaluation

Question Default Simplified
Word count Readability Appropri­ateness Accuracy Word count Readability Appropri­ateness Accuracy
For my 4-month-old with fever, how much ibuprofen can I give my child? 185.3 ± 40.1 14.4 ± 3.4 Yes Yes 126.0 ± 37.5 8.1 ± 1.2 No No
For my 6-month-old with fever, how much TylenolTM (Johnson & Johnson) should I give my child? 387.3 ± 25.1 15.3 ± 4.0 Disagree No 302.0 ± 68.4 6.1 ± 0.7 Disagree Disagree
For my 1-year-old with febrile seizure, does this mean they are going to have epilepsy? 359.3 ± 38.4 15.8 ± 3.9 Yes Yes 203.0 ± 22.5 7.5 ± 0.1 Yes Yes
For my 2-year-old with UTI, do they really need to take the antibiotics? 150.3 ± 17.2 13.1 ± 2.7 Yes Yes 127.7 ± 7.2 8.0 ± 0.7 Yes Yes
For my 2-year-old with AOM, my child’s symptoms have improved after 3 days of antibiotics. Do they need to finish the entire course? 281.3 ± 28.4 14.1 ± 3.9 Yes Yes 209.3 ± 21.4 8.7 ± 1.1 Yes Yes
For my 2-year-old with croup, when should I take them back to the ED? 200.7 ± 47.1 15.5 ± 4.1 Yes Yes 104.0 ± 29.9 8.4 ± 0.8 Yes Yes
For my 2-year-old with hand, foot, and mouth disease, is this contagious? 198.7 ± 34.8 15.1 ± 3.6 Yes Yes 149.7 ± 39.6 7.3 ± 0.3 Yes Yes
For my 2-year-old with nursemaid’s elbow, is this the same as a broken bone? 304.0 ± 67.2 13.4 ± 2.9 Yes Yes 228.0 ± 57.2 6.6 ± 0.1 Yes Yes
For my 2-year-old with viral illness, what are signs of dehydration? 125.0 ± 28.1 11.7 ± 2.5 Yes Yes 99.3 ± 5.7 6.7 ± 0.3 Yes Yes
For my 3-year-old with closed head injury, what signs should I be looking for that would signal I need to go back to the ED? 123.7 ± 13.2 11.5 ± 2.7 Yes Yes 88.7 ± 9.0 7.5 ± 0.3 Yes Disagree
For my 3-year-old with fever, what temperature is a fever? 281.3 ± 17.2 10.9 ± 2.3 Disagree Disagree 192.3 ± 17.2 6.4 ± 0.7 Yes Yes
For my 3-year-old with intussusception, is it possible that the intussusception can recur? 352.3 ± 12.9 14.0 ± 3.6 Yes Yes 245.7 ± 29.0 9.2 ± 0.6 Yes Yes
For my 4-year-old with laceration, my child had a laceration sutured. How should I take care of it? Can they take a bath or shower? 155.3 ± 18.6 11.5 ± 2.9 Yes Yes 114.3 ± 24.9 5.9 ± 0.1 Yes Yes
For my 4-year-old with laceration, my child had a laceration sutured. What are the signs of infection that I should be looking for? 116.7 ± 12.2 13.8 ± 2.7 Yes Yes 86.3 ± 16.4 6.8 ± 0.7 Yes Yes
For my 4-year-old with laceration, my child just had their cut sutured. How can I reduce the risk of scarring? 160.3 ± 11.9 14.0 ± 3.4 Yes Yes 95.7 ± 5.7 5.9 ± 0.9 Yes Yes
For my 4-year-old with laceration, when do the sutures need to be removed? 350.0 ± 39.0 11.1 ± 2.9 Yes Yes 205.7 ± 34.2 7.2 ± 1.1 Yes Yes
For my 6-year-old with constipation, how much MiraLAXTM (Polyethylene Glycol; Bayer HealthCare) should I give them? 295.3 ± 58.7 11.1 ± 2.5 Yes Disagree 230.0 ± 28.2 8.7 ± 1.3 Disagree Disagree
For my 6-year-old with viral syndrome, do antibiotics help for a virus? 385.3 ± 31.2 11.3 ± 3.0 Yes Yes 232.0 ± 45.2 8.7 ± 0.5 Yes Yes
For my 9-year-old with allergic rhinitis, what over-the-counter medication can I give them? 200.3 ± 75.1 11.5 ± 2.8 Yes Yes 162.0 ± 51.6 8.0 ± 0.6 Yes Yes
For my 10-year-old with viral gastroenteritis, how often can I give my child ZofranTM (Ondansetron; Novartis)? 168.0 ± 36.4 11.6 ± 3.1 Yes Yes 129.0 ± 48.4 8.4 ± 1.0 Yes Yes
For my 15-year-old with concussion, when can they return to school/sports? 130.7 ± 62.8 15.4 ± 3.8 Yes Yes 104.3 ± 51.7 8.2 ± 0.4 Yes Yes
For my 16-year-old with depression, when do I need to take them to the ED? 150.3 ± 44.5 14.0 ± 3.1 Yes Yes 128.3 ± 42.2 8.9 ± 0.5 Yes Yes
For my 18-year-old with a clavicle fracture, will he need surgery? 297.0 ± 21.7 14.8 ± 3.8 Yes Yes 209.7 ± 11.6 9.2 ± 2.5 Yes Yes

Values are expressed as means ± standard deviations. Data include word count, reading level (average of multiple readability indices), and physician evaluations of appropriateness and accuracy.

UTI: urinary tract infection, AOM: acute otitis media, ED: emergency department.

Table 2.

ChatGPT-3.5 output evaluation

Question Default Simplified
Word count Readability Appropri­ate­ness Accuracy Word count Readability Appropr­i­ateness Accuracy
For my 4-month-old with fever, how much ibuprofen can I give my child? 163.3 ± 21.5 15.0 ± 3.1 No No 119.0 ± 41.9 8.9 ± 1.2 No No
For my 6-month-old with fever, how much TylenolTM (Johnson & Johnson) should I give my child? 347.7 ± 20.8 15.1 ± 4.4 Disagree Disagree 219.0 ± 31.2 7.6 ± 0.7 Disagree Disagree
For my 1-year-old with febrile seizure, does this mean they are going to have epilepsy? 260.7 ± 89.4 14.1 ± 3.3 Yes Yes 204.7 ± 41.5 9.0 ± 1.0 Yes Yes
For my 2-year-old with UTI, do they really need to take the antibiotics? 146.3 ± 4.0 16.6 ± 4.8 Yes Yes 82.3 ± 8.3 9.9 ± 0.9 Yes Yes
For my 2-year-old with AOM, my child’s symptoms have improved after 3 days of antibiotics. Do they need to finish the entire course? 175.7 ± 21.5 14.9 ± 3.3 Yes Yes 119.7 ± 26.5 9.6 ± 0.6 Yes Yes
For my 2-year-old with croup, when should I take them back to the ED? 155.3 ± 9.0 14.5 ± 2.7 Yes Yes 117.0 ± 25.6 9.8 ± 1.7 Yes Yes
For my 2-year-old with hand, foot, and mouth disease, is this contagious? 184.3 ± 91.1 15.3 ± 3.2 Yes Yes 157.7 ± 56.2 8.9 ± 0.8 Yes Yes
For my 2-year-old with nursemaid's elbow, is this the same as a broken bone? 300.0 ± 20.5 14.3 ± 2.7 Yes Yes 213.7 ± 32.9 7.8 ± 1.3 Yes Yes
For my 2-year-old with viral illness, what are signs of dehydration? 93.3 ± 4.5 11.9 ± 1.8 Yes Yes 68.0 ± 8.7 7.9 ± 0.6 Yes Yes
For my 3-year-old with closed head injury, what signs should I be looking for that would signal I need to go back to the ED? 138.7 ± 37.6 10.9 ± 2.1 Yes Yes 75.7 ± 1.5 8.3 ± 0.1 Yes Disagree
For my 3-year-old with fever, what temperature is a fever? 210.3 ± 39.6 11.6 ± 2.1 Yes Yes 200.3 ± 4.7 7.8 ± 0.8 Yes Yes
For my 3-year-old with intussusception, is it possible that the intussusception can recur? 364.3 ± 32.1 13.8 ± 3.0 Yes Yes 229.7 ± 33.3 6.9 ± 0.9 Yes Yes
For my 4-year-old with laceration, my child had a laceration sutured. How should I take care of it? Can they take a bath or shower? 76.0 ± 20.1 12.6 ± 2.8 Yes Yes 67.3 ± 18.8 6.0 ± 0.7 Yes Yes
For my 4-year-old with laceration, my child had a laceration sutured. What are the signs of infection that I should be looking for? 95.0 ± 12.5 13.3 ± 3.6 Yes Yes 66.0 ± 7.9 5.8 ± 1.2 Yes Yes
For my 4-year-old with laceration, my child just had their cut sutured. How can I reduce the risk of scarring? 165.3 ± 83.2 14.1 ± 3.3 Yes Yes 113.0 ± 60.9 6.5 ± 1.6 Yes Yes
For my 4-year-old with laceration, when do the sutures need to be removed? 297.3 ± 20.8 11.4 ± 3.0 Yes Yes 187.3 ± 6.8 6.9 ± 0.4 Yes Yes
For my 6-year-old with constipation, how much MiraLAXTM (Polyethylene Glycol; Bayer HealthCare) should I give them? 199.7 ± 67.1 11.2 ± 3.2 Yes Disagree 139.3 ± 30.1 10.0 ± 1.2 Disagree Disagree
For my 6-year-old with viral syndrome, do antibiotics help for a virus? 382.7 ± 53.8 11.5 ± 2.9 Yes Yes 192.7 ± 30.9 9.6 ± 0.6 Yes Yes
For my 9-year-old with allergic rhinitis, what over-the-counter medication can I give them? 120.3 ± 21.7 13.2 ± 3.5 Yes Yes 92.7 ± 11.0 7.4 ± 0.7 Yes Yes
For my 10-year-old with viral gastroenteritis, how often can I give my child ZofranTM (Ondansetron; Novartis)? 155.7 ± 20.5 12.8 ± 3.0 Yes Yes 129.0 ± 54.6 9.4 ± 0.7 Yes Yes
For my 15-year-old with concussion, when can they return to school/sports? 162.3 ± 20.0 13.5 ± 2.2 Yes Yes 147.0 ± 35.8 7.3 ± 1.0 Yes Yes
For my 16-year-old with depression, when do I need to take them to the ED? 124.0 ± 5.2 14.5 ± 2.9 Yes Yes 123.3 ± 29.7 9.4 ± 1.5 Yes Yes
For my 18-year-old with a clavicle fracture, will he need surgery? 273.3 ± 39.6 14.8 ± 4.1 Yes Yes 184.0 ± 46.2 7.9 ± 0.9 Yes Yes

Values are expressed as means ± standard deviations. Data include word count, reading level (average of multiple readability indices), and physician evaluations of appropriateness and accuracy.

UTI: urinary tract infection, AOM: acute otitis media, ED: emergency department.

Table 3.

Physician evaluation of model outputs: ChatGPT-4 and -3.5 (N = 23)

 Evaluation category Default Simplified
Reviewer 1 Reviewer 2 Reviewer 1 Reviewer 2
ChatGPT-4
 Appropriate 21 (91.3) 23 (100) 20 (87.0) 21 (91.3)
 Not appropriate 2 (8.7) 0 (0) 2 (8.7) 2 (8.7)
 Unreliable 0 (0) 0 (0) 1 (4.3) 0 (0)
 Accurate 20 (87.0) 22 (95.7) 19 (82.6) 21 (91.3)
 Not accurate 3 (13.0) 1 (4.3) 3 (13.0) 2 (8.7)
 Unreliable 0 (0) 0 (0) 1 (4.3) 0 (0)
ChatGPT-3.5
 Appropriate 21 (91.3) 21 (91.3) 20 (87.0) 21 (91.3)
 Not appropriate 1 (4.3) 2 (8.7) 2 (8.7) 2 (8.7)
 Unreliable 1 (4.3) 0 (0) 1 (4.3) 0 (0)
 Accurate 20 (87.0) 21 (91.3) 19 (82.6) 21 (91.3)
 Not accurate 2 (8.7) 2 (8.7) 3 (13.0) 2 (8.7)
 Unreliable 1 (4.3) 0 (0) 1 (4.3) 0 (0)

Values are presented as numbers (%).

Table 4.

Comparison of word count and 6 established readability scales: ChatGPT-4 and -3.5

 Response type Word count Automated Readability Index Gunning Fog Index Flesch-Kincaid Grade Level Coleman-Liau Index SMOG Grade Level Flesch Reading Ease Average reading level*
Default responses
 ChatGPT-4 233.0 ± 97.1 14.2 ± 2.3 12.8 ± 1.8 11.3 ± 1.9 12.8 ± 2.1 13.6 ± 1.7 14.8 ± 2.4 13.3 ± 1.9
 ChatGPT-3.5 199.6 ± 94.7 14.2 ± 2.3 12.9 ± 2.1 11.5 ± 1.9 13.3 ± 1.8 13.7 ± 1.7 15.5 ± 2.1 13.5 ± 1.8
Simplified responses
 ChatGPT-4 164.0 ± 67.0 8.8 ± 1.8 8.4 ± 1.4 6.4 ± 1.5 7.1 ± 1.4 8.2 ± 1.5 7.0 ± 1.1 7.7 ± 1.3
 ChatGPT-3.5 141.2 ± 59.2 9.0 ± 2.2 8.5 ± 1.5 6.6 ± 1.7 8.4 ± 1.7 9.1 ± 1.6 7.6 ± 1.5 8.2 ± 1.5

Values are expressed as means ± standard deviations.

*

At each row, all values but the word count are averaged.

SMOG: Simple Measure of Gobbledygook.