Generative artificial intelligence (AI) chatbots, like ChatGPT, have become more competent and prevalent, making their role in patient education more salient. A study aimed to compare the educational utility of six AI chatbots by quantifying the readability and quality of their answers to common patient questions about clavicle fracture management. Methods ChatGPT 4, ChatGPT 4o, Gemini 1.0, Gemini 1.5 Pro, Microsoft Copilot, and Perplexity were used with no prior training. Ten representative patient questions about clavicle fractures were posed to each model. The readability of AI responses was measured using Flesch-Kincaid Reading Grade Level, Gunning Fog, and Simple Measure of Gobbledygook (SMOG). Six orthopedists blindly graded the response quality of each model using the DISCERN criteria. Both metrics were analyzed via the Kruskal-Wallis test. Results No statistically significant difference was found among the readability of the six models. Microsoft Copilot (70.33±7.74) and Perplexity (71.83±7.57) demonstrated statistically significant higher DISCERN scores than ChatGPT 4 (56.67±7.15) and Gemini 1.5 Pro (51.00±8.94) with similar findings seen between Gemini 1.0 (68.00±6.42) and Gemini 1.5 Pro. The mean overall quality (question 16, DISCERN) of each model was rated at or above average (range, 3-4.4). Conclusion The findings suggest generative AI models have the capability to serve as supplementary patient education materials. With equal readability and overall high quality, Microsoft Copilot and Perplexity may be implicated as chatbots with the most educational utility regarding surgical intervention for clavicle fractures 1).
## Critical Review of the Study on Generative AI Chatbots in Patient Education for Clavicle Fractures
### Introduction Generative AI chatbots are increasingly being considered for patient education, offering accessible and scalable information. This study sought to compare six AI chatbots (ChatGPT 4, ChatGPT 4o, Gemini 1.0, Gemini 1.5 Pro, Microsoft Copilot, and Perplexity) by evaluating the readability and quality of their responses to common patient questions about clavicle fracture management. The study employed objective readability metrics (Flesch-Kincaid, Gunning Fog, and SMOG) and the DISCERN criteria, rated by orthopedic surgeons, to assess the quality of responses.
While the study provides valuable insights, certain methodological limitations and broader implications must be critically examined.
### Strengths of the Study 1. Objective Comparison Across Multiple AI Models
2. Blinded Evaluation by Orthopedists
3. Use of Statistical Analysis
### Limitations and Criticisms 1. Limited Scope of Medical Topic (Clavicle Fractures Only)
2. Lack of Contextual and Conversational Analysis
3. Potential Bias in DISCERN Evaluation
4. Exclusion of Prior Training or Optimization
5. Clinical Safety and Misinformation Risks Not Addressed
### Implications and Future Research Directions 1. Broader Medical Applications
2. Patient-Centered Evaluation
3. Conversational Adaptability and Emotional Intelligence
4. Longitudinal Studies on AI Integration in Clinical Settings
### Conclusion The study provides a useful comparative analysis of AI chatbots for patient education on clavicle fractures, demonstrating that Microsoft Copilot and Perplexity performed best in response quality. However, its narrow focus, lack of conversational analysis, and exclusion of clinical safety considerations highlight the need for further research. Future studies should explore AI chatbots' broader medical applications, misinformation risks, and their integration into clinical practice.