Simulation-based learning (SBL) has become standard practice in educating health care professionals to apply their knowledge and skills in patient care. While SBL has demonstrated its value in education, many educators find the process of developing new, unique scenarios to be time-intensive, creating limits to the variety of issues students may experience within educational settings. Generative artificial intelligence (AI) platforms, such as ChatGPT (OpenAI), have emerged as a potential tool for developing simulation case studies more efficiently, though little is known about the performance of AI in generating high-quality case studies for interprofessional education.
Objective
This study aimed to generate geriatric case scenarios across 5 AI platforms by a transdisciplinary team and systematically evaluate them for quality, accuracy, and bias.
Methods
Ten geriatric case studies were generated using the same prompt from 5 different generative AI platforms (N=50): ChatGPT, Claude (Anthropic AI), Copilot (Microsoft), Gemini (Google), and Grok (xAI). An evaluation tool was developed to collect evaluative data to assess the content and quality of each case, sociodemographic data of the featured patient, the appropriateness of each case for interprofessional education, and potential bias. Case quality was evaluated using the Simulation Scenario Evaluation Tool (SSET). Each case was evaluated by 3 team members who had experience in SBL education. Assessment scores were averaged, and qualitative responses were extracted to triangulate patterns found in the quantitative data.
Results
While each AI platform was able to generate 10 unique case studies, the quality of studies varied within and across platforms. Generally, evaluators felt that the content in the cases was accurate, though some cases were not realistic. Some patient populations and common conditions among older adults were underrepresented or absent across the cases. All cases were set within traditional health care settings (eg, hospitals and routine medical visits). No cases featured home-based care. Based on the average SSET scores, reviewers assessed ChatGPT to be the highest overall performer (mean 3.27, SD 0.45, 95% CI 2.95-3.59) while Grok received the lowest scores (mean 1.61, SD 1.26, 95% CI 0.71-2.51). Platforms performed best at generating learning objectives (mean 3.35, SD 1.08, 95% CI 3.04-3.65) and lowest on their ability to describe supplies and materials that may be available in hypothetical scenarios (mean 1.27, SD 0.84, 95% CI 1.03-1.51).
Conclusions
This study is the first to systematically evaluate and compare multiple generative AI platforms for case study generation using a validated assessment tool (SSET) and provides evidence-based guidance on selecting and using AI tools effectively. The findings offer practical direction for educators navigating available generative AI tools to enhance training for health care professionals, including specific strategies for prompt engineering that can improve the quality of SBL resources in interprofessional education. These insights enable educators to leverage AI capabilities while maintaining pedagogical rigor.