Comparison of Resident Physician Responses and Four Large Language Model Platforms on Obstetrics and Gynecology Questions

Ömer Faruk Öz; Can Dinç; Ergün Şengün; Selen Doğan; Murat Özekinci; Nasuh Utku Doğan; İnanç Mendilcioğlu

doi:10.31083/CEOG49207

Information
Figures
References
Contents

Academic Editor

Michael H. Dahan

Download

[1]Zhang K, Meng X, Yan X, Ji J, Liu J, Xu H, et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. Journal of Medical Internet Research. 2025; 27: e59069. https://doi.org/10.2196/59069.
[2]Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Annals of Internal Medicine. 2024; 177: 210–220. https://doi.org/10.7326/M23-2772.
[3]Maity S, Saikia MJ. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering. 2025; 12: 631. https://doi.org/10.3390/bioengineering12060631.
[4]Vrdoljak J, Boban Z, Vilović M, Kumrić M, Božić J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare. 2025; 13: 603. https://doi.org/10.3390/healthcare13060603.
[5]Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023; 620: 172–180. https://doi.org/10.1038/s41586-023-06291-2.
[6]Zhu Y, Zhang J, Fei J, Fang H, Zhang Z. Problem-Based Learning and Case-Based Learning in Clinical Practical Teaching for Gynecology Residents: A Narrative Review. Advances in Medical Education and Practice. 2025; 16: 1269–1279. https://doi.org/10.2147/AMEP.S534053.
[7]Skryd A, Lawrence K. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Formative Research. 2024; 8: e51346. https://doi.org/10.2196/51346.
[8]Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digital Medicine. 2025; 8: 175. https://doi.org/10.1038/s41746-025-01543-z.
[9]Cohen A, Burns J, Gabra M, Gordon A, Deebel N, Terlecki R, et al. Performance of Chat Generative Pre-Trained Transformer on Personal Review of Learning in Obstetrics and Gynecology. Southern Medical Journal. 2025; 118: 102–105. https://doi.org/10.14423/SMJ.0000000000001783.
[10]Eoh KJ, Kwon GY, Lee EJ, Lee J, Lee I, Kim YT, et al. Efficacy of large language models and their potential in Obstetrics and Gynecology education. Obstetrics & Gynecology Science. 2024; 67: 550–556. https://doi.org/10.5468/ogs.24211.
[11]Martinelli C, Giordano A, Carnevale V, Burk SR, Porto L, Vizzielli G, et al. The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints. Mayo Clinic Proceedings. Digital Health. 2025; 3: 100206. https://doi.org/10.1016/j.mcpdig.2025.100206.
[12]Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nature Medicine. 2025; 31: 943–950. https://doi.org/10.1038/s41591-024-03423-7.
[13]Noda R, Tanabe K, Ichikawa D, Shibagaki Y. GPT-4’s performance in supporting physician decision-making in nephrology multiple-choice questions. Scientific Reports. 2025; 15: 15439. https://doi.org/10.1038/s41598-025-99774-3.
[14]Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine. 2019; 25: 44–56. https://doi.org/10.1038/s41591-018-0300-7.
[15]Feigerlova E, Hani H, Hothersall-Davies E. A systematic review of the impact of artificial intelligence on educational outcomes in health professions education. BMC Medical Education. 2025; 25: 129. https://doi.org/10.1186/s12909-025-06719-5.
[16]Brügge E, Ricchizzi S, Arenbeck M, Keller MN, Schur L, Stummer W, et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Medical Education. 2024; 24: 1391. https://doi.org/10.1186/s12909-024-06399-7.
[17]Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Medical Informatics. 2025; 13: e66917. https://doi.org/10.2196/66917.

Open Access 13 Apr 2026Original Research

Comparison of Resident Physician Responses and Four Large Language Model Platforms on Obstetrics and Gynecology Questions

Ömer Faruk Öz ¹, Can Dinç ^1,*, Ergün Şengün ¹, Selen Doğan ¹, Murat Özekinci ¹, Nasuh Utku Doğan ¹, İnanç Mendilcioğlu ¹

Affiliation

Article Info

¹ Department of Obstetrics and Gynecology, Akdeniz University Faculty of Medicine, 07070 Antalya, Türkiye

^*Correspondence: candinc@akdeniz.edu.tr (Can Dinç)

Abstract

Background:

Large language models (LLMs), including ChatGPT, Gemini, Microsoft Copilot, and DeepSeek, are increasingly used to answer complex clinical questions. Their accessibility and rapid development suggest potential value in medical education, residency training, and point-of-care decision support. However, direct comparisons with obstetrics and gynecology trainees remain limited. To compare the accuracy of multiple free and subscription-based LLMs with that of obstetrics and gynecology resident physicians across specialty-specific domains.

Methods:

A validated 40-item single-best-answer multiple-choice set covering endocrinology, gynecology, obstetrics, and oncology was developed by subspecialists and demonstrated good internal consistency (Cronbach’s alpha = 0.82). 25 residents completed the assessment. The same questions were posed to five LLMs: ChatGPT (GPT-4o Mini; free), ChatGPT (GPT-4o; subscription-based), Microsoft Copilot (GPT-4 based; free), DeepSeek (free), and Gemini (free).

Results:

Residents achieved a mean score of 20.6 out of 40 (51.5% correct). Overall accuracy was highest for ChatGPT (GPT-4o; subscription-based) at 90.0%, followed by DeepSeek and Gemini, each at 87.5%. ChatGPT (GPT-4o Mini; free) scored 47.5%, similar to residents, while Microsoft Copilot (free) scored 37.5%. In paired item-level comparisons using a two-sided exact McNemar test based on resident-majority item correctness (≥13/25 residents correct), ChatGPT (GPT-4o; subscription-based), DeepSeek, and Gemini significantly outperformed residents (all p < 0.001), whereas ChatGPT (GPT-4o Mini; free) and Microsoft Copilot did not differ significantly (p = 1.000 and p = 0.607, respectively). Domain-specific analyses showed that ChatGPT (GPT-4o; subscription-based), DeepSeek, and Gemini generally outperformed residents in gynecology, obstetrics, and oncology, with smaller and less consistent differences observed in endocrinology. Qualitative review of incorrect outputs noted occasional internally inconsistent or ambiguous responses.

Conclusions:

Across a validated, curriculum-mapped 40-item question set, advanced LLMs achieved higher accuracy than residents, and in item-level paired comparisons using the resident-majority indicator, several models performed significantly better. Performance variability across models, including gaps between subscription-based and free versions and lower accuracy observed for some free tools, indicates that LLMs should complement, rather than replace, human expertise.

Keywords

artificial intelligence (AI)
large language models
medical education
clinical decision support
obstetrics and gynecology
resident physicians

1. Introduction

Artificial intelligence (AI), particularly large language models (LLMs), has advanced rapidly in recent years, enabling novel applications across diverse areas of medicine [1, 2, 3]. These models can generate contextually relevant, human-like responses to complex clinical queries, and their accessibility, along with their ability to synthesize medical knowledge, highlights potential roles in medical education, residency training, and clinical decision-making [4, 5].

In obstetrics and gynecology, residency training demands mastery of a broad knowledge base spanning endocrinology, gynecology, obstetrics, and oncology. Traditional learning methods such as textbooks, lectures, and supervised clinical experience, remain essential but are often limited by time constraints and variability in clinical exposure [6]. AI-driven tools may supplement training by providing on-demand access to information, thereby supporting both didactic learning and clinical reasoning [7]. Despite this promise, the reliability and accuracy of LLMs in specialized medical fields remain uncertain. Prior studies evaluating LLM performance in general medicine and other specialties have reported mixed findings compared with physicians [8]. However, evidence in obstetrics and gynecology remains limited [9]. Moreover, potential differences between freely accessible and subscription-based AI models have not been systematically examined.

This study aimed to address these gaps by systematically evaluating and comparing the accuracy of several widely used LLMs with that of obstetrics and gynecology resident physicians across specialty-specific domains [10]. In doing so, we sought to clarify the strengths and limitations of current general-purpose LLMs in supporting medical education and to assess whether these models could serve as reliable adjuncts in residency training and clinical decision-making [11].

2. Materials and Methods

2.1 Study Design

This cross-sectional knowledge assessment compared the performance of multiple LLMs with that of obstetrics and gynecology resident physicians on a specialty-specific multiple-choice examination.

2.2 Participants

25 obstetrics and gynecology residents from a single tertiary university hospital participated in this study. Responses were anonymized and analyzed in aggregate. Residents from the Akdeniz University Department of Obstetrics and Gynecology, across all postgraduate years, were eligible, and participants represented multiple training levels. Subgroup analyses by resident characteristics (e.g., sex or postgraduate year) were not prespecified in the study protocol and were therefore neither performed nor reported.

2.3 LLMs Evaluated

Five LLMs were evaluated: ChatGPT (GPT-4o Mini; free) and ChatGPT (GPT-4o; subscription-based) from OpenAI; Gemini (free; Google); DeepSeek (free); and Microsoft Copilot (free; GPT-4-based).

The resident examination had been administered previously as part of a routine educational assessment. Following Institutional Review Board (IRB) approval, the examination items were extracted, de-identified, and then entered into each LLM via its publicly available web-based user interface between 26 September 2025 and 30 September 2025. All systems were accessed via their standard publicly available web interfaces using default (free) configurations, except for ChatGPT, which was evaluated in both free and subscription versions.

For each item, the question stem and all five answer options were copied verbatim into a single prompt, and the model was instructed to select the single best answer. Both the resident assessment and all LLM prompts were administered in Turkish, with question stems and answer options entered exactly as written, without translation. Each item was run once per model using default web-interface settings (temperature not user-configurable). When models returned free-text outputs, the final selected option was coded as A, B, C, D, or E for analysis. Web browsing and external plugins were not used on any platform to standardize testing conditions and evaluate baseline performance.

2.4 Question Set and Item Refinement

The initial item pool consisted of 50 single best-answer questions constructed by subspecialty-trained obstetrics and gynecology faculty members to reflect core learning objectives from the national residency curriculum in endocrinology, gynecology, obstetrics, and gynecologic oncology. All items were drawn from our institution’s internal examination bank, which is not publicly available online or in commercial question resources. Therefore, this approach reduced, although it did not eliminate, the likelihood that the exact questions were encountered during LLM training. Each item included one correct answer and four distractors.

Before finalization, faculty members reviewed the stems, answer options, and keyed correct answers to ensure clarity and key accuracy. Items were excluded if they contained keying errors, ambiguous or internally inconsistent wording, multiple defensible answers, an unclear clinical context, or unacceptable psychometric properties on item analysis (for example, extreme floor or ceiling effects or negative discrimination). A biostatistician performed item analysis using residents’ responses to the initial 50-item pool, with a focus on item difficulty and discrimination. Items flagged as problematic (e.g., ambiguity, extreme difficulty, or negative discrimination) were removed during refinement. The final analyses and all LLM evaluations used the finalized 40-item set. 10 items were removed during this process. The final examination comprised 40 items and demonstrated good internal consistency (Cronbach’s alpha = 0.82). All analyses in this study are based on this 40-item set, and the same items were administered to both residents and LLMs.

2.5 Statistical Analysis

Accuracy was defined as the proportion of correct responses within each group. For residents, overall accuracy was summarized as the mean number of correctly answered items across the 40 questions and expressed as a percentage. For each LLM, accuracy was defined as the percentage of correctly answered items out of 40.

For item-level paired comparisons between residents and each LLM, resident performance was summarized at the item level using a majority vote. An item was coded as correct for the resident group if at least 13 of 25 residents selected the correct answer. For each model, paired 2 $\times{}$ 2 tables (LLM correct or incorrect versus resident-majority correct or incorrect across the 40 items) were constructed, and a two-sided exact McNemar test (binomial) was applied based on the discordant pair counts (b = LLM correct/resident incorrect; c = LLM incorrect/resident correct).

3. Results

25 obstetrics and gynecology residents completed the 40-item multiple-choice examination spanning gynecology, obstetrics, endocrinology, and oncology. Internal consistency was high (Cronbach’s alpha = 0.82). Five LLMs were evaluated on the same item set. Overall accuracy varied across groups (Table 1). Residents achieved a mean score of 20.6 out of 40 (51.5% correct). Among the AI models, ChatGPT (GPT-4o; subscription-based) achieved the highest overall accuracy (36/40; 90.0%), followed by DeepSeek (35/40; 87.5%) and Gemini (35/40; 87.5%). ChatGPT (GPT-4o Mini; free) performed similarly to residents (19/40; 47.5%), whereas Microsoft Copilot (free) had the lowest overall accuracy (15/40; 37.5%). Corresponding 95% confidence intervals (CIs) are shown in Table 1.

Table 1. Overall accuracy of LLMs compared with obstetrics and gynecology residents.

Model	Number of Questions	Mean correct (of 40)	Accuracy (%)	95% CI (%)
Residents (n = 25)	40	20.6	51.5%	[48.50–54.69]
ChatGPT (GPT-4o; subscription-based)	40	36.0	90.0%	[76.95–96.04]
DeepSeek (free)	40	35.0	87.5%	[73.89–94.54]
Gemini (free)	40	35.0	87.5%	[73.89–94.54]
ChatGPT (GPT-4o Mini; free)	40	19.0	47.5%	[32.94–62.50]
Microsoft Copilot (free)	40	15.0	37.5%	[24.22–52.97]

n = 40 items per group; 95% CIs are as percentages. 95% CIs were calculated using the Wilson score method. For residents, the CI was computed from pooled responses across all residents (25 $\times{}$ 40 = 1000 total responses), whereas LLM CIs were computed across 40 items per model. CI, confidence interval; LLM, large language model.

Fig. 1 summarizes overall accuracy (percent correct) across residents and model configurations. Overall performance fell into three tiers. ChatGPT (GPT-4o; subscription-based) achieved the highest accuracy numerically (36/40; 90.0%), followed closely by DeepSeek and Gemini (35/40; 87.5% each). Residents and ChatGPT (GPT-4o Mini; free) showed intermediate, similar accuracy, whereas Microsoft Copilot (free) performed worst across the 40-item examination.

At the item level, the resident-majority was correct on 18/40 items (45.0%). In paired item-level comparisons using the exact McNemar test (Table 2), ChatGPT (GPT-4o; subscription-based) outperformed the resident-majority (36/40 vs. 18/40; b = 21, c = 3; p $<$ 0.001). DeepSeek and Gemini also outperformed residents (each 35/40 vs. 18/40; b = 19, c = 2; p $<$ 0.001 for both). In contrast, ChatGPT (GPT-4o Mini; free) showed no significant difference (19/40 vs. 18/40; b = 9, c = 8; p = 1.000), and Microsoft Copilot (free) did not differ significantly (15/40 vs. 18/40; b = 6, c = 9; p = 0.607).

Table 2. Exact McNemar test results for paired item-level comparisons between each LLM and the resident-majority (n = 40 items).

Model	AI correct n (%)	Resident-majority correct n (%)	b (AI+/Res−)	c (AI−/Res+)	McNemar exact p (2-sided)
ChatGPT (GPT-4o; subscription-based)	36 (90.0)	18 (45.0)	21	3	$<$ 0.001
ChatGPT (GPT-4o Mini; free)	19 (47.5)	18 (45.0)	9	8	1.000
DeepSeek (free)	35 (87.5)	18 (45.0)	19	2	$<$ 0.001
Gemini (free)	35 (87.5)	18 (45.0)	19	2	$<$ 0.001
Microsoft Copilot (free)	15 (37.5)	18 (45.0)	6	9	0.607

Resident-majority correct was defined as $\geq$ 13/25 residents answering the item correctly. b: denotes items where the LLM was correct and the resident-majority was incorrect; c: denotes items where the LLM was incorrect and the resident-majority was correct. Two-sided exact McNemar p-values are reported. AI, artificial intelligence; Res, residents.

Domain-specific analyses are summarized in Table 3. Given the small number of items in some subdomains (particularly endocrinology, n = 3), these domain-specific results are descriptive and provided for reference only; they should not be interpreted as statistically precise estimates. In endocrinology (3 items), residents attained 32.0% accuracy, whereas ChatGPT (GPT-4o; subscription-based), DeepSeek, and Gemini answered all items correctly (100.0% each). Both ChatGPT (GPT-4o Mini; free) and Microsoft Copilot (free) showed accuracies comparable to residents (33.3%). In gynecology (18 items), residents scored 51.1% on average, whereas ChatGPT (GPT-4o; 83.3%), DeepSeek (100.0%), and Gemini (94.4%) achieved substantially higher accuracy. ChatGPT (GPT-4o Mini) performed similarly to residents (55.6%), while Microsoft Copilot remained lower (38.9%). In obstetrics (9 items), residents achieved 64.0% accuracy, lower than ChatGPT (GPT-4o; 100.0%), DeepSeek (77.8%), and Gemini (88.9%). ChatGPT (GPT-4o Mini; 66.7%) and Microsoft Copilot (55.6%) demonstrated intermediate performance. In oncology (10 items), residents scored 47.2%, whereas ChatGPT (GPT-4o), DeepSeek, and Gemini reached 90.0%, 70.0%, and 70.0% accuracy, respectively. ChatGPT (GPT-4o Mini) and Microsoft Copilot performed worst in this domain (both 20.0%). CIs were wide in domains with few items, particularly endocrinology, reflecting limited precision for these estimates.

Table 3. Comparative accuracy of LLMs and residents across obstetrics and gynecology domains.

Category	Model	Number of items	Mean	Accuracy (%)	95% CI
Endocrinology	ChatGPT (GPT-4o Mini; free)	3	1.0	33.3	[6.15–79.23]
	ChatGPT (GPT-4o; subscription-based)	3	3.0	100.0	[43.85–100.00]
	DeepSeek (free)	3	3.0	100.0	[43.85–100.00]
	Gemini (free)	3	3.0	100.0	[43.85–100.00]
	Microsoft Copilot (free)	3	1.0	33.3	[6.15–79.23]
	Residents	3	1.0	32.0	[22.54–43.21]
Gynecology	ChatGPT (GPT-4o Mini; free)	18	10.0	55.6	[33.72–75.44]
	ChatGPT (GPT-4o; subscription-based)	18	15.0	83.3	[60.78–94.16]
	DeepSeek (free)	18	18.0	100.0	[82.41–100.00]
	Gemini (free)	18	17.0	94.4	[74.24–99.01]
	Microsoft Copilot (free)	18	7.0	38.9	[20.30–61.38]
	Residents	18	9.2	51.1	[46.50–55.70]
Obstetrics	ChatGPT (GPT-4o Mini; free)	9	6.0	66.7	[35.42–87.94]
	ChatGPT (GPT-4o; subscription-based)	9	9.0	100.0	[70.08–100.00]
	DeepSeek (free)	9	7.0	77.8	[45.26–93.68]
	Gemini (free)	9	8.0	88.9	[56.50–98.01]
	Microsoft Copilot (free)	9	5.0	55.6	[26.66–81.12]
	Residents	9	5.8	64.0	[57.54–69.99]
Oncology	ChatGPT (GPT-4o Mini; free)	10	2.0	20.0	[5.67–50.98]
	ChatGPT (GPT-4o; subscription-based)	10	9.0	90.0	[59.58–98.21]
	DeepSeek (free)	10	7.0	70.0	[39.68–89.22]
	Gemini (free)	10	7.0	70.0	[39.68–89.22]
	Microsoft Copilot (free)	10	2.0	20.0	[5.67–50.98]
	Residents	10	4.7	47.2	[41.10–53.38]

Endocrinology: n = 3 items; gynecology: n = 18; obstetrics: n = 9; oncology: n = 10; 95% CIs are percentages. Subdomain estimates, especially particularly endocrinology (n = 3 items), are descriptive and provided for reference only due to limited statistical precision.

Qualitative review (data not shown) noted that some incorrect model outputs were internally inconsistent or ambiguous, particularly for items with nuanced clinical framing. Among the reviewed resident errors (sampled), responses more often reflected incomplete or partially correct reasoning rather than consistent selection of a single incorrect option.

4. Discussion

This study directly compared several LLMs with obstetrics and gynecology residents using a specialty-specific multiple-choice examination, providing domain-focused insights in this field. The primary finding was that an advanced subscription-based model, ChatGPT (GPT-4o; subscription-based), along with two free models, DeepSeek and Gemini, achieved substantially higher overall accuracy than residents. In paired item-level comparisons using an exact McNemar framework based on resident-majority item correctness, ChatGPT (GPT-4o; subscription-based), DeepSeek, and Gemini also demonstrated significantly higher item-level accuracy than the resident group (all p $<$ 0.001). In contrast, ChatGPT (GPT-4o Mini; free) performed similarly to residents and did not differ significantly in paired comparisons, while Microsoft Copilot showed lower overall accuracy but did not differ significantly from residents in the paired item-level analysis. Differences among the top-performing models were small (one item) and should not be interpreted as definitive head-to-head superiority.

Our findings are consistent with reports from other medical specialties in which state-of-the-art LLMs have achieved near-expert or superior performance on formal knowledge assessments [12]. The present study extends the literature by focusing on obstetrics and gynecology, a field that integrates cognitive, procedural, and guideline-driven decision-making and often relies on case-based reasoning. The high accuracy of ChatGPT (GPT-4o; subscription-based), DeepSeek, and Gemini across the 40-item examination, particularly in gynecology and oncology, suggests potential value as educational and decision-support adjuncts in women’s health. At the same time, the similar performance of ChatGPT (GPT-4o Mini; free) to residents, along with the lower overall accuracy observed for Microsoft Copilot, illustrates that performance is not uniform across commonly available LLM interfaces.

Our findings complement prior obstetrics and gynecology-focused LLM studies, including work using reflective writing tasks and vignette-based scenarios. Those studies primarily evaluated performance on narrative responses or simulated clinical cases under specific constraints, whereas we employed a standardized multiple-choice examination mapped to the national residency curriculum. By directly comparing several free and subscription-based models with residents on the same specialty-specific examination, our study provides a multi-model, exam-based perspective that aligns with formal assessment settings while remaining distinct from vignette- and portfolio-based evaluations. Our multiple-choice, curriculum-mapped examination design complements prior vignette-based studies by assessing core knowledge performance under standardized inputs.

From a clinical and educational perspective, these results suggest several possible applications. High-performing models may serve as adjunctive study aids by providing immediate feedback on practice questions, generating explanations, and supporting guideline review. However, the observed variability among models, including differences between subscription-based and free systems, underscores that recommendations for LLM use should be model-specific and supported by ongoing validation [13, 14, 15, 16]. Careful model selection will therefore be essential if training programs consider integrating LLMs into curricula or clinical decision-support workflows. Our findings should be interpreted as supporting adjunct use rather than replacement. LLMs may complement trainees by providing explanations and highlighting guideline-based concepts, but clinically actionable outputs must be verified against authoritative references and interpreted under faculty or clinician supervision. In clinical settings, LLM use should adhere to local governance policies, with verification of high-stakes recommendations documented and accountability clearly retained by the supervising clinician.

When an LLM output references a study, guideline, or specific evidence, learners should treat the response as a preliminary aid and verify key claims in the primary source. If the model’s recommendation conflicts with established references or supervisor instruction, the appropriate course of action is to prioritize authoritative guidance, cross-check the cited source, and consult faculty or senior clinicians. This verification step is particularly important for clinically actionable content, as unverified outputs should not guide patient care. In practice, training programs should implement clear guardrails for LLM use, including reference standards and escalation pathways when outputs are uncertain or conflict with established guidance. Ongoing validation should be coordinated at the institutional level by residency program leadership and clinical governance structures, with guidance informed by specialty societies as models evolve.

The use of an exact McNemar test reflects the paired structure of the data at the item level. Each examination item was answered by both the resident cohort and by each model, enabling item-level comparisons that partially control for item difficulty. Nevertheless, this approach summarizes resident performance using a group-level majority-vote indicator and does not fully account for within-resident and within-item correlation structures. More complex approaches, such as mixed-effects models, could be considered in future studies, particularly in larger, multi-institutional cohorts.

A descriptive review of incorrect model outputs suggested that some responses were internally inconsistent or ambiguous, particularly for items with nuanced clinical framing. Because this review was not a formal safety assessment and did not quantify “clinically unsafe” outputs using predefined criteria and independent raters, conclusions regarding comparative safety across models cannot be drawn from this study. Future studies should classify incorrect outputs using predefined, safety-focused error categories and independent clinician raters. More broadly, concerns have been raised that LLMs may provide incorrect answers with high confidence, reinforcing the need for critical appraisal and oversight [14, 17]. These observations support the broader view that LLM outputs should be interpreted critically and used with appropriate clinician oversight, particularly when applied outside educational contexts [17].

This study has several strengths, including a specialty-specific examination developed by domain experts, high internal consistency, direct comparisons across multiple LLMs and residents, and transparent item-level paired analyses.

This study provides empirical data on the performance of general-purpose LLMs compared with obstetrics and gynecology residents on standardized clinical questions. By characterizing accuracy and the nature of incorrect outputs, it informs educators and health system leaders about the realistic capabilities and limitations of current models. These findings can guide the safe integration of AI tools into postgraduate training and clinical decision-support workflows, while emphasizing the continued need for clinician oversight.

Limitations

Nevertheless, several limitations should be acknowledged. First, the sample was limited to 25 residents from a single institution, which may restrict generalizability and might not capture heterogeneity across postgraduate years. Accordingly, external validity may be limited, and multi-institutional studies with larger, more diverse resident cohorts are needed to replicate these findings. Second, the assessment focused on multiple-choice questions rather than on open-ended reasoning, communication skills, or real-time patient management. Third, some domains, particularly endocrinology, were represented by few items, resulting in wide CIs and reduced precision in domain-specific estimates. Fourth, each model was evaluated using a single prompt format and a single run per item, and LLMs are updated frequently; performance may therefore differ with alternative prompts, settings, or future model versions. Although web-browsing or search features were not enabled or intentionally used, we cannot fully exclude the possibility of proprietary, platform-level retrieval in some interfaces; this should be considered when interpreting results. Finally, we did not evaluate usability, explanation quality, or patient-level outcomes. Taken together, these limitations indicate that our findings should be interpreted as reflecting performance on a structured knowledge test rather than as evidence of real-world clinical effectiveness.

5. Conclusions

Advanced LLM configurations, including ChatGPT (GPT-4o; subscription-based), DeepSeek, and Gemini, achieved higher accuracy than residents on this specialty-specific multiple-choice examination and outperformed the resident-majority indicator in paired item-level comparisons. These findings support the potential role of LLMs as supplementary educational tools and adjuncts for structured information support in women’s health. Importantly, performance was not uniform across models. ChatGPT (GPT-4o Mini; free) did not differ significantly from residents, and Microsoft Copilot showed lower overall accuracy but did not differ significantly from residents in paired item-level comparisons. Taken together, these results support the use of LLMs as adjuncts rather than replacements for human expertise and emphasize the importance of careful model selection, ongoing validation, and responsible integration with clinician oversight [17].

Availability of Data and Materials

The datasets generated and analyzed during this study are available from the corresponding author upon reasonable request. Due to institutional privacy policies, de-identified datasets are not publicly available.

Author Contributions

ÖFÖ and CD conceptualized the study and designed the question set. EŞ, SD, and NUD contributed to data collection and analysis. MÖ and İM made substantial contributions to the interpretation of data, academic supervision, and critical revision of the manuscript for important intellectual content. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript. All authors have participated sufficiently in the work and agreed to be accountable for all aspects of the work.

Ethics Approval and Consent to Participate

The study was conducted in accordance with the Declaration of Helsinki and approved by the Akdeniz University Faculty of Medicine Clinical Research Ethics Committee (approval number: TBAEK-863). Written informed consent was obtained from all participants prior to inclusion in the study.

Acknowledgment

The authors thank all resident physicians from Akdeniz University Faculty of Medicine, Department of Obstetrics and Gynecology, for their participation in this study. We also appreciate the contributions of the department’s academic staff in reviewing the question bank.

Funding

This research received no external funding.

Conflict of Interest

The authors declare that they have no conflicts of interest in the research. All figures and tables are original and were created by the authors for this manuscript; no permissions were required.

Declaration of AI and AI-Assisted Technologies in the Writing Process

During the preparation of this work the authors used ChatGPT for language editing (spelling/grammar) and clarity improvement. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

References

[1] Zhang K, Meng X, Yan X, Ji J, Liu J, Xu H, et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. Journal of Medical Internet Research. 2025; 27: e59069. https://doi.org/10.2196/59069.
Cited within: 1Google Scholar PubMed Crossref
[2] Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Annals of Internal Medicine. 2024; 177: 210–220. https://doi.org/10.7326/M23-2772.
Cited within: 1Google Scholar PubMed Crossref
[3] Maity S, Saikia MJ. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering. 2025; 12: 631. https://doi.org/10.3390/bioengineering12060631.
Cited within: 1Google Scholar PubMed Crossref
[4] Vrdoljak J, Boban Z, Vilović M, Kumrić M, Božić J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare. 2025; 13: 603. https://doi.org/10.3390/healthcare13060603.
Cited within: 1Google Scholar PubMed Crossref
[5] Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023; 620: 172–180. https://doi.org/10.1038/s41586-023-06291-2.
Cited within: 1Google Scholar PubMed Crossref
[6] Zhu Y, Zhang J, Fei J, Fang H, Zhang Z. Problem-Based Learning and Case-Based Learning in Clinical Practical Teaching for Gynecology Residents: A Narrative Review. Advances in Medical Education and Practice. 2025; 16: 1269–1279. https://doi.org/10.2147/AMEP.S534053.
Cited within: 1Google Scholar PubMed Crossref
[7] Skryd A, Lawrence K. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Formative Research. 2024; 8: e51346. https://doi.org/10.2196/51346.
Cited within: 1Google Scholar PubMed Crossref
[8] Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digital Medicine. 2025; 8: 175. https://doi.org/10.1038/s41746-025-01543-z.
Cited within: 1Google Scholar PubMed Crossref
[9] Cohen A, Burns J, Gabra M, Gordon A, Deebel N, Terlecki R, et al. Performance of Chat Generative Pre-Trained Transformer on Personal Review of Learning in Obstetrics and Gynecology. Southern Medical Journal. 2025; 118: 102–105. https://doi.org/10.14423/SMJ.0000000000001783.
Cited within: 1Google Scholar PubMed Crossref
[10] Eoh KJ, Kwon GY, Lee EJ, Lee J, Lee I, Kim YT, et al. Efficacy of large language models and their potential in Obstetrics and Gynecology education. Obstetrics & Gynecology Science. 2024; 67: 550–556. https://doi.org/10.5468/ogs.24211.
Cited within: 1Google Scholar PubMed Crossref
[11] Martinelli C, Giordano A, Carnevale V, Burk SR, Porto L, Vizzielli G, et al. The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints. Mayo Clinic Proceedings. Digital Health. 2025; 3: 100206. https://doi.org/10.1016/j.mcpdig.2025.100206.
Cited within: 1Google Scholar PubMed Crossref
[12] Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nature Medicine. 2025; 31: 943–950. https://doi.org/10.1038/s41591-024-03423-7.
Cited within: 1Google Scholar PubMed Crossref
[13] Noda R, Tanabe K, Ichikawa D, Shibagaki Y. GPT-4’s performance in supporting physician decision-making in nephrology multiple-choice questions. Scientific Reports. 2025; 15: 15439. https://doi.org/10.1038/s41598-025-99774-3.
Cited within: 1Google Scholar PubMed Crossref
[14] Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine. 2019; 25: 44–56. https://doi.org/10.1038/s41591-018-0300-7.
Cited within: 2Google Scholar PubMed Crossref
[15] Feigerlova E, Hani H, Hothersall-Davies E. A systematic review of the impact of artificial intelligence on educational outcomes in health professions education. BMC Medical Education. 2025; 25: 129. https://doi.org/10.1186/s12909-025-06719-5.
Cited within: 1Google Scholar PubMed Crossref
[16] Brügge E, Ricchizzi S, Arenbeck M, Keller MN, Schur L, Stummer W, et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Medical Education. 2024; 24: 1391. https://doi.org/10.1186/s12909-024-06399-7.
Cited within: 1Google Scholar PubMed Crossref
[17] Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Medical Informatics. 2025; 13: e66917. https://doi.org/10.2196/66917.
Cited within: 3Google Scholar PubMed Crossref

Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Academic Editor

Download

Fig. 1.

Academic Editor

Article Metrics

Download

Fig. 1.

Abstract

Keywords

References