Performance of Large Language Models in Metabolic Bariatric Surgery: a Comparative Study.

El-Masry, H., El-Mezayen, M. Y., Farouk, B., Tawfik, A. M., Sharsher, P. S., Mohamed, B. H., Elmagd, A. A., Khammas, A., Nimeri, A., Cohen, R. , V, & Abokhozima, A. (2026). Performance of Large Language Models in Metabolic Bariatric Surgery: a Comparative Study.. Obesity Surgery, 36(2), 538-545.

Publisher's Version

Abstract

BACKGROUND: The rapid integration of Large Language Models (LLMs) into healthcare necessitates a rigorous evaluation of their performance in specialized medical fields. In metabolic bariatric surgery (MBS), LLMs have the potential to revolutionize education and clinical support, yet their accuracy and reliability are not well-established. This study provides a critical assessment of the capabilities of current LLMs in the context of MBS.

METHODS: This cross-sectional validation study assessed the performance of six LLMs (ChatGPT-3.5, ChatGPT-4o, Gemini, Copilot, GROK, and DeepSeek) in answering 100 evidence-based binary and multiple-choice questions related to MBS. Questions were constructed from international guidelines and categorized into six thematic domains. Expert consensus answers served as the reference standard, with inter-rater reliability measured using Fleiss’ κ. Model outputs were scored for accuracy. Comparisons across LLMs were first assessed using an overall test for differences between multiple related groups. Pairwise comparisons were then conducted between LLMs to identify specific differences in performance.

RESULTS: Across the dataset, the mean number of correct LLM responses per question was 3.9 (SD = 1.8). ChatGPT-4o achieved the highest accuracy (66.0%), while DeepSeek recorded the lowest (60.0%). Accuracy varied across domains, highest for indications/contraindications (78.7%) and complications/management (68.0%), and lowest for preoperative preparation (52.0%) and postoperative care (58.4%). Binary questions yielded higher accuracy (69.1%) than multiple-choice questions (62.0%). Inter-expert reliability was substantial (κ = 0.742, 95% CI: 0.71–0.77). Agreement between LLMs and experts ranged from fair (DeepSeek κ = 0.349) to moderate (ChatGPT-4o κ = 0.446). No significant accuracy differences were detected across models (Friedman test, p = 0.662).

CONCLUSION: LLMs represent a promising, yet imperfect, adjunct in MBS education. Their utility is currently limited by inconsistencies in accuracy, particularly in areas requiring nuanced clinical judgment. While these models can supplement traditional learning resources, they are not yet a substitute for expert clinical guidance. This study underscores the need for continued refinement and validation of LLMs to ensure their safe and effective integration into clinical practice.

SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11695-025-08418-y.

Last updated on 03/20/2026
PubMed

Find Us

330 Brookline Ave.
Boston, MA 02115

Performance of Large Language Models in Metabolic Bariatric Surgery: a Comparative Study.

Abstract

Find Us

Get In Touch

Let's Talk