inner-banner-bg

Advances in Neurology and Neuroscience(AN)

ISSN: 2690-909X | DOI: 10.33140/AN

Impact Factor: 1.12

Evaluation of Advanced Artificial Intelligence in Minimally Invasive Surgery Training: A Preliminary Study of the Large Language Models DeepSeek-R1 and Claude 3.5 Sonnet

Abstract

Brandon L. Staple*, Elijah M. Staple, Cynthia Wallace and Bevan D. Staple

Background/Objective: Minimally invasive surgery (MIS) reduces tissue trauma, pain, and recovery times but demands advanced technical skill acquisition. Current surgical training remains time-intensive and mentor-dependent. While robotics, simulation, and AI promise transformative improvements for surgical education, early large language models (LLMs) like ChatGPT raised concerns due to factual inaccuracies ("hallucinations") and limited explainability. It remains unclear whether modern LLMs—such as Claude 3.5 Sonnet and the reasoning-focused DeepSeek-R1— adequately overcome these limitations while ensuring the interpretability and reliability essential for medical applications. Moreover, their alignment with MIS-specific knowledge is understudied. This work preliminarily evaluates both models’ accuracy, reasoning capabilities, and error patterns in MIS knowledge assessment to determine their viability for enhancing surgical training.

Methods: We performed a comparative analysis using 30 multiple-choice questions (MCQs) derived from the Atlas of Minimally Invasive Surgical Operations. Model performance was statistically compared using one-way ANOVA with Bonferroni correction. Inter-rater reliability was assessed with Cohen's Kappa, and effect size was measured using odds ratios. Qualitative analysis focused on reasoning patterns, error classification, and pedagogical applications specific to MIS education.

Results: Claude 3.5 Sonnet achieved 97% accuracy (29/30 correct), while DeepSeek-R1 achieved 93.3% accuracy (28/30 correct). Statistical analysis (ANOVA, p = 0.742) revealed no significant difference in overall performance between models. Inter-rater reliability showed moderate agreement (κ = 0.52) with a strong effect size (OR = 23.0). Qualitative analysis identified distinct reasoning styles: DeepSeek-R1 demonstrated comprehensive, systematic (step-by- step) analysis, whereas Claude 3.5 Sonnet exhibited more focused, efficient reasoning. Error analysis revealed Formula Confusion Errors and Context Value Errors, with both models converging on clinically plausible but educationally incorrect answers in complex MIS scenarios.

Conclusions: Both cutting-edge LLMs demonstrated exceptional accuracy in MIS knowledge assessment, surpassing older models, and possess unique reasoning approaches valuable for surgical education. Claude 3.5 Sonnet showed marginal superiority in accuracy with efficient reasoning, while DeepSeek-R1 offered advantages in reasoning transparency and open-source cost-effectiveness. The high accuracy rates, combined with detailed reasoning analysis, suggest meaningful potential for MIS-specific educational applications; however, careful implementation and human oversight remain essential.

HTML PDF