Comparison of Domain-Specific and Ensemble Large Language Models in Surgical Education: A Preliminary Performance Evaluation
Abstract
Brandon D. Staple, Elijah M. Staple, Cynthia Wallace and Bevan L. Staple*
Standard Large Language Models (sLLMs) are known for their high accuracy in answering multiple-choice questions from the Self-Assessment Neurosurgery Exam (SANS). However, their tendency to 'hallucinate' or fabricate information presents challenges for neurosurgical applications that require a high degree of precision. AtlasGPT, a Domain-specific Large Language Model (dLLM), has managed to achieve a lower Hallucination Rate (HR) through targeted fine- tuning and retrieval-augmented generation from specialized databases. Nevertheless, proprietary limitations hinder customization and broader research into hallucination mitigation, prompting an exploration of model-agnostic Ensemble Methods (EMs) that combine several sLLMs to enhance performance. This study assessed hallucination mitigation by comparing an EM consisting of three sLLMs (Gemini, Claude 3.5 Sonnet, and Mistral) with AtlasGPT. Hallucination Rates were evaluated using sampling and maximum voting on 150 SANS multiple-choice-questions. The EM achieved the highest accuracy (97.33%, HR: 2.67%), slightly surpassing AtlasGPT (96.67%, HR: 3.33%) and outperforming all individual sLLMs: Claude 3.5 Sonnet (94.67%, HR: 5.33%), Gemini (92.00%, HR: 8.00%), and Mistral (88.67%, HR: 11.33%). The success of the EM lies in its integration of multiple sLLMs, which minimizes errors through diverse training data and enhances outcomes by leveraging varied data sources. This diversity principle, where different models make different errors, reduces overall mistakes and improves performance, leading to better generalization and robustness. Also, EMs can effectively use both open-source and proprietary models, with simple sampling and maximum voting approaches proving effective, suggesting that complex methods may not be necessary for specific applications. In summary, specialized models like AtlasGPT and EM demonstrate the value of domain-specific training and multi-model approaches. However, they require further development and human oversight for safe clinical deployment.
