SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
arXiv preprint arXiv:2404.00376 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
- The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection