{"paper":{"title":"A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Even large Vision Language Models fail at basic surgical tool detection in neurosurgery, with scaling producing only diminishing gains.","cross_cats":["cs.CV","cs.LG"],"primary_cat":"cs.AI","authors_text":"Daniel A. Donoho, Eric Fithian, Jack Cook, John Zhu, Kirill Skobelev, Margaux Masson-Forsythe, Neeraj Mainkar, Sandeep Angara, Shauna Otto, X.Y. Han, Yegor Baranovski, Zhuang-Fang Yi","submitted_at":"2026-03-28T17:18:40Z","abstract_excerpt":"Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The chosen tool-detection task and neurosurgery datasets are representative of broader surgical AI challenges, and that the tested models represent the best possible application of 2026-era methods without unstated domain adaptations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Current vision-language models underperform on surgical tool detection in neurosurgery, with scaling model size and training time producing only diminishing returns.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Even large Vision Language Models fail at basic surgical tool detection in neurosurgery, with scaling producing only diminishing gains.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c2535aba5c8823222765d6769af4a3d3bcf366df8ba7931428345d641ed973a5"},"source":{"id":"2603.27341","kind":"arxiv","version":3},"verdict":{"id":"2f96b7e9-d906-40a4-8967-16e9a4a9841a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T22:02:27.369927Z","strongest_claim":"even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics.","one_line_summary":"Current vision-language models underperform on surgical tool detection in neurosurgery, with scaling model size and training time producing only diminishing returns.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The chosen tool-detection task and neurosurgery datasets are representative of broader surgical AI challenges, and that the tested models represent the best possible application of 2026-era methods without unstated domain adaptations.","pith_extraction_headline":"Even large Vision Language Models fail at basic surgical tool detection in neurosurgery, with scaling producing only diminishing gains."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2603.27341/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}