{"paper":{"title":"Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"3D medical vision-language models struggle with semantic-spatial reasoning in CT volumes, averaging just 34% accuracy on a new benchmark.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Asif Hanif, Mashrafi Monon, Mohammad Yaqub, Numan Saeed, Umaima Rahman","submitted_at":"2026-05-09T08:16:00Z","abstract_excerpt":"Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a bench"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. [...] finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The constructed QA pairs require and test explicit 3D volumetric spatial reasoning rather than being solvable through 2D projections, language correlations, or learned priors alone.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"3D medical vision-language models struggle with semantic-spatial reasoning in CT volumes, averaging just 34% accuracy on a new benchmark.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4050758ae258854015a70d3fec1d02502beff132baa8574c0df81931efa7f716"},"source":{"id":"2605.08787","kind":"arxiv","version":2},"verdict":{"id":"e02e8c7b-4d3f-46bb-9725-7c3727c75aa8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-12T01:26:43.430194Z","strongest_claim":"We introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. [...] finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random.","one_line_summary":"CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The constructed QA pairs require and test explicit 3D volumetric spatial reasoning rather than being solvable through 2D projections, language correlations, or learned priors alone.","pith_extraction_headline":"3D medical vision-language models struggle with semantic-spatial reasoning in CT volumes, averaging just 34% accuracy on a new benchmark."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.08787/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T09:02:01.895099Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T22:34:33.686847Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T14:01:22.078084Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T10:49:11.014658Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"ba05c6706d5006dfdf5a712f9800c469461ca39226b2429cc6476277f3038590"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"86e8a497b5ec500ed77e4674c303a2ee52bc002df1cc52f66d0c9bd34e258518"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}