Recognition: 1 theorem link
· Lean TheoremLost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
Pith reviewed 2026-05-12 01:26 UTC · model grok-4.3
The pith
3D medical vision-language models struggle with semantic-spatial reasoning in CT volumes, averaging just 34% accuracy on a new benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer pairs derived directly from 1601 radiology reports and CT volumes. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We benchmark eight 3D medical VLMs and find severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random.
What carries the argument
CT-SpatialVQA benchmark of 9077 QA pairs that specifically test semantic-spatial reasoning capabilities in 3D CT volumes through anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning.
If this is right
- Models require deeper integration of volumetric evidence beyond current methods to support trustworthy clinical decision making.
- The benchmark provides a standardized protocol to measure progress in semantic-spatial capabilities for 3D medical VLMs.
- Poor performance indicates that existing training approaches leave models vulnerable to errors in anatomy localization and relational queries.
- Advancements in handling 3D spatial data are necessary before these models can be reliably used in clinical reporting or diagnosis.
Where Pith is reading between the lines
- New architectures with explicit 3D geometric components may be needed to overcome the spatial reasoning deficits.
- Similar evaluation gaps likely exist for other 3D imaging modalities such as MRI.
- Targeted fine-tuning on spatial QA pairs could close the accuracy gap and improve downstream clinical utility.
- The findings suggest current objectives in VLM training insufficiently emphasize direct 3D spatial cues from volumes.
Load-bearing premise
The benchmark's question-answer pairs require and test explicit 3D volumetric spatial reasoning rather than being solvable through 2D projections, language correlations, or learned priors alone.
What would settle it
A model achieving high accuracy specifically on questions about 3D inter-structure relations only resolvable by considering the full volume from multiple viewpoints, while failing on non-spatial controls, would support the claim; persistent low performance even after volume-specific training would falsify it.
Figures
read the original abstract
Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CT-SpatialVQA, a benchmark of 9077 QA pairs derived from 1601 CT volumes and radiology reports, aimed at evaluating semantic-spatial reasoning (anatomical localization, laterality, structural comparison, and 3D inter-structure relations) in 3D medical VLMs. An LLM-assisted generation and validation pipeline achieves 95% human consensus. The authors apply a standardized protocol to benchmark eight 3D VLMs and report severe performance degradation, with average accuracy of 34% and frequent results below random guessing.
Significance. If the QA pairs are shown to require genuine 3D volumetric reasoning rather than textual leakage or 2D cues, the benchmark would be a valuable contribution for exposing limitations in current 3D medical VLMs and motivating more robust spatial integration. The scale of the dataset, direct derivation from clinical reports, and high-consensus validation pipeline are clear strengths that could support reproducible follow-up work.
major comments (3)
- [§3] §3 (Benchmark Construction) and §3.3 (Validation): The 95% human consensus rate is reported without detailing question-generation rules, exclusion criteria for ambiguous cases, or statistical measures of inter-annotator agreement beyond the aggregate figure. This leaves open whether the pairs were constructed to exclude language-only solutions.
- [§4.2] §4.2 (Evaluation Protocol): No control experiments are described in which models (or humans) are tested on the same QA pairs with the CT volume withheld or replaced by single 2D slices/axial projections. Because questions are derived directly from reports that already encode laterality and relations, the absence of these baselines makes it impossible to attribute the 34% accuracy drop specifically to failure of 3D reasoning.
- [§4.3] §4.3 (Results) and Table 3: The headline claim of 'often below random' performance is presented without per-task random baselines, statistical significance tests for the degradation, or text-only model runs. This weakens the interpretation that the observed scores demonstrate a lack of semantic-spatial understanding rather than other factors.
minor comments (2)
- [Abstract] Abstract: The phrase 'often below random' should be qualified with the exact random baseline for each question category (e.g., binary laterality vs. multi-choice relational).
- [Figure 1] Figure 1 or dataset examples: Provide at least one concrete QA pair together with the corresponding CT slices and report excerpt to illustrate why 3D volume access is required.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important aspects of our benchmark construction and evaluation that require clarification. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction) and §3.3 (Validation): The 95% human consensus rate is reported without detailing question-generation rules, exclusion criteria for ambiguous cases, or statistical measures of inter-annotator agreement beyond the aggregate figure. This leaves open whether the pairs were constructed to exclude language-only solutions.
Authors: We agree that additional details are needed. In the revised manuscript, we will expand §3 to explicitly describe the question-generation rules (including templates for localization, laterality, comparison, and 3D relations), the exclusion criteria for ambiguous or text-only solvable cases, and inter-annotator agreement statistics such as Cohen's kappa and Fleiss' kappa computed on the human validation subset. The pipeline was designed to prioritize questions requiring volumetric evidence, but we will add concrete examples demonstrating that language-only solutions are insufficient for the majority of items. revision: yes
-
Referee: [§4.2] §4.2 (Evaluation Protocol): No control experiments are described in which models (or humans) are tested on the same QA pairs with the CT volume withheld or replaced by single 2D slices/axial projections. Because questions are derived directly from reports that already encode laterality and relations, the absence of these baselines makes it impossible to attribute the 34% accuracy drop specifically to failure of 3D reasoning.
Authors: This is a valid concern. We will add control experiments in the revised §4.2, including (1) text-only runs of all eight models on the identical QA pairs with volumes withheld, and (2) 2D slice-based evaluations using axial projections or representative slices. These baselines will allow direct comparison to the full 3D results and help quantify the contribution of volumetric reasoning. While the questions were manually reviewed to emphasize 3D inter-structure relations not explicitly stated in reports, the added controls will make this attribution rigorous. revision: yes
-
Referee: [§4.3] §4.3 (Results) and Table 3: The headline claim of 'often below random' performance is presented without per-task random baselines, statistical significance tests for the degradation, or text-only model runs. This weakens the interpretation that the observed scores demonstrate a lack of semantic-spatial understanding rather than other factors.
Authors: We will revise §4.3 and Table 3 to include per-task random-chance baselines (computed from the number of answer options per question type), statistical significance tests (e.g., binomial tests or McNemar's test against random and against text-only performance), and the text-only model results from the new controls. These additions will provide a clearer statistical foundation for interpreting the 34% average accuracy and the 'below random' observations as evidence of limited 3D semantic-spatial understanding. revision: yes
Circularity Check
No circularity: new benchmark evaluated on external models
full rationale
The paper constructs CT-SpatialVQA from 1601 external radiology reports and CT volumes via an LLM-assisted pipeline, then evaluates eight existing 3D VLMs on the resulting 9077 QA pairs. No equations, fitted parameters, or predictions appear; the central result (34% average accuracy) is a direct measurement against independent models and data. No self-citations are load-bearing, no ansatzes are smuggled, and no quantity is defined in terms of itself or renamed as a novel derivation. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Radiology reports paired with CT volumes contain sufficient information to generate questions that require explicit 3D anatomical localization and relational reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alibaba Cloud Model Studio: Qwen-plus (qwen3 series) model listing.https:// www.alibabacloud.com/help/en/model-studio/models, accessed 2026-02-24
work page 2026
-
[2]
M3d:Ad- vancing 3d medical image analysis with multi-modal large language models
Bai, F., Du, Y., Huang, T., Meng, M.Q.H., Zhao, B.: M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578 (2024)
-
[3]
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)
work page 2005
-
[4]
Blankemeier, L., Cohen, J.P., Kumar, A., Van Veen, D., Gardezi, S.J.S., Paschali, M., Chen, Z., Delbrouck, J.B., Reis, E., Truyts, C., et al.: Merlin: A vision language foundation model for 3d computed tomography. Research Square pp. rs–3 (2024)
work page 2024
-
[5]
Chen, Y., Xiao, W., Bassi, P.R., Zhou, X., Er, S., Hamamci, I.E., Zhou, Z., Yuille, A.: Are vision language models ready for clinical diagnosis? a 3d medical bench- mark for tumor-centric visual question answering. arXiv preprint arXiv:2505.18915 (2025)
-
[6]
Google AI for Developers: Gemini 2.5 flash (model code: gemini-2.5-flash).https: //ai.google.dev/gemini-api/docs/models/gemini-2.5-flash, accessed 2026- 02-24
work page 2026
-
[7]
Google Research: Medgemma 1.5 model card (2026),https://huggingface.co/ google/medgemma-1.5-4b-it, accessed: 2026-02-22
work page 2026
-
[8]
Nature Biomedical Engineering pp
Hamamci, I.E., Er, S., Wang, C., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Durugol, O.F., Hou, B., Shit, S., et al.: Generalist foundation models from a multimodal dataset for 3d computed tomography. Nature Biomedical Engineering pp. 1–19 (2026)
work page 2026
-
[9]
Lai, H., Jiang, Z., Yao, Q., Wang, R., He, Z., Tao, X., Wei, W., Lv, W., Zhou, S.K.: E3d-gpt: enhanced 3d visual foundation for medical vision-language model. arXiv preprint arXiv:2410.14200 (2024)
-
[10]
arXiv preprint arXiv:2412.13558 (2024)
Lee, C., Park, S., Shin, C.I., Choi, W.H., Park, H.J., Lee, J.E., Ye, J.C.: Read like a radiologist: efficient vision-language model for 3d medical imaging interpretation. arXiv preprint arXiv:2412.13558 (2024)
-
[11]
In: Text sum- marization branches out
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)
work page 2004
-
[12]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Nath, V., Li, W., Yang, D., Myronenko, A., Zheng, M., Lu, Y., Liu, Z., Yin, H., Law, Y.M., Tang, Y., et al.: Vila-m3: Enhancing vision-language models with medical expert knowledge. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14788–14798 (2025)
work page 2025
-
[13]
OpenAI: Gpt-4o model documentation.https://platform.openai.com/docs/ models/gpt-4o, accessed 2026-02-24 10 M. Monon et al
work page 2026
-
[14]
In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
work page 2002
-
[15]
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)
work page 2019
-
[16]
arXiv preprint arXiv:2501.14548 (2025)
Shui, Z., Zhang, J., Cao, W., Wang, S., Guo, R., Lu, L., Yang, L., Ye, X., Liang, T., Zhang, Q., et al.: Large-scale and fine-grained vision-language pre-training for enhanced ct image understanding. arXiv preprint arXiv:2501.14548 (2025)
-
[17]
Advances in Neural Information Processing Systems37, 99947–99964 (2024)
Wang, Y., Dai, Y., Jones, C., Sair, H., Shen, J., Loizou, N., Hsu, W.C., Imami, M., Jiao, Z., Zhang, P., et al.: Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection. Advances in Neural Information Processing Systems37, 99947–99964 (2024)
work page 2024
- [18]
-
[19]
Nature Communications16(1), 7866 (2025)
Wu, C., Zhang, X., Zhang, Y., Hui, H., Wang, Y., Xie, W.: Towards generalist foun- dation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications16(1), 7866 (2025)
work page 2025
-
[20]
IEEE Journal of Biomedical and Health Informatics (2025)
Xin, Y., Ates, G.C., Gong, K., Shao, W.: Med3dvlm: An efficient vision-language model for 3d medical image analysis. IEEE Journal of Biomedical and Health Informatics (2025)
work page 2025
-
[21]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.