arxiv: 2605.08787 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

Mashrafi Monon , Umaima Rahman , Asif Hanif , Numan Saeed , Mohammad Yaqub

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D CT imagingmedical vision-language modelsspatial reasoningVQA benchmarksemantic-spatial understandingvolumetric analysisanatomical localization

0 comments

The pith

3D medical vision-language models struggle with semantic-spatial reasoning in CT volumes, averaging just 34% accuracy on a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the CT-SpatialVQA benchmark to assess whether 3D medical vision-language models truly understand semantic-spatial relationships in CT volumes or merely rely on statistical patterns. It consists of 9077 QA pairs based on 1601 real radiology reports and scans, focusing on tasks like anatomical localization, laterality, structure comparisons, and 3D relations. Testing eight models shows an average accuracy of 34%, frequently underperforming random baselines. This gap matters because effective clinical tools must ground their responses in actual 3D anatomy rather than superficial correlations.

Core claim

We introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer pairs derived directly from 1601 radiology reports and CT volumes. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We benchmark eight 3D medical VLMs and find severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random.

What carries the argument

CT-SpatialVQA benchmark of 9077 QA pairs that specifically test semantic-spatial reasoning capabilities in 3D CT volumes through anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning.

If this is right

Models require deeper integration of volumetric evidence beyond current methods to support trustworthy clinical decision making.
The benchmark provides a standardized protocol to measure progress in semantic-spatial capabilities for 3D medical VLMs.
Poor performance indicates that existing training approaches leave models vulnerable to errors in anatomy localization and relational queries.
Advancements in handling 3D spatial data are necessary before these models can be reliably used in clinical reporting or diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New architectures with explicit 3D geometric components may be needed to overcome the spatial reasoning deficits.
Similar evaluation gaps likely exist for other 3D imaging modalities such as MRI.
Targeted fine-tuning on spatial QA pairs could close the accuracy gap and improve downstream clinical utility.
The findings suggest current objectives in VLM training insufficiently emphasize direct 3D spatial cues from volumes.

Load-bearing premise

The benchmark's question-answer pairs require and test explicit 3D volumetric spatial reasoning rather than being solvable through 2D projections, language correlations, or learned priors alone.

What would settle it

A model achieving high accuracy specifically on questions about 3D inter-structure relations only resolvable by considering the full volume from multiple viewpoints, while failing on non-spatial controls, would support the claim; persistent low performance even after volume-specific training would falsify it.

Figures

Figures reproduced from arXiv: 2605.08787 by Asif Hanif, Mashrafi Monon, Mohammad Yaqub, Numan Saeed, Umaima Rahman.

**Figure 1.** Figure 1: Current 3D medical VLMs struggle to answer spatially grounded questions, highlighting limitations in spatial reasoning and raising concerns about their reliability for clinically meaningful use. slices and forming a coherent 3D picture of the anatomy. These spatial judgments are not ancillary; instead, they directly govern diagnosis and procedural planning. Given this fundamental role of spatial reasoning… view at source ↗

**Figure 2.** Figure 2: An Overview of CT-SpatialVQA (top) The dataset generation pipeline utilizes 3D CT scan and paired radiology report to prompt a "QA Generator LLM", synthesizing candidate spatially-grounded QA pairs. An independent QA Validator LLM filters the candidates, and the remaining pairs are then verified by medical human auditors to produce the final benchmark dataset. (bottom) The 3D medical VLM evaluation protoc… view at source ↗

**Figure 3.** Figure 3: An illustrative example of a 3D CT scan, its corresponding full radiology report, and representative QA pairs. or extending across regions). For each identified spatial observation, the LLM generates one or more QA pairs as evident from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of word counts in questions, ground-truth answers and predictions of medical 3D VLMs. criteria are discarded. This two-stage LLM pipeline, combining generation with independent grounding and spatial-consistency validation, reduces hallucinations and improves spatial faithfulness. Spatially-Grounded Question–Answer Validation (Human Audit). To verify the LLM-based validator, we conduct a human … view at source ↗

**Figure 5.** Figure 5: (Left) Radar plot showing model performance across six spatial reasoning categories in CT-SpatialVQA (LLM-as-Jury). (Right). Spatial categories considered for our evaluation. Each category captures a clinically relevant spatial primitive required for preserving volumetric spatial integrity in medical VLMs. related to CT scans, with all models falling below 50% accuracy. Among the evaluated models, CT-Chat… view at source ↗

read the original abstract

Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CT-SpatialVQA gives a new benchmark for spatial reasoning in 3D medical VLMs but the 34% accuracy claim rests on untested assumptions about what the questions actually require.

read the letter

The colleague should know two things about this paper. It introduces CT-SpatialVQA, a set of 9077 QA pairs drawn from 1601 real CT volumes and reports, and it reports that eight existing 3D VLMs average only 34% accuracy on the tasks, often below random guessing. The work targets semantic-spatial skills such as laterality, localization, and 3D relations between structures that standard medical VQA sets do not isolate. The construction uses an LLM-assisted pipeline with 95% human consensus on factual correctness, which is a practical way to scale a clinically grounded dataset. That part is concrete and addresses a real gap in how we evaluate whether these models actually use volumetric information rather than language priors. The soft spot is the missing evidence that the questions force 3D reasoning. The pairs come directly from reports that already describe locations and relations, so textual leakage or 2D slice cues remain possible. The paper gives no control results for models or humans when the full volume is withheld or when only axial or coronal slices are supplied. Without those checks, the performance drop cannot be read as clear proof of a volumetric failure. Details on exact generation rules and any statistical tests for the degradation are also limited. This paper is for researchers who build or evaluate 3D medical VLMs and need better tools to measure spatial reliability. The dataset and protocol are specific enough that someone working on clinical trustworthiness would find them worth looking at, even with the current limitations. It deserves a serious referee because the benchmark idea is useful and the targeted weakness matters for real use. Reviewers would likely ask for the control experiments, but the core contribution is worth the time. I recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces CT-SpatialVQA, a benchmark of 9077 QA pairs derived from 1601 CT volumes and radiology reports, aimed at evaluating semantic-spatial reasoning (anatomical localization, laterality, structural comparison, and 3D inter-structure relations) in 3D medical VLMs. An LLM-assisted generation and validation pipeline achieves 95% human consensus. The authors apply a standardized protocol to benchmark eight 3D VLMs and report severe performance degradation, with average accuracy of 34% and frequent results below random guessing.

Significance. If the QA pairs are shown to require genuine 3D volumetric reasoning rather than textual leakage or 2D cues, the benchmark would be a valuable contribution for exposing limitations in current 3D medical VLMs and motivating more robust spatial integration. The scale of the dataset, direct derivation from clinical reports, and high-consensus validation pipeline are clear strengths that could support reproducible follow-up work.

major comments (3)

[§3] §3 (Benchmark Construction) and §3.3 (Validation): The 95% human consensus rate is reported without detailing question-generation rules, exclusion criteria for ambiguous cases, or statistical measures of inter-annotator agreement beyond the aggregate figure. This leaves open whether the pairs were constructed to exclude language-only solutions.
[§4.2] §4.2 (Evaluation Protocol): No control experiments are described in which models (or humans) are tested on the same QA pairs with the CT volume withheld or replaced by single 2D slices/axial projections. Because questions are derived directly from reports that already encode laterality and relations, the absence of these baselines makes it impossible to attribute the 34% accuracy drop specifically to failure of 3D reasoning.
[§4.3] §4.3 (Results) and Table 3: The headline claim of 'often below random' performance is presented without per-task random baselines, statistical significance tests for the degradation, or text-only model runs. This weakens the interpretation that the observed scores demonstrate a lack of semantic-spatial understanding rather than other factors.

minor comments (2)

[Abstract] Abstract: The phrase 'often below random' should be qualified with the exact random baseline for each question category (e.g., binary laterality vs. multi-choice relational).
[Figure 1] Figure 1 or dataset examples: Provide at least one concrete QA pair together with the corresponding CT slices and report excerpt to illustrate why 3D volume access is required.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our benchmark construction and evaluation that require clarification. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction) and §3.3 (Validation): The 95% human consensus rate is reported without detailing question-generation rules, exclusion criteria for ambiguous cases, or statistical measures of inter-annotator agreement beyond the aggregate figure. This leaves open whether the pairs were constructed to exclude language-only solutions.

Authors: We agree that additional details are needed. In the revised manuscript, we will expand §3 to explicitly describe the question-generation rules (including templates for localization, laterality, comparison, and 3D relations), the exclusion criteria for ambiguous or text-only solvable cases, and inter-annotator agreement statistics such as Cohen's kappa and Fleiss' kappa computed on the human validation subset. The pipeline was designed to prioritize questions requiring volumetric evidence, but we will add concrete examples demonstrating that language-only solutions are insufficient for the majority of items. revision: yes
Referee: [§4.2] §4.2 (Evaluation Protocol): No control experiments are described in which models (or humans) are tested on the same QA pairs with the CT volume withheld or replaced by single 2D slices/axial projections. Because questions are derived directly from reports that already encode laterality and relations, the absence of these baselines makes it impossible to attribute the 34% accuracy drop specifically to failure of 3D reasoning.

Authors: This is a valid concern. We will add control experiments in the revised §4.2, including (1) text-only runs of all eight models on the identical QA pairs with volumes withheld, and (2) 2D slice-based evaluations using axial projections or representative slices. These baselines will allow direct comparison to the full 3D results and help quantify the contribution of volumetric reasoning. While the questions were manually reviewed to emphasize 3D inter-structure relations not explicitly stated in reports, the added controls will make this attribution rigorous. revision: yes
Referee: [§4.3] §4.3 (Results) and Table 3: The headline claim of 'often below random' performance is presented without per-task random baselines, statistical significance tests for the degradation, or text-only model runs. This weakens the interpretation that the observed scores demonstrate a lack of semantic-spatial understanding rather than other factors.

Authors: We will revise §4.3 and Table 3 to include per-task random-chance baselines (computed from the number of answer options per question type), statistical significance tests (e.g., binomial tests or McNemar's test against random and against text-only performance), and the text-only model results from the new controls. These additions will provide a clearer statistical foundation for interpreting the 34% average accuracy and the 'below random' observations as evidence of limited 3D semantic-spatial understanding. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark evaluated on external models

full rationale

The paper constructs CT-SpatialVQA from 1601 external radiology reports and CT volumes via an LLM-assisted pipeline, then evaluates eight existing 3D VLMs on the resulting 9077 QA pairs. No equations, fitted parameters, or predictions appear; the central result (34% average accuracy) is a direct measurement against independent models and data. No self-citations are load-bearing, no ansatzes are smuggled, and no quantity is defined in terms of itself or renamed as a novel derivation. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the generated QA pairs isolate 3D spatial reasoning; no free parameters, mathematical axioms, or new entities are introduced.

axioms (1)

domain assumption Radiology reports paired with CT volumes contain sufficient information to generate questions that require explicit 3D anatomical localization and relational reasoning.
This assumption underpins the derivation of the 9077 QA pairs from 1601 reports.

pith-pipeline@v0.9.0 · 5535 in / 1180 out tokens · 69435 ms · 2026-05-12T01:26:43.430194+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Alibaba Cloud Model Studio: Qwen-plus (qwen3 series) model listing.https:// www.alibabacloud.com/help/en/model-studio/models, accessed 2026-02-24

work page 2026
[2]

M3d:Ad- vancing 3d medical image analysis with multi-modal large language models

Bai, F., Du, Y., Huang, T., Meng, M.Q.H., Zhao, B.: M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578 (2024)

work page arXiv 2024
[3]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

work page 2005
[4]

Research Square pp

Blankemeier, L., Cohen, J.P., Kumar, A., Van Veen, D., Gardezi, S.J.S., Paschali, M., Chen, Z., Delbrouck, J.B., Reis, E., Truyts, C., et al.: Merlin: A vision language foundation model for 3d computed tomography. Research Square pp. rs–3 (2024)

work page 2024
[5]

Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering.arXiv preprint arXiv:2505.18915, 2025

Chen, Y., Xiao, W., Bassi, P.R., Zhou, X., Er, S., Hamamci, I.E., Zhou, Z., Yuille, A.: Are vision language models ready for clinical diagnosis? a 3d medical bench- mark for tumor-centric visual question answering. arXiv preprint arXiv:2505.18915 (2025)

work page arXiv 2025
[6]

Google AI for Developers: Gemini 2.5 flash (model code: gemini-2.5-flash).https: //ai.google.dev/gemini-api/docs/models/gemini-2.5-flash, accessed 2026- 02-24

work page 2026
[7]

Google Research: Medgemma 1.5 model card (2026),https://huggingface.co/ google/medgemma-1.5-4b-it, accessed: 2026-02-22

work page 2026
[8]

Nature Biomedical Engineering pp

Hamamci, I.E., Er, S., Wang, C., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Durugol, O.F., Hou, B., Shit, S., et al.: Generalist foundation models from a multimodal dataset for 3d computed tomography. Nature Biomedical Engineering pp. 1–19 (2026)

work page 2026
[9]

E3d-gpt: Enhanced 3d visual foundation for medical vision-language model.arXiv preprint arXiv:2410.14200,

Lai, H., Jiang, Z., Yao, Q., Wang, R., He, Z., Tao, X., Wei, W., Lv, W., Zhou, S.K.: E3d-gpt: enhanced 3d visual foundation for medical vision-language model. arXiv preprint arXiv:2410.14200 (2024)

work page arXiv 2024
[10]

arXiv preprint arXiv:2412.13558 (2024)

Lee, C., Park, S., Shin, C.I., Choi, W.H., Park, H.J., Lee, J.E., Ye, J.C.: Read like a radiologist: efficient vision-language model for 3d medical imaging interpretation. arXiv preprint arXiv:2412.13558 (2024)

work page arXiv 2024
[11]

In: Text sum- marization branches out

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

work page 2004
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Nath, V., Li, W., Yang, D., Myronenko, A., Zheng, M., Lu, Y., Liu, Z., Yin, H., Law, Y.M., Tang, Y., et al.: Vila-m3: Enhancing vision-language models with medical expert knowledge. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14788–14798 (2025)

work page 2025
[13]

Monon et al

OpenAI: Gpt-4o model documentation.https://platform.openai.com/docs/ models/gpt-4o, accessed 2026-02-24 10 M. Monon et al

work page 2026
[14]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

work page 2002
[15]

In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)

work page 2019
[16]

arXiv preprint arXiv:2501.14548 (2025)

Shui, Z., Zhang, J., Cao, W., Wang, S., Guo, R., Lu, L., Yang, L., Ye, X., Liang, T., Zhang, Q., et al.: Large-scale and fine-grained vision-language pre-training for enhanced ct image understanding. arXiv preprint arXiv:2501.14548 (2025)

work page arXiv 2025
[17]

Advances in Neural Information Processing Systems37, 99947–99964 (2024)

Wang, Y., Dai, Y., Jones, C., Sair, H., Shen, J., Loizou, N., Hsu, W.C., Imami, M., Jiao, Z., Zhang, P., et al.: Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection. Advances in Neural Information Processing Systems37, 99947–99964 (2024)

work page 2024
[18]

Wolf, D., Hillenhagen, H., Taskin, B., Bäuerle, A., Beer, M., Götz, M., Ropinski, T.: Your other left! vision-language models fail to identify relative positions in medical images (2025),https://arxiv.org/abs/2508.00549

work page arXiv 2025
[19]

Nature Communications16(1), 7866 (2025)

Wu, C., Zhang, X., Zhang, Y., Hui, H., Wang, Y., Xie, W.: Towards generalist foun- dation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications16(1), 7866 (2025)

work page 2025
[20]

IEEE Journal of Biomedical and Health Informatics (2025)

Xin, Y., Ates, G.C., Gong, K., Shao, W.: Med3dvlm: An efficient vision-language model for 3d medical image analysis. IEEE Journal of Biomedical and Health Informatics (2025)

work page 2025
[21]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review arXiv 2025