Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Aditya Sanjiv Kanade; Sai Srinivas Kancheti; Tanuja Ganu; Vineeth N. Balasubramanian

arxiv: 2604.16060 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Sai Srinivas Kancheti , Aditya Sanjiv Kanade , Vineeth N. Balasubramanian , Tanuja Ganu This is my paper

Pith reviewed 2026-05-10 08:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords chain-of-thoughtmultimodal LLMsspatial reasoningvisual reasoningshortcut learninghallucinationbenchmark evaluationablation study

0 comments

The pith

Chain-of-Thought prompting reduces accuracy on visual spatial reasoning tasks for multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates seventeen multimodal models across thirteen spatial benchmarks and finds that Chain-of-Thought prompting consistently lowers performance instead of improving it. Through a No-Image++ ablation that removes the image input, the work shows models still produce detailed spatial answers by drawing on textual priors alone, indicating shortcut learning and hallucination of visual details. This challenges the assumption that text-based reasoning chains transfer effectively to tasks needing genuine visual understanding. A reader would care because it highlights a concrete limitation in how current multimodal systems handle spatial intelligence.

Core claim

Chain-of-Thought prompting degrades performance in visual spatial reasoning, and both MRMs and CoT-prompted MLMs suffer from severe shortcut learning by hallucinating visual details from textual priors even when the image is absent.

What carries the argument

The No-Image++ ablation that tests model responses on spatial questions after completely removing the image input to isolate reliance on text priors.

If this is right

Text-only Chain-of-Thought is ineffective for spatial tasks and can harm results.
Models rely on language shortcuts rather than visual input for spatial questions.
Current multimodal reasoning models require vision-centric alternatives to text chains.
Shortcut learning appears in both dedicated MRMs and standard MLMs under CoT prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern may extend to other perception-heavy tasks where text priors can substitute for missing sensory data.
Future work could test whether training objectives that penalize image-absent hallucinations reduce the degradation.
Architectures that keep visual features active throughout reasoning steps might avoid the observed shortcut behavior.

Load-bearing premise

That the thirteen spatial benchmarks measure generalized spatial intelligence and that the observed performance drops are caused by Chain-of-Thought prompting rather than other factors.

What would settle it

Re-evaluate the same models on a new spatial benchmark that minimizes exploitable textual priors and check whether Chain-of-Thought still produces consistent accuracy drops.

Figures

Figures reproduced from arXiv: 2604.16060 by Aditya Sanjiv Kanade, Sai Srinivas Kancheti, Tanuja Ganu, Vineeth N. Balasubramanian.

**Figure 2.** Figure 2: Qualitative examples of failure modes. Top: GThinker produces degenerate output when prompted without CoT. Bottom: ViGoRL hallucinates detailed spatial reasoning for a blank image in the No-Image++ setting. Additional examples in Appx. C. (iv) Analysis of Proprietary Models. To evaluate the generalizability of our findings to proprietary models, we benchmark five models from the GPT family [PITH_FULL_IM… view at source ↗

read the original abstract

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoT hurts spatial task scores across many models and the No-Image++ ablation shows text-prior shortcuts, but the causal role of the reasoning step itself is not fully isolated.

read the letter

The main thing to know is that this paper reports consistent drops in performance when multimodal models use chain-of-thought on spatial visual tasks, and their No-Image++ ablation shows the models hallucinate details from text even without seeing the image. They evaluated seventeen models across thirteen benchmarks, which gives a broader picture than earlier narrower studies. The ablation is a straightforward addition that highlights shortcut learning in both MRMs and CoT-prompted MLMs. That part feels like the stronger contribution because it directly tests what happens when visual input is removed. The paper does well at laying out the evaluation setup and showing the pattern holds across different models. It challenges the assumption that CoT helps in all reasoning domains, which is worth noting for vision tasks. The soft spots are around pinning the degradation specifically to the CoT mechanism. Without controls that match prompt length or generation format, the drops could stem from longer text outputs or other prompt changes rather than the step-by-step reasoning. The benchmarks themselves might favor certain textual strategies, so more per-benchmark breakdowns or statistical tests would help confirm the cause. Overall, this is for people working on multimodal LLMs and prompting strategies. Readers focused on spatial intelligence or limitations of text-based reasoning in vision models will find the data useful. It has enough empirical weight to go to peer review, though it would benefit from tighter controls on the experimental variables. I would recommend sending it for review after asking for those additional checks on confounding factors like output length.

Referee Report

3 major / 2 minor

Summary. The paper evaluates seventeen multimodal LLMs and reasoning models across thirteen visual spatial reasoning benchmarks, claiming that Chain-of-Thought (CoT) prompting consistently degrades performance relative to direct prompting. It further introduces a No-Image++ ablation demonstrating that both MRMs and CoT-prompted models hallucinate visual details from textual priors even without an image, indicating severe shortcut learning. The work concludes that text-only CoT is unsuitable for spatial tasks and advocates for vision-centric reasoning approaches.

Significance. If the empirical results hold after addressing controls, the paper would usefully document limitations of CoT in multimodal spatial reasoning at scale. The breadth of the evaluation (17 models, 13 benchmarks) and the novel No-Image++ ablation that exposes textual-prior reliance are concrete strengths that could inform future work on visual grounding. The findings align with known concerns about reasoning shortcuts but would benefit from tighter causal isolation to strengthen the central claim.

major comments (3)

[Abstract / Evaluation results] Abstract and results: The claim that 'CoT prompting consistently degrades performance' is load-bearing for the paper's contribution, yet the evaluation does not appear to include matched-length controls, fixed decoding budgets, or prompt-length ablations. Without these, observed drops could arise from longer generations, different stopping criteria, or format effects rather than the CoT reasoning step itself.
[Ablation study] No-Image++ ablation section: While the ablation usefully shows hallucination from textual priors even without images, it does not directly test whether this shortcut mechanism is the primary driver of the benchmark degradations under CoT. A per-benchmark breakdown linking the two would be needed to support the causal attribution.
[Evaluation setup] Evaluation setup: The thirteen benchmarks are treated as collectively measuring 'generalized spatial intelligence,' but no analysis is provided on whether performance drops are uniform or driven by a subset of benchmarks that may share textual priors or low visual complexity. This weakens the generalization of the degradation claim.

minor comments (2)

[Abstract] The acronym MRM is used in the abstract without an immediate expansion; a parenthetical definition on first use would improve readability.
[Results figures/tables] Figure or table captions for the main results should explicitly state the prompting conditions (e.g., 'direct' vs. 'CoT') and any statistical tests used for the reported differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify important potential confounds and opportunities to strengthen causal claims. We address each major comment below and have revised the manuscript to incorporate additional controls, analyses, and clarifications where feasible.

read point-by-point responses

Referee: [Abstract / Evaluation results] Abstract and results: The claim that 'CoT prompting consistently degrades performance' is load-bearing for the paper's contribution, yet the evaluation does not appear to include matched-length controls, fixed decoding budgets, or prompt-length ablations. Without these, observed drops could arise from longer generations, different stopping criteria, or format effects rather than the CoT reasoning step itself.

Authors: We agree that length and format differences represent a plausible alternative explanation. Our original experiments used standard decoding parameters, but to isolate the contribution of the reasoning step itself we have added new matched-length and fixed-budget ablations (using both truncated CoT prompts and explicit token limits). These controls confirm that the degradation persists, albeit with a modestly reduced effect size. The revised evaluation section now reports these results alongside the original findings and includes a brief discussion of format effects. revision: yes
Referee: [Ablation study] No-Image++ ablation section: While the ablation usefully shows hallucination from textual priors even without images, it does not directly test whether this shortcut mechanism is the primary driver of the benchmark degradations under CoT. A per-benchmark breakdown linking the two would be needed to support the causal attribution.

Authors: We concur that a direct linkage would strengthen the causal argument. In the revision we have added a per-benchmark correlation analysis that relates each benchmark's No-Image++ hallucination rate to its CoT-induced performance drop. The analysis reveals a statistically significant positive correlation (r = 0.68), supporting shortcut learning as a contributing factor. This new figure and accompanying text have been inserted into the ablation study section. revision: yes
Referee: [Evaluation setup] Evaluation setup: The thirteen benchmarks are treated as collectively measuring 'generalized spatial intelligence,' but no analysis is provided on whether performance drops are uniform or driven by a subset of benchmarks that may share textual priors or low visual complexity. This weakens the generalization of the degradation claim.

Authors: We appreciate the call for uniformity analysis. The revised results section now includes a per-benchmark breakdown table and accompanying text showing that CoT degradation occurs on 11 of the 13 benchmarks. The two exceptions are discussed in terms of their lower visual complexity and higher textual-prior overlap. We have also added a short subsection examining benchmark characteristics (visual complexity, textual prior strength) and their relation to effect size. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external benchmarks

full rationale

The paper conducts a direct empirical comparison of 17 models across 13 spatial benchmarks, measuring performance with and without CoT prompting plus a No-Image++ ablation. No derivations, equations, fitted parameters presented as predictions, or self-citation chains are used to establish the central claims. All reported results follow from straightforward accuracy measurements on publicly available external test sets, with the ablation serving as an independent control. This structure contains no self-referential loops or reductions of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of existing spatial benchmarks as proxies for generalized spatial intelligence and on the interpretation of the ablation as evidence of shortcut learning.

axioms (1)

domain assumption The thirteen chosen spatial benchmarks are representative measures of generalized spatial intelligence
The paper's broad claim depends on these benchmarks capturing the intended capability without further validation of their coverage.

pith-pipeline@v0.9.0 · 5427 in / 1172 out tokens · 31113 ms · 2026-05-10T08:33:12.126047+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved...

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and Psychological Mea- surement, 20:37 – 46. Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xi...

work page internal anchor Pith review Pith/arXiv arXiv 1960
[2]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models.ArXiv, abs/2201.11903. Penghao Wu and Saining Xie. 2024. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13084–13094. xAI. 2025. Grok-1.5 vision. https://huggingface. co...

work page internal anchor Pith review arXiv 2024
[3]

1956” matches option “B. 1956

If multiple choices appear in the response, the last unambiguous one is the final choice.\n 4) Never judge factual correctness—only map the response to the best matching option letter from the given options.\n 5) If no explicit letter can be extracted from the response, compare the response’s meaning to option texts. If exactly one option clearly restates...

work page 1956
[4]

I don’t know

If multiple conflicting answers or uncertainty like "I don’t know" appear in the Response, output "0".\n 3) Do not use external knowledge; judge only based on the text in Gold and Response.\n

work page
[5]

double-bus

Punctuation, grammar, and minor spelling errors should be ignored.\n - uppercase/lowercase differences should be ignored.\n - hyphen and underscore are ignored. For ex, "double-bus" and "double bus" are considered the same.\n - synonyms of "Yes"/"No" like "Y"/"N", "True"/"False" must be considered the same.\n - word representations of numbers like "one"/"...

work page
[6]

New York City

Core concept and critical attributes must match. For example, "New York City" and "New York State" do not match. Other examples of non-matches are “bus” vs “double bus”; “red” vs “light red”; “dog” vs “golden retriever”; “apple” vs “green apple”.\n 6) If the response says "I don’t know", "Cannot determine", or similar, output "0".\n\n Examples:\n - Gold: ...

work page arXiv 2025

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and Psychological Mea- surement, 20:37 – 46. Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xi...

work page internal anchor Pith review Pith/arXiv arXiv 1960

[2] [2]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models.ArXiv, abs/2201.11903. Penghao Wu and Saining Xie. 2024. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13084–13094. xAI. 2025. Grok-1.5 vision. https://huggingface. co...

work page internal anchor Pith review arXiv 2024

[3] [3]

1956” matches option “B. 1956

If multiple choices appear in the response, the last unambiguous one is the final choice.\n 4) Never judge factual correctness—only map the response to the best matching option letter from the given options.\n 5) If no explicit letter can be extracted from the response, compare the response’s meaning to option texts. If exactly one option clearly restates...

work page 1956

[4] [4]

I don’t know

If multiple conflicting answers or uncertainty like "I don’t know" appear in the Response, output "0".\n 3) Do not use external knowledge; judge only based on the text in Gold and Response.\n

work page

[5] [5]

double-bus

Punctuation, grammar, and minor spelling errors should be ignored.\n - uppercase/lowercase differences should be ignored.\n - hyphen and underscore are ignored. For ex, "double-bus" and "double bus" are considered the same.\n - synonyms of "Yes"/"No" like "Y"/"N", "True"/"False" must be considered the same.\n - word representations of numbers like "one"/"...

work page

[6] [6]

New York City

Core concept and critical attributes must match. For example, "New York City" and "New York State" do not match. Other examples of non-matches are “bus” vs “double bus”; “red” vs “light red”; “dog” vs “golden retriever”; “apple” vs “green apple”.\n 6) If the response says "I don’t know", "Cannot determine", or similar, output "0".\n\n Examples:\n - Gold: ...

work page arXiv 2025