pith. sign in

arxiv: 2606.26535 · v1 · pith:SXX2U4L2new · submitted 2026-06-25 · 💻 cs.CV · cs.AI

From Hallucination to Grounding: Diagnosing Visual Spatial Intelligence via CRISP

Pith reviewed 2026-06-26 05:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords CRISPvisual spatial intelligenceperception-reasoning disconnect3D scene graphsvision-language modelsmetric estimationmulti-hop reasoningoracle intervention
0
0 comments X

The pith

CRISP reveals that proprietary VLMs have strong latent reasoning but fail at metric estimation and using internal structures, while open-source models lack multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRISP to test visual spatial intelligence by checking whether a model's implicit perception aligns with its explicit reasoning steps. It applies metric 3D scene graphs together with an oracle intervention protocol that supplies perfect structural information at test time. This setup shows proprietary models can reason once given accurate data yet still produce wrong distances and ignore the structures they already hold. Open-source models instead fail at chaining multiple reasoning steps even when perception is helped. The work therefore treats simple answer accuracy as insufficient and pushes evaluation toward verifiable perception-reasoning consistency.

Core claim

CRISP demonstrates a systematic perception-reasoning disconnect: proprietary models possess robust latent reasoning engines but suffer from inaccurate metric estimation and a critical failure to leverage their implicit structural representations, while open-source models remain fundamentally bottlenecked by their lack of multi-hop compositional reasoning.

What carries the argument

CRISP, the structural-diagnostic evaluation paradigm that measures consistency between implicit perception and explicit reasoning using metric 3D Scene Graphs and an oracle intervention protocol to isolate reasoning from perception.

If this is right

  • Evaluation of vision-language models should shift from final-answer accuracy to explicit checks of perception-reasoning consistency.
  • Proprietary models require targeted fixes for metric estimation and mechanisms that force use of already-present structural knowledge.
  • Open-source models need architectural or training advances that enable multi-hop compositional reasoning.
  • Progress in multimodal alignment will require methods that go beyond end-to-end post-training to enforce grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • CRISP-style diagnostics could be extended to other reasoning domains such as temporal or causal understanding to locate similar disconnects.
  • Training objectives that reward explicit verification against internal representations might reduce the proprietary-model gap identified here.
  • If the same disconnect appears on tasks outside spatial reasoning, current model scaling alone is unlikely to close it.

Load-bearing premise

The oracle intervention protocol and metric 3D Scene Graphs can separate latent reasoning from perceptual bottlenecks without adding their own biases or artifacts.

What would settle it

An experiment in which models given the oracle 3D graphs still produce the same error patterns as without them, or in which consistency scores remain low even after the intervention removes all perceptual error.

Figures

Figures reproduced from arXiv: 2606.26535 by Yinan Yu, Zhixing Li.

Figure 1
Figure 1. Figure 1: The illusion of spatial intelligence. While VLMs correctly answer queries, our structural-diagnostic paradigm reveals internal hallucinations. This exposes a crit￾ical inconsistency between correct linguistic outputs and flawed implicit 3D modeling. structural spatial perception. While current models excel at the former: identifying “what is where” via semantic labels and 2D bounding boxes; they exhibit a … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CRISP benchmark. We construct a physically grounded 3D Scene Graph to instantiate paired diagnostic tasks: Spatial QA and SGC. A novel consistency protocol then evaluates the alignment between explicit linguistic reasoning and the externalization of implicit structural modeling. tleneck to rigorously verify implicit 3D cognition. Furthermore, while efforts like SIGBench [40] and THEORY OF S… view at source ↗
Figure 3
Figure 3. Figure 3: Question distribution of the Spatial QA task. Spatial QA Task. To probe explicit reasoning, we instantiate a Spatial QA suite comprising seven competen￾cies vital for embodied agents: dis￾tance estimation (reaching), size es￾timation (clearance), directional per￾ception (egocentric navigation), count￾ing (inventory), spatial ranking (pri￾oritization), view transformation (pose prediction), and logical dedu… view at source ↗
Figure 4
Figure 4. Figure 4: Fine-grained spatial capa￾bilities breakdown across 9 dimen￾sions. 4 Experiments Evaluation Setup. We benchmark 13 state-of-the-art VLMs, comprising pro￾prietary giants like Gemini 2.5 Flash/Pro [10], Gemini 3 Flash [11], GPT-5-Mini [30] and GPT-5.2 [30]; and leading open-source models like Qwen2.5/3-VL [5,6], InternVL-3.5 [35], LLaVA-OneVision-1.5 [2]. We also include VG LLM [56] and Cambrian-S [45] as sp… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizing the dependency of reasoning on grounding. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decoupling Perception from Reasoning. Diagnosing shortcut learning and perception-reasoning disconnect. significantly with visual input (∆QA = 16.31), its SGC gain is very limited (∆SGC = 3.24). This dissociation implies a critical Semantic-Geometric Gap: the vision encoder successfully identifies what the objects are, allowing the log￾ical engine to answer QA queries based on restored semantic contexts, b… view at source ↗
Figure 7
Figure 7. Figure 7: Unlocking Reasoning Potential. We compare QA accuracy across four input settings (from left to right): Base, Pred SG, Multimodal GT SG, and Text￾Only GT SG. The dramatic performance surge in the two rightmost GT bars confirms the presence of robust latent reasoning engines, while the stagnation in Pred SG visually quantifies the severity of modality conflict. attention map would correctly highlight the dra… view at source ↗
Figure 1
Figure 1. Figure 1: Confusion matrices comparing Base QA and Derived QA predictions. To empirically ground the diagnostic archetypes identified in Sec. 4.3, we visualize the consistency evaluation protocol via instance-level confusion matri￾ces (Base QA vs. Derived QA) for three representative models. As shown in [PITH_FULL_IMAGE:figures/full_fig_p033_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A qualitative example of Qwen3-VL-8B demonstrating the consistent halluci￾nation [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A qualitative example of Qwen3-VL-8B demonstrating the imperfect alignment [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A qualitative example of LLaVA-OneVision1.5-8B demonstrating the solver failed [PITH_FULL_IMAGE:figures/full_fig_p037_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case Study on GPT-5-Mini. Under GT 3D SG intervention, the model corrects 1,819 previously failed queries and achieves near-perfect metric estimation (Median MRA 1.0). Under Pred 3D SG intervention, the metric distribution stagnates (Median MRA 0.4), and the model exhibits high consistent hallucination, confirming that internal perceptual bottlenecks prevent the reasoning engine from activating [PITH_FULL… view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrices of generated spatial relations across three orthog [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The CRISP Diagnostic Roadmap. A modular framework for suggesting potential architectural intervention. Panel A leverages baseline metrics to diagnose primary phenotypic failure modes. Panel B provides independent structural probes to mechanistically decouple perceptual bottlenecks from reasoning deficits and evaluate robustness against modality conflict. ignoring geometry entirely (diagnosed as a Semantic … view at source ↗
read the original abstract

Current VLM evaluations often conflate language priors with genuine spatial reasoning. To address this, we introduce CRISP, a novel structural-diagnostic evaluation paradigm that assesses visual spatial intelligence through consistency, the alignment between implicit perception and explicit reasoning. Unlike traditional black-box QA, CRISP utilizes metric 3D Scene Graphs and an oracle intervention protocol to decouple latent reasoning capabilities from perceptual bottlenecks. This granular diagnosis uncovers a systematic perception-reasoning disconnect. Crucially, we reveal that while proprietary models possess robust latent reasoning engines, they suffer from inaccurate metric estimation and a critical failure to leverage their implicit structural representations. Conversely, open-source models remain fundamentally bottlenecked by their lack of multi-hop compositional reasoning. By shifting the focus from merely ``guessing correctly'' via language priors to genuinely ``perceiving, verifying, and reasoning,'' CRISP offers a rigorous roadmap for multimodal alignment beyond end-to-end post-training. The code and dataset are available at https://github.com/iiyamayuki/CRISP-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CRISP, a structural-diagnostic evaluation paradigm for assessing visual spatial intelligence in vision-language models (VLMs). Unlike standard black-box QA, CRISP employs metric 3D Scene Graphs as oracles and an intervention protocol to isolate latent reasoning from perceptual errors. The central findings are that proprietary models exhibit robust latent reasoning engines yet fail at accurate metric estimation and fail to exploit implicit structural representations, while open-source models are primarily limited by insufficient multi-hop compositional reasoning. Code and dataset are released.

Significance. If the oracle intervention protocol validly decouples perception from reasoning without artifacts, the work supplies a more granular diagnostic than existing VLM benchmarks and identifies concrete bottlenecks (metric estimation in proprietary models; compositional chaining in open-source models). The public release of the benchmark strengthens reproducibility and enables follow-up alignment research.

major comments (3)
  1. [Oracle intervention protocol (methodology section describing CRISP)] The central claim that proprietary models possess 'robust latent reasoning engines' while open-source models lack 'multi-hop compositional reasoning' depends on the oracle intervention protocol successfully isolating these capacities. The manuscript must demonstrate, via explicit ablations or controls, that injecting metric or structural information does not itself create new reasoning pathways or alter internal processing (see stress-test concern). Without such evidence the proprietary/open-source distinction risks being an artifact of the diagnostic setup.
  2. [Metric 3D Scene Graphs construction and validation] The assertion that 3D Scene Graphs provide an 'unbiased, complete structural representation' of the visual input actually received by the model requires verification that the graphs match the model's perceptual input distribution and do not introduce their own biases. The paper should report quantitative checks (e.g., agreement between graph-derived metrics and human annotations on the same images) and sensitivity analyses when graph completeness is varied.
  3. [Results and experimental setup] The reported perception-reasoning disconnect and model-type differences are presented as systematic; however, the abstract supplies no experimental details, sample sizes, statistical tests, or variance across runs. The results section must include these to establish that the observed gaps are not driven by prompt sensitivity or small test sets.
minor comments (2)
  1. [CRISP definition] Clarify the precise definition of 'consistency' used to measure alignment between implicit perception and explicit reasoning; the current description is high-level.
  2. [Reproducibility statement] The GitHub link is provided; ensure the released code exactly reproduces the reported oracle interventions and graph construction steps.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing our response and indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Oracle intervention protocol (methodology section describing CRISP)] The central claim that proprietary models possess 'robust latent reasoning engines' while open-source models lack 'multi-hop compositional reasoning' depends on the oracle intervention protocol successfully isolating these capacities. The manuscript must demonstrate, via explicit ablations or controls, that injecting metric or structural information does not itself create new reasoning pathways or alter internal processing (see stress-test concern). Without such evidence the proprietary/open-source distinction risks being an artifact of the diagnostic setup.

    Authors: We agree that explicit validation of the oracle intervention protocol is essential to support the claims. The manuscript describes the protocol and includes preliminary controls showing that open-source models continue to fail on compositional tasks even with full oracle input, while proprietary models improve on reasoning but not metric estimation. To further address potential artifacts, we will add dedicated ablations and stress-tests in the revision, including systematic variation of injected information and measurement of any changes in reasoning behavior. revision: yes

  2. Referee: [Metric 3D Scene Graphs construction and validation] The assertion that 3D Scene Graphs provide an 'unbiased, complete structural representation' of the visual input actually received by the model requires verification that the graphs match the model's perceptual input distribution and do not introduce their own biases. The paper should report quantitative checks (e.g., agreement between graph-derived metrics and human annotations on the same images) and sensitivity analyses when graph completeness is varied.

    Authors: We accept this point and will strengthen the validation of the metric 3D Scene Graphs. The revised manuscript will include quantitative agreement metrics between graph-derived values and human annotations on sampled images, along with sensitivity analyses that vary graph completeness and report resulting changes in model performance. These additions will confirm alignment with perceptual input and quantify any potential biases. revision: yes

  3. Referee: [Results and experimental setup] The reported perception-reasoning disconnect and model-type differences are presented as systematic; however, the abstract supplies no experimental details, sample sizes, statistical tests, or variance across runs. The results section must include these to establish that the observed gaps are not driven by prompt sensitivity or small test sets.

    Authors: We will expand the results section to report full experimental details, including test set sizes, statistical tests for significance, and variance across multiple runs and prompt variations. This will demonstrate that the observed perception-reasoning disconnect and model-type differences are robust. Note that abstracts conventionally omit such granular details, but the main text will be updated accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical diagnosis relies on external oracles without self-referential reduction

full rationale

The paper's central claims rest on an introduced evaluation protocol (CRISP) that applies metric 3D Scene Graphs and oracle interventions to observed model outputs. These are presented as external diagnostic tools rather than quantities derived from or fitted to the target conclusions. No equations, predictions, or uniqueness theorems reduce by construction to the paper's own inputs or prior self-citations. The distinction between proprietary and open-source models is framed as an empirical finding from the protocol, not a definitional or fitted tautology. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that metric 3D scene graphs serve as unbiased oracles and that the intervention protocol isolates perception from reasoning without circularity.

axioms (1)
  • domain assumption Metric 3D Scene Graphs provide accurate ground-truth spatial relations independent of model perception
    Invoked as the basis for the oracle intervention protocol in the abstract.

pith-pipeline@v0.9.1-grok · 5704 in / 1086 out tokens · 23001 ms · 2026-06-26T05:28:58.154232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 11 canonical work pages · 8 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

  2. [2]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report (2025),https://arxiv.org/ abs/2511.21631

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020)

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  6. [6]

    Google: A new era of intelligence with Gemini 3 (2025),https://blog.google/ products/gemini/gemini-3, accessed: 2026-02-11

  7. [7]

    arXiv preprint arXiv:2501.05444 (2025)

    Hao, Y., Gu, J., Wang, H.W., Li, L., Yang, Z., Wang, L., Cheng, Y.: Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. arXiv preprint arXiv:2501.05444 (2025)

  8. [8]

    arXiv preprint arXiv:2506.03135 (2025)

    Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135 (2025)

  9. [9]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

  10. [10]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  11. [11]

    Advances in neural information processing systems35, 24824–24837 (2022) 30 Z

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 30 Z. Li and Y.Yu

  12. [12]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

  13. [13]

    Cambrian-S: Towards Spatial Supersensing in Video

    Yang, S., Yang, J., Huang, P., Brown, E., Yang, Z., Yu, Y., Tong, S., Zheng, Z., Xu, Y., Wang, M., et al.: Cambrian-s: Towards spatial supersensing in video. arXiv preprint arXiv:2511.04670 (2025)

  14. [14]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025)

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)

  16. [16]

    arXiv preprint arXiv:2505.24625 (2025)

    Zheng, D., Huang, S., Li, Y., Wang, L.: Learning from videos for 3d world: En- hancing mllms with 3d vision geometry priors. arXiv preprint arXiv:2505.24625 (2025)