pith. sign in

arxiv: 2606.31257 · v1 · pith:TNI7RFIRnew · submitted 2026-06-30 · 💻 cs.CV

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

Pith reviewed 2026-07-01 06:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsspatial reasoninggroundinglinear probesblank imagesteeringaxis inversionvision ablation
0
0 comments X

The pith

A blank-image replacement shows that decodable spatial knowledge in VLMs is not necessarily grounded in vision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that linear probes and steering vectors can make it appear that vision-language models ground spatial reasoning in images when they do not. By swapping the input image for a uniform gray blank, the authors isolate whether performance depends on visual input. This reveals three regimes: axes where the model truly uses the image, axes where it falls back to a learned prior regardless of the image, and axes where the model has learned an inverted mapping that produces systematically wrong answers. The distinction matters because it means many claims about what VLMs know from images may actually reflect language priors or training artifacts. Across fourteen models the pattern is consistent, with horizontal often grounded, vertical a prior, and depth inverted.

Core claim

The central claim is that the standard combination of a linear probe and a training-free steering recovery can systematically overstate visual grounding in VLMs. The one-line causal control of replacing the image with a gray blank refutes apparent grounding on some axes and exposes an inversion regime where the model deploys the decoded direction with the wrong sign, scoring below chance. This taxonomy of grounded, prior, and inverted regimes holds across models from multiple families, with the inversion appearing at larger scales within families.

What carries the argument

The blank-image arbiter, a causal control that replaces the visual input with a uniform gray field to test dependence on actual image content.

If this is right

  • Horizontal spatial axes are grounded in vision for the models tested.
  • Vertical axes behave as image-independent priors.
  • Depth axes are inverted, with decodable information deployed in the opposite direction.
  • The complexity of correcting the inversion varies across models, from simple rotations to low-rank edits.
  • The blank-image test cleanly separates the three regimes and should serve as a standard control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same blank-image control to other reasoning tasks could reveal similar hidden priors or inversions.
  • Training data biases may systematically produce inverted mappings on certain geometric dimensions.
  • Future steering methods might need to include sign checks derived from ablation tests.
  • The per-model spectrum of correction complexity suggests that some VLMs have more distributed representations of spatial inversions.

Load-bearing premise

Replacing the image with a gray blank cleanly isolates whether the model's spatial decisions depend on visual content rather than introducing new processing artifacts from uniform inputs.

What would settle it

If accuracy on real images for an axis does not drop when the image is replaced by a blank, while the claimed regime predicts a drop for grounded or a specific pattern for inverted, that would falsify the separation into three regimes.

Figures

Figures reproduced from arXiv: 2606.31257 by Chih-Ting Liao, Fei Shen, Tat-Seng Chua, Xin Cao.

Figure 1
Figure 1. Figure 1: Decodability does not predict whether behavior depends on the image. Each point is a (model, axis) pair across 8 VLMs: x is probe decoding accuracy, y is vision-dependence (real￾image − gray-image behavior). Probes decode every axis well (x>70%), but behavior splits into three regimes probing cannot see: horizontal grounded (high), vertical a prior (zero), depth inverted (below zero). The arbiter recovers … view at source ↗
Figure 2
Figure 2. Figure 2: The recovery is prior amplification, not visual deployment. Training-free projection along the localized vertical direction lifts vertical from 58.5 to 79 at γ = 2 with horizontal intact. But the same projection with a gray-blank image (dashed) gains just as much (+19.1 vs. +20.5): the lever amplifies a non-visual prior. An off-band projection (L2–6, dotted) does nothing. Percentages. signature. A multi-la… view at source ↗
Figure 3
Figure 3. Figure 3: The grounding taxonomy as a cross-model card. Real-image accuracy on the three ViewSpatial axes for 14 VLMs (six LM families, 2B–27B), each cell colored by its arbiter regime. At a glance: horizontal grounded, vertical a prior, depth inverted in every capable model, with depth inversion scale-emergent within families (InternVL3 38 → 31 → 28; Gemma non-monotone 53→43→46); SmolVLM-2.2B is a capability floor.… view at source ↗
Figure 4
Figure 4. Figure 4: The decode̸=deploy inversion is a population phenomenon (left); which minimal edit re-deploys it reads out each model’s geometry (right). Left: signed steering slope of decodability (adecode, blue) and behaviour (adeploy, vermillion) along depth; decodability rises everywhere while behaviour moves the opposite way in 7/8 models across five LM families (exception: the not-yet￾inverted Qwen2.5-VL-3B). Right:… view at source ↗
Figure 5
Figure 5. Figure 5: The inverted depth channel is causal and localized across the population, with a one-sided directional signature. Seven inverted models. Left: the probe-decode-peak (blue) and causal-controllability-peak (vermillion) layers both sit mid-late, never early (shaded). Right: causal controllability is strongly asymmetric, the ctrl(−) steer (bar) recovers correct depth while ctrl(+) (tick) does almost nothing, i… view at source ↗
Figure 6
Figure 6. Figure 6: The decodability–behavior dissociation that invites a wrong conclusion. Peak probe￾decode accuracy (blue) vs. actual binary behavior (teal). Horizontal is decoded and deployed (gap +4). Vertical is the most decodable axis yet sits near chance behaviorally (gap +36). Depth is decodable but behavior is below chance (gap +47, anti-correlated). Read alone, this looks like “knowledge present, not deployed” and … view at source ↗
Figure 7
Figure 7. Figure 7: The axis signal is assembled LM-side, not in raw image tokens (Qwen2.5-VL-7B, five￾fold CV probe). Per-layer decode accuracy of the within-axis pole from three loci: decision token, mean-pooled text tokens, and mean-pooled image tokens (grey). On all three axes the image-pool trace stays flat near its shuffle floor (dashed) while the decision/text traces rise sharply at L16–24: the direction is composed in… view at source ↗
Figure 8
Figure 8. Figure 8: All three axes are linearly decodable; vertical most of all. Five-fold cross-validated logistic-probe accuracy for the two within-axis poles, read from the decision-token residual stream at each layer (PCA-50 features). Decoding rises sharply at L16–24 for every axis. Vertical peaks high￾est (94 at L23), above horizontal (85); depth reaches 77. The dotted band is the per-axis shuffled￾label floor (55–63). … view at source ↗
Figure 9
Figure 9. Figure 9: The deployment locus is a distributed L20–24 path, and vertical necessity is local￾ized there (S4). Left: corrective-injection gain peaks at L21–22 for all axes, matching the decode onset ( [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise decode and causal controllability (Qwen3-VL-8B depth). Probe-decoding accuracy (blue) climbs from chance in the early layers to a peak at L26, while one-sided causal controllability (vermillion, β=4) is flat and near-zero early, then spikes to 98.7% at L24 before falling back to ∼35–39% in the later layers. The narrow controllability spike at the decode-onset region—strongly asymmetric, recover… view at source ↗
Figure 11
Figure 11. Figure 11: Causal control is clean and monotone on all three axes (S3). Probability of the positive pole as the per-axis diff-of-means direction is added at the decision token (full n=1047, Qwen2.5- VL-7B). Horizontal and vertical increase with α (auto-detected readout sign +1); depth decreases (sign −1), the steering signature of an inverted readout. Each axis is causally steerable—which, like decodability ( [PITH… view at source ↗
Figure 12
Figure 12. Figure 12: The depth inversion is a clean low-dimensional rotation in Qwen3-VL-8B. Schematic of the depth readout plane. Left: the model’s readout places true-“closer” items on the “farther” side of the decision boundary (and vice-versa)—a systematic sign inversion of an otherwise well￾separated representation (accuracy 21.4, well below chance). Right: a single training-free, norm￾preserving rotation by θ≈π re-align… view at source ↗
Figure 13
Figure 13. Figure 13: The three-regime taxonomy is architecture-general (10 of 14 models shown; full set in [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Depth inversion emerges with LM scale within families, but the size–severity rela￾tion is family-specific and not strictly monotone. InternVL3 (2B 38.1 → 8B 30.9 → 14B 28.2) and Qwen2.5-VL (3B 47.9 → 7B 31.6) deepen monotonically with size, whereas Gemma-3 (4B 52.6 → 12B 43 → 27B 45.6) emerges then partially recovers at 27B—a clean within-family non-monotonicity. Onset differs by family (InternVL3 already… view at source ↗
Figure 15
Figure 15. Figure 15: Grounding is task-type-specific: the depth inversion is ViewSpatial-camera￾egocentric, not a general depth failure (Qwen2.5-VL-7B; depth on What’sUp-B front/behind and 3DSRBench closer-to-camera, vertical on What’sUp-A on/under and 3DSRBench height). The same model that inverts depth on ViewSpatial’s camera-relative front/back (30.7) is grounded-correct on near-field front/behind (99.5) and metric closer-… view at source ↗
read the original abstract

The standard way to read latent knowledge out of a model, a linear probe confirmed by a steering recovery, can systematically overstate what a vision-language model (VLM) actually grounds in the image. We show this on spatial reasoning, where the error is invisible to both probing and steering yet exposed by a one-line causal control: replacing the image with a gray blank. Probes decode the within-axis answer at 73--97% across axes, and a training-free projection lifts a near-chance axis from 59% to 79%, exactly the signature of unlocking latent knowledge. The blank-image arbiter refutes it, revealing three grounding regimes that probing conflates: an axis can be grounded (vision-dependent, correct), a prior (vision-independent, with its decode and its apparent recovery a directional default rather than perception), or, surprisingly, inverted: decodable, causally controllable, but deployed with the wrong sign, so the model scores below chance and the error requires looking. The taxonomy holds across the studied VLMs: in fourteen models spanning six language-model families and 2B--27B, horizontal is grounded, vertical is a prior, and depth is inverted, with the inversion emerging at scale within families. The decode-versus-deploy inversion replicates on seven of eight models across five families, and the minimal edit that re-deploys it varies with geometry: a training-free rotation matches a trained edit on the cleanest model, while distributed inversions need a trained low-rank edit, tracing a per-model correction-complexity spectrum. The cheap, self-calibrating arbiter cleanly separates grounded perception, inverted perception, and prior substitution; we argue it should be a default control for latent-knowledge and steering claims in VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard linear probing and steering recovery in VLMs overstate visual grounding for spatial reasoning tasks. Using a blank-image (gray) ablation as a causal control, it identifies three regimes—grounded (vision-dependent and correct), prior (vision-independent directional default), and inverted (decodable and steerable but with wrong sign, yielding below-chance performance)—that probing conflates. Horizontal axes are grounded, vertical are priors, and depth is inverted (emerging at scale); the pattern replicates across 14 models in 6 families (2B–27B). A training-free rotation or low-rank edit can correct inversions, with complexity varying by model. The blank-image arbiter is proposed as a default control for grounding claims.

Significance. If the blank-image control validly isolates vision dependence, the result supplies a cheap, self-calibrating diagnostic that separates true perceptual grounding from language priors and sign inversions, directly challenging reliance on probe accuracy or steering recovery alone. The broad evaluation across model families and scales, plus the geometry-specific correction spectrum, strengthens the practical takeaway. The work credits the external control for exposing patterns invisible to internal methods and offers falsifiable per-axis predictions.

major comments (2)
  1. [Methods (blank-image control)] Methods / blank-image arbiter description: The central taxonomy (grounded vs. prior vs. inverted) is defined solely by comparing real-image vs. gray-blank performance. No validation is reported that uniform gray inputs do not themselves trigger model-specific attention shifts, logit biases, or out-of-distribution defaults; if they do, the 'prior' and 'inverted' labels become artifacts of the control rather than evidence about grounding on actual images. This directly undermines the claim that the arbiter cleanly separates the three regimes.
  2. [Results (depth inversion)] Results (cross-model patterns, e.g., depth inversion at scale): The inversion claim for depth (and its emergence within families) rests on the gray-blank baseline being neutral. Without auxiliary controls (e.g., Gaussian noise, black images, or scrambled patches) to confirm the gray field does not systematically invert or default the depth axis, the scale-dependent pattern cannot be attributed to grounding failure rather than control artifact.
minor comments (2)
  1. [Abstract / Results] Abstract and §4: The reported accuracy ranges (73–97%, 59% to 79%) would benefit from per-model, per-axis tables with confidence intervals or statistical tests to support the 'exactly the signature' claim for steering recovery.
  2. [Figures] Figure captions: Clarify whether the gray blank is a constant RGB value or sampled; this affects reproducibility of the control.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need for stronger validation of the blank-image control. We agree that additional auxiliary controls would increase confidence in the taxonomy and will incorporate them in the revision.

read point-by-point responses
  1. Referee: Methods / blank-image arbiter description: The central taxonomy (grounded vs. prior vs. inverted) is defined solely by comparing real-image vs. gray-blank performance. No validation is reported that uniform gray inputs do not themselves trigger model-specific attention shifts, logit biases, or out-of-distribution defaults; if they do, the 'prior' and 'inverted' labels become artifacts of the control rather than evidence about grounding on actual images. This directly undermines the claim that the arbiter cleanly separates the three regimes.

    Authors: We acknowledge that the manuscript reports no explicit comparison of the gray blank against other neutral inputs. In the revised manuscript we will add experiments replacing the image with Gaussian noise and with solid black fields, reporting probe accuracies and steering recovery under each. If the grounded/prior/inverted classification remains stable across these controls, this will confirm that the taxonomy reflects model behavior on real images rather than a gray-specific artifact. revision: yes

  2. Referee: Results (cross-model patterns, e.g., depth inversion at scale): The inversion claim for depth (and its emergence within families) rests on the gray-blank baseline being neutral. Without auxiliary controls (e.g., Gaussian noise, black images, or scrambled patches) to confirm the gray field does not systematically invert or default the depth axis, the scale-dependent pattern cannot be attributed to grounding failure rather than control artifact.

    Authors: We agree that the depth-inversion result would be more robust with the suggested auxiliary controls. The revision will include per-axis probe and steering results under Gaussian noise and black-image baselines, with particular attention to whether the below-chance depth performance and its scale dependence persist. This will allow us to attribute the inversion to the models rather than to the choice of gray field. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical control defines taxonomy independently

full rationale

The paper's central result is an empirical taxonomy (grounded / prior / inverted) obtained by direct performance comparison between real images and gray-blank ablations across 14 models. This is an external causal intervention, not a quantity fitted to data and then renamed as a prediction, nor any self-definitional mapping, nor a load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are invoked; the three regimes are observed outcomes of the control experiment itself. The derivation chain therefore contains no reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities appear in the abstract; the work rests on empirical ablation results.

pith-pipeline@v0.9.1-grok · 5862 in / 1162 out tokens · 40765 ms · 2026-07-01T06:20:47.005914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 31 canonical work pages · 14 internal anchors

  1. [1]

    Don’t just assume; look and answer: Overcoming priors for visual question answering

    Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InCVPR, 2018

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jian- qiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report....

  4. [4]

    Sanjay Basu et al. Interpretability without actionability: Mechanistic methods cannot cor- rect language model errors despite near-perfect internal representations.arXiv preprint arXiv:2603.18353, 2026

  5. [5]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

  6. [6]

    Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

    Wenxiao Cai et al. Spatialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024

  7. [7]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

  8. [8]

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms

    Erik Daxberger et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. In ICCV, 2025. arXiv:2503.13111

  9. [9]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

    Mengfei Du et al. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2406.05756, 2024

  10. [10]

    Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

    Wei Gao et al. Uncovering and shaping the latent representation of 3d scene topology in vision-language models.arXiv preprint arXiv:2605.07148, 2026

  11. [11]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 10 Preprint

  12. [12]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017

  13. [13]

    Language models represent space and time

    Wes Gurnee and Max Tegmark. Language models represent space and time. InICLR, 2024

  14. [14]

    Designing and interpreting probes with control tasks

    John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InEMNLP, 2019

  15. [15]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInter- national Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685

  16. [16]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models,

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision lan- guage models. InICLR, 2026. arXiv:2506.03135

  17. [17]

    up” with vision-language mod- els? investigating their struggle with spatial reasoning. InEMNLP, 2023. arXiv:2310.19785; benchmark commonly cited as “What’sUp

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language mod- els? investigating their struggle with spatial reasoning. InEMNLP, 2023. arXiv:2310.19785; benchmark commonly cited as “What’sUp”

  18. [18]

    Linear mechanisms for spatiotemporal reasoning in vision language models.arXiv preprint arXiv:2601.12626, 2026

    Raphi Kang, Hongqiao Chen, Georgia Gkioxari, and Pietro Perona. Linear mechanisms for spatiotemporal reasoning in vision language models.arXiv preprint arXiv:2601.12626, 2026

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yan- wei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  20. [20]

    ViewSpatial-Bench: Evaluating multi-perspective spatial localization in vision-language models,

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language mod- els.arXiv preprint arXiv:2505.21500, 2025

  21. [21]

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

    Kenneth Li, Oam Patel, Fernanda Vi´egas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2306.03341

  22. [22]

    Visual spatial reasoning.TACL, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.TACL, 2023

  23. [23]

    Spatial-ssrl: Enhancing spatial understanding via self- supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self- supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. CVPR 2026 ac- cepted

  24. [24]

    Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms.arXiv preprint arXiv:2510.17771, 2025

    Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, and Hanghang Tong. Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms.arXiv preprint arXiv:2510.17771, 2025

  25. [25]

    de Melo, Alan Yuille, and Jieneng Chen

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M. de Melo, Alan Yuille, and Jieneng Chen. 3DSRBench: A comprehensive 3D spatial reasoning benchmark. InarXiv preprint arXiv:2412.07825, 2024

  26. [26]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

  27. [27]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InNeurIPS, 2022

  28. [28]

    Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

    Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, and Jaesik Park. Why far looks up: Probing spatial representation in vision- language models.arXiv preprint arXiv:2605.30161, 2026

  29. [29]

    Pixtral 12B

    Mistral AI. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024. 11 Preprint

  30. [30]

    Sparse autoencoders learn monosemantic features in vision-language models

    Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. InCVPR, 2025. arXiv:2504.02821

  31. [31]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Matt Turner. Steering Llama 2 via contrastive activation addition. InACL, 2024. arXiv:2312.06681

  32. [32]

    Steering llama 2 via contrastive activation addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Matt Turner. Steering llama 2 via contrastive activation addition. InACL, 2024. Alias for panickssery2024; preserved for citations

  33. [33]

    Beyond semantics: Rediscovering spatial awareness in vision-language models.arXiv preprint arXiv:2503.17349, 2025

    Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Beyond semantics: Rediscovering spatial awareness in vision-language models.arXiv preprint arXiv:2503.17349, 2025

  34. [34]

    Ponti, and Shay B

    Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, and Shay B. Cohen. Spectral editing of activations for large language model alignment. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2405.09719

  35. [35]

    SpaceQwen2.5-VL and SpaceOm: Spatial-reasoning fine-tunes of Qwen2.5-VL

    RemyxAI. SpaceQwen2.5-VL and SpaceOm: Spatial-reasoning fine-tunes of Qwen2.5-VL. Hugging Face model cards,https://huggingface.co/remyxai, 2025. VQASynth- distilled spatial-reasoning VLMs, building on SpatialVLM [7]

  36. [36]

    The Geometry of Representational Failures in Vision Language Models

    Daniele Savietto, Declan Campbell, Andr ´e Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, and Alan Perotti. The geometry of representational failures in vision language models.arXiv preprint arXiv:2602.07025, 2026

  37. [37]

    Linear spatial world models emerge in large language models.arXiv preprint arXiv:2506.02996, 2025

    Matthieu Tehenan, Christian Bolivar Moya, Tenghai Long, and Guang Lin. Linear spatial world models emerge in large language models.arXiv preprint arXiv:2506.02996, 2025

  38. [38]

    arXiv:2510.26243

    Vu and Nguyen. Angular steering: Behavior control via rotation in activation space. InAd- vances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2510.26243

  39. [39]

    Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

    Kevin Wang et al. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. InICLR, 2023

  40. [40]

    Semantics-adaptive activation intervention for LLMs via dynamic steering vectors

    Weixuan Wang, Jingyuan Yang, and Wei Peng. Semantics-adaptive activation intervention for LLMs via dynamic steering vectors. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.12299

  41. [41]

    Manning, and Christopher Potts

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. ReFT: Representation finetuning for language models. In NeurIPS, 2024

  42. [42]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Think- ing in space: How multimodal large language models see, remember and recall spaces. In CVPR, 2025. VSI-Bench

  43. [43]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2026. arXiv:2505.23764

  44. [44]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  45. [45]

    knowledge present, not deployed

    Andy Zou et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint, 2023. 12 Preprint A IMPLEMENTATIONDETAILS Model and decoding.Qwen2.5-VL-7B-Instruct [3], loaded in 4-bit nf4 on a single RTX 4090,attn implementation="sdpa"(flash-attention not installed), greedy decoding, max new tokens=8. The language tower has 28 decoder ...