pith. machine review for the scientific record. sign in

arxiv: 2604.09364 · v2 · submitted 2026-04-10 · 💻 cs.CV · cs.CL

Recognition: unknown

Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:09 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsvisual-linguistic conflictsactivation patchinglogit lensmultimodal arbitrationvisual groundingactivation steering
0
0 comments X

The pith

Vision-language models encode visual evidence correctly even when they give wrong answers based on language priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the core problem in VLMs is not failing to perceive visual details but failing to let those details override conflicting prior knowledge during answer generation. Layer-by-layer probing shows visual attributes remain linearly decodable at high accuracy from early layers whether the final output is correct or incorrect. Full-sequence activation patching then demonstrates that swapping activations from image tokens causally shifts outputs in most cases, while text tokens have little effect. This dissociation implies that improving how models arbitrate between modalities can raise grounding accuracy without retraining vision components.

Core claim

Models that answer incorrectly on visual-linguistic conflicts still encode the visual attribute as strongly as models that answer correctly, as measured by linear decodability from early layers and confirmed by full-sequence activation patching that alters 60 to 84 percent of outputs when image tokens are replaced. The decisive signal is the gap in final-layer logits between visual and prior options rather than encoding strength. Training-free steering of early-layer activations can raise visual grounding accuracy by up to 3.8 percent.

What carries the argument

Encoding-Grounding Dissociation tracked via Multimodal Arbitration Crossover analysis: visual evidence is extracted and linearly readable early regardless of final answer, while the arbitration step at later layers determines whether the evidence is used.

If this is right

  • Visual attributes remain linearly decodable from early layers at AUC above 0.86 for both successful and failed samples.
  • The gap between visual and prior logits at the final layer predicts grounding success with correlation 0.847.
  • Replacing the full token sequence at MAC-identified layers changes 60 to 84 percent of outputs, with nearly all causal impact from image tokens.
  • Early-layer activation steering, linear or sparse-autoencoder-guided, raises visual grounding accuracy without degrading other capabilities in many setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dissociation may appear in other multimodal models, suggesting arbitration-focused diagnostics could replace perception-only benchmarks.
  • Because image tokens alone carry the causal signal, interventions can target specific token positions rather than entire layers.
  • Scaling model size may reduce but not eliminate the arbitration gap, so targeted steering remains useful even for larger VLMs.

Load-bearing premise

Linear decodability of visual attributes from early layers fully captures that the model has perceived the attribute without missing nonlinear or context-dependent interactions.

What would settle it

Finding a set of visual-linguistic conflict examples where early-layer linear probes decode the visual attribute poorly on failure cases, or where full-sequence image-token patching leaves the output unchanged.

Figures

Figures reproduced from arXiv: 2604.09364 by Farhad Nooralahzadeh, Jonathan F\"urst, Kurt Stockinger, Omid Rohanian, Yi Zhang.

Figure 1
Figure 1. Figure 1: When shown a blue banana, VLMs correctly [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: concrete banana example with the max-variant rule: at each layer, we take the best-matching blue and yellow token forms, then define MAC as the first stable layer where visual (v) exceeds prior (p), verified by persistence at the next layer. Right: MAC logit trajectories (color, 493 samples). Each subplot shows per-layer visual (blue) and prior (orange) logits with sample traces and mean ±1 std shadi… view at source ↗
Figure 3
Figure 3. Figure 3: Full-sequence activation patching. (a) Run the VLM on a counterfactual image (baseline: “blue”), then inject probe-layer states h (ℓ ∗) std from a standard-image run. If the output flips to “yellow,” the probe layer causally mediates grounding. (b) Last-token patching fails; full-sequence patching succeeds, reflecting that VLMs spread visual information across all tokens. What predicts success? Correlating… view at source ↗
Figure 4
Figure 4. Figure 4: MAC logit trajectories for all models (color, 493 samples). Each subplot shows [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of per-sample MAC layers across seven primary models (Color, 493 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Fine-grained visual encoding trajectory: L2 distance (counterfactual vs. standard) [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual encoding heatmap: mean L2 distance across all depth points and seven [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-stage correlation analysis (n = 7 primary models). (a) Encoding strength (L2 at 75% depth) shows no correlation with success (ρ = 0.198). (b) Logit gap is strongly predictive (ρ = 0.847, p = 0.016). (c) Visual rank is also predictive (ρ = −0.893). (d) Encod￾ing does not predict logit gap (ρ = 0.464). (e) Sample-level ROC-AUC near chance (0.528). (f) Arbitration-phase metrics predict success; encoding… view at source ↗
Figure 9
Figure 9. Figure 9: Encoding strength in success vs. failure cases (L2 distance at 75% MAC depth, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Activation patching case studies. Each row shows a counterfactual image (altered [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Behavioral shift: baseline (BL) vs. patched (PT) output distributions. Blue: follows [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding-Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit - not the strength of encoding - better predicts grounding outcomes with a correlation of $\rho=$ 0.847. After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering - both linear and sparse autoencoder-guided - in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that VLM failures on visual-linguistic conflicts (e.g., describing a blue banana as yellow) reflect arbitration failures between visual and prior signals rather than perceptual blindness. Across ten VLMs, it reports an Encoding-Grounding Dissociation: visual attributes remain linearly decodable from early layers (AUC > 0.86) with nearly identical accuracy on success versus failure samples; the final-layer logit gap predicts outcomes (ρ = 0.847); full-sequence activation patching alters 60-84% of outputs with image tokens carrying the causal effect; and training-free linear or SAE-guided steering in early layers yields up to +3.8% grounding improvement.

Significance. If the empirical results hold, the work would usefully redirect VLM research from perception-centric fixes toward arbitration mechanisms, supported by causal interventions and practical steering. Strengths include evaluation on ten models, layer-wise MAC/Logit Lens tracking, quantitative correlations, and reproducible patching/steering protocols that demonstrate falsifiable effects.

major comments (2)
  1. [Encoding-Grounding Dissociation and layer-by-layer Logit Lens probing] The central claim that VLMs 'already see well' rests on linear decodability (AUC > 0.86) being equivalent for success and failure samples in the MAC analysis. However, linear probes do not establish that nonlinear or higher-order combinations of visual attributes are integrated equivalently; systematic differences in later-layer processing could mean the signal is present but not arbitrated correctly, so patching may be overriding rather than restoring an intact representation.
  2. [Causal analysis via activation patching] Full-sequence patching is shown to be necessary because last-token interventions fail, yet the manuscript does not quantify whether the patched activations preserve the original perceptual trajectory or introduce a new signal that bypasses whatever nonlinear suppression occurred in failure cases.
minor comments (2)
  1. [Abstract and experimental setup] The abstract and methods sections should explicitly list all controls for sample selection, layer identification in MAC, and post-hoc choices to address potential concerns about unstated decisions affecting the reported AUC and ρ values.
  2. [Model and intervention details] Clarify the exact architectures and sizes of the ten VLMs and provide the precise definition of 'full token sequence' used in patching to facilitate replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important nuances in interpreting our results on visual-linguistic arbitration in VLMs. We address each major comment below and have revised the manuscript accordingly to strengthen the claims and clarify limitations.

read point-by-point responses
  1. Referee: [Encoding-Grounding Dissociation and layer-by-layer Logit Lens probing] The central claim that VLMs 'already see well' rests on linear decodability (AUC > 0.86) being equivalent for success and failure samples in the MAC analysis. However, linear probes do not establish that nonlinear or higher-order combinations of visual attributes are integrated equivalently; systematic differences in later-layer processing could mean the signal is present but not arbitrated correctly, so patching may be overriding rather than restoring an intact representation.

    Authors: We agree that linear probes capture only a subset of representational structure and do not rule out differences in nonlinear integration across layers. Our central evidence for the Encoding-Grounding Dissociation, however, is not limited to the AUC equivalence: the final-layer logit gap predicts grounding outcomes with ρ = 0.847 across models, and full-sequence activation patching from successful trajectories alters 60–84 % of failure outputs, with image tokens carrying nearly all causal effect. These convergent lines of evidence—decodability, logit-gap correlation, and causal intervention—collectively indicate that visual attributes remain available but are not properly arbitrated. We have added an explicit limitations paragraph acknowledging that higher-order or nonlinear interactions are not exhaustively tested and that future work could employ nonlinear probes or circuit-level analyses. revision: partial

  2. Referee: [Causal analysis via activation patching] Full-sequence patching is shown to be necessary because last-token interventions fail, yet the manuscript does not quantify whether the patched activations preserve the original perceptual trajectory or introduce a new signal that bypasses whatever nonlinear suppression occurred in failure cases.

    Authors: We acknowledge that the original manuscript did not include direct quantification of trajectory preservation. The necessity of full-sequence rather than last-token patching, together with the partial-token decomposition showing that image tokens (not text tokens) drive the effect, is consistent with restoring visual-signal propagation rather than fabricating an unrelated bypass. In the revised manuscript we have added (i) cosine-similarity and norm comparisons between original failure activations and patched activations at the intervention layers, and (ii) layer-wise logit-evolution plots contrasting failure, success, and patched trajectories. These new analyses show that patched activations move closer to successful trajectories without introducing large deviations in non-visual dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements and interventions

full rationale

The paper presents an empirical study using linear probing (AUC values), logit-lens tracking, correlation analysis (ρ=0.847), full-sequence activation patching (60-84% output changes), and activation steering interventions. These are measured outcomes from experiments on ten VLMs, not derivations that reduce to fitted parameters, self-definitions, or self-citation chains. No equations or steps are shown that equate a 'prediction' to its own inputs by construction, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The central claim follows from the experimental results without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis relies on standard mechanistic interpretability assumptions without introducing new free parameters or invented entities in the reported work.

axioms (2)
  • domain assumption Logit lens probing and linear decodability reflect the presence of visual attribute information in model activations
    Invoked when claiming visual attributes are encoded as strongly in failed samples based on AUC > 0.86.
  • domain assumption Activation patching interventions isolate causal contributions without major side effects on unrelated computations
    Used to establish that image tokens carry the causal impact for grounding outcomes.

pith-pipeline@v0.9.0 · 5650 in / 1433 out tokens · 70767 ms · 2026-05-10T17:09:28.452516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    Omnimodal LLMs encode premise-perception mismatches in hidden states yet almost never reject false textual claims, exposing a representation-action gap that is modality-asymmetric and prompt-resistant.

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    blue”), (2) capitalized (“Blue

    Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper files/ paper/2021/file/4f5c422f4d49a5a807eda27434231040-Paper.pdf. 10 Preprint. Under review. Michal Golovanevsky, William Rudman, Michael A. Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. Pixels versus priors: Controlling knowledge priors in vision- language models through...

  2. [2]

    Mean-pool the hidden state across token positions: ¯h= 1 T ∑t ht

  3. [3]

    Encode through the SAE:z=ReLU(W enc · ¯h+b enc)

  4. [4]

    All values are clamped to≥0 after modification

    Construct a steering vector in feature space: z′ j =    zj +α v ifj∈ F visual zj −α p ifj∈ F prior zj otherwise (7) where αv, αp ≥ 0 are the visual and prior steering strengths, swept over {0, 1, 2, 3, 5}. All values are clamped to≥0 after modification

  5. [5]

    Decode both the original and modified feature vectors: ˆh=D(z), ˆh′ =D(z ′)(8)

  6. [6]

    ,T}(9) 19 Preprint

    Compute the delta and add it to theoriginalhidden state at every token position: h′ t =h t + (ˆh′ − ˆh),∀t∈ {1, . . . ,T}(9) 19 Preprint. Under review. Why residual application?The standard approach (Templeton et al., 2024) replaces the hidden state with the modified SAE reconstruction:h′ t = ˆh′. This discards all information the SAE did not learn to rec...