pith. sign in

arxiv: 2606.05753 · v1 · pith:EHO3VINHnew · submitted 2026-06-04 · 💻 cs.CV

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Pith reviewed 2026-06-28 03:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelslatent visual reasoningcosine alignmentauxiliary lossesinformation bottlenecksupervised latentsPRISM diagnostics
0
0 comments X

The pith

Auxiliary cosine losses improve vision-language models by reshaping shared parameters rather than aligning their supervised latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models that insert supervised latent tokens for visual reasoning rely on cosine similarity or MSE as both training loss and quality metric, under the assumption that tighter alignment produces better final answers. Testing five variants shows the assumption inverted, with cosine alignment negatively correlated to accuracy at r=-0.94. New diagnostics called PRISM use a linear probe to locate where answers are decodable and a corruption test to check whether the latent is required; both show the latents are bypassed during inference, with corruption shifting accuracy by at most four points. The answer becomes decodable only after the latent stage, and the size of this decodability gap tracks how much each variant depends on its latent. The auxiliary objective therefore changes the language model through parameters it shares with the main task, consistent with an Information Bottleneck account of the loss.

Core claim

In latent visual reasoning for VLMs the assumption that cosine alignment between supervised latents and visual targets improves answer accuracy is inverted (r=-0.94 across five variants). PRISM diagnostics show the supervised latents are largely bypassed: answers are decodable downstream of the latent but not at it, and corrupting the latents shifts accuracy by at most four points. The auxiliary objective therefore reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

What carries the argument

PRISM, a pair of inference-time diagnostics (linear probe for decodability location and corruption test for load-bearing status) that measure whether the supervised latent carries the answer.

If this is right

  • Cosine similarity and MSE are unreliable training signals for latent visual reasoning because they do not track downstream utility.
  • Performance gains from auxiliary losses arise from changes to shared language-model parameters, not from the latents themselves.
  • The decodability gap between the latent and later layers predicts how much a given variant will rely on its latent when perturbed.
  • Information-bottleneck dynamics explain why an auxiliary objective can improve accuracy while leaving its nominal latent variable unused.
  • The same bypassing pattern holds across multiple LVR architectural variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of other auxiliary-loss regimes in multimodal models should test whether gains come from parameter sharing rather than from the explicit intermediate representation.
  • Directly optimizing for the downstream reshaping effect could replace alignment objectives in future LVR training.
  • The result raises the question of whether similar bypassing occurs in other chain-of-thought or latent-reasoning setups that use auxiliary supervision.

Load-bearing premise

The linear probe and corruption test accurately measure whether the answer is decodable at the latent and whether the latent is load-bearing for the final answer.

What would settle it

An experiment that forces the latent to be the sole information path (for example by zeroing all other connections after the latent) and then checks whether cosine alignment then becomes positively correlated with accuracy.

Figures

Figures reproduced from arXiv: 2606.05753 by Junfeng Fang, XiuYu Zhang, Zhenkai Liang.

Figure 1
Figure 1. Figure 1: Cosine misleads. Cosine alignment between LVR-position hidden states and their teacher-forced vi￾sual targets is negatively correlated with V∗Bench accu￾racy across all five trained variants (Pearson r=−0.94). Progressive variants (P-LVR-2, P-LVR-3) reach the highest cosine but the lowest accuracy. implicit assumption behind this practice is that a better cosine means a more faithful latent, which implies … view at source ↗
Figure 2
Figure 2. Figure 2: PRISM overview. Two inference-time diagnostics for the LVR family. Axis 1 trains linear probes (Alain and Bengio, 2018) at two positions in the model: (a) the answer-decoding state the LM head reads, and (b) the feedback variable the autoregressive loop re-injects. We report the decodability gap G = accprobe(a) − accprobe(b), which summarizes how much more answer-decodable the post-latent state is than the… view at source ↗
Figure 3
Figure 3. Figure 3: The matrix of five LVR variants, grouped by their relationship to the IB objective. Top row (reconstruction-only supervision): LVR baseline, P-LVR (progressive 2/3-stage scaffolding), D-LVR (reconstruction loss removed at mid-training). All three rely on an MSE-to-low-rate target as an indirect compression mechanism, with no direct bound on I(X;Z). Bottom (IB-consistent regularizer): N-LVR adds zero-mean G… view at source ↗
Figure 4
Figure 4. Figure 4: Faithfulness corruption profiles on V ∗Bench. Change in V∗Bench accuracy (∆ acc, pp) under each corruption applied to the LVR latents at la￾tent reasoning steps, against a clean pass. Negative ∆ = own latent helps; positive = own latent hurts. design trades training simplicity for cleaner recon￾struction at the teaching stage. Cosine rewards that trade and V∗Bench punishes it. The second disso￾ciation is a… view at source ↗
Figure 5
Figure 5. Figure 5: The probe contrast predicts latent reliance. Decoding gap G against corruption response under small-noise (σ = 0.1, left) and truncation (right), across the five variants. 5.5 The probe contrast predicts latent reliance Axis 1 measures where the answer is decodable. Axis 2 assesses whether the model causally uses the latents. These are different questions, but on this matrix they are empirically tied ( [P… view at source ↗
read the original abstract

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript examines the effectiveness of cosine similarity as both a training loss and quality metric for supervised latent tokens in latent visual reasoning (LVR) setups within vision-language models. Across a designed matrix of five LVR variants, it reports a strong negative correlation (r=-0.94) between cosine alignment and downstream accuracy, inverting the standard assumption. The authors introduce PRISM (a linear probe for decodability and a corruption test for load-bearing status) to show that the supervised latents are largely bypassed: the answer is decodable downstream of the latent but not at it, corruption shifts accuracy by at most four points, and the decodability gap predicts reliance on the latent. The auxiliary objective is argued to reshape the language model via shared parameters rather than the nominal latent, consistent with an Information Bottleneck view.

Significance. If the central empirical findings hold after validation of the diagnostics, the result would be significant for VLM training practices, as it challenges reliance on alignment metrics like cosine similarity or MSE for latent quality and suggests auxiliary losses primarily act through parameter sharing. The direct measurements across variants and the introduction of inference-time tests are strengths of the empirical approach.

major comments (2)
  1. [Abstract] Abstract: The PRISM diagnostics (linear probe for decodability location and corruption test for load-bearing status) are introduced as the key evidence for the bypass claim without any reported controls, ablations, sanity checks, or validation (e.g., probe capacity, whether corruption perturbs attention/shared weights, or tests on architectures where the latent is known to be critical). This is load-bearing for the central claim that latents are bypassed and the auxiliary loss acts via shared parameters.
  2. [Abstract] Abstract and experimental matrix: The r=-0.94 correlation is reported across only five variants with no details on statistical controls, variance, or sensitivity to variant selection; this small n makes the inversion claim sensitive and requires additional robustness checks to support the conclusion that cosine alignment is anti-correlated with accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify areas where additional validation and robustness analysis would strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The PRISM diagnostics (linear probe for decodability location and corruption test for load-bearing status) are introduced as the key evidence for the bypass claim without any reported controls, ablations, sanity checks, or validation (e.g., probe capacity, whether corruption perturbs attention/shared weights, or tests on architectures where the latent is known to be critical). This is load-bearing for the central claim that latents are bypassed and the auxiliary loss acts via shared parameters.

    Authors: We agree that the abstract does not enumerate explicit controls and that further validation would make the PRISM diagnostics more robust. The full manuscript already varies probe capacity implicitly through layer-wise testing and applies corruption across all five variants while monitoring downstream effects. To directly address the referee's examples, the revision will add: (i) an ablation on probe hidden dimension and regularization, (ii) attention-map comparisons before/after corruption to confirm shared weights are not inadvertently perturbed, and (iii) a discussion of why direct tests on architectures with provably critical latents are currently limited by the LVR literature. These additions will appear in a new subsection on diagnostic validation. revision: yes

  2. Referee: [Abstract] Abstract and experimental matrix: The r=-0.94 correlation is reported across only five variants with no details on statistical controls, variance, or sensitivity to variant selection; this small n makes the inversion claim sensitive and requires additional robustness checks to support the conclusion that cosine alignment is anti-correlated with accuracy.

    Authors: The five variants were deliberately constructed to span distinct supervision, alignment, and architectural choices rather than random sampling; the negative correlation is uniform across them. We acknowledge that n=5 warrants explicit sensitivity analysis. The revision will include: bootstrap confidence intervals for the correlation, per-variant accuracy variance across random seeds, and a leave-one-variant-out sensitivity check. These results will be reported in the experimental matrix section and will qualify the strength of the inversion claim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with direct observations

full rationale

The paper reports experimental results across five LVR variants, including a measured correlation (r=-0.94) between cosine alignment and accuracy, plus outcomes from linear probes and corruption tests. No equations, derivations, or first-principles claims are present that could reduce to inputs by construction. PRISM diagnostics are introduced as new inference-time tests whose validity is assessed via the reported empirical outcomes themselves; no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces PRISM as a new diagnostic pair but relies on standard machine-learning evaluation practices; no free parameters are fitted to produce the central claim, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Standard assumptions about linear probes and corruption tests as valid measures of decodability and causal importance in neural networks
    The validity of PRISM rests on these common but unproven-in-this-work assumptions about what the probe and corruption reveal.
invented entities (1)
  • PRISM (linear probe + corruption test pair) no independent evidence
    purpose: To diagnose whether answers are decodable at the latent and whether the latent is load-bearing
    New inference-time diagnostics introduced to test the load-bearing status of supervised latents.

pith-pipeline@v0.9.1-grok · 5741 in / 1434 out tokens · 53412 ms · 2026-06-28T03:04:28.891421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    2024 , eprint=

    An Introduction to Vision-Language Modeling , author=. 2024 , eprint=

  2. [2]

    The Fourteenth International Conference on Learning Representations , year=

    Latent Visual Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=

  3. [3]

    2025 , eprint=

    Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens , author=. 2025 , eprint=

  4. [4]

    2026 , eprint=

    Interleaved Latent Visual Reasoning with Selective Perceptual Modeling , author=. 2026 , eprint=

  5. [5]

    2025 , eprint=

    Monet: Reasoning in Latent Visual Space Beyond Images and Language , author=. 2025 , eprint=

  6. [6]

    2026 , eprint=

    Vision-aligned Latent Reasoning for Multi-modal Large Language Model , author=. 2026 , eprint=

  7. [7]

    V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , year=

    Wu, Penghao and Xie, Saining , booktitle=. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs , year=

  8. [8]

    and Ma, Wei-Chiu and Krishna, Ranjay , title =

    Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A. and Ma, Wei-Chiu and Krishna, Ranjay , title =. 2024 , isbn =. doi:10.1007/978-3-031-73337-6_9 , booktitle =

  9. [9]

    In: 2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Tong, Shengbang and Liu, Zhuang and Zhai, Yuexiang and Ma, Yi and LeCun, Yann and Xie, Saining , booktitle =. 2024 , volume =. doi:10.1109/CVPR52733.2024.00914 , url =

  10. [10]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Leng, Xingjian and Singh, Jaskirat and Hou, Yunzhong and Xing, Zhenchang and Xie, Saining and Zheng, Liang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  11. [11]

    2018 , eprint=

    Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

  12. [12]

    Designing and Interpreting Probes with Control Tasks

    Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

  13. [13]

    Computational Linguistics , volume =

    Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

  14. [14]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  15. [15]

    Measuring chain of thought faithfulness by unlearning reasoning steps

    Tutek, Martin and Hashemi Chaleshtori, Fateme and Marasovic, Ana and Belinkov, Yonatan. Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.504

  16. [16]

    2000 , eprint=

    The information bottleneck method , author=. 2000 , eprint=

  17. [17]

    International Conference on Learning Representations , year=

    Deep Variational Information Bottleneck , author=. International Conference on Learning Representations , year=

  18. [18]

    Information Maximization in Noisy Channels : A Variational Approach , url =

    Barber, David and Agakov, Felix , booktitle =. Information Maximization in Noisy Channels : A Variational Approach , url =

  19. [19]

    , journal=

    Bishop, Chris M. , journal=. Training with Noise is Equivalent to Tikhonov Regularization , year=

  20. [20]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  21. [21]

    The Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning , author=. The Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=