System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

Anisha Saha; Michael Hahn; Tsan Tsai Chan; Varsha Suresh; Vera Demberg

arxiv: 2601.12430 · v2 · submitted 2026-01-18 · 💻 cs.CL

System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

Tsan Tsai Chan , Varsha Suresh , Anisha Saha , Michael Hahn , Vera Demberg This is my paper

Pith reviewed 2026-05-16 13:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords vision-language modelsyes-biashallucinationattention allocationsystem promptmultimodal imbalanceattention redistribution

0 comments

The pith

Redistributing attention from the system prompt to image and text inputs suppresses vision-language models' indiscriminate yes responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language models often answer yes to questions regardless of the actual content because too much attention is diverted to the system prompt at the expense of image and text inputs. This imbalance is traced to redundant system weights that function as a default pull on attention. When attention is causally shifted away from the system modality toward the inputs, the yes-bias drops substantially and often beats prior mitigation techniques. The same shift also reduces reliance on coarse representations that work for easy tasks but fail on precise ones. This frames hallucination as a modality-balance problem rather than solely a visual-interpretation issue.

Core claim

Vision-language models display a yes-bias because functionally redundant system weights create attention imbalances that reduce focus on image and textual inputs. Causally redistributing attention from the system modality to those inputs substantially suppresses the bias, frequently outperforming existing approaches. The imbalances further promote default use of coarse input representations that suffice for some tasks but prove ill-suited for others.

What carries the argument

system-mediated attention imbalances produced by functionally redundant system weights that divert focus from image and text inputs

Load-bearing premise

The yes-bias stems mainly from redundant system weights diverting attention rather than from training data or core model architecture.

What would settle it

A controlled test in which attention redistribution is applied but the yes-bias remains unchanged in models whose system weights show no redundancy would disprove the causal account.

read the original abstract

Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond `yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable system-mediated account of yes-bias and shows attention redistribution can suppress it, but the intervention lacks the controls needed to rule out simpler rebalancing explanations.

read the letter

The main takeaway is that yes-bias in VLMs can be reduced by moving attention away from system tokens and toward image and text inputs. The authors treat the three modalities symmetrically and argue that system weights are often redundant, pushing the model toward coarse features that produce indiscriminate yes answers. Their redistribution experiments show clear suppression of the bias and sometimes beat prior image-centric fixes, which is the useful empirical piece here. The link they draw between system attention and reliance on coarse representations is a reasonable mechanism if the supporting measurements hold up in the full results. That framing moves the discussion beyond just boosting image attention, which is a modest but concrete shift. The soft spot is the causal claim. Redistribution via scaling or masking system attention scores could easily change relative image-text ratios or overall input emphasis at the same time, so the suppression might not be specific to system weights. The paper would be stronger with ablations that apply comparable rebalancing to non-system tokens and check whether the yes-bias reduction is unique. Without those, alternative accounts like general prompt dilution or improved feature use remain plausible. The data and methods sections need to make the intervention details and statistical controls explicit. This is the kind of targeted reliability work that deserves referee time. People building or auditing VLMs for visual QA and agents would get practical value from the redistribution idea once the specificity is tightened. I would send it for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that yes-bias hallucinations in vision-language models stem from system-mediated attention imbalances driven by functionally redundant system weights that reduce attention to image and text inputs. It presents a causal intervention that redistributes attention away from the system modality, which substantially suppresses the bias and often outperforms prior methods, while also linking the imbalance to over-reliance on coarse input representations.

Significance. If the causal specificity of the attention redistribution holds after controls, the work would usefully shift emphasis from purely image-centric accounts of VLM hallucination toward system attention as a modifiable factor, potentially offering a practical and generalizable mitigation lever with implications for model robustness across tasks.

major comments (2)

[Methods (attention redistribution procedure)] The central causal claim (that suppression results specifically from reducing functionally redundant system attention) requires explicit controls showing that equivalent yes-bias reduction does not occur under non-system-targeted attention rebalancing; without these, alternative explanations such as altered image-text attention ratios remain viable.
[Experiments and Results] The outperformance claim over existing approaches lacks reported statistical details, ablation tables, or variance estimates in the results; this undermines evaluation of whether the redistribution isolates the proposed mechanism or simply improves coarse-feature utilization.

minor comments (2)

[Methods] Notation for attention scores and redistribution parameters should be defined more explicitly in the methods to avoid ambiguity when comparing to prior image-centric baselines.
[Figures] Figure captions for attention visualizations could include quantitative metrics (e.g., mean attention mass per modality) to make the imbalance claims easier to verify at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the causal specificity of our findings and improve the rigor of our experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods (attention redistribution procedure)] The central causal claim (that suppression results specifically from reducing functionally redundant system attention) requires explicit controls showing that equivalent yes-bias reduction does not occur under non-system-targeted attention rebalancing; without these, alternative explanations such as altered image-text attention ratios remain viable.

Authors: We agree that additional controls are necessary to isolate the role of system attention reduction. In the revised manuscript, we will include new experiments that rebalance attention exclusively between image and text modalities (without decreasing system attention) and demonstrate that these do not produce equivalent yes-bias suppression. This will rule out simple image-text ratio changes as the primary driver. revision: yes
Referee: [Experiments and Results] The outperformance claim over existing approaches lacks reported statistical details, ablation tables, or variance estimates in the results; this undermines evaluation of whether the redistribution isolates the proposed mechanism or simply improves coarse-feature utilization.

Authors: We acknowledge this limitation in the current reporting. The revised version will add ablation tables, per-run variance estimates (standard deviations), and statistical significance tests (e.g., paired t-tests with p-values) comparing our redistribution method against baselines. These additions will allow readers to assess whether gains are attributable to the system-attention mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical intervention is self-contained

full rationale

The paper advances a system-mediated account of yes-bias in VLMs through experimental attention redistribution, with the central claim resting on observed suppression of bias after causal intervention rather than any definitional reduction or fitted quantity. No equations, parameter fits renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatz smuggling appear in the abstract or described chain. The account is empirically driven, with the intervention's effects validated against existing approaches, leaving the derivation independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies no explicit free parameters, invented entities, or non-standard axioms; the account rests on the standard transformer attention mechanism plus the domain assumption that system weights can be functionally redundant.

axioms (1)

domain assumption Attention weights in VLMs reflect functional redundancy of system modality
Invoked to explain why system attention produces coarse representations and yes-bias.

pith-pipeline@v0.9.0 · 5496 in / 1060 out tokens · 24758 ms · 2026-05-16T13:03:26.091162+00:00 · methodology

System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)