pith. sign in

arxiv: 2511.10292 · v2 · pith:C56DMTFGnew · submitted 2025-11-13 · 💻 cs.CV · cs.AI

Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

Pith reviewed 2026-05-21 18:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords hallucination mitigationvision-language modelsresidual updatesdecoding regulationadaptive gatingobject hallucinationlow-overhead inferencevisual dilution
0
0 comments X

The pith

RUDDER extracts a visual evidence direction from residual updates and injects it adaptively during decoding to reduce hallucinations in large vision-language models by about a quarter while keeping throughput above 96 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RUDDER as a way to stop large vision-language models from losing track of image details and falling back on language priors that produce wrong objects. It pulls a persistent visual signal called CARD straight from the residual updates computed when the model first processes the image and prompt, then uses a Beta Gate to add this signal back into later decoding steps only when needed. This avoids the extra passes or logit comparisons that earlier fixes require. If the approach holds, models could produce more faithful image descriptions in real time without custom hardware or major speed penalties.

Core claim

RUDDER counters visual dilution in LVLMs by creating a persistent visual anchor: it extracts a robust evidence direction (CARD) directly from the model's prefill residual updates and modulates its injection into the autoregressive decoding process with an adaptive Beta Gate that acts as a trust mechanism, applying the reminder selectively to reduce over-reliance on language priors.

What carries the argument

The robust evidence direction (CARD) extracted from prefill residual updates, selectively injected via the Beta Gate to serve as a visual reminder during text generation.

If this is right

  • RUDDER reduces CHAIR_S by an average of 24.4 percent and CHAIR_i by 23.6 percent relative under greedy decoding across tested models.
  • The method maintains over 96.0 percent of original throughput on LLaVA-1.5, Idefics2, InstructBLIP, and Qwen2.5-VL.
  • It scales across multiple LVLM architectures without requiring architecture-specific changes.
  • It achieves lower latency than logit-contrast or iterative-refinement baselines for the same hallucination reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The residual-update steering could be tested on other dilution effects such as long-context forgetting in pure language models.
  • If the Beta Gate proves reliable, similar adaptive mechanisms might reduce modality imbalance in audio-visual or video-language models.
  • Extending the CARD extraction to intermediate layers rather than only prefill could reveal whether later visual signals add further gains.

Load-bearing premise

The extracted CARD from prefill residuals reliably encodes usable visual information that the Beta Gate can apply without creating new errors or shifting the output distribution.

What would settle it

Running RUDDER on a set of images containing objects that conflict with common language priors and observing whether hallucination rates rise instead of fall, or whether the added gating reduces throughput below 90 percent of baseline.

Figures

Figures reproduced from arXiv: 2511.10292 by Bin Li, Jiarui Guan, Pekka Marttinen, Ya Gao, Zhengtao Zou.

Figure 1
Figure 1. Figure 1: (Left) An example where the vanilla LLaVA-1.5–7B (Liu et al., 2024a) hallucinates ob￾jects. Erroneous text is marked in red, while RUDDER’s corrected, factual output is in blue. (Right) A comparison showing that unlike existing non-steering and steering-based methods, RUDDER pro￾vides adaptive, low-overhead control without requiring extra forward passes. overhead. They frequently require multiple forward p… view at source ↗
Figure 2
Figure 2. Figure 2: The overall workflow of RUDDER. Our method operates in two stages. (1) Prefill Stage (Yellow Arrows): We extract CARD vector vCARD by first collecting attention-induced residual updates ∆l i from a target layer l for each token i in the prefill span. These updates are then aggregated using pooling and normalization. The final CARD vectors are cached for each (image, prompt) pair. (2) Decoding Stage (Orange… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study of RUDDER’s hyperparameters on the Idefics2 (Laurenc¸on et al., 2024) model. (a) The bar plot shows the impact of the intervention layer L. (b-d) The heatmaps analyze the trade-off between steering strength αmax and gate sensitivity k, showing their effect on CHAIR scores and recall. 4.5 ABLATION STUDIES We conduct an ablation study on Idefics2 using the CHAIR benchmark to analyze the key hy… view at source ↗
Figure 4
Figure 4. Figure 4: Structure in steering space (b) and its sample-wise projection to [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Directional evidence with reflowed layout. (a) Consistent [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of the internal dynamics of LLaVA-1.5 (Liu et al., 2024a). (a, c) Both the absolute [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study. Hallucinated contents generated by the vanilla LLaVA-1.5 are marked in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study. Hallucinated contents generated by the vanilla LLaVA-1.5 are marked in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study. Hallucinated contents generated by the vanilla Idefics2 are marked in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case study. Hallucinated contents generated by the vanilla InstructBlip are marked in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation matrices for RUDDER on LLaVA-1.5 (Liu et al., 2024a) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation matrices for RUDDER on InstructBlip (Dai et al., 2023) [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) typically process visual inputs as a prefix to the language decoder. As the model autoregressively generates text, this initial visual information inevitably undergoes "dilution" leading the model to over-rely on language priors and hallucinate objects. Existing interventions attempt to correct this by contrasting logits or iteratively refining outputs, but they incur prohibitive latency costs. We propose Residual-Update Directed DEcoding Regulation (RUDDER), a framework that counters visual dilution by creating a persistent visual anchor. We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process. This injection is modulated by an adaptive gate, the Beta Gate, which acts as a trust mechanism and ensures the visual reminder is applied only when necessary. Experiments on LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL demonstrate that RUDDER consistently mitigates hallucination (with greedy decoding, RUDDER reduces CHAIR_S by an average of 24.4% and CHAIR_i by 23.6% relative) and scales effectively across architectures, all while maintaining >96.0% throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Residual-Update Directed DEcoding Regulation (RUDDER) to mitigate object hallucinations in LVLMs caused by visual information dilution during autoregressive decoding. It extracts a 'robust evidence direction' (CARD) from residual updates in the prefill stage (which processes both image tokens and initial text) and injects this direction into subsequent decoding steps, with the injection strength controlled by an adaptive Beta Gate that acts as a trust mechanism. Experiments across LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL report average relative CHAIR_S reductions of 24.4% and CHAIR_i reductions of 23.6% under greedy decoding, with throughput maintained above 96%.

Significance. If the central mechanism is shown to isolate and reinforce visual evidence without introducing new distribution shifts, the result would be significant: it offers a low-latency alternative to logit-contrasting or iterative-refinement methods for hallucination control, with demonstrated scaling across four distinct LVLM families and negligible throughput cost. This could improve reliability in downstream tasks such as captioning and VQA while remaining practical for deployment.

major comments (2)
  1. [§3] §3 (Method), CARD extraction paragraph: the claim that CARD 'counters language-prior dilution' rests on the assumption that prefill residuals predominantly encode separable visual evidence. Because the prefill stage mixes image tokens with the initial prompt, the manuscript must demonstrate (via equation or ablation) that CARD is not a linear combination of visual and prompt-induced language signals; without this, the injection step risks reinforcing rather than mitigating hallucinations.
  2. [Experiments] Experiments section, CHAIR results paragraph: the reported 24.4% / 23.6% relative reductions are presented without ablations that isolate the contribution of CARD versus the Beta Gate, nor any text-masked prefill control. This leaves open whether the gains are caused by the proposed residual-update steering or by incidental changes in decoding dynamics.
minor comments (2)
  1. [Abstract] The abstract and §4 would benefit from a one-sentence statement of the exact parameterization of the Beta Gate (threshold or scaling factor) so readers can assess reproducibility.
  2. [§3] Notation for residual updates and the injection operation should be introduced with a compact equation early in §3 to avoid ambiguity when the Beta Gate modulates the direction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The points raised highlight important aspects that will strengthen the clarity and rigor of our claims. We address each major comment below and commit to incorporating the necessary revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Method), CARD extraction paragraph: the claim that CARD 'counters language-prior dilution' rests on the assumption that prefill residuals predominantly encode separable visual evidence. Because the prefill stage mixes image tokens with the initial prompt, the manuscript must demonstrate (via equation or ablation) that CARD is not a linear combination of visual and prompt-induced language signals; without this, the injection step risks reinforcing rather than mitigating hallucinations.

    Authors: We agree that an explicit demonstration of separability is required to substantiate the claim. In the revised manuscript we will add both a short derivation in §3 showing how the residual update is dominated by attention over image tokens (due to the prefix structure and cross-attention patterns) and a new ablation that extracts CARD from a text-only prefill (image tokens masked) versus the standard image+text prefill. The text-only variant will be injected under identical conditions; we expect substantially weaker hallucination mitigation, confirming that the visual component is the primary driver. revision: yes

  2. Referee: Experiments section, CHAIR results paragraph: the reported 24.4% / 23.6% relative reductions are presented without ablations that isolate the contribution of CARD versus the Beta Gate, nor any text-masked prefill control. This leaves open whether the gains are caused by the proposed residual-update steering or by incidental changes in decoding dynamics.

    Authors: We acknowledge the need for component-isolation experiments. The revised version will include three targeted ablations: (1) CARD injection with the Beta Gate disabled (fixed strength), (2) full RUDDER versus a no-CARD baseline, and (3) a text-masked prefill control that generates a language-only direction for injection. These will be reported alongside the main CHAIR results to show that performance gains track the presence of visual residuals rather than generic decoding modifications. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RUDDER introduces an operational extraction-injection procedure validated empirically rather than reducing to fitted inputs or self-referential definitions.

full rationale

The paper describes RUDDER as a framework that extracts CARD directly from prefill residual updates and modulates injection via the Beta Gate to counter visual dilution. This is presented as a practical intervention on model internals, with performance measured through experiments on CHAIR metrics across LLaVA, Idefics2, InstructBLIP, and Qwen2.5-VL. No derivation chain exists that equates a claimed prediction or first-principles result to its own inputs by construction. The method does not rely on fitting parameters to a data subset and relabeling the outcome as a prediction, nor does it import uniqueness theorems or ansatzes via self-citation in a load-bearing manner. The central claims rest on external experimental benchmarks rather than tautological re-expression of the technique's own definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The method rests on the untested premise that prefill residuals contain a stable visual signal and on two newly introduced constructs (CARD and Beta Gate) whose definitions and tuning are not detailed.

free parameters (1)
  • Beta Gate threshold or scaling factor
    The adaptive gate must contain at least one tunable parameter that decides when the visual reminder is applied; its value is not reported.
axioms (1)
  • domain assumption Prefill residual updates encode visual evidence that subsequently dilutes during autoregressive decoding.
    This premise is invoked to justify extracting CARD from the prefill stage.
invented entities (2)
  • CARD (robust evidence direction) no independent evidence
    purpose: Persistent visual anchor extracted from prefill residuals.
    Newly defined construct whose extraction rule is not specified.
  • Beta Gate no independent evidence
    purpose: Adaptive trust mechanism that modulates injection of the visual anchor.
    New component introduced to avoid unnecessary steering.

pith-pipeline@v0.9.0 · 5771 in / 1405 out tokens · 74396 ms · 2026-05-21T18:43:12.471656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  2. [2]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023

  3. [3]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP : Towards General -purpose Vision - Language Models with Instruction Tuning , June 2023. URL http://arxiv.org/abs/2305.06500. arXiv:2305.06500 [cs]

  4. [4]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. MME : A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , March 2024. URL http://arxiv.org/abs/2306.13394. arXiv:2306.13394 [cs]

  5. [5]

    Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search

    Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search, 2017. URL https://arxiv.org/abs/1704.07138

  6. [6]

    Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation . ACM Comput. Surv., 55 0 (12): 0 1--38, December 2023. ISSN 0360-0300, 1557-7341. doi:10.1145/3571730. URL http://arxiv.org/abs/2202.036...

  7. [7]

    What matters when building vision-language models? Advances in Neural Information Processing Systems, 37: 0 87874--87907, 2024

    Hugo Lauren c on, L \'e o Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? Advances in Neural Information Processing Systems, 37: 0 87874--87907, 2024

  8. [8]

    Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023. URL http://arxiv.org/abs/2311.16922. arXiv:2311.16922 [cs]

  9. [9]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023 a . URL https://arxiv.org/abs/2301.12597

  10. [10]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 b

  11. [11]

    Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N. Metaxas. The Hidden Life of Tokens : Reducing Hallucination of Large Vision - Language Models via Visual Information Steering , July 2025. URL http://arxiv.org/abs/2502.03628. arXiv:2502.03628 [cs]

  12. [12]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URL https://arxiv.org/abs/1405.0312

  13. [13]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning , May 2024 a . URL http://arxiv.org/abs/2310.03744. arXiv:2310.03744 [cs]

  14. [14]

    Reducing hallucinations in large vision-language models via latent space steering

    Sheng Liu, Haotian Ye, and James Zou. Reducing hallucinations in large vision-language models via latent space steering. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LBl7Hez0fF

  15. [15]

    Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

    Shi Liu, Kecheng Zheng, and Wei Chen. Paying More Attention to Image : A Training - Free Method for Alleviating Hallucination in LVLMs , July 2024 b . URL http://arxiv.org/abs/2407.21771. arXiv:2407.21771 [cs]

  16. [16]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv.org/abs/2303.08896

  17. [17]

    Object Hallucination in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning, 2019. URL https://arxiv.org/abs/1809.02156

  18. [18]

    Evidential Deep Learning to Quantify Classification Uncertainty

    Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential Deep Learning to Quantify Classification Uncertainty , October 2018. URL http://arxiv.org/abs/1806.01768. arXiv:1806.01768 [cs]

  19. [19]

    Bivariate Beta - LSTM , November 2019

    Kyungwoo Song, JoonHo Jang, Seung jae Shin, and Il-Chul Moon. Bivariate Beta - LSTM , November 2019. URL http://arxiv.org/abs/1905.10521. arXiv:1905.10521 [cs]

  20. [21]

    Feature selection using stochastic gates

    Yutaro Yamada, Ofir Lindenbaum, Sahand Negahban, and Yuval Kluger. Feature selection using stochastic gates. In International conference on machine learning, pp.\ 10648--10659. PMLR, 2020

  21. [22]

    Mitigating object hallucination in large vision-language models via image-grounded guidance, 2025

    Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via image-grounded guidance, 2025. URL https://openreview.net/forum?id=eFoj2egr7G

  22. [23]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  23. [24]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  24. [25]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  25. [26]

    hallucination

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...