Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

Bin Li; Jiarui Guan; Pekka Marttinen; Ya Gao; Zhengtao Zou

arxiv: 2511.10292 · v2 · pith:C56DMTFGnew · submitted 2025-11-13 · 💻 cs.CV · cs.AI

Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

Zhengtao Zou , Ya Gao , Jiarui Guan , Bin Li , Pekka Marttinen This is my paper

Pith reviewed 2026-05-21 18:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hallucination mitigationvision-language modelsresidual updatesdecoding regulationadaptive gatingobject hallucinationlow-overhead inferencevisual dilution

0 comments

The pith

RUDDER extracts a visual evidence direction from residual updates and injects it adaptively during decoding to reduce hallucinations in large vision-language models by about a quarter while keeping throughput above 96 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RUDDER as a way to stop large vision-language models from losing track of image details and falling back on language priors that produce wrong objects. It pulls a persistent visual signal called CARD straight from the residual updates computed when the model first processes the image and prompt, then uses a Beta Gate to add this signal back into later decoding steps only when needed. This avoids the extra passes or logit comparisons that earlier fixes require. If the approach holds, models could produce more faithful image descriptions in real time without custom hardware or major speed penalties.

Core claim

RUDDER counters visual dilution in LVLMs by creating a persistent visual anchor: it extracts a robust evidence direction (CARD) directly from the model's prefill residual updates and modulates its injection into the autoregressive decoding process with an adaptive Beta Gate that acts as a trust mechanism, applying the reminder selectively to reduce over-reliance on language priors.

What carries the argument

The robust evidence direction (CARD) extracted from prefill residual updates, selectively injected via the Beta Gate to serve as a visual reminder during text generation.

If this is right

RUDDER reduces CHAIR_S by an average of 24.4 percent and CHAIR_i by 23.6 percent relative under greedy decoding across tested models.
The method maintains over 96.0 percent of original throughput on LLaVA-1.5, Idefics2, InstructBLIP, and Qwen2.5-VL.
It scales across multiple LVLM architectures without requiring architecture-specific changes.
It achieves lower latency than logit-contrast or iterative-refinement baselines for the same hallucination reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The residual-update steering could be tested on other dilution effects such as long-context forgetting in pure language models.
If the Beta Gate proves reliable, similar adaptive mechanisms might reduce modality imbalance in audio-visual or video-language models.
Extending the CARD extraction to intermediate layers rather than only prefill could reveal whether later visual signals add further gains.

Load-bearing premise

The extracted CARD from prefill residuals reliably encodes usable visual information that the Beta Gate can apply without creating new errors or shifting the output distribution.

What would settle it

Running RUDDER on a set of images containing objects that conflict with common language priors and observing whether hallucination rates rise instead of fall, or whether the added gating reduces throughput below 90 percent of baseline.

Figures

Figures reproduced from arXiv: 2511.10292 by Bin Li, Jiarui Guan, Pekka Marttinen, Ya Gao, Zhengtao Zou.

**Figure 1.** Figure 1: (Left) An example where the vanilla LLaVA-1.5–7B (Liu et al., 2024a) hallucinates objects. Erroneous text is marked in red, while RUDDER’s corrected, factual output is in blue. (Right) A comparison showing that unlike existing non-steering and steering-based methods, RUDDER provides adaptive, low-overhead control without requiring extra forward passes. overhead. They frequently require multiple forward p… view at source ↗

**Figure 2.** Figure 2: The overall workflow of RUDDER. Our method operates in two stages. (1) Prefill Stage (Yellow Arrows): We extract CARD vector vCARD by first collecting attention-induced residual updates ∆l i from a target layer l for each token i in the prefill span. These updates are then aggregated using pooling and normalization. The final CARD vectors are cached for each (image, prompt) pair. (2) Decoding Stage (Orange… view at source ↗

**Figure 3.** Figure 3: Ablation study of RUDDER’s hyperparameters on the Idefics2 (Laurenc¸on et al., 2024) model. (a) The bar plot shows the impact of the intervention layer L. (b-d) The heatmaps analyze the trade-off between steering strength αmax and gate sensitivity k, showing their effect on CHAIR scores and recall. 4.5 ABLATION STUDIES We conduct an ablation study on Idefics2 using the CHAIR benchmark to analyze the key hy… view at source ↗

**Figure 4.** Figure 4: Structure in steering space (b) and its sample-wise projection to [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Directional evidence with reflowed layout. (a) Consistent [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of the internal dynamics of LLaVA-1.5 (Liu et al., 2024a). (a, c) Both the absolute [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Case study. Hallucinated contents generated by the vanilla LLaVA-1.5 are marked in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Case study. Hallucinated contents generated by the vanilla LLaVA-1.5 are marked in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Case study. Hallucinated contents generated by the vanilla Idefics2 are marked in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Case study. Hallucinated contents generated by the vanilla InstructBlip are marked in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation matrices for RUDDER on LLaVA-1.5 (Liu et al., 2024a) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation matrices for RUDDER on InstructBlip (Dai et al., 2023) [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) typically process visual inputs as a prefix to the language decoder. As the model autoregressively generates text, this initial visual information inevitably undergoes "dilution" leading the model to over-rely on language priors and hallucinate objects. Existing interventions attempt to correct this by contrasting logits or iteratively refining outputs, but they incur prohibitive latency costs. We propose Residual-Update Directed DEcoding Regulation (RUDDER), a framework that counters visual dilution by creating a persistent visual anchor. We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process. This injection is modulated by an adaptive gate, the Beta Gate, which acts as a trust mechanism and ensures the visual reminder is applied only when necessary. Experiments on LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL demonstrate that RUDDER consistently mitigates hallucination (with greedy decoding, RUDDER reduces CHAIR_S by an average of 24.4% and CHAIR_i by 23.6% relative) and scales effectively across architectures, all while maintaining >96.0% throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RUDDER pulls a steering signal from prefill residuals to cut hallucinations with low overhead, but the abstract gives almost no mechanics so the gains are hard to trust yet.

read the letter

The core idea is to extract a persistent visual direction called CARD from the residual updates during the prefill phase and feed it back into decoding through an adaptive Beta Gate. This is meant to fight the dilution of image information as the model generates text, all without the latency hit of logit contrast or repeated refinement loops. They test it on LLaVA-1.5, Idefics2, InstructBLIP, and Qwen2.5-VL and report roughly 24% relative drops on CHAIR_S and CHAIR_i under greedy decoding while keeping throughput above 96% of baseline. That combination of cross-model results and speed preservation is the part that could matter for real systems. The approach is new in how it tries to reuse internal prefill residuals as the anchor instead of adding external signals. The soft spots are clear from the abstract alone. There are no equations for how CARD is actually computed from the residuals, no description of the Beta Gate parameterization, and no ablations that would show the effect comes from the proposed mechanism rather than something incidental. The concern that prefill residuals mix visual tokens with prompt-driven language signals is reasonable and unaddressed so far; without a text-masked control or similar check, it is difficult to know whether the injection is reinforcing the right thing. The reported numbers are consistent with the claim but do not yet isolate the cause. This is the kind of paper that would interest people building production vision-language pipelines who need something lighter than current mitigation tricks. The idea is worth a closer look even if the current write-up is thin on the internals. I would send it to peer review so the authors can supply the missing implementation details, ablations, and any checks on signal entanglement.

Referee Report

2 major / 2 minor

Summary. The paper proposes Residual-Update Directed DEcoding Regulation (RUDDER) to mitigate object hallucinations in LVLMs caused by visual information dilution during autoregressive decoding. It extracts a 'robust evidence direction' (CARD) from residual updates in the prefill stage (which processes both image tokens and initial text) and injects this direction into subsequent decoding steps, with the injection strength controlled by an adaptive Beta Gate that acts as a trust mechanism. Experiments across LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL report average relative CHAIR_S reductions of 24.4% and CHAIR_i reductions of 23.6% under greedy decoding, with throughput maintained above 96%.

Significance. If the central mechanism is shown to isolate and reinforce visual evidence without introducing new distribution shifts, the result would be significant: it offers a low-latency alternative to logit-contrasting or iterative-refinement methods for hallucination control, with demonstrated scaling across four distinct LVLM families and negligible throughput cost. This could improve reliability in downstream tasks such as captioning and VQA while remaining practical for deployment.

major comments (2)

[§3] §3 (Method), CARD extraction paragraph: the claim that CARD 'counters language-prior dilution' rests on the assumption that prefill residuals predominantly encode separable visual evidence. Because the prefill stage mixes image tokens with the initial prompt, the manuscript must demonstrate (via equation or ablation) that CARD is not a linear combination of visual and prompt-induced language signals; without this, the injection step risks reinforcing rather than mitigating hallucinations.
[Experiments] Experiments section, CHAIR results paragraph: the reported 24.4% / 23.6% relative reductions are presented without ablations that isolate the contribution of CARD versus the Beta Gate, nor any text-masked prefill control. This leaves open whether the gains are caused by the proposed residual-update steering or by incidental changes in decoding dynamics.

minor comments (2)

[Abstract] The abstract and §4 would benefit from a one-sentence statement of the exact parameterization of the Beta Gate (threshold or scaling factor) so readers can assess reproducibility.
[§3] Notation for residual updates and the injection operation should be introduced with a compact equation early in §3 to avoid ambiguity when the Beta Gate modulates the direction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The points raised highlight important aspects that will strengthen the clarity and rigor of our claims. We address each major comment below and commit to incorporating the necessary revisions.

read point-by-point responses

Referee: [§3] §3 (Method), CARD extraction paragraph: the claim that CARD 'counters language-prior dilution' rests on the assumption that prefill residuals predominantly encode separable visual evidence. Because the prefill stage mixes image tokens with the initial prompt, the manuscript must demonstrate (via equation or ablation) that CARD is not a linear combination of visual and prompt-induced language signals; without this, the injection step risks reinforcing rather than mitigating hallucinations.

Authors: We agree that an explicit demonstration of separability is required to substantiate the claim. In the revised manuscript we will add both a short derivation in §3 showing how the residual update is dominated by attention over image tokens (due to the prefix structure and cross-attention patterns) and a new ablation that extracts CARD from a text-only prefill (image tokens masked) versus the standard image+text prefill. The text-only variant will be injected under identical conditions; we expect substantially weaker hallucination mitigation, confirming that the visual component is the primary driver. revision: yes
Referee: Experiments section, CHAIR results paragraph: the reported 24.4% / 23.6% relative reductions are presented without ablations that isolate the contribution of CARD versus the Beta Gate, nor any text-masked prefill control. This leaves open whether the gains are caused by the proposed residual-update steering or by incidental changes in decoding dynamics.

Authors: We acknowledge the need for component-isolation experiments. The revised version will include three targeted ablations: (1) CARD injection with the Beta Gate disabled (fixed strength), (2) full RUDDER versus a no-CARD baseline, and (3) a text-masked prefill control that generates a language-only direction for injection. These will be reported alongside the main CHAIR results to show that performance gains track the presence of visual residuals rather than generic decoding modifications. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RUDDER introduces an operational extraction-injection procedure validated empirically rather than reducing to fitted inputs or self-referential definitions.

full rationale

The paper describes RUDDER as a framework that extracts CARD directly from prefill residual updates and modulates injection via the Beta Gate to counter visual dilution. This is presented as a practical intervention on model internals, with performance measured through experiments on CHAIR metrics across LLaVA, Idefics2, InstructBLIP, and Qwen2.5-VL. No derivation chain exists that equates a claimed prediction or first-principles result to its own inputs by construction. The method does not rely on fitting parameters to a data subset and relabeling the outcome as a prediction, nor does it import uniqueness theorems or ansatzes via self-citation in a load-bearing manner. The central claims rest on external experimental benchmarks rather than tautological re-expression of the technique's own definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The method rests on the untested premise that prefill residuals contain a stable visual signal and on two newly introduced constructs (CARD and Beta Gate) whose definitions and tuning are not detailed.

free parameters (1)

Beta Gate threshold or scaling factor
The adaptive gate must contain at least one tunable parameter that decides when the visual reminder is applied; its value is not reported.

axioms (1)

domain assumption Prefill residual updates encode visual evidence that subsequently dilutes during autoregressive decoding.
This premise is invoked to justify extracting CARD from the prefill stage.

invented entities (2)

CARD (robust evidence direction) no independent evidence
purpose: Persistent visual anchor extracted from prefill residuals.
Newly defined construct whose extraction rule is not specified.
Beta Gate no independent evidence
purpose: Adaptive trust mechanism that modulates injection of the visual anchor.
New component introduced to avoid unnecessary steering.

pith-pipeline@v0.9.0 · 5771 in / 1405 out tokens · 74396 ms · 2026-05-21T18:43:12.471656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process... modulated by an adaptive gate, the Beta Gate
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the Beta Gate... uses st to parameterize a Beta distribution, and the gate value gt is taken as its posterior mean

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 12 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP : Towards General -purpose Vision - Language Models with Instruction Tuning , June 2023. URL http://arxiv.org/abs/2305.06500. arXiv:2305.06500 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. MME : A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , March 2024. URL http://arxiv.org/abs/2306.13394. arXiv:2306.13394 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search

Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search, 2017. URL https://arxiv.org/abs/1704.07138

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation . ACM Comput. Surv., 55 0 (12): 0 1--38, December 2023. ISSN 0360-0300, 1557-7341. doi:10.1145/3571730. URL http://arxiv.org/abs/2202.036...

work page doi:10.1145/3571730 2023
[7]

What matters when building vision-language models? Advances in Neural Information Processing Systems, 37: 0 87874--87907, 2024

Hugo Lauren c on, L \'e o Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? Advances in Neural Information Processing Systems, 37: 0 87874--87907, 2024

work page 2024
[8]

Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023. URL http://arxiv.org/abs/2311.16922. arXiv:2311.16922 [cs]

work page arXiv 2023
[9]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023 a . URL https://arxiv.org/abs/2301.12597

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N. Metaxas. The Hidden Life of Tokens : Reducing Hallucination of Large Vision - Language Models via Visual Information Steering , July 2025. URL http://arxiv.org/abs/2502.03628. arXiv:2502.03628 [cs]

work page arXiv 2025
[12]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URL https://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning , May 2024 a . URL http://arxiv.org/abs/2310.03744. arXiv:2310.03744 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Reducing hallucinations in large vision-language models via latent space steering

Sheng Liu, Haotian Ye, and James Zou. Reducing hallucinations in large vision-language models via latent space steering. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LBl7Hez0fF

work page 2025
[15]

Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying More Attention to Image : A Training - Free Method for Alleviating Hallucination in LVLMs , July 2024 b . URL http://arxiv.org/abs/2407.21771. arXiv:2407.21771 [cs]

work page arXiv 2024
[16]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv.org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning, 2019. URL https://arxiv.org/abs/1809.02156

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Evidential Deep Learning to Quantify Classification Uncertainty

Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential Deep Learning to Quantify Classification Uncertainty , October 2018. URL http://arxiv.org/abs/1806.01768. arXiv:1806.01768 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Bivariate Beta - LSTM , November 2019

Kyungwoo Song, JoonHo Jang, Seung jae Shin, and Il-Chul Moon. Bivariate Beta - LSTM , November 2019. URL http://arxiv.org/abs/1905.10521. arXiv:1905.10521 [cs]

work page arXiv 2019
[21]

Feature selection using stochastic gates

Yutaro Yamada, Ofir Lindenbaum, Sahand Negahban, and Yuval Kluger. Feature selection using stochastic gates. In International conference on machine learning, pp.\ 10648--10659. PMLR, 2020

work page 2020
[22]

Mitigating object hallucination in large vision-language models via image-grounded guidance, 2025

Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via image-grounded guidance, 2025. URL https://openreview.net/forum?id=eFoj2egr7G

work page 2025
[23]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[24]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[25]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[26]

hallucination

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.18653/v1/2025.acl-long.634 2025

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP : Towards General -purpose Vision - Language Models with Instruction Tuning , June 2023. URL http://arxiv.org/abs/2305.06500. arXiv:2305.06500 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. MME : A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , March 2024. URL http://arxiv.org/abs/2306.13394. arXiv:2306.13394 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search

Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search, 2017. URL https://arxiv.org/abs/1704.07138

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation . ACM Comput. Surv., 55 0 (12): 0 1--38, December 2023. ISSN 0360-0300, 1557-7341. doi:10.1145/3571730. URL http://arxiv.org/abs/2202.036...

work page doi:10.1145/3571730 2023

[7] [7]

What matters when building vision-language models? Advances in Neural Information Processing Systems, 37: 0 87874--87907, 2024

Hugo Lauren c on, L \'e o Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? Advances in Neural Information Processing Systems, 37: 0 87874--87907, 2024

work page 2024

[8] [8]

Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023. URL http://arxiv.org/abs/2311.16922. arXiv:2311.16922 [cs]

work page arXiv 2023

[9] [9]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023 a . URL https://arxiv.org/abs/2301.12597

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, and Dimitris N. Metaxas. The Hidden Life of Tokens : Reducing Hallucination of Large Vision - Language Models via Visual Information Steering , July 2025. URL http://arxiv.org/abs/2502.03628. arXiv:2502.03628 [cs]

work page arXiv 2025

[12] [12]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URL https://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved Baselines with Visual Instruction Tuning , May 2024 a . URL http://arxiv.org/abs/2310.03744. arXiv:2310.03744 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Reducing hallucinations in large vision-language models via latent space steering

Sheng Liu, Haotian Ye, and James Zou. Reducing hallucinations in large vision-language models via latent space steering. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LBl7Hez0fF

work page 2025

[15] [15]

Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms

Shi Liu, Kecheng Zheng, and Wei Chen. Paying More Attention to Image : A Training - Free Method for Alleviating Hallucination in LVLMs , July 2024 b . URL http://arxiv.org/abs/2407.21771. arXiv:2407.21771 [cs]

work page arXiv 2024

[16] [16]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv.org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning, 2019. URL https://arxiv.org/abs/1809.02156

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Evidential Deep Learning to Quantify Classification Uncertainty

Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential Deep Learning to Quantify Classification Uncertainty , October 2018. URL http://arxiv.org/abs/1806.01768. arXiv:1806.01768 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Bivariate Beta - LSTM , November 2019

Kyungwoo Song, JoonHo Jang, Seung jae Shin, and Il-Chul Moon. Bivariate Beta - LSTM , November 2019. URL http://arxiv.org/abs/1905.10521. arXiv:1905.10521 [cs]

work page arXiv 2019

[20] [21]

Feature selection using stochastic gates

Yutaro Yamada, Ofir Lindenbaum, Sahand Negahban, and Yuval Kluger. Feature selection using stochastic gates. In International conference on machine learning, pp.\ 10648--10659. PMLR, 2020

work page 2020

[21] [22]

Mitigating object hallucination in large vision-language models via image-grounded guidance, 2025

Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. Mitigating object hallucination in large vision-language models via image-grounded guidance, 2025. URL https://openreview.net/forum?id=eFoj2egr7G

work page 2025

[22] [23]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[23] [24]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[24] [25]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[25] [26]

hallucination

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.18653/v1/2025.acl-long.634 2025