Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Carsten Eickhoff; Dana Arad; Kyle Mahowald; Michal Golovanevsky; Ritambhara Singh; William Rudman; Yonatan Belinkov

arxiv: 2601.05201 · v2 · submitted 2026-01-08 · 💻 cs.CV · cs.AI· cs.CL

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

William Rudman , Michal Golovanevsky , Dana Arad , Yonatan Belinkov , Ritambhara Singh , Carsten Eickhoff , Kyle Mahowald This is my paper

Pith reviewed 2026-05-16 16:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords prompt-induced hallucinationsvision-language modelsattention headsmechanistic interpretabilityobject countinghallucination mitigationmultimodal models

0 comments

The pith

Ablating a small set of attention heads cuts prompt-induced hallucinations in vision-language models by at least 40 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models tend to favor text prompts over actual image content when the prompt overstates facts, such as describing more objects than are present. This bias grows stronger as the number of objects increases, leading models to conform to the prompt rather than correct it. Mechanistic analysis across three models locates a small collection of attention heads that drive this prompt-copying behavior. Ablating those heads makes the models rely more on visual evidence, and the reduction in hallucinations holds without any retraining. The heads operate differently across models but share the role of mediating prompt influence.

Core claim

Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence.

What carries the argument

PIH-heads, the small set of attention heads that mediate prompt copying and override visual evidence in favor of textual overstatements.

If this is right

Ablating the heads increases the model's tendency to correct overstated prompts toward the actual image content.
The same heads implement prompt copying differently in each of the three tested models.
The reduction in hallucinations occurs without any model retraining or fine-tuning.
The bias toward prompts strengthens reliably as object counts rise in the test images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar heads could be targeted to reduce other forms of text-over-image bias in captioning or visual question answering.
The approach suggests a lightweight way to improve VLM reliability on tasks where prompts might conflict with evidence.
Model-specific differences in how the heads function may require per-model identification rather than a universal fix.

Load-bearing premise

The controlled object-counting setup with overstated prompts captures the core mechanism of prompt-induced hallucinations that appears in broader use cases.

What would settle it

Measuring whether ablating the identified heads in a fourth VLM or a non-counting task produces a comparable drop in prompt conformity would confirm or refute the mechanism.

read the original abstract

Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ablating a small set of attention heads cuts prompt-induced hallucinations by at least 40% in an object-counting task across three VLMs, but the result stays tied to that narrow setup.

read the letter

The paper's main result is that a small number of attention heads in three different VLMs drive the models to copy overstated object counts from the prompt. Ablating those heads reduces the hallucination rate by 40 percent or more and makes the outputs align better with the actual image, all without any retraining. They also note that the heads operate in model-specific ways. This is the concrete new piece: a targeted intervention that works across models and points to internal mechanisms rather than just surface behavior. The ablation experiments are straightforward and produce consistent directional change toward visual evidence, which is the part that holds up cleanly from the reported results. The cross-model comparison adds a bit of useful detail on implementation differences. The main limitation is scope. All the data comes from prompts that overstate object counts in images, such as asking for four waterlilies when three are present. It is not shown whether the same heads or the same benefit appear for other prompt-induced failures like inventing attributes, misstating spatial relations, or open-ended description errors. If these heads mainly handle numerical conformity or simple token repetition, the 40 percent reduction and the mechanistic claim would not extend far beyond counting. The abstract also skips the exact head-selection procedure and any checks that other capabilities stayed intact, though the full paper may address that. This is useful reading for people working on mechanistic fixes for VLM reliability in captioning or decision tasks. A reader focused on interpretability would get value from the ablation method and the model comparisons. It is coherent enough on its own terms to deserve peer review, with the main request being broader tests on other hallucination types.

Referee Report

3 major / 2 minor

Summary. The paper studies prompt-induced hallucinations (PIH) in vision-language models using a controlled object-counting task in which prompts overstate the number of objects present in an image. Through ablation experiments on three VLMs, the authors identify a small set of attention heads whose removal reduces PIH by at least 40% without further training, characterize model-specific differences in how these heads implement prompt copying, and report that ablation increases correction toward visual evidence.

Significance. If the identified heads prove to implement a general prompt-over-visual routing mechanism rather than task-specific numerical conformity, the result would offer a concrete, training-free intervention for a common VLM failure mode and would advance mechanistic interpretability of multimodal models. The current evidence, however, is confined to a single narrow task, so the broader significance remains conditional on transfer experiments that are not yet provided.

major comments (3)

[Results (object-counting experiments)] The central claim that the ablated heads mediate general PIH is load-bearing yet rests on a single object-counting overstatement setup. No results are shown for attribute invention, spatial-relation errors, or open-ended VQA hallucinations; if the heads primarily implement token-copying or count conformity, the reported 40% reduction would be an artifact of the chosen task rather than a mechanistic insight into PIH.
[Methods (mechanistic analysis)] Head-selection criteria, statistical controls, and preservation of performance on unrelated tasks are not described. The abstract states that ablation reduces PIH by >=40% across three models, but without explicit selection procedure or controls for multiple comparisons it is impossible to verify that the reported heads are not post-hoc selections that capitalize on noise.
[Discussion (model-specific characterization)] The paper reports model-specific differences in how PIH-heads mediate prompt copying, yet provides no quantitative comparison of the heads' attention patterns or activation statistics across the three VLMs that would allow readers to assess whether the differences are substantive or merely quantitative variations on the same mechanism.

minor comments (2)

[Abstract] The abstract claims 'substantially reduces prompt-induced hallucinations (PIH) by at least 40%' but does not define the exact metric (e.g., accuracy delta, hallucination rate) or report confidence intervals; this should be clarified in the results section.
[Figures] Figure legends and axis labels for the ablation plots should explicitly state the number of runs and the precise definition of 'correction toward visual evidence' used in the bar charts.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the scope of our claims and indicating where the manuscript will be revised.

read point-by-point responses

Referee: [Results (object-counting experiments)] The central claim that the ablated heads mediate general PIH is load-bearing yet rests on a single object-counting overstatement setup. No results are shown for attribute invention, spatial-relation errors, or open-ended VQA hallucinations; if the heads primarily implement token-copying or count conformity, the reported 40% reduction would be an artifact of the chosen task rather than a mechanistic insight into PIH.

Authors: We agree that the experiments are limited to a controlled object-counting task chosen to isolate prompt conformity versus visual correction through numerical discrepancy. The manuscript does not claim that the identified heads mediate every form of PIH; rather, it demonstrates a prompt-copying mechanism in this setting that reduces hallucinations by at least 40% upon ablation. We will revise the abstract, introduction, and discussion to more precisely qualify the claims as applying to prompt-induced numerical overstatement while noting the potential relevance to other textual-override failures. Comprehensive experiments on attribute invention, spatial relations, and open-ended VQA are not available and would require new data collection. revision: partial
Referee: [Methods (mechanistic analysis)] Head-selection criteria, statistical controls, and preservation of performance on unrelated tasks are not described. The abstract states that ablation reduces PIH by >=40% across three models, but without explicit selection procedure or controls for multiple comparisons it is impossible to verify that the reported heads are not post-hoc selections that capitalize on noise.

Authors: We appreciate this observation. Head selection was performed by ranking heads according to the magnitude of PIH reduction upon ablation while requiring that overall accuracy on the base counting task remain within 5% of the unablated model. We will expand the Methods section to document the exact ranking procedure, the number of heads evaluated per model, the statistical threshold applied, and controls for multiple comparisons. We will also add results confirming that ablation of the selected heads leaves performance on standard (non-overstated) VQA tasks statistically unchanged. revision: yes
Referee: [Discussion (model-specific characterization)] The paper reports model-specific differences in how PIH-heads mediate prompt copying, yet provides no quantitative comparison of the heads' attention patterns or activation statistics across the three VLMs that would allow readers to assess whether the differences are substantive or merely quantitative variations on the same mechanism.

Authors: We concur that quantitative cross-model comparisons would strengthen the discussion. In the revised manuscript we will add a new table and accompanying figure that report, for each model, the average attention weight on prompt tokens versus image tokens, the correlation between head activations and prompt count tokens, and the change in these statistics after ablation. These metrics will allow readers to evaluate whether the observed differences reflect distinct mechanisms or graded variations of a shared prompt-copying circuit. revision: yes

standing simulated objections not resolved

Transfer experiments demonstrating that the same heads reduce hallucinations on attribute invention, spatial-relation errors, or open-ended VQA are not present in the current work.

Circularity Check

0 steps flagged

Empirical ablation study with no self-referential derivations or fitted predictions

full rationale

The paper conducts a mechanistic analysis via attention-head ablations on three VLMs in a controlled object-counting task. No equations, derivations, or parameter-fitting steps are present that reduce reported effects to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The central finding (PIH reduction after ablation) is measured directly from behavioral experiments and does not rely on any load-bearing self-citation or ansatz imported from prior work by the same authors. The study is self-contained against its own experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that attention heads are meaningful causal units for prompt copying behavior and that the controlled counting task isolates the relevant mechanism without introducing task-specific artifacts.

axioms (1)

domain assumption Attention heads can be causally linked to specific behavioral outputs via ablation
Invoked when the paper treats head ablation as a direct test of mediation.

pith-pipeline@v0.9.0 · 5490 in / 1146 out tokens · 38092 ms · 2026-05-16T16:04:36.718711+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.