How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

Abhivansh Gupta; Advika Sinha; Akshat Tomar; Shreyansh Modi; Simardeep Singh

arxiv: 2606.08777 · v1 · pith:7HAXDK4Znew · submitted 2026-06-07 · 💻 cs.LG · cs.AI

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

Abhivansh Gupta , Simardeep Singh , Advika Sinha , Shreyansh Modi , Akshat Tomar This is my paper

Pith reviewed 2026-06-27 18:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords visual language modelshallucinationscounterfactualscausal influencecircuit discoverysample complexityactivation patching

0 comments

The pith

A causal influence metric yields bounds on the minimum number of counterfactual samples needed to detect instability in VLM hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper derives bounds on the smallest number of counterfactual samples required to reliably identify unstable hallucinated predictions in visual language models. It defines a causal influence metric from differences in log probabilities across factual, counterfactual, and activation-patched model runs. Circuit discovery identifies the relevant components, and concentration inequalities applied to the metric's variance produce the sample complexity bound. A reader would care because the bound gives a concrete, data-driven way to test whether a VLM's hallucination is robust or fragile under small input changes.

Core claim

The paper claims that by measuring a causal influence metric on log-probability differences and estimating its variance across samples, concentration inequalities can be used to bound the minimum number m of counterfactual samples needed to detect instability in a VLM's hallucinated output with high probability.

What carries the argument

The causal influence metric computed from log-probability differences between factual, counterfactual, and activation-patched runs, applied to components found by circuit discovery (CD-T).

Load-bearing premise

The defined causal influence metric based on log-probability differences and the observed variance of this metric are sufficient to apply concentration inequalities and obtain valid bounds on the required number of samples m.

What would settle it

Running the method on a VLM with a known hallucination and finding that the computed m does not detect the instability when using that many samples, or that the empirical variance leads to bounds that are violated by actual instability rates.

Figures

Figures reproduced from arXiv: 2606.08777 by Abhivansh Gupta, Advika Sinha, Akshat Tomar, Shreyansh Modi, Simardeep Singh.

**Figure 1.** Figure 1: Overall Framework 2.3. Causal Effect Estimation For each counterfactual sample, targeted activation patching is applied over the discovered circuit S ⋆ . Let A cf,k v denote the activation of node v during the counterfactual run and let A f v denote the corresponding factual activation. The counterfactual activation is replaced with its factual counterpart, and the resulting change in model confidence is m… view at source ↗

**Figure 3.** Figure 3: Evolution of the estimated circuit-level sample complexity mˆ as the number of retained counterfactual interventions increases. Shaded regions denote the empirical variance envelope computed from node-wise causal deltas within the discovered CDT circuit. Models exhibiting lower asymptotic mˆ require fewer counterfactual samples for stable causal estimation, indicating more robust and grounded internal re… view at source ↗

**Figure 4.** Figure 4: Visualization of the uncertainty associated with the estimated circuit-level sample complexity m¯ . The spread of each distribution reflects the variability induced by counterfactual interventions and node-wise causal estimation. Narrower distributions indicate more stable circuit behavior, while broader distributions suggest higher causal uncertainty and increased hallucination susceptibility. tainty q… view at source ↗

**Figure 5.** Figure 5: Counterfactual generation pipeline where latent visual embeddings are perturbed using Gaussian interventions and transformed into alternative visual tokens for LLM reasoning The counterfactual generator Gϕ maps a factual multimodal input to a family of latent counterfactual representations. Rather than performing explicit pixel-level editing, the generator operates in the embedding space of the vision enco… view at source ↗

**Figure 6.** Figure 6: Circuit-performance sensitivity under varying outlier budgets during backward-search circuit discovery. Increasing the number of retained outliers expands the discovered circuit and changes the trade-off between causal recovery and reverse faithfulness. Intermediate budgets achieve the strongest recovery with relatively compact circuits, while larger budgets induce oversized circuits with poor recovery, su… view at source ↗

**Figure 7.** Figure 7: Circuit-budget sensitivity analysis across four search strategies: GLOBAL, LAYER NORM, ITER DECAY, and RANDOM NOISE. The left panel reports faithfulness recovery as the number of retained nodes increases, the middle panel tracks the remaining hallucination signal after ablation, and the right panel shows post-ablation accuracy recovery. The global strategy consistently yields the most faithful and stable c… view at source ↗

**Figure 8.** Figure 8: Minimal schematic of the circuit tracing procedure used to identify sparse causal pathways underlying the target prediction. selected nodes reach Layer 0 or no valid promoter candidates remain, yielding a sparse, causally-grounded subgraph of the full attention head network. A.6. Experimental Ablations and Trade-offs Convergence of Sample Complexity Bounds (K): We evaluate the sensitivity of our circuit-le… view at source ↗

read the original abstract

Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new causal influence metric on log-prob differences for VLM hallucinations, combines it with circuit discovery, and claims empirical bounds on the number of counterfactual samples needed, but the bounds rest on variance estimates whose tail properties are not checked.

read the letter

The main contribution is a causal influence metric that looks at log-probability shifts between factual inputs, counterfactuals, and activation-patched runs, then applies CD-T circuit discovery to track which components drive hallucinated outputs. From there they estimate variance of this metric and plug it into concentration inequalities to bound the minimum m needed to detect instability.

This is a reasonable extension of existing circuit work to the hallucination setting. The choice to focus on sample complexity for detection is practical, and treating the problem through explicit causal interventions rather than post-hoc correlation is cleaner than many current VLM robustness papers.

The soft spot is exactly the one flagged in the stress test. The bounds use empirical variance from the same counterfactual runs without any separate argument that the metric is sub-Gaussian, has bounded moments, or that the variance estimate itself is stable enough for the inequality to hold. The abstract gives no derivation, no simulation check on the tails, and no discussion of post-hoc selection effects. That makes the central claim about reliable detection of instability hard to evaluate and potentially optimistic.

The work is aimed at people already working on mechanistic interpretability and causal analysis of VLMs. A reader who wants concrete tools for measuring hallucination robustness will find an idea worth trying, but anyone planning to rely on the reported m values will need the full statistical justification first.

I would send it to peer review. The framing and the metric are worth referee attention even if the concentration step needs substantial additional validation or a different approach.

Referee Report

2 major / 1 minor

Summary. The paper defines a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs in VLMs to characterize stability of hallucinated predictions. It applies circuit discovery (CD-T) to identify responsible model components and derives empirical bounds on the minimum number m of counterfactual samples needed to reliably detect instability, using concentration inequalities together with variance estimates of the causal influence distribution.

Significance. If the bounds hold, the work supplies a sample-complexity analysis for counterfactual robustness testing of VLM hallucinations, which could inform more reliable evaluation protocols and component-level interventions in multimodal models.

major comments (2)

[Abstract] Abstract: the central claim that concentration inequalities applied to observed variance of the causal influence metric yield valid, non-vacuous bounds on m is presented without derivation details, validation experiments, error analysis, or checks against post-hoc selection of runs; this prevents verification that the metric satisfies the tail conditions (sub-Gaussianity or bounded moments) required by Hoeffding/Bernstein-type inequalities.
[Derivation of bounds on m] The section deriving the bounds on m: variance estimates are obtained from the same counterfactual runs used to claim the bound, raising a circularity concern; without an independent argument or separate validation set that the metric's distribution meets the inequality preconditions, the resulting m may be optimistically small.

minor comments (1)

Clarify whether the causal influence metric is computed on the same samples used for variance estimation or on held-out data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the concerns about the abstract presentation and the derivation of bounds on m below, and we will make revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that concentration inequalities applied to observed variance of the causal influence metric yield valid, non-vacuous bounds on m is presented without derivation details, validation experiments, error analysis, or checks against post-hoc selection of runs; this prevents verification that the metric satisfies the tail conditions (sub-Gaussianity or bounded moments) required by Hoeffding/Bernstein-type inequalities.

Authors: We agree the abstract is high-level and omits supporting details. The full derivation appears in Section 4, applying Bernstein's inequality after empirical variance estimation from the causal influence scores. We did not include explicit tail-condition validation or post-hoc selection analysis in the original submission. We will add a new subsection with empirical CDF plots against sub-Gaussian references, error bounds, and discussion of run selection to allow verification. revision: yes
Referee: [Derivation of bounds on m] The section deriving the bounds on m: variance estimates are obtained from the same counterfactual runs used to claim the bound, raising a circularity concern; without an independent argument or separate validation set that the metric's distribution meets the inequality preconditions, the resulting m may be optimistically small.

Authors: The circularity concern is valid: variance is computed from the same counterfactual samples. While the bounds are presented as empirical rather than strict theoretical guarantees, we will revise Section 4 to use a two-stage approach with a held-out pilot set for variance estimation (or bootstrap resampling) before applying the concentration inequality on the primary set. This provides a clearer separation and reduces optimism in the reported m. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper claims to derive empirical bounds on m via concentration inequalities applied to variance estimates of a causal influence metric defined from log-probability differences. The abstract and context provide no equations, self-citations, or explicit reductions showing that the variance estimates are taken from the identical runs in a manner that forces the bound by construction (e.g., no fitted parameter renamed as prediction or self-definitional loop). No load-bearing self-citation chains, uniqueness theorems, or ansatz smuggling appear. The method is presented as an empirical estimation procedure whose validity rests on standard concentration results rather than internal redefinition, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work introduces one new metric and relies on standard concentration inequalities; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (2)

domain assumption The causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs accurately reflects stability of hallucinated predictions.
This metric is defined as the foundation for all subsequent analysis and bounds.
domain assumption Variance estimates of the causal influence distribution permit direct application of concentration inequalities without additional tail or dependence assumptions.
Invoked when deriving the empirical bounds on m.

invented entities (1)

Causal influence metric no independent evidence
purpose: Quantify stability of hallucinated VLM outputs under counterfactual and activation-patched interventions.
Newly defined quantity central to the sample-complexity claim.

pith-pipeline@v0.9.1-grok · 5683 in / 1370 out tokens · 22033 ms · 2026-06-27T18:43:26.626770+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 linked inside Pith

[1]

2024 , eprint=

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author=. 2024 , eprint=

2024
[2]

2025 , eprint=

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition , author=. 2025 , eprint=

2025
[3]

2023 , eprint=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , eprint=

2023
[4]

2022 , eprint=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=

2022
[5]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

2023
[6]

Counterfactual Vision and Language Learning , doi =

Abbasnejad, Ehsan and Teney, Damien and Parvaneh, Amin and Shi, Javen and Hengel, Anton , year =. Counterfactual Vision and Language Learning , doi =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
[7]

2025 , eprint=

Treble Counterfactual VLMs: A Causal Approach to Hallucination , author=. 2025 , eprint=

2025
[8]

2022 , eprint=

Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review , author=. 2022 , eprint=

2022
[9]

2024 , eprint=

A Survey on Hallucination in Large Vision-Language Models , author=. 2024 , eprint=

2024
[10]

2025 , eprint=

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models , author=. 2025 , eprint=

2025
[11]

2024 , eprint=

Reducing Hallucinations in Vision-Language Models via Latent Space Steering , author=. 2024 , eprint=

2024
[12]

Angelopoulos and Stephen Bates , title =

Anastasios N. Angelopoulos and Stephen Bates , title =. CoRR , volume =. 2021 , url =. 2107.07511 , timestamp =

Pith/arXiv arXiv 2021
[13]

A Survey of Confidence Estimation and Calibration in Large Language Models , doi =

Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna , year =. A Survey of Confidence Estimation and Calibration in Large Language Models , doi =
[14]

Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models , year=

Guan, Tianrui and Liu, Fuxiao and Wu, Xiyang and Xian, Ruiqi and Li, Zongxia and Liu, Xiaoyu and Wang, Xijun and Chen, Lichang and Huang, Furong and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , booktitle=. Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models , year=
[15]

Computer Vision--ECCV 2014 , pages=

Microsoft COCO: Common Objects in Context , author=. Computer Vision--ECCV 2014 , pages=. 2014 , organization=

2014
[16]

2024 , eprint=

Clip Body and Tail Separately: High Probability Guarantees for DPSGD with Heavy Tails , author=. 2024 , eprint=

2024
[17]

The Thirteenth International Conference on Learning Representations , year=

Understanding and mitigating hallucination in large vision-language models via modular attribution and intervention , author=. The Thirteenth International Conference on Learning Representations , year=

[1] [1]

2024 , eprint=

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author=. 2024 , eprint=

2024

[2] [2]

2025 , eprint=

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition , author=. 2025 , eprint=

2025

[3] [3]

2023 , eprint=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , eprint=

2023

[4] [4]

2022 , eprint=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=

2022

[5] [5]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

2023

[6] [6]

Counterfactual Vision and Language Learning , doi =

Abbasnejad, Ehsan and Teney, Damien and Parvaneh, Amin and Shi, Javen and Hengel, Anton , year =. Counterfactual Vision and Language Learning , doi =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

[7] [7]

2025 , eprint=

Treble Counterfactual VLMs: A Causal Approach to Hallucination , author=. 2025 , eprint=

2025

[8] [8]

2022 , eprint=

Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review , author=. 2022 , eprint=

2022

[9] [9]

2024 , eprint=

A Survey on Hallucination in Large Vision-Language Models , author=. 2024 , eprint=

2024

[10] [10]

2025 , eprint=

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models , author=. 2025 , eprint=

2025

[11] [11]

2024 , eprint=

Reducing Hallucinations in Vision-Language Models via Latent Space Steering , author=. 2024 , eprint=

2024

[12] [12]

Angelopoulos and Stephen Bates , title =

Anastasios N. Angelopoulos and Stephen Bates , title =. CoRR , volume =. 2021 , url =. 2107.07511 , timestamp =

Pith/arXiv arXiv 2021

[13] [13]

A Survey of Confidence Estimation and Calibration in Large Language Models , doi =

Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna , year =. A Survey of Confidence Estimation and Calibration in Large Language Models , doi =

[14] [14]

Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models , year=

Guan, Tianrui and Liu, Fuxiao and Wu, Xiyang and Xian, Ruiqi and Li, Zongxia and Liu, Xiaoyu and Wang, Xijun and Chen, Lichang and Huang, Furong and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , booktitle=. Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models , year=

[15] [15]

Computer Vision--ECCV 2014 , pages=

Microsoft COCO: Common Objects in Context , author=. Computer Vision--ECCV 2014 , pages=. 2014 , organization=

2014

[16] [16]

2024 , eprint=

Clip Body and Tail Separately: High Probability Guarantees for DPSGD with Heavy Tails , author=. 2024 , eprint=

2024

[17] [17]

The Thirteenth International Conference on Learning Representations , year=

Understanding and mitigating hallucination in large vision-language models via modular attribution and intervention , author=. The Thirteenth International Conference on Learning Representations , year=