pith. sign in

arxiv: 2605.19250 · v1 · pith:P7BNJBU7new · submitted 2026-05-19 · 💻 cs.AI

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

Pith reviewed 2026-05-20 06:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords modality conflict hallucinationattention headscausal interventionmultimodal large language modelspath patchinghallucination reductionMACI
0
0 comments X

The pith

Attention head imbalance in multimodal models favors erroneous text over visual evidence during generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks why multimodal large language models sometimes follow contradictory text instead of visual input and traces the cause to internal attention mechanisms. Using path patching on five open-source models, it separates attention heads into those that causally promote hallucinations and those that resist them. The driving heads turn out to be more numerous and collectively stronger, while resisting heads are fewer but individually potent. This creates a structural tilt toward the wrong premise. The authors then build a conditional intervention that suppresses only the driving heads when conflict appears, producing the largest drop in hallucinations among tested methods.

Core claim

Across five open-source MLLMs, hallucination-driving attention heads are more broadly distributed and carry greater aggregate causal weight than hallucination-resisting heads, forming an imbalanced routing structure that biases generation toward erroneous textual premises; conditional suppression of the driving heads via MACI yields the largest hallucination reduction on the MMMC benchmark among compared baselines.

What carries the argument

Path-patching identification of hallucination-driving versus hallucination-resisting attention heads and the resulting distributed-versus-localized imbalance in their causal effects on token prediction.

If this is right

  • The imbalance appears consistently across the five tested open-source MLLMs.
  • Conditional suppression of driving heads improves the hallucination-accuracy trade-off compared with unconditional or random interventions.
  • The same intervention transfers zero-shot to the SCI-SemanticConflict test.
  • Ablation experiments confirm that driving and resisting heads exert opposing effects on generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives could be adjusted to strengthen the aggregate weight of resisting heads relative to driving heads.
  • Similar head-level imbalances may exist for other hallucination types such as object or attribute errors.
  • The routing structure identified here could be monitored at inference time as an early-warning signal for modality conflicts.

Load-bearing premise

Path patching isolates the true causal contribution of each individual attention head to the final output without substantial interference from other heads or from the chosen patching values.

What would settle it

No measurable reduction in modality-conflict hallucinations when the same suppression is applied to randomly selected heads instead of the causally identified driving heads on the MMMC benchmark.

Figures

Figures reproduced from arXiv: 2605.19250 by Jinrui Jiang, Xinyu Dai, Zhangtai Wu, Zhen Wu.

Figure 1
Figure 1. Figure 1: Head-level path patching. Top (Conflict run): the model is biased toward the erroneous textual premise. Middle (Clean run): the model identi￾fies the visual evidence given an unbiased query. Bottom (Patching): substi￾tuting head (l, i)’s activation with its clean-run counterpart and measuring the change in hallucination advantage indicates whether the head drives or resists hallucination. ratio: 1.51×), wh… view at source ↗
Figure 2
Figure 2. Figure 2: Hallucination-driving (H+, red) and hallucination-resisting (H−, blue) heads in Qwen2.5-VL-7B. Top: layer-wise importance and per-layer sums. Bot￾tom: ranked heads and cumulative importance. Results for all models are in Appendix B. Qwen2.5-VL-7B Qwen3-VL-8B InternVL3-8B LLaVA-NeXT-7B LLaVA-7B 0 20 40 60 80 100 Hallucination Rate (%) Base Prune-Random Prune-D Prune-Both Prune-R [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 3
Figure 3. Figure 3: Causal validation by head ablation. Hallucination rate (%) under Base, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trade-off between hallucination-rate reduction and non-conflict accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Head importance distributions for the remaining four models. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Modality-conflict hallucination occurs when multimodal large language models (MLLMs) prioritize erroneous textual premises over contradictory visual evidence. To understand why visual evidence fails to prevail during generation, we take a mechanistic perspective and examine which internal components drive or resist this failure. We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads. We find a consistent asymmetry: driving effects are more broadly distributed and carry greater aggregate weight, whereas resisting effects concentrate in a small number of high-importance heads. Ablation experiments further confirm that these groups exert opposing effects during generation: distributed driving influence and localized resistance together form an imbalanced routing structure that biases generation toward the erroneous premise. Motivated by this finding, we propose MACI (Modality-conflict-Aware Causal Intervention), a conditional intervention that suppresses causally identified hallucination-driving heads only when conflict is detected. Across five MLLMs, MACI achieves the largest hallucination reduction among compared inference-time baselines on the MMMC benchmark with a favorable hallucination-accuracy trade-off, and transfers zero-shot to the SCI-SemanticConflict test.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript claims that modality-conflict hallucination in MLLMs arises from an imbalanced attention-head routing structure: across five open-source models, hallucination-driving heads identified via path patching are more broadly distributed and carry greater aggregate causal weight than hallucination-resisting heads. Ablations confirm opposing effects, and the authors propose MACI, a conditional intervention that suppresses driving heads only on detected conflict, yielding the largest hallucination reduction on MMMC among baselines while preserving accuracy and transferring zero-shot to SCI-SemanticConflict.

Significance. If the path-patching results prove robust, the work supplies concrete causal evidence for why visual evidence is overridden by textual premises and demonstrates a practical, targeted mitigation strategy. Consistency across five models and the ablation confirmation are strengths; the favorable hallucination-accuracy trade-off of MACI would be a useful contribution to inference-time hallucination control if the underlying head classifications are stable.

major comments (1)
  1. [path-patching protocol and definition of driving versus resisting heads] Path-patching protocol and definition of driving versus resisting heads: the central claim of an imbalanced routing structure (broader distribution and higher aggregate causal weight for driving heads) depends on the assumption that single-head path patching isolates each head's causal contribution to final-token prediction. Because heads interact through the residual stream, patching one head's output from a corrupted run into a clean run can be compensated by remaining heads or altered by the specific corrupted activation chosen as the patch value. Without reported controls such as joint patching of candidate head sets or comparisons across multiple patch sources, the reported asymmetry risks being an artifact of the intervention rather than a stable computational property.
minor comments (3)
  1. The abstract and methods description provide no quantitative effect sizes, error bars, or statistical tests for the reported differences in distribution and aggregate weight between driving and resisting heads.
  2. Details on the conflict detector used inside MACI (how conflict is detected at inference time and the precise suppression rule) are not specified, making it difficult to assess reproducibility or failure modes.
  3. The manuscript would benefit from explicit comparison of the path-patching results against alternative causal methods (e.g., activation patching with multiple source runs or attribution patching) to strengthen the claim that the identified imbalance is method-independent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our path-patching analysis. We address the methodological concern in detail below and describe the revisions we will undertake.

read point-by-point responses
  1. Referee: [path-patching protocol and definition of driving versus resisting heads] Path-patching protocol and definition of driving versus resisting heads: the central claim of an imbalanced routing structure (broader distribution and higher aggregate causal weight for driving heads) depends on the assumption that single-head path patching isolates each head's causal contribution to final-token prediction. Because heads interact through the residual stream, patching one head's output from a corrupted run into a clean run can be compensated by remaining heads or altered by the specific corrupted activation chosen as the patch value. Without reported controls such as joint patching of candidate head sets or comparisons across multiple patch sources, the reported asymmetry risks being an artifact of the intervention rather than a stable computational property.

    Authors: We appreciate the referee's observation regarding residual-stream interactions and the assumptions underlying single-head path patching. Our protocol follows the standard single-head intervention used in mechanistic interpretability to attribute effects to individual components. While compensation by other heads is possible in principle, we validated the opposing roles through group-level ablation experiments that intervene simultaneously on the full sets of driving and resisting heads; these collective interventions confirm the distributed driving influence and localized resistance, thereby providing evidence that the asymmetry is not solely an artifact of isolated patching. The same imbalance pattern is reproduced across five architecturally distinct MLLMs, further supporting that the finding reflects a stable property rather than a patching-source artifact. In the revised manuscript we will expand the Methods section to explicitly discuss the residual-stream interaction concern, clarify why single-head patching was chosen for head identification, and explain how the group ablations serve as a control for collective effects. We will also add a dedicated limitations paragraph addressing the assumptions of the protocol. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies the established external technique of path-patching to quantify the causal effect of each attention head on hallucination rates under modality conflict. Heads are then partitioned into driving and resisting groups according to the sign of those measured effects; the reported broader distribution and higher aggregate causal weight of the driving group is an empirical summary statistic computed directly from the same set of intervention results across the five models. This constitutes an observation about the measured distribution rather than a quantity defined in terms of itself or a parameter fitted and then relabeled as a prediction. The subsequent ablation checks and the design of MACI follow from these measurements without any self-referential reduction or load-bearing self-citation that would make the central claim equivalent to its inputs by construction. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of path patching as a causal probe and on the assumption that the MMMC benchmark isolates modality conflict without confounding factors. No explicit free parameters or invented entities are described in the abstract; the conflict detector inside MACI is an unstated modeling choice.

axioms (1)
  • domain assumption Path patching on attention heads isolates their causal contribution to the final generation without substantial side effects from other components
    Invoked when the authors label heads as driving or resisting on the basis of patching outcomes

pith-pipeline@v0.9.0 · 5751 in / 1500 out tokens · 46001 ms · 2026-05-20T06:21:28.387167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 7 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL technical report (2025), https://arxiv. org/abs/2511.21631

  2. [2]

    Bai, S., Chen, K., Liu, X., et al.: Qwen2.5-VL technical report (2025), https:// arxiv.org/abs/2502.13923

  3. [3]

    Bai, Z., Wang, P., Ma, T., Chen, G., Liu, Z., Fu, J., Shou, M.Z.: Hallucination of multimodal large language models: A survey (2024), https://arxiv.org/abs/2404. 18930

  4. [4]

    Basu, S., Grayson, M., Morrison, C., Nushi, B., Feizi, S., Massiceti, D.: Under- standing information storage and transfer in multi-modal large language models (2024), https://arxiv.org/abs/2406.04236

  5. [5]

    DeepSeek-AI: DeepSeek-V3.2: Pushing the frontier of open large language models (2025), https://arxiv.org/abs/2512.02556

  6. [6]

    In: European Conference on Computer Vision

    Gao, J., Gan, L., Li, Y., Ye, Y., Wang, D.: Dissecting dissonance: Benchmark- ing large multimodal models against self-contradictory instructions. In: European Conference on Computer Vision. pp. 404–420. Springer (2024)

  7. [7]

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models (2024), https://arxiv.org/abs/2310.14566

  8. [8]

    In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL) (2025)

    He, J., Zhu, K., Guo, H., Fang, J., Hua, Z., Jia, Y., Tang, M., Chua, T.S., Wang, J.: Cracking the code of hallucination in LVLMs with vision-aware head divergence. In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL) (2025)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418– 13427 (2024)

  10. [10]

    Mitigating Object Hallucinations in Large Vision - Language Models through Visual Contrastive Decoding , November 2023

    Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922 (2023), https://arxiv.org/abs/2311.16922

  11. [11]

    github.io/blog/2024-01-30-llava-next/

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: LLaVA-NeXT: Im- proved reasoning, OCR, and world knowledge (January 2024), https://llava-vl. github.io/blog/2024-01-30-llava-next/

  12. [12]

    Visual Instruction Tuning

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023), https://arxiv. org/abs/2304.08485

  13. [13]

    Advances in Neural Information Processing Systems35(2022) Attention Head Imbalance in Modality Conflict 11

    Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual asso- ciations in GPT. Advances in Neural Information Processing Systems35(2022) Attention Head Imbalance in Modality Conflict 11

  14. [14]

    Nguyen, T., Michaels, J., Fiterau, M., Jensen, D.: Challenges in understanding modality conflict in vision-language models (2025), https://arxiv.org/abs/2509. 02805

  15. [15]

    Qian, J., Zheng, G., Zhu, Y., Yang, S.: Intervene-All-Paths: Unified mitigation of LVLMhallucinationsacrossalignmentformats.In:AdvancesinNeuralInformation Processing Systems (NeurIPS) (2025)

  16. [16]

    In: Ad- vances in Neural Information Processing Systems

    Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., Shieber, S.: In- vestigating gender bias in language models using causal mediation analysis. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 12388–12401 (2020)

  17. [17]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J.: Interpretabil- ity in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593 (2022)

  18. [18]

    In: Findings of the Association for Computational Linguistics ACL 2024

    Wang, X., Pan, J., Ding, L., Biemann, C.: Mitigating hallucinations in large vision-language models with instruction contrastive decoding. In: Findings of the Association for Computational Linguistics ACL 2024. pp. 15840–15853 (2024), https://aclanthology.org/2024.findings-acl.937

  19. [19]

    In: Proceedings of the AAAI Con- ference on Artificial Intelligence

    Wang, Y., Aniri, Bi, J., Pirk, S., Ma, Y.: ASCD: Attention-steerable contrastive decoding for reducing hallucination in MLLM. In: Proceedings of the AAAI Con- ference on Artificial Intelligence. vol. 40, pp. 10306–10314 (2026)

  20. [20]

    In: Inter- national Conference on Learning Representations (2025)

    Yang, T., Li, Z., Cao, J., Xu, C.: Understanding and mitigating hallucination in large vision-language models via modular attribution and intervention. In: Inter- national Conference on Learning Representations (2025)

  21. [21]

    Cross-modal information flow in multimodal large language models

    Zhang, Z., Yadav, S., Han, F., Shutova, E.: Cross-modal information flow in mul- timodal large language models. arXiv preprint arXiv:2411.18620 (2024)

  22. [22]

    In: Proceedings of the 42nd International Conference on Machine Learning

    Zhang, Z., Zhou, W., Zhao, J., Li, H.: Robust multimodal large language models against modality conflict. In: Proceedings of the 42nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 267, pp. 77233–77253. PMLR (2025)

  23. [23]

    In: Advances in Neural Information Processing Systems (2023)

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems (2023)

  24. [24]

    Zhu, J., Wang, W., Chen, Z., et al.: InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models (2025), https://arxiv.org/ abs/2504.10479

  25. [25]

    arXiv preprint arXiv:2410.03659 (2024) A Judge Reliability and Probe Detection Hall

    Zhu, T., Liu, Q., Wang, F., Tu, Z., Chen, M.: Unraveling cross-modality knowledge conflict in large vision-language models. arXiv preprint arXiv:2410.03659 (2024) A Judge Reliability and Probe Detection Hall. labels premise following, not correctness: erroneous-premise following on MMMC and substituted-premise following on SCI. Llama-3.3-70B givesκ= 0.784...