Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination
Pith reviewed 2026-05-20 06:21 UTC · model grok-4.3
The pith
Attention head imbalance in multimodal models favors erroneous text over visual evidence during generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across five open-source MLLMs, hallucination-driving attention heads are more broadly distributed and carry greater aggregate causal weight than hallucination-resisting heads, forming an imbalanced routing structure that biases generation toward erroneous textual premises; conditional suppression of the driving heads via MACI yields the largest hallucination reduction on the MMMC benchmark among compared baselines.
What carries the argument
Path-patching identification of hallucination-driving versus hallucination-resisting attention heads and the resulting distributed-versus-localized imbalance in their causal effects on token prediction.
If this is right
- The imbalance appears consistently across the five tested open-source MLLMs.
- Conditional suppression of driving heads improves the hallucination-accuracy trade-off compared with unconditional or random interventions.
- The same intervention transfers zero-shot to the SCI-SemanticConflict test.
- Ablation experiments confirm that driving and resisting heads exert opposing effects on generation.
Where Pith is reading between the lines
- Training objectives could be adjusted to strengthen the aggregate weight of resisting heads relative to driving heads.
- Similar head-level imbalances may exist for other hallucination types such as object or attribute errors.
- The routing structure identified here could be monitored at inference time as an early-warning signal for modality conflicts.
Load-bearing premise
Path patching isolates the true causal contribution of each individual attention head to the final output without substantial interference from other heads or from the chosen patching values.
What would settle it
No measurable reduction in modality-conflict hallucinations when the same suppression is applied to randomly selected heads instead of the causally identified driving heads on the MMMC benchmark.
Figures
read the original abstract
Modality-conflict hallucination occurs when multimodal large language models (MLLMs) prioritize erroneous textual premises over contradictory visual evidence. To understand why visual evidence fails to prevail during generation, we take a mechanistic perspective and examine which internal components drive or resist this failure. We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads. We find a consistent asymmetry: driving effects are more broadly distributed and carry greater aggregate weight, whereas resisting effects concentrate in a small number of high-importance heads. Ablation experiments further confirm that these groups exert opposing effects during generation: distributed driving influence and localized resistance together form an imbalanced routing structure that biases generation toward the erroneous premise. Motivated by this finding, we propose MACI (Modality-conflict-Aware Causal Intervention), a conditional intervention that suppresses causally identified hallucination-driving heads only when conflict is detected. Across five MLLMs, MACI achieves the largest hallucination reduction among compared inference-time baselines on the MMMC benchmark with a favorable hallucination-accuracy trade-off, and transfers zero-shot to the SCI-SemanticConflict test.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that modality-conflict hallucination in MLLMs arises from an imbalanced attention-head routing structure: across five open-source models, hallucination-driving heads identified via path patching are more broadly distributed and carry greater aggregate causal weight than hallucination-resisting heads. Ablations confirm opposing effects, and the authors propose MACI, a conditional intervention that suppresses driving heads only on detected conflict, yielding the largest hallucination reduction on MMMC among baselines while preserving accuracy and transferring zero-shot to SCI-SemanticConflict.
Significance. If the path-patching results prove robust, the work supplies concrete causal evidence for why visual evidence is overridden by textual premises and demonstrates a practical, targeted mitigation strategy. Consistency across five models and the ablation confirmation are strengths; the favorable hallucination-accuracy trade-off of MACI would be a useful contribution to inference-time hallucination control if the underlying head classifications are stable.
major comments (1)
- [path-patching protocol and definition of driving versus resisting heads] Path-patching protocol and definition of driving versus resisting heads: the central claim of an imbalanced routing structure (broader distribution and higher aggregate causal weight for driving heads) depends on the assumption that single-head path patching isolates each head's causal contribution to final-token prediction. Because heads interact through the residual stream, patching one head's output from a corrupted run into a clean run can be compensated by remaining heads or altered by the specific corrupted activation chosen as the patch value. Without reported controls such as joint patching of candidate head sets or comparisons across multiple patch sources, the reported asymmetry risks being an artifact of the intervention rather than a stable computational property.
minor comments (3)
- The abstract and methods description provide no quantitative effect sizes, error bars, or statistical tests for the reported differences in distribution and aggregate weight between driving and resisting heads.
- Details on the conflict detector used inside MACI (how conflict is detected at inference time and the precise suppression rule) are not specified, making it difficult to assess reproducibility or failure modes.
- The manuscript would benefit from explicit comparison of the path-patching results against alternative causal methods (e.g., activation patching with multiple source runs or attribution patching) to strengthen the claim that the identified imbalance is method-independent.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our path-patching analysis. We address the methodological concern in detail below and describe the revisions we will undertake.
read point-by-point responses
-
Referee: [path-patching protocol and definition of driving versus resisting heads] Path-patching protocol and definition of driving versus resisting heads: the central claim of an imbalanced routing structure (broader distribution and higher aggregate causal weight for driving heads) depends on the assumption that single-head path patching isolates each head's causal contribution to final-token prediction. Because heads interact through the residual stream, patching one head's output from a corrupted run into a clean run can be compensated by remaining heads or altered by the specific corrupted activation chosen as the patch value. Without reported controls such as joint patching of candidate head sets or comparisons across multiple patch sources, the reported asymmetry risks being an artifact of the intervention rather than a stable computational property.
Authors: We appreciate the referee's observation regarding residual-stream interactions and the assumptions underlying single-head path patching. Our protocol follows the standard single-head intervention used in mechanistic interpretability to attribute effects to individual components. While compensation by other heads is possible in principle, we validated the opposing roles through group-level ablation experiments that intervene simultaneously on the full sets of driving and resisting heads; these collective interventions confirm the distributed driving influence and localized resistance, thereby providing evidence that the asymmetry is not solely an artifact of isolated patching. The same imbalance pattern is reproduced across five architecturally distinct MLLMs, further supporting that the finding reflects a stable property rather than a patching-source artifact. In the revised manuscript we will expand the Methods section to explicitly discuss the residual-stream interaction concern, clarify why single-head patching was chosen for head identification, and explain how the group ablations serve as a control for collective effects. We will also add a dedicated limitations paragraph addressing the assumptions of the protocol. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper applies the established external technique of path-patching to quantify the causal effect of each attention head on hallucination rates under modality conflict. Heads are then partitioned into driving and resisting groups according to the sign of those measured effects; the reported broader distribution and higher aggregate causal weight of the driving group is an empirical summary statistic computed directly from the same set of intervention results across the five models. This constitutes an observation about the measured distribution rather than a quantity defined in terms of itself or a parameter fitted and then relabeled as a prediction. The subsequent ablation checks and the design of MACI follow from these measurements without any self-referential reduction or load-bearing self-citation that would make the central claim equivalent to its inputs by construction. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Path patching on attention heads isolates their causal contribution to the final generation without substantial side effects from other components
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Importance Score. ... ¯I_{l,i} = ... L(x_cf) - L(x^{(l,i)←cl}_cf)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL technical report (2025), https://arxiv. org/abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Bai, S., Chen, K., Liu, X., et al.: Qwen2.5-VL technical report (2025), https:// arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Bai, Z., Wang, P., Ma, T., Chen, G., Liu, Z., Fu, J., Shou, M.Z.: Hallucination of multimodal large language models: A survey (2024), https://arxiv.org/abs/2404. 18930
work page 2024
- [4]
-
[5]
DeepSeek-AI: DeepSeek-V3.2: Pushing the frontier of open large language models (2025), https://arxiv.org/abs/2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: European Conference on Computer Vision
Gao, J., Gan, L., Li, Y., Ye, Y., Wang, D.: Dissecting dissonance: Benchmark- ing large multimodal models against self-contradictory instructions. In: European Conference on Computer Vision. pp. 404–420. Springer (2024)
work page 2024
-
[7]
Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models (2024), https://arxiv.org/abs/2310.14566
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
He, J., Zhu, K., Guo, H., Fang, J., Hua, Z., Jia, Y., Tang, M., Chua, T.S., Wang, J.: Cracking the code of hallucination in LVLMs with vision-aware head divergence. In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL) (2025)
work page 2025
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., Yu, N.: OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13418– 13427 (2024)
work page 2024
-
[10]
Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., Bing, L.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922 (2023), https://arxiv.org/abs/2311.16922
-
[11]
github.io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: LLaVA-NeXT: Im- proved reasoning, OCR, and world knowledge (January 2024), https://llava-vl. github.io/blog/2024-01-30-llava-next/
work page 2024
-
[12]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023), https://arxiv. org/abs/2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual asso- ciations in GPT. Advances in Neural Information Processing Systems35(2022) Attention Head Imbalance in Modality Conflict 11
work page 2022
-
[14]
Nguyen, T., Michaels, J., Fiterau, M., Jensen, D.: Challenges in understanding modality conflict in vision-language models (2025), https://arxiv.org/abs/2509. 02805
work page 2025
-
[15]
Qian, J., Zheng, G., Zhu, Y., Yang, S.: Intervene-All-Paths: Unified mitigation of LVLMhallucinationsacrossalignmentformats.In:AdvancesinNeuralInformation Processing Systems (NeurIPS) (2025)
work page 2025
-
[16]
In: Ad- vances in Neural Information Processing Systems
Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., Shieber, S.: In- vestigating gender bias in language models using causal mediation analysis. In: Ad- vances in Neural Information Processing Systems. vol. 33, pp. 12388–12401 (2020)
work page 2020
-
[17]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J.: Interpretabil- ity in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
In: Findings of the Association for Computational Linguistics ACL 2024
Wang, X., Pan, J., Ding, L., Biemann, C.: Mitigating hallucinations in large vision-language models with instruction contrastive decoding. In: Findings of the Association for Computational Linguistics ACL 2024. pp. 15840–15853 (2024), https://aclanthology.org/2024.findings-acl.937
work page 2024
-
[19]
In: Proceedings of the AAAI Con- ference on Artificial Intelligence
Wang, Y., Aniri, Bi, J., Pirk, S., Ma, Y.: ASCD: Attention-steerable contrastive decoding for reducing hallucination in MLLM. In: Proceedings of the AAAI Con- ference on Artificial Intelligence. vol. 40, pp. 10306–10314 (2026)
work page 2026
-
[20]
In: Inter- national Conference on Learning Representations (2025)
Yang, T., Li, Z., Cao, J., Xu, C.: Understanding and mitigating hallucination in large vision-language models via modular attribution and intervention. In: Inter- national Conference on Learning Representations (2025)
work page 2025
-
[21]
Cross-modal information flow in multimodal large language models
Zhang, Z., Yadav, S., Han, F., Shutova, E.: Cross-modal information flow in mul- timodal large language models. arXiv preprint arXiv:2411.18620 (2024)
-
[22]
In: Proceedings of the 42nd International Conference on Machine Learning
Zhang, Z., Zhou, W., Zhao, J., Li, H.: Robust multimodal large language models against modality conflict. In: Proceedings of the 42nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 267, pp. 77233–77253. PMLR (2025)
work page 2025
-
[23]
In: Advances in Neural Information Processing Systems (2023)
Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: Advances in Neural Information Processing Systems (2023)
work page 2023
-
[24]
Zhu, J., Wang, W., Chen, Z., et al.: InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models (2025), https://arxiv.org/ abs/2504.10479
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
arXiv preprint arXiv:2410.03659 (2024) A Judge Reliability and Probe Detection Hall
Zhu, T., Liu, Q., Wang, F., Tu, Z., Chen, M.: Unraveling cross-modality knowledge conflict in large vision-language models. arXiv preprint arXiv:2410.03659 (2024) A Judge Reliability and Probe Detection Hall. labels premise following, not correctness: erroneous-premise following on MMMC and substituted-premise following on SCI. Llama-3.3-70B givesκ= 0.784...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.