Recognition: no theorem link
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Pith reviewed 2026-05-12 04:27 UTC · model grok-4.3
The pith
Enforcing separate audio and visual chain-of-thought reasoning before evidence fusion mitigates cross-modal interference in audio-visual LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that cross-modal interference stems from uncontrolled mixing of audio and visual information during intermediate reasoning steps, and that requiring modality-specific chain-of-thought traces produced separately and then fused, supported by reinforcement learning on modality-preference labels, reduces hallucinations while retaining complementary information from both modalities.
What carries the argument
Modality-specific chain-of-thought reasoning that keeps audio and visual processing isolated in the reasoning stage before allowing full cross-modal access only at the final evidence fusion stage, guided by an auxiliary RL reward.
Load-bearing premise
Enforcing separate modality-specific reasoning traces will reduce interference without causing the loss of useful information that only appears when modalities interact early.
What would settle it
An experiment that applies the full SFFL pipeline to the cross-modal hallucination benchmark and finds no reduction or an increase in hallucination rate compared to a standard fused reasoning baseline.
Figures
read the original abstract
Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that audio-visual LLMs suffer from cross-modal interference during intermediate reasoning, which induces hallucinations by allowing one modality to misguide another. It proposes the Separate First, Fuse Later (SFFL) framework that enforces modality-specific chain-of-thought reasoning to produce separate audio and visual reasoning traces before fusing evidence for the final answer. Modality-preference labels are constructed via a data pipeline based on controlled input ablations under different modality settings and used as an auxiliary reward signal in reinforcement learning to encourage instance-dependent modality cue preference. A modality-specific reasoning mechanism maintains isolation during the separated reasoning stage while permitting full cross-modal access at the evidence fusion stage. Experiments report consistent improvements, with an average relative gain of 5.16% on general AVQA benchmarks and 11.17% on a cross-modal hallucination benchmark.
Significance. If the reported gains prove robust, SFFL would offer a practical and targeted approach to controlling cross-modal interactions in multimodal LLMs, addressing a recognized source of hallucinations while retaining complementary information across modalities. The combination of ablation-derived preference labels as an RL auxiliary reward and staged isolation/fusion provides a concrete training and inference recipe that could generalize to other multimodal settings. This would be a useful contribution to the literature on reliable audio-visual reasoning.
major comments (2)
- [Abstract] Abstract: The central empirical claims of 5.16% relative gain on AVQA benchmarks and 11.17% on the hallucination benchmark are stated without any mention of the number of runs, error bars, statistical significance tests, or baseline implementation details. These omissions make it impossible to determine whether the improvements exceed experimental variance and are load-bearing for the paper's main result.
- [Data pipeline] Data pipeline section: The construction of modality-preference labels from controlled input ablations is described at a high level, but the manuscript provides no validation procedure (e.g., human agreement, consistency checks across ablations, or sensitivity analysis) to confirm that the labels accurately reflect genuine modality preferences rather than artifacts of the ablation protocol. This directly affects the reliability of the RL auxiliary reward and therefore the soundness of the training procedure.
minor comments (1)
- [Abstract] The abstract would be clearer if it named the specific AVQA datasets and the size of the hallucination benchmark used to obtain the reported relative gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the presentation of our empirical results and strengthens the validation of our data construction process. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claims of 5.16% relative gain on AVQA benchmarks and 11.17% on the hallucination benchmark are stated without any mention of the number of runs, error bars, statistical significance tests, or baseline implementation details. These omissions make it impossible to determine whether the improvements exceed experimental variance and are load-bearing for the paper's main result.
Authors: We agree that the abstract would benefit from explicit mention of experimental rigor to support the reported gains. Due to length constraints, we will revise the abstract to include a brief qualifier (e.g., 'averaged over 3 runs with full details, error bars, and significance tests in Section 4'). The Experiments section already reports results over multiple seeds with standard deviations; we will add explicit statements on the number of runs, baseline re-implementation details, and statistical tests (paired t-tests) to make this information immediately accessible and confirm the gains exceed variance. revision: partial
-
Referee: [Data pipeline] Data pipeline section: The construction of modality-preference labels from controlled input ablations is described at a high level, but the manuscript provides no validation procedure (e.g., human agreement, consistency checks across ablations, or sensitivity analysis) to confirm that the labels accurately reflect genuine modality preferences rather than artifacts of the ablation protocol. This directly affects the reliability of the RL auxiliary reward and therefore the soundness of the training procedure.
Authors: We acknowledge that the current high-level description lacks explicit validation, which is a valid concern for the reliability of the auxiliary reward. In the revised manuscript, we will add a new subsection under Data Pipeline that includes: consistency checks by re-running ablations with modality swaps, sensitivity analysis on ablation thresholds, and human agreement evaluation on a sampled subset of labels (reporting Cohen's kappa). These additions will demonstrate that the labels capture genuine preferences and are not artifacts, thereby supporting the soundness of the RL training procedure. revision: yes
Circularity Check
No significant circularity; purely empirical framework
full rationale
The paper describes an empirical pipeline (modality-preference labels from input ablations, RL auxiliary reward, modality-isolation during CoT followed by fusion) whose performance is reported via measured accuracy gains on AVQA and hallucination benchmarks. No equations, uniqueness theorems, or derivations appear in the provided text; the reported 5.16% and 11.17% relative improvements are experimental outcomes rather than quantities forced by construction from fitted inputs or self-citations. The central claim therefore remains externally falsifiable and does not reduce to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modality-specific chain-of-thought reduces cross-modal interference while preserving complementary evidence
Reference graph
Works this paper leans on
-
[1]
Merging the senses into a robust percept , journal =. 2004 , issn =. doi:https://doi.org/10.1016/j.tics.2004.02.002 , url =
-
[2]
Nature Reviews Neuroscience , volume =
Multisensory integration: current issues from the perspective of the single neuron , author =. Nature Reviews Neuroscience , volume =. 2008 , publisher =
work page 2008
-
[3]
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , author=. 2018 , eprint=
work page 2018
-
[4]
Quantifying uncertainty in answers from any language model and enhancing their trustworthiness,
Chen, Jiuhai and Mueller, Jonas. Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.283
-
[5]
Proceedings of the 30th ACM International Conference on Multimedia , pages=
AVQA: A Dataset for Audio-Visual Question Answering on Videos , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=
-
[6]
Learning to Answer Questions in Dynamic Audio-Visual Scenarios , author=. 2022 , eprint=
work page 2022
-
[7]
Riahi, Ines and Radman, Abduljalil and Guo, Zixin and Hedjam, Rachid and Laaksonen, Jorma , title =. 2025 , isbn =. doi:10.1145/3746027.3758261 , booktitle =
-
[8]
Zhao, Xujian and Wang, Yixin and Jin, Peiquan , title =. 2025 , isbn =. doi:10.1609/aaai.v39i10.33138 , booktitle =
- [9]
-
[10]
AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering , author=. 2025 , eprint=
work page 2025
-
[11]
Question-Aware Gaussian Experts for Audio-Visual Question Answering , author=. 2025 , eprint=
work page 2025
-
[12]
Multimodal Chain-of-Thought Reasoning in Language Models , author=. 2024 , eprint=
work page 2024
-
[13]
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models , author=. 2025 , eprint=
work page 2025
-
[14]
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
Event-specific audio-visual fusion layers: A simple and new perspective on video understanding , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
-
[15]
What Makes Training Multi-Modal Classification Networks Hard? , author=. 2020 , eprint=
work page 2020
-
[16]
Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning , author=. 2025 , eprint=
work page 2025
-
[17]
AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding , author=. 2025 , eprint=
work page 2025
-
[18]
A Survey on Multimodal Large Language Models,
Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong , year=. A survey on multimodal large language models , volume=. National Science Review , publisher=. doi:10.1093/nsr/nwae403 , number=
-
[19]
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model , author=. 2024 , eprint=
work page 2024
-
[20]
Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization , author=. 2025 , eprint=
work page 2025
-
[21]
Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation , author=. 2025 , eprint=
work page 2025
-
[22]
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio , author=. 2024 , eprint=
work page 2024
-
[23]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and others , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=
-
[24]
Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning , author=. 2024 , eprint=
work page 2024
-
[25]
Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. 2023 , eprint=
work page 2023
-
[26]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[27]
Self-Rewarding Vision-Language Model via Reasoning Decomposition , author=. 2025 , eprint=
work page 2025
-
[28]
Learning to Reason via Mixture-of-Thought for Logical Reasoning , author=. 2025 , eprint=
work page 2025
-
[29]
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video , author=. 2025 , eprint=
work page 2025
-
[30]
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models , author=. 2025 , eprint=
work page 2025
-
[31]
OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination , author=. 2025 , eprint=
work page 2025
-
[32]
Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models , author=. 2024 , eprint=
work page 2024
-
[33]
Sakib, Fardin Ahsan and Zhu, Ziwei and Grace, Karen Trister and Yetisgen, Meliha and Uzuner, Ozlem. Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18...
-
[34]
2024 , month = aug, howpublished =
work page 2024
- [35]
- [36]
-
[37]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding , author=. 2023 , eprint=
work page 2023
-
[38]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs , author=. arXiv preprint arXiv:2406.07476 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models , author=. 2024 , eprint=
work page 2024
-
[40]
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models , author=. 2025 , eprint=
work page 2025
-
[41]
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning , author=. 2024 , eprint=
work page 2024
-
[42]
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=
work page 2025
-
[43]
Parallel-R1: Towards Parallel Thinking via Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[44]
Towards A Rigorous Science of Interpretable Machine Learning , author=. 2017 , eprint=
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.