pith. sign in

arxiv: 2605.27741 · v1 · pith:FA6SYYQ6new · submitted 2026-05-26 · 💻 cs.CL

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

Pith reviewed 2026-06-29 17:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords audio reasoningmodality collapsereinforcement learningpolicy optimizationmultimodal modelsattention losschain of thought
0
0 comments X

The pith

Modality-Aware Policy Optimization concentrates RL gradients on audio-dependent tokens using differential entropy to block language-prior takeover in long reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard methods apply uniform policy gradients across tokens in audio models, so as chain-of-thought lengthens the model drops the audio signal and defaults to compressed text priors, producing confident but ungrounded outputs. MAPO counters this by first building a modality relevance mask from the entropy difference between a full multimodal policy and an audio-ablated reference, then routing policy updates only to the high-relevance tokens. It adds a second branch that applies a temporally scaled penalty to internal attention distributions to keep cross-modal grounding active deep into the trace. The result is reported as higher fidelity on complex audio reasoning benchmarks and new state-of-the-art numbers among open-weight models.

Core claim

MAPO is a dual-branch reinforcement learning method that derives a modality relevance mask from cross-modal differential entropy between an audio-ablated reference and the multimodal policy to concentrate policy gradients on modality-critical tokens, while an auxiliary attention loss branch applies a targeted, temporally scaled penalty to sustain cross-modal attention and thereby reduce late-stage modality collapse.

What carries the argument

Modality relevance mask computed from cross-modal differential entropy between audio-ablated reference and multimodal policy, which selects tokens for concentrated policy gradients.

If this is right

  • Long-horizon audio reasoning fidelity improves because gradients no longer reinforce text-only shortcuts.
  • Multimodal instruction following strengthens as attention is actively kept on the non-text modality.
  • State-of-the-art results appear on several key benchmarks among open-weight models.
  • Confident but ungrounded hallucinations decrease in extended chain-of-thought traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy-difference mask could be tested on vision-language models to check whether modality collapse is prevented there as well.
  • If the mask remains stable across different base models, it offers a domain-agnostic way to handle unequal modality dependence in any multimodal RL setup.
  • Combining the mask with existing length penalties might further control when grounding is enforced.

Load-bearing premise

The entropy-derived mask correctly flags tokens whose outputs depend on the audio input without injecting bias or training instability.

What would settle it

Ablation runs on the same audio reasoning benchmarks that remove either the relevance mask or the attention loss branch and measure whether the reported gains in long-horizon fidelity and reduction in hallucinations disappear.

Figures

Figures reproduced from arXiv: 2605.27741 by Chenxing Li, Cihan Xiao, Liefeng Bo, Sanjeev Khudanpur, Steve Yves, Xiang He, Yiwen Shao, Zhenwen Liang.

Figure 1
Figure 1. Figure 1: Overview of the MAPO framework. (a) Late-stage modality collapse, where attention shifts from the source audio to the text prior during CoT reasoning. (b) MAPO mitigates this via a dual-branch architecture. A modality relevance mask uses cross-modal differential entropy (∆ht) to focus the policy gradient on audio-critical tokens. Simultaneously, an attention loss branch applies a temporally scaled penalty … view at source ↗
Figure 2
Figure 2. Figure 2: Late-stage modality collapse and key token extraction. Plots display the temporal decay of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attention loss trajectory. While the modality relevance mask corrects the policy gra￾dient budget, it fundamentally relies on the model main￾taining some latent awareness of the cross-modal signal. To directly combat late-stage modality collapse, MAPO incorporates an auxiliary attention loss branch (Lattn) that penalizes the neglect of the source signal deep within the reasoning chain. We define a token-le… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics over 800 steps, comparing standard GRPO with MAPO across various [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean audio attention mass over 800 training steps for different attention loss weights ( [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the attention loss branch on internal attention dynamics. Plots display cross-modal [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: POS entropy summary across cross-modal reasoning trajectories. The top panel displays the [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics of Qwen3-Omni-Instruct. (a) Both GRPO and MAPO show deceptively [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes Modality-Aware Policy Optimization (MAPO), a dual-branch RL post-training method for audio/omni-modal LLMs. It derives a modality relevance mask from cross-modal differential entropy between an audio-ablated reference policy and the multimodal policy to concentrate policy gradients on modality-critical tokens, and adds an auxiliary attention loss with temporal scaling to prevent late-stage modality collapse during long CoT. The abstract claims this yields substantial gains in long-horizon reasoning fidelity and sets new SOTA results on complex audio reasoning benchmarks among open-weight models, relying only on native statistical signals.

Significance. If the central mechanism is shown to work without introducing bias from reference-model distribution shift, the approach would be significant for multimodal RL by providing a targeted, bias-light way to sustain cross-modal grounding. The emphasis on native signals rather than hand-crafted inductive biases is a conceptual strength that could generalize beyond audio.

major comments (1)
  1. [Abstract / method (modality relevance mask derivation)] Abstract and method description: The modality relevance mask is defined via cross-modal differential entropy between the audio-ablated reference and the multimodal policy. This construction assumes the ablated reference differs from the multimodal policy only in its dependence on the audio signal. However, removing the audio encoder input typically induces large, non-local changes to hidden states and next-token distributions (especially over long CoT traces), so the entropy difference conflates modality dependence with general distribution mismatch. This directly undermines the claim that gradients are concentrated on modality-critical tokens rather than via incidental regularization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment on the modality relevance mask. We respond point-by-point below.

read point-by-point responses
  1. Referee: Abstract and method description: The modality relevance mask is defined via cross-modal differential entropy between the audio-ablated reference and the multimodal policy. This construction assumes the ablated reference differs from the multimodal policy only in its dependence on the audio signal. However, removing the audio encoder input typically induces large, non-local changes to hidden states and next-token distributions (especially over long CoT traces), so the entropy difference conflates modality dependence with general distribution mismatch. This directly undermines the claim that gradients are concentrated on modality-critical tokens rather than via incidental regularization.

    Authors: We acknowledge that ablating the audio encoder input induces distribution shifts beyond isolated modality dependence, as hidden-state and token-distribution changes are non-local. The differential entropy is nevertheless computed between two policies that share identical parameters and training history, differing solely in audio-input availability; this supplies a native statistical proxy for modality impact rather than an exact isolation. Empirical gains on long-horizon audio-reasoning benchmarks indicate the resulting mask still concentrates gradients usefully on tokens whose probabilities are most affected by audio presence. We agree the assumption warrants explicit discussion and will revise the method section to state the approximation, note the potential conflation with general mismatch, and report any additional controls feasible in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent native signals

full rationale

The provided abstract and description define MAPO via a modality relevance mask computed from cross-modal differential entropy (audio-ablated reference vs. multimodal policy) plus an auxiliary attention loss; these are direct constructions from model outputs rather than parameters fitted to target metrics or results that reduce to inputs by construction. No equations, self-citations, or uniqueness theorems are shown that would make any prediction equivalent to its inputs. The approach is presented as relying on native statistical signals, rendering the chain self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; the MAPO framework is presented as the main contribution but details are absent.

pith-pipeline@v0.9.1-grok · 5806 in / 1111 out tokens · 47055 ms · 2026-06-29T17:42:30.293799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 30 canonical work pages · 17 internal anchors

  1. [1]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Charlie Nash, Antoine Caillon, Cheng- Zhi Anna Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and et al. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325, 2023. doi: 10.48550/arXiv.2301.11325

  2. [2]

    Chang, Sungbok Lee, and Shrikanth S

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. Iemocap: interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, Dec 2008. ISSN 1574-0218. doi: 10.1007/s10579-008-9076-6. URLhttps://doi.org/10.1007/s10579-008-9076-6

  3. [3]

    Gemmeke, Daniel P

    Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017. doi: 10.1109/icassp.2017.7952261

  4. [4]

    Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro

    Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understand- ing and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025. doi: 10.48550/arxiv.2503.03983

  5. [5]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models, 2025. URL https: //arxiv.org/abs/2507.08128

  6. [6]

    Cochlscene: A multi-device dataset for acoustic scene classification.arXiv preprint arXiv:2111.08245, 2021

    Inseok Han, Jiyoung Park, and Kyogu Lee. Cochlscene: A multi-device dataset for acoustic scene classification.arXiv preprint arXiv:2111.08245, 2021. doi: 10.48550/arXiv.2111.08245

  7. [7]

    Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

    Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, and Dong Yu. Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models, 2026. URLhttps://arxiv.org/abs/2604.18187

  8. [8]

    Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

    Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

  9. [9]

    Amazon nova 2: Multimodal reasoning and generation models.Amazon Technical Reports, 2025

    Amazon Artificial General Intelligence. Amazon nova 2: Multimodal reasoning and generation models.Amazon Technical Reports, 2025. URL https://www.amazon.science/publications/ amazon-nova-2-multimodal-reasoning-and-generation-models

  10. [10]

    Libritts-r: A restored multi-speaker text-to-speech corpus.arXiv preprint arXiv:2305.18802, 2023

    Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus, 2023. URLhttps://arxiv.org/abs/2305.18802

  11. [11]

    Mmau-pro: A challeng- ing and comprehensive benchmark for holistic evaluation of audio general intelligence, 2025

    Sonal Kumar, Šimon Sedláˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeong- gon Ryu, Lichang Chen, Maxim Pliˇcka, Miroslav Hlaváˇcek, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadz...

  12. [12]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix, 2025

    Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xu...

  13. [13]

    GPT-4o System Card

    OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

  14. [14]

    Tacos: Temporally-aligned audio captions for language-audio pretraining, 2025

    Paul Primus, Florian Schmid, and Gerhard Widmer. Tacos: Temporally-aligned audio captions for language-audio pretraining, 2025. URLhttps://arxiv.org/abs/2505.07609

  15. [15]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290. 10

  16. [16]

    Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

    Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?, 2025. URL https://arxiv.org/abs/2505.09439

  17. [17]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024. URLhttps://arxiv.org/abs/2410.19168

  18. [18]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  19. [19]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  20. [20]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

    Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

  21. [21]

    URLhttps://arxiv.org/abs/2403.05530

  22. [22]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

  23. [23]

    Mimo-audio: Audio language models are few-shot learners, 2025

    MiMo Core Team. Mimo-audio: Audio language models are few-shot learners, 2025. URL https: //arxiv.org/abs/2512.23808

  24. [24]

    Step-Audio 2 Technical Report

    StepAudio Team. Step-audio 2 technical report, 2025. URLhttps://arxiv.org/abs/2507.16632

  25. [25]

    MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

    Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark, 2026. URL https://arxiv.org/abs/2506.04779

  26. [26]

    Covo-audio technical report, 2026

    Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, Zikang Huang, Jiajun Xu, Jiliang Hu, Xiang He, Zeyu Xie, Jiawen Kang, Youjun Chen, Meng Yu, Dong Yu, Rilin Chen, Linlin Di, Shulin Feng, Na Hu, Yang Liu, Bang Wang, and Shan Yang. Covo-audio technical report, 2026. URLhttps://arxiv.org/abs/...

  27. [27]

    Visually-Guided Policy Optimization for Multimodal Reasoning

    Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning.arXiv preprint arXiv:2604.09349, 2026

  28. [28]

    Perception-Aware Policy Optimization for Multimodal Reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, and Heng Ji. Perception-aware policy optimization for multimodal reasoning, 2026. URLhttps://arxiv.org/abs/2507.06448

  29. [29]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  30. [30]

    Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning

    Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, and Dong Yu. Audio-thinker: Guiding audio language model when and how to think via reinforcement learning, 2025. URL https: //arxiv.org/abs/2508.08039

  31. [31]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

  32. [32]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  33. [33]

    Avqa: A dataset for audio-visual question answering on videos

    Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM International Conference on Multimedia, pages 3480–3491, 2022

  34. [34]

    specialized

    Xu Zhongxing, Wang Zhonghua, Qian Zhe, Shi Dachuan, Tang Feilong, Hu Ming, Su Shiyan, Zou Xiaocheng, Feng Wei, Mahapatra Dwarikanath, Peng Yifan, Lin Mingquan, and Ge Zongyuan. Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, 2026. 11 A ...

  35. [35]

    The conclusion reached in the thinking process does not agree with or contradicts the final answer

  36. [36]

    both speakers show sadness

    The thinking process is visibly incomplete or cut off prematurely. If neither of these failure modes is present, respond“YES”. Only output YES or NO. C Ablation analysis C.1 General training dynamics (a) Training Accuracy (b) Completion Length Figure 4: Training dynamics over 800 steps, comparing standard GRPO with MAPO across various attention loss weigh...