Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

Chenxing Li; Cihan Xiao; Liefeng Bo; Sanjeev Khudanpur; Steve Yves; Xiang He; Yiwen Shao; Zhenwen Liang

arxiv: 2605.27741 · v1 · pith:FA6SYYQ6new · submitted 2026-05-26 · 💻 cs.CL

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

Cihan Xiao , Yiwen Shao , Chenxing Li , Xiang He , Zhenwen Liang , Steve Yves , Sanjeev Khudanpur , Liefeng Bo This is my paper

Pith reviewed 2026-06-29 17:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords audio reasoningmodality collapsereinforcement learningpolicy optimizationmultimodal modelsattention losschain of thought

0 comments

The pith

Modality-Aware Policy Optimization concentrates RL gradients on audio-dependent tokens using differential entropy to block language-prior takeover in long reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard methods apply uniform policy gradients across tokens in audio models, so as chain-of-thought lengthens the model drops the audio signal and defaults to compressed text priors, producing confident but ungrounded outputs. MAPO counters this by first building a modality relevance mask from the entropy difference between a full multimodal policy and an audio-ablated reference, then routing policy updates only to the high-relevance tokens. It adds a second branch that applies a temporally scaled penalty to internal attention distributions to keep cross-modal grounding active deep into the trace. The result is reported as higher fidelity on complex audio reasoning benchmarks and new state-of-the-art numbers among open-weight models.

Core claim

MAPO is a dual-branch reinforcement learning method that derives a modality relevance mask from cross-modal differential entropy between an audio-ablated reference and the multimodal policy to concentrate policy gradients on modality-critical tokens, while an auxiliary attention loss branch applies a targeted, temporally scaled penalty to sustain cross-modal attention and thereby reduce late-stage modality collapse.

What carries the argument

Modality relevance mask computed from cross-modal differential entropy between audio-ablated reference and multimodal policy, which selects tokens for concentrated policy gradients.

If this is right

Long-horizon audio reasoning fidelity improves because gradients no longer reinforce text-only shortcuts.
Multimodal instruction following strengthens as attention is actively kept on the non-text modality.
State-of-the-art results appear on several key benchmarks among open-weight models.
Confident but ungrounded hallucinations decrease in extended chain-of-thought traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-difference mask could be tested on vision-language models to check whether modality collapse is prevented there as well.
If the mask remains stable across different base models, it offers a domain-agnostic way to handle unequal modality dependence in any multimodal RL setup.
Combining the mask with existing length penalties might further control when grounding is enforced.

Load-bearing premise

The entropy-derived mask correctly flags tokens whose outputs depend on the audio input without injecting bias or training instability.

What would settle it

Ablation runs on the same audio reasoning benchmarks that remove either the relevance mask or the attention loss branch and measure whether the reported gains in long-horizon fidelity and reduction in hallucinations disappear.

Figures

Figures reproduced from arXiv: 2605.27741 by Chenxing Li, Cihan Xiao, Liefeng Bo, Sanjeev Khudanpur, Steve Yves, Xiang He, Yiwen Shao, Zhenwen Liang.

**Figure 1.** Figure 1: Overview of the MAPO framework. (a) Late-stage modality collapse, where attention shifts from the source audio to the text prior during CoT reasoning. (b) MAPO mitigates this via a dual-branch architecture. A modality relevance mask uses cross-modal differential entropy (∆ht) to focus the policy gradient on audio-critical tokens. Simultaneously, an attention loss branch applies a temporally scaled penalty … view at source ↗

**Figure 2.** Figure 2: Late-stage modality collapse and key token extraction. Plots display the temporal decay of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Attention loss trajectory. While the modality relevance mask corrects the policy gradient budget, it fundamentally relies on the model maintaining some latent awareness of the cross-modal signal. To directly combat late-stage modality collapse, MAPO incorporates an auxiliary attention loss branch (Lattn) that penalizes the neglect of the source signal deep within the reasoning chain. We define a token-le… view at source ↗

**Figure 4.** Figure 4: Training dynamics over 800 steps, comparing standard GRPO with MAPO across various [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Mean audio attention mass over 800 training steps for different attention loss weights ( [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of the attention loss branch on internal attention dynamics. Plots display cross-modal [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: POS entropy summary across cross-modal reasoning trajectories. The top panel displays the [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Training dynamics of Qwen3-Omni-Instruct. (a) Both GRPO and MAPO show deceptively [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAPO uses an entropy-difference mask from an audio-ablated reference plus an attention penalty to focus RL updates on modality-dependent tokens, but the ablation step likely mixes in distribution shift that weakens the mask's reliability.

read the letter

The main point is that this paper introduces Modality-Aware Policy Optimization, a dual-branch RL method that builds a modality relevance mask from cross-modal differential entropy between the full model and an audio-ablated reference, then adds a temporally scaled attention penalty to keep the model grounded during long chain-of-thought traces.

It does a solid job naming the concrete failure mode: standard methods like GRPO spread gradients uniformly, so models drop the audio signal late in reasoning and fall back on text priors. Framing the fix around the model's own statistics rather than extra inductive biases is a reasonable choice and could transfer to other multimodal setups.

The soft spot sits in the mask itself. Ablating the audio encoder input will shift hidden states and next-token distributions in ways that go beyond simple loss of audio information, especially across extended traces. The resulting entropy difference therefore mixes modality dependence with general mismatch, which undercuts the claim that gradients are cleanly concentrated on the right tokens. The abstract reports benchmark gains and new SOTA numbers among open models, but without ablations on mask stability, reference choice, or training dynamics it is difficult to tell how much of the improvement traces to the proposed mechanism versus incidental regularization.

This is for people working on RL post-training for audio or omni-modal models who already see collapse in long reasoning. A reader in that niche would find the problem statement and the dual-branch structure useful to discuss. The work shows clear engagement with the literature on the issue and deserves a serious referee to check the experiments and the mask derivation in detail.

Referee Report

1 major / 0 minor

Summary. The paper proposes Modality-Aware Policy Optimization (MAPO), a dual-branch RL post-training method for audio/omni-modal LLMs. It derives a modality relevance mask from cross-modal differential entropy between an audio-ablated reference policy and the multimodal policy to concentrate policy gradients on modality-critical tokens, and adds an auxiliary attention loss with temporal scaling to prevent late-stage modality collapse during long CoT. The abstract claims this yields substantial gains in long-horizon reasoning fidelity and sets new SOTA results on complex audio reasoning benchmarks among open-weight models, relying only on native statistical signals.

Significance. If the central mechanism is shown to work without introducing bias from reference-model distribution shift, the approach would be significant for multimodal RL by providing a targeted, bias-light way to sustain cross-modal grounding. The emphasis on native signals rather than hand-crafted inductive biases is a conceptual strength that could generalize beyond audio.

major comments (1)

[Abstract / method (modality relevance mask derivation)] Abstract and method description: The modality relevance mask is defined via cross-modal differential entropy between the audio-ablated reference and the multimodal policy. This construction assumes the ablated reference differs from the multimodal policy only in its dependence on the audio signal. However, removing the audio encoder input typically induces large, non-local changes to hidden states and next-token distributions (especially over long CoT traces), so the entropy difference conflates modality dependence with general distribution mismatch. This directly undermines the claim that gradients are concentrated on modality-critical tokens rather than via incidental regularization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment on the modality relevance mask. We respond point-by-point below.

read point-by-point responses

Referee: Abstract and method description: The modality relevance mask is defined via cross-modal differential entropy between the audio-ablated reference and the multimodal policy. This construction assumes the ablated reference differs from the multimodal policy only in its dependence on the audio signal. However, removing the audio encoder input typically induces large, non-local changes to hidden states and next-token distributions (especially over long CoT traces), so the entropy difference conflates modality dependence with general distribution mismatch. This directly undermines the claim that gradients are concentrated on modality-critical tokens rather than via incidental regularization.

Authors: We acknowledge that ablating the audio encoder input induces distribution shifts beyond isolated modality dependence, as hidden-state and token-distribution changes are non-local. The differential entropy is nevertheless computed between two policies that share identical parameters and training history, differing solely in audio-input availability; this supplies a native statistical proxy for modality impact rather than an exact isolation. Empirical gains on long-horizon audio-reasoning benchmarks indicate the resulting mask still concentrates gradients usefully on tokens whose probabilities are most affected by audio presence. We agree the assumption warrants explicit discussion and will revise the method section to state the approximation, note the potential conflation with general mismatch, and report any additional controls feasible in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent native signals

full rationale

The provided abstract and description define MAPO via a modality relevance mask computed from cross-modal differential entropy (audio-ablated reference vs. multimodal policy) plus an auxiliary attention loss; these are direct constructions from model outputs rather than parameters fitted to target metrics or results that reduce to inputs by construction. No equations, self-citations, or uniqueness theorems are shown that would make any prediction equivalent to its inputs. The approach is presented as relying on native statistical signals, rendering the chain self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; the MAPO framework is presented as the main contribution but details are absent.

pith-pipeline@v0.9.1-grok · 5806 in / 1111 out tokens · 47055 ms · 2026-06-29T17:42:30.293799+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 30 canonical work pages · 17 internal anchors

[1]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Charlie Nash, Antoine Caillon, Cheng- Zhi Anna Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and et al. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325, 2023. doi: 10.48550/arXiv.2301.11325

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.11325 2023
[2]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. Iemocap: interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, Dec 2008. ISSN 1574-0218. doi: 10.1007/s10579-008-9076-6. URLhttps://doi.org/10.1007/s10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008
[3]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017. doi: 10.1109/icassp.2017.7952261

work page doi:10.1109/icassp.2017.7952261 2017
[4]

Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understand- ing and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025. doi: 10.48550/arxiv.2503.03983

work page doi:10.48550/arxiv.2503.03983 2025
[5]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models, 2025. URL https: //arxiv.org/abs/2507.08128

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Cochlscene: A multi-device dataset for acoustic scene classification.arXiv preprint arXiv:2111.08245, 2021

Inseok Han, Jiyoung Park, and Kyogu Lee. Cochlscene: A multi-device dataset for acoustic scene classification.arXiv preprint arXiv:2111.08245, 2021. doi: 10.48550/arXiv.2111.08245

work page doi:10.48550/arxiv.2111.08245 2021
[7]

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, and Dong Yu. Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models, 2026. URLhttps://arxiv.org/abs/2604.18187

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

work page arXiv 2025
[9]

Amazon nova 2: Multimodal reasoning and generation models.Amazon Technical Reports, 2025

Amazon Artificial General Intelligence. Amazon nova 2: Multimodal reasoning and generation models.Amazon Technical Reports, 2025. URL https://www.amazon.science/publications/ amazon-nova-2-multimodal-reasoning-and-generation-models

2025
[10]

Libritts-r: A restored multi-speaker text-to-speech corpus.arXiv preprint arXiv:2305.18802, 2023

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus, 2023. URLhttps://arxiv.org/abs/2305.18802

work page arXiv 2023
[11]

Mmau-pro: A challeng- ing and comprehensive benchmark for holistic evaluation of audio general intelligence, 2025

Sonal Kumar, Šimon Sedláˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeong- gon Ryu, Lichang Chen, Maxim Pliˇcka, Miroslav Hlaváˇcek, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadz...

work page arXiv 2025
[12]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xu...

work page arXiv 2025
[13]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Tacos: Temporally-aligned audio captions for language-audio pretraining, 2025

Paul Primus, Florian Schmid, and Gerhard Widmer. Tacos: Temporally-aligned audio captions for language-audio pretraining, 2025. URLhttps://arxiv.org/abs/2505.07609

work page arXiv 2025
[15]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?, 2025. URL https://arxiv.org/abs/2505.09439

work page arXiv 2025
[17]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024. URLhttps://arxiv.org/abs/2410.19168

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,
[21]

URLhttps://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Mimo-audio: Audio language models are few-shot learners, 2025

MiMo Core Team. Mimo-audio: Audio language models are few-shot learners, 2025. URL https: //arxiv.org/abs/2512.23808

work page arXiv 2025
[24]

Step-Audio 2 Technical Report

StepAudio Team. Step-audio 2 technical report, 2025. URLhttps://arxiv.org/abs/2507.16632

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark, 2026. URL https://arxiv.org/abs/2506.04779

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Covo-audio technical report, 2026

Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, Zikang Huang, Jiajun Xu, Jiliang Hu, Xiang He, Zeyu Xie, Jiawen Kang, Youjun Chen, Meng Yu, Dong Yu, Rilin Chen, Linlin Di, Shulin Feng, Na Hu, Yang Liu, Bang Wang, and Shan Yang. Covo-audio technical report, 2026. URLhttps://arxiv.org/abs/...

work page arXiv 2026
[27]

Visually-Guided Policy Optimization for Multimodal Reasoning

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning.arXiv preprint arXiv:2604.09349, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, and Heng Ji. Perception-aware policy optimization for multimodal reasoning, 2026. URLhttps://arxiv.org/abs/2507.06448

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, and Dong Yu. Audio-thinker: Guiding audio language model when and how to think via reinforcement learning, 2025. URL https: //arxiv.org/abs/2508.08039

work page arXiv 2025
[31]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Avqa: A dataset for audio-visual question answering on videos

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM International Conference on Multimedia, pages 3480–3491, 2022

2022
[34]

specialized

Xu Zhongxing, Wang Zhonghua, Qian Zhe, Shi Dachuan, Tang Feilong, Hu Ming, Su Shiyan, Zou Xiaocheng, Feng Wei, Mahapatra Dwarikanath, Peng Yifan, Lin Mingquan, and Ge Zongyuan. Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, 2026. 11 A ...

2026
[35]

The conclusion reached in the thinking process does not agree with or contradicts the final answer
[36]

both speakers show sadness

The thinking process is visibly incomplete or cut off prematurely. If neither of these failure modes is present, respond“YES”. Only output YES or NO. C Ablation analysis C.1 General training dynamics (a) Training Accuracy (b) Completion Length Figure 4: Training dynamics over 800 steps, comparing standard GRPO with MAPO across various attention loss weigh...

[1] [1]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Charlie Nash, Antoine Caillon, Cheng- Zhi Anna Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and et al. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325, 2023. doi: 10.48550/arXiv.2301.11325

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.11325 2023

[2] [2]

Chang, Sungbok Lee, and Shrikanth S

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. Iemocap: interactive emotional dyadic motion capture database.Language Resources and Evaluation, 42(4):335–359, Dec 2008. ISSN 1574-0218. doi: 10.1007/s10579-008-9076-6. URLhttps://doi.org/10.1007/s10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008

[3] [3]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017. doi: 10.1109/icassp.2017.7952261

work page doi:10.1109/icassp.2017.7952261 2017

[4] [4]

Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S. Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understand- ing and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025. doi: 10.48550/arxiv.2503.03983

work page doi:10.48550/arxiv.2503.03983 2025

[5] [5]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models, 2025. URL https: //arxiv.org/abs/2507.08128

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Cochlscene: A multi-device dataset for acoustic scene classification.arXiv preprint arXiv:2111.08245, 2021

Inseok Han, Jiyoung Park, and Kyogu Lee. Cochlscene: A multi-device dataset for acoustic scene classification.arXiv preprint arXiv:2111.08245, 2021. doi: 10.48550/arXiv.2111.08245

work page doi:10.48550/arxiv.2111.08245 2021

[7] [7]

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, and Dong Yu. Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models, 2026. URLhttps://arxiv.org/abs/2604.18187

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

work page arXiv 2025

[9] [9]

Amazon nova 2: Multimodal reasoning and generation models.Amazon Technical Reports, 2025

Amazon Artificial General Intelligence. Amazon nova 2: Multimodal reasoning and generation models.Amazon Technical Reports, 2025. URL https://www.amazon.science/publications/ amazon-nova-2-multimodal-reasoning-and-generation-models

2025

[10] [10]

Libritts-r: A restored multi-speaker text-to-speech corpus.arXiv preprint arXiv:2305.18802, 2023

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus, 2023. URLhttps://arxiv.org/abs/2305.18802

work page arXiv 2023

[11] [11]

Mmau-pro: A challeng- ing and comprehensive benchmark for holistic evaluation of audio general intelligence, 2025

Sonal Kumar, Šimon Sedláˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeong- gon Ryu, Lichang Chen, Maxim Pliˇcka, Miroslav Hlaváˇcek, William Fineas Ellingwood, Sathvik Udupa, Siyuan Hou, Allison Ferner, Sara Barahona, Cecilia Bolaños, Satish Rahi, Laura Herrera-Alarcón, Satvik Dixit, Siddhi Patil, Soham Deshmukh, Lasha Koroshinadz...

work page arXiv 2025

[12] [12]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xu...

work page arXiv 2025

[13] [13]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Tacos: Temporally-aligned audio captions for language-audio pretraining, 2025

Paul Primus, Florian Schmid, and Gerhard Widmer. Tacos: Temporally-aligned audio captions for language-audio pretraining, 2025. URLhttps://arxiv.org/abs/2505.07609

work page arXiv 2025

[15] [15]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?, 2025. URL https://arxiv.org/abs/2505.09439

work page arXiv 2025

[17] [17]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024. URLhttps://arxiv.org/abs/2410.19168

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

[21] [21]

URLhttps://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Mimo-audio: Audio language models are few-shot learners, 2025

MiMo Core Team. Mimo-audio: Audio language models are few-shot learners, 2025. URL https: //arxiv.org/abs/2512.23808

work page arXiv 2025

[24] [24]

Step-Audio 2 Technical Report

StepAudio Team. Step-audio 2 technical report, 2025. URLhttps://arxiv.org/abs/2507.16632

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark, 2026. URL https://arxiv.org/abs/2506.04779

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Covo-audio technical report, 2026

Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, Zikang Huang, Jiajun Xu, Jiliang Hu, Xiang He, Zeyu Xie, Jiawen Kang, Youjun Chen, Meng Yu, Dong Yu, Rilin Chen, Linlin Di, Shulin Feng, Na Hu, Yang Liu, Bang Wang, and Shan Yang. Covo-audio technical report, 2026. URLhttps://arxiv.org/abs/...

work page arXiv 2026

[27] [27]

Visually-Guided Policy Optimization for Multimodal Reasoning

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, and Xiangxiang Chu. Visually-guided policy optimization for multimodal reasoning.arXiv preprint arXiv:2604.09349, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, and Heng Ji. Perception-aware policy optimization for multimodal reasoning, 2026. URLhttps://arxiv.org/abs/2507.06448

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, and Dong Yu. Audio-thinker: Guiding audio language model when and how to think via reinforcement learning, 2025. URL https: //arxiv.org/abs/2508.08039

work page arXiv 2025

[31] [31]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Avqa: A dataset for audio-visual question answering on videos

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM International Conference on Multimedia, pages 3480–3491, 2022

2022

[34] [34]

specialized

Xu Zhongxing, Wang Zhonghua, Qian Zhe, Shi Dachuan, Tang Feilong, Hu Ming, Su Shiyan, Zou Xiaocheng, Feng Wei, Mahapatra Dwarikanath, Peng Yifan, Lin Mingquan, and Ge Zongyuan. Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding. InProceedings of the Computer Vision and Pattern Recognition Conference, 2026. 11 A ...

2026

[35] [35]

The conclusion reached in the thinking process does not agree with or contradicts the final answer

[36] [36]

both speakers show sadness

The thinking process is visibly incomplete or cut off prematurely. If neither of these failure modes is present, respond“YES”. Only output YES or NO. C Ablation analysis C.1 General training dynamics (a) Training Accuracy (b) Completion Length Figure 4: Training dynamics over 800 steps, comparing standard GRPO with MAPO across various attention loss weigh...