EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Cai Yuchen; Chunxi Luo; Gongli Xi; Jie Zhang; Jin Wang; Junhao Dong; Kaiwen Luo; Kun Wang; Liang Lin; Qiankun Li

arxiv: 2605.23954 · v1 · pith:3ICCMY4Anew · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.SD

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Liang Lin , Chunxi Luo , Kaiwen Luo , Jie Zhang , Jin Wang , Yuanhe Zhang , Cai Yuchen , Qiankun Li

show 4 more authors

Gongli Xi Zhenhong Zhou Kun Wang Junhao Dong

This is my paper

Pith reviewed 2026-06-30 22:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD

keywords Audio LLMsnoise robustnessself-distillationGRPOsemantic alignmenthallucination reductioninference-time optimization

0 comments

The pith

EchoDistill aligns a noisy Audio LLM student to a frozen clean teacher via GRPO token-consistency rewards to cut semantic drift without raising inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EchoDistill to reduce hallucinations and semantic drift in Audio Large Language Models when inputs contain real-world noise. A frozen clean-audio teacher supplies semantic references while the noisy student samples multiple response trajectories at inference time. These trajectories are refined with group-relative policy optimization that treats token-level agreement with the teacher as a reward signal, plus audio-aware shaping. The result is higher task accuracy and grounded reasoning on noisy audio, achieved without any change to the deployed model speed.

Core claim

EchoDistill is an alignment-based noisy-to-clean self-distillation framework in which a frozen clean-audio teacher provides semantic references for an inference-time noisy-audio student; the student samples candidate responses under noise, and these trajectories are optimized via group-relative policy optimization where token-level consistency with the teacher acts as a reward bonus, yielding improved semantic reliability and task performance under complex noise with no added inference cost.

What carries the argument

Group-relative policy optimization (GRPO) on noisy student trajectories that uses token-level consistency with the clean teacher as the reward signal.

If this is right

Average GSR rises 4.18 percent under strong noise relative to the strongest prior baseline.
On Qwen-Omni, the full method adds 3.02 percent Acc, 3.89 percent Noisy, and 4.53 percent GSR over GRPO alone.
Reasoning paths become both more correct and more directly tied to the actual acoustic input.
The same model can be made robust after training by applying the alignment procedure only at test time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of a clean reference model from the deployed noisy model suggests a general pattern for test-time robustness in other multimodal settings where clean exemplars can be obtained offline.
Because the optimization occurs only on sampled trajectories, the method could be combined with existing acoustic enhancement pipelines without retraining the core LLM.
If the clean teacher itself contains domain biases, those biases would be transferred into the student's selected paths, an effect not measured in the current experiments.

Load-bearing premise

The frozen clean-audio teacher supplies semantic references that are reliable enough to serve as token-level consistency rewards for the noisy student.

What would settle it

Running the same noisy-audio evaluation suite and finding that EchoDistill produces no measurable gain in GSR or accuracy over the GRPO-only baseline would falsify the value of the teacher alignment step.

Figures

Figures reproduced from arXiv: 2605.23954 by Cai Yuchen, Chunxi Luo, Gongli Xi, Jie Zhang, Jin Wang, Junhao Dong, Kaiwen Luo, Kun Wang, Liang Lin, Qiankun Li, Yuanhe Zhang, Zhenhong Zhou.

**Figure 2.** Figure 2: Sparse and uneven audio grounding in correct trajectories. (a) Distribution of token/window-level audio dependency di,w, measured by the decision-margin drop after ablating a local audio window. (b) Distribution of trajectory-level audio anchor gi , measuring the gain of using audio over the no-audio condition [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of EchoDistill. EchoDistill uses a noisy-audio student and a frozen clean-audio teacher during training, combining task reward, noisy-to-clean evidence alignment, and preference optimization while retaining single-student inference. The distillation loss L (i) distill is the masked teacher-to-student KL between these distributions over response tokens. The mask mi,t excludes prompt tokens, audio p… view at source ↗

**Figure 4.** Figure 4: EchoDistill consistently achieves higher Composite Robustness Score than the strongest [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qwen2.5-Omni results under −10 dB noise on Music, Sound, and Speech [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of noisy-to-clean consistency. MiniCPM improves rapidly in the early stage, while StepAudio exhibits a steadier upward trend, indicating that EchoDistill progressively aligns noisy-input generation with cleanaudio semantic behavior. Obs. 7: Noisy-to-clean alignment progressively suppresses semantic drift. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoDistill applies GRPO to align noisy audio LLM trajectories with a frozen clean teacher via token consistency rewards, but the abstract gives no evidence that this reward tracks actual correctness.

read the letter

The main takeaway is that EchoDistill runs a noisy student against a frozen clean-audio teacher, samples responses at inference time, and uses GRPO to boost token-level agreement with the teacher while adding audio-aware reward shaping. It reports a 4.18% GSR lift under strong noise over the best baseline and further gains over plain GRPO in ablations on Qwen-Omni, all without extra inference cost.

The combination of noisy-to-clean self-distillation with GRPO is the clearest new piece. The setup directly targets test-time behavior rather than training-time enhancement or internal noise suppression, which is a reasonable angle for real-world audio use.

The soft spot sits right on the reward signal. The method assumes the clean teacher supplies reliable semantic references and that exact token overlap rewards acoustically grounded reasoning. The abstract contains no check showing that teacher-student agreement correlates with ground-truth accuracy or that it avoids rewarding superficial matches. Without that link, the reported gains could come from other factors in the optimization.

The experimental claims are stated at summary level only, with no visible details on noise types, dataset splits, variance, or how baselines were implemented. Code is linked, which helps, but the current text does not let a reader verify the numbers.

This is for people already working on audio LLMs and robustness methods who want to try GRPO-style alignment. A reader in that niche can extract the method and the ablation pattern.

It is worth sending to peer review so the reward validity and full experimental controls can be examined properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EchoDistill, a noisy-to-clean self-distillation framework for Audio Large Language Models. A frozen clean-audio teacher supplies semantic references that are used to optimize an inference-time noisy student's sampled response trajectories via Group Relative Policy Optimization (GRPO), with token-level consistency serving as a reward bonus and audio-aware shaping applied. The method claims to improve semantic reliability and task performance under complex noise without added inference cost, reporting an average 4.18% GSR gain under strong noise versus the strongest baseline and further gains (3.02% Acc, 3.89% Noisy, 4.53% GSR) over a GRPO-only ablation on Qwen-Omni.

Significance. If the central results hold, the work would offer a practical route to noise-robust ALLMs by aligning noisy inference trajectories to clean semantic evidence at no extra runtime cost. The use of GRPO on token-consistency rewards and the explicit separation of teacher and student audio conditions constitute a distinct technical contribution relative to waveform enhancement or answer-level supervision baselines.

major comments (2)

The reported performance gains rest on the assumption that token-level consistency between the frozen clean teacher and the noisy student's trajectories constitutes a valid reward signal under GRPO. The manuscript provides no quantitative validation (e.g., correlation between teacher-student agreement rates and ground-truth correctness) that this signal rewards acoustically grounded reasoning rather than superficial lexical overlap; this check is load-bearing for the 4.18% GSR claim under strong noise.
Ablation results on Qwen-Omni claim consistent improvements over the GRPO-only variant, yet the description does not specify how the teacher-derived reward is ablated (e.g., whether the teacher is replaced by a noisy reference or removed entirely) or report variance across noise types and seeds. Without these controls it is unclear whether the additional 3-4.5% gains are attributable to the clean-teacher alignment.

minor comments (2)

The abstract states that codes are available at an anonymous repository; the final version should replace this with a permanent link and include the exact experimental configurations (noise levels, datasets, and GSR definition) used for the reported percentages.
Notation for the reward bonus and audio-aware shaping is introduced only descriptively; an explicit equation or pseudocode block would clarify how the token-consistency term is combined with any base reward in the GRPO objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: The reported performance gains rest on the assumption that token-level consistency between the frozen clean teacher and the noisy student's trajectories constitutes a valid reward signal under GRPO. The manuscript provides no quantitative validation (e.g., correlation between teacher-student agreement rates and ground-truth correctness) that this signal rewards acoustically grounded reasoning rather than superficial lexical overlap; this check is load-bearing for the 4.18% GSR claim under strong noise.

Authors: We agree that the manuscript lacks an explicit quantitative validation such as correlation between teacher-student agreement rates and ground-truth correctness. In the revised version we will add a dedicated analysis subsection that computes and reports Pearson correlations (and associated p-values) between per-trajectory agreement rates and correctness labels across all noise conditions and tasks. This will directly test whether the reward signal aligns with acoustically grounded reasoning. revision: yes
Referee: Ablation results on Qwen-Omni claim consistent improvements over the GRPO-only variant, yet the description does not specify how the teacher-derived reward is ablated (e.g., whether the teacher is replaced by a noisy reference or removed entirely) or report variance across noise types and seeds. Without these controls it is unclear whether the additional 3-4.5% gains are attributable to the clean-teacher alignment.

Authors: We acknowledge the ablation description is underspecified. In revision we will explicitly state that the GRPO-only baseline removes the teacher-derived token-consistency reward entirely (retaining only the task-performance component of the reward) and will report results with standard deviations across three random seeds, broken down by individual noise type (babble, music, environmental, etc.). These additions will clarify attribution to the clean-teacher signal. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external teacher assumption and empirical results

full rationale

The paper presents EchoDistill as an empirical alignment framework in which a separately frozen clean-audio teacher supplies token-level consistency rewards to optimize a noisy student's GRPO trajectories. No equations, fitted parameters renamed as predictions, or self-citations are shown that would reduce the claimed performance gains (e.g., 4.18% GSR) to a definitional identity or tautological input. The method's load-bearing step is the external assumption that the clean teacher yields reliable semantic references, which is stated as an assumption rather than derived from the student's own outputs. All reported improvements are positioned as experimental outcomes from ablations and comparisons, leaving the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method relies on standard self-distillation and RL concepts from prior literature.

pith-pipeline@v0.9.1-grok · 5870 in / 1079 out tokens · 20712 ms · 2026-06-30T22:50:29.236338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 33 canonical work pages · 20 internal anchors

[1]

De-noising by soft-thresholding.IEEE transactions on information theory, 41(3):613–627, 1995

1995
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024
[3]

Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

work page arXiv 2025
[4]

Pearson correlation coefficient

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. InNoise reduction in speech processing, pages 1–4. Springer, 2009

2009
[5]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

work page arXiv 2025
[6]

Noise-robust sound event detection and counting via language-queried sound separation.IEEE Signal Processing Letters, 2025

Yuanjian Chen, Yang Xiao, Han Yin, Yadong Guan, and Xubo Liu. Noise-robust sound event detection and counting via language-queried sound separation.IEEE Signal Processing Letters, 2025

2025
[7]

Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios.arXiv preprint arXiv:2501.01384, 2025

Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, et al. Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios.arXiv preprint arXiv:2501.01384, 2025

work page arXiv 2025
[8]

Omnisep: Unified omni-modality sound separation with query-mixup.arXiv preprint arXiv:2410.21269, 2024

Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, et al. Omnisep: Unified omni-modality sound separation with query-mixup.arXiv preprint arXiv:2410.21269, 2024

work page arXiv 2024
[9]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Adverse effects of environmental noise on acoustic voice quality measurements.Journal of Voice, 19(1):15–28, 2005

Dimitar D Deliyski, Heather S Shaw, and Maegan K Evans. Adverse effects of environmental noise on acoustic voice quality measurements.Journal of Voice, 19(1):15–28, 2005

2005
[11]

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 2003

Yariv Ephraim and David Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 2003

2003
[12]

Alphaedit: Null-space constrained knowledge editing for language models

Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Alphaedit: Null-space constrained knowledge editing for language models. arXiv preprint arXiv:2410.02355, 2024

work page arXiv 2024
[13]

Born again neural networks

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616. PMLR, 2018

2018
[14]

Double the trouble: handling noise and reverberation in far-field automatic speech recognition

David Gelbart and Nelson Morgan. Double the trouble: handling noise and reverberation in far-field automatic speech recognition. InInterspeech, pages 2185–2188, 2002

2002
[15]

Explainable disentanglement on discrete speech representations for noise-robust asr

Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, and Eng Siong Chng. Explainable disentanglement on discrete speech representations for noise-robust asr. In2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 2535–2540. IEEE, 2025. 10

2025
[16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Evaluating robustness of large audio language models to audio injection: An empirical study

Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, and Wenbo Jiang. Evaluating robustness of large audio language models to audio injection: An empirical study. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25671–25687, 2025

2025
[19]

A tandem algorithm for pitch estimation and voiced speech segregation.IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079, 2010

Guoning Hu and DeLiang Wang. A tandem algorithm for pitch estimation and voiced speech segregation.IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079, 2010

2067
[20]

Audiogpt: Understanding and generating speech, music, sound, and talking head

Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. Audiogpt: Understanding and generating speech, music, sound, and talking head. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 23802–23804, 2024

2024
[21]

Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

work page arXiv 2025
[22]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026
[24]

Audio augmentation for speech recognition

Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation for speech recognition. InInterspeech, volume 2015, page 3586, 2015

2015
[25]

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics,

Pankaj Kumar and Subhankar Mishra. Robustness in large language models: A survey of mitigation strategies and evaluation metrics.arXiv preprint arXiv:2505.18658, 2025

work page arXiv 2025
[26]

Mmau- pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

Sonal Kumar, Šimon Sedlá ˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Pli ˇcka, Miroslav Hlavá ˇcek, et al. Mmau- pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2...

2026
[27]

Isa-bench: Benchmarking instruction sensitivity for large audio language models.arXiv preprint arXiv:2510.23558, 2025

Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, and Kai Yu. Isa-bench: Benchmarking instruction sensitivity for large audio language models.arXiv preprint arXiv:2510.23558, 2025

work page arXiv 2025
[28]

Applications of large language models and multimodal large models in autonomous driving: A comprehensive review.Drones, 9(4):238, 2025

Jing Li, Jingyuan Li, Guo Yang, Lie Yang, Haozhuang Chi, and Lichao Yang. Applications of large language models and multimodal large models in autonomous driving: A comprehensive review.Drones, 9(4):238, 2025

2025
[29]

Robust automatic speech recogni- tion: a bridge to practical applications

Jinyu Li, Li Deng, Reinhold Haeb-Umbach, and Yifan Gong. Robust automatic speech recogni- tion: a bridge to practical applications. 2015

2015
[30]

Hidden in the noise: Unveiling backdoors in audio llms alignment through latent acoustic pattern triggers

Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, et al. Hidden in the noise: Unveiling backdoors in audio llms alignment through latent acoustic pattern triggers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32015–32023, 2026

2026
[31]

ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, and Qingsong Wen. Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language models.arXiv preprint arXiv:2601.04876, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025
[33]

Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779, 2019

Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779, 2019

work page arXiv 1904
[34]

SEGAN: Speech Enhancement Generative Adversarial Network

Santiago Pascual, Antonio Bonafonte, and Joan Serra. Segan: Speech enhancement generative adversarial network.arXiv preprint arXiv:1703.09452, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019

Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-Yiin Chang, and Tara Sainath. Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019

2019
[36]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023
[37]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[38]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen.arXiv preprint arXiv:2306.12925, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

2026
[40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

2024
[42]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

work page arXiv 2026
[44]

Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

Renxuan Tan, Rongpeng Li, Zhifeng Zhao, and Honggang Zhang. Beyond compromise: Pareto- lenient consensus for efficient multi-preference llm alignment.arXiv preprint arXiv:2604.05965, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

A hybrid dsp/deep learning approach to real-time full-band speech enhance- ment

Jean-Marc Valin. A hybrid dsp/deep learning approach to real-time full-band speech enhance- ment. In2018 IEEE 20th international workshop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2018

2018
[47]

Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

2025
[48]

Supervised speech separation based on deep learning: An overview.IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702– 1726, 2018

DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview.IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702– 1726, 2018

2018
[49]

ESPnet: End-to-End Speech Processing Toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit.arXiv preprint arXiv:1804.00015, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection.IEEE Signal Processing Letters, 2025

Yang Xiao and Rohan Kumar Das. Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection.IEEE Signal Processing Letters, 2025

2025
[52]

Adakws: Towards robust keyword spotting with test-time adaptation

Yang Xiao, Tianyi Peng, Yanghao Zhou, and Rohan Kumar Das. Adakws: Towards robust keyword spotting with test-time adaptation. InINTERSPEECH 2025, 2025

2025
[53]

Self-training with noisy student improves imagenet classification

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020

2020
[54]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

1979
[56]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Leveraging llm and text-queried separation for noise-robust sound event detection

Han Yin, Yang Xiao, Jisheng Bai, and Rohan Kumar Das. Leveraging llm and text-queried separation for noise-robust sound event detection. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Satellite Workshop on Speech and Audio Language Models (SALMA), 2024

2025
[60]

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, and Jung-Woo Choi. Focus then listen: Exploring plug-and-play audio enhancer for noise-robust large audio language models.arXiv preprint arXiv:2603.04862, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

See: Signal embedding energy for quantifying noise interference in large audio language models.arXiv preprint arXiv:2601.07331, 2026

Yuanhe Zhang, Jiayu Tian, Yibo Zhang, Shilinlu Yan, Liang Lin, Zhenhong Zhou, Li Sun, and Sen Su. See: Signal embedding energy for quantifying noise interference in large audio language models.arXiv preprint arXiv:2601.07331, 2026

work page arXiv 2026
[63]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 13 A Related Work Knowledge Distillation and Frontier Self-Distillation.Knowledge distillation minimizes the functional or distributional discrepan...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

De-noising by soft-thresholding.IEEE transactions on information theory, 41(3):613–627, 1995

1995

[2] [2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

2024

[3] [3]

Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

work page arXiv 2025

[4] [4]

Pearson correlation coefficient

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. InNoise reduction in speech processing, pages 1–4. Springer, 2009

2009

[5] [5]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

work page arXiv 2025

[6] [6]

Noise-robust sound event detection and counting via language-queried sound separation.IEEE Signal Processing Letters, 2025

Yuanjian Chen, Yang Xiao, Han Yin, Yadong Guan, and Xubo Liu. Noise-robust sound event detection and counting via language-queried sound separation.IEEE Signal Processing Letters, 2025

2025

[7] [7]

Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios.arXiv preprint arXiv:2501.01384, 2025

Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, et al. Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios.arXiv preprint arXiv:2501.01384, 2025

work page arXiv 2025

[8] [8]

Omnisep: Unified omni-modality sound separation with query-mixup.arXiv preprint arXiv:2410.21269, 2024

Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, et al. Omnisep: Unified omni-modality sound separation with query-mixup.arXiv preprint arXiv:2410.21269, 2024

work page arXiv 2024

[9] [9]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Adverse effects of environmental noise on acoustic voice quality measurements.Journal of Voice, 19(1):15–28, 2005

Dimitar D Deliyski, Heather S Shaw, and Maegan K Evans. Adverse effects of environmental noise on acoustic voice quality measurements.Journal of Voice, 19(1):15–28, 2005

2005

[11] [11]

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 2003

Yariv Ephraim and David Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 2003

2003

[12] [12]

Alphaedit: Null-space constrained knowledge editing for language models

Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Alphaedit: Null-space constrained knowledge editing for language models. arXiv preprint arXiv:2410.02355, 2024

work page arXiv 2024

[13] [13]

Born again neural networks

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616. PMLR, 2018

2018

[14] [14]

Double the trouble: handling noise and reverberation in far-field automatic speech recognition

David Gelbart and Nelson Morgan. Double the trouble: handling noise and reverberation in far-field automatic speech recognition. InInterspeech, pages 2185–2188, 2002

2002

[15] [15]

Explainable disentanglement on discrete speech representations for noise-robust asr

Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, and Eng Siong Chng. Explainable disentanglement on discrete speech representations for noise-robust asr. In2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 2535–2540. IEEE, 2025. 10

2025

[16] [16]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[17] [17]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Evaluating robustness of large audio language models to audio injection: An empirical study

Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, and Wenbo Jiang. Evaluating robustness of large audio language models to audio injection: An empirical study. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25671–25687, 2025

2025

[19] [19]

A tandem algorithm for pitch estimation and voiced speech segregation.IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079, 2010

Guoning Hu and DeLiang Wang. A tandem algorithm for pitch estimation and voiced speech segregation.IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079, 2010

2067

[20] [20]

Audiogpt: Understanding and generating speech, music, sound, and talking head

Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. Audiogpt: Understanding and generating speech, music, sound, and talking head. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 23802–23804, 2024

2024

[21] [21]

Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

work page arXiv 2025

[22] [22]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026

[24] [24]

Audio augmentation for speech recognition

Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation for speech recognition. InInterspeech, volume 2015, page 3586, 2015

2015

[25] [25]

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics,

Pankaj Kumar and Subhankar Mishra. Robustness in large language models: A survey of mitigation strategies and evaluation metrics.arXiv preprint arXiv:2505.18658, 2025

work page arXiv 2025

[26] [26]

Mmau- pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

Sonal Kumar, Šimon Sedlá ˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Pli ˇcka, Miroslav Hlavá ˇcek, et al. Mmau- pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2...

2026

[27] [27]

Isa-bench: Benchmarking instruction sensitivity for large audio language models.arXiv preprint arXiv:2510.23558, 2025

Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, and Kai Yu. Isa-bench: Benchmarking instruction sensitivity for large audio language models.arXiv preprint arXiv:2510.23558, 2025

work page arXiv 2025

[28] [28]

Applications of large language models and multimodal large models in autonomous driving: A comprehensive review.Drones, 9(4):238, 2025

Jing Li, Jingyuan Li, Guo Yang, Lie Yang, Haozhuang Chi, and Lichao Yang. Applications of large language models and multimodal large models in autonomous driving: A comprehensive review.Drones, 9(4):238, 2025

2025

[29] [29]

Robust automatic speech recogni- tion: a bridge to practical applications

Jinyu Li, Li Deng, Reinhold Haeb-Umbach, and Yifan Gong. Robust automatic speech recogni- tion: a bridge to practical applications. 2015

2015

[30] [30]

Hidden in the noise: Unveiling backdoors in audio llms alignment through latent acoustic pattern triggers

Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, et al. Hidden in the noise: Unveiling backdoors in audio llms alignment through latent acoustic pattern triggers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32015–32023, 2026

2026

[31] [31]

ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, and Qingsong Wen. Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language models.arXiv preprint arXiv:2601.04876, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025

[33] [33]

Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779, 2019

Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779, 2019

work page arXiv 1904

[34] [34]

SEGAN: Speech Enhancement Generative Adversarial Network

Santiago Pascual, Antonio Bonafonte, and Joan Serra. Segan: Speech enhancement generative adversarial network.arXiv preprint arXiv:1703.09452, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019

Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-Yiin Chang, and Tara Sainath. Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019

2019

[36] [36]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023

[37] [37]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[38] [38]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen.arXiv preprint arXiv:2306.12925, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

2026

[40] [40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

2024

[42] [42]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

work page arXiv 2026

[44] [44]

Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

Renxuan Tan, Rongpeng Li, Zhifeng Zhao, and Honggang Zhang. Beyond compromise: Pareto- lenient consensus for efficient multi-preference llm alignment.arXiv preprint arXiv:2604.05965, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

A hybrid dsp/deep learning approach to real-time full-band speech enhance- ment

Jean-Marc Valin. A hybrid dsp/deep learning approach to real-time full-band speech enhance- ment. In2018 IEEE 20th international workshop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2018

2018

[47] [47]

Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

2025

[48] [48]

Supervised speech separation based on deep learning: An overview.IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702– 1726, 2018

DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview.IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702– 1726, 2018

2018

[49] [49]

ESPnet: End-to-End Speech Processing Toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit.arXiv preprint arXiv:1804.00015, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[50] [50]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection.IEEE Signal Processing Letters, 2025

Yang Xiao and Rohan Kumar Das. Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection.IEEE Signal Processing Letters, 2025

2025

[52] [52]

Adakws: Towards robust keyword spotting with test-time adaptation

Yang Xiao, Tianyi Peng, Yanghao Zhou, and Rohan Kumar Das. Adakws: Towards robust keyword spotting with test-time adaptation. InINTERSPEECH 2025, 2025

2025

[53] [53]

Self-training with noisy student improves imagenet classification

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020

2020

[54] [54]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

1979

[56] [56]

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[59] [59]

Leveraging llm and text-queried separation for noise-robust sound event detection

Han Yin, Yang Xiao, Jisheng Bai, and Rohan Kumar Das. Leveraging llm and text-queried separation for noise-robust sound event detection. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Satellite Workshop on Speech and Audio Language Models (SALMA), 2024

2025

[60] [60]

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, and Jung-Woo Choi. Focus then listen: Exploring plug-and-play audio enhancer for noise-robust large audio language models.arXiv preprint arXiv:2603.04862, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

See: Signal embedding energy for quantifying noise interference in large audio language models.arXiv preprint arXiv:2601.07331, 2026

Yuanhe Zhang, Jiayu Tian, Yibo Zhang, Shilinlu Yan, Liang Lin, Zhenhong Zhou, Li Sun, and Sen Su. See: Signal embedding energy for quantifying noise interference in large audio language models.arXiv preprint arXiv:2601.07331, 2026

work page arXiv 2026

[63] [63]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 13 A Related Work Knowledge Distillation and Frontier Self-Distillation.Knowledge distillation minimizes the functional or distributional discrepan...

work page internal anchor Pith review Pith/arXiv arXiv 2026