pith. sign in

arxiv: 2605.23954 · v1 · pith:3ICCMY4Anew · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.SD

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Pith reviewed 2026-06-30 22:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD
keywords Audio LLMsnoise robustnessself-distillationGRPOsemantic alignmenthallucination reductioninference-time optimization
0
0 comments X

The pith

EchoDistill aligns a noisy Audio LLM student to a frozen clean teacher via GRPO token-consistency rewards to cut semantic drift without raising inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EchoDistill to reduce hallucinations and semantic drift in Audio Large Language Models when inputs contain real-world noise. A frozen clean-audio teacher supplies semantic references while the noisy student samples multiple response trajectories at inference time. These trajectories are refined with group-relative policy optimization that treats token-level agreement with the teacher as a reward signal, plus audio-aware shaping. The result is higher task accuracy and grounded reasoning on noisy audio, achieved without any change to the deployed model speed.

Core claim

EchoDistill is an alignment-based noisy-to-clean self-distillation framework in which a frozen clean-audio teacher provides semantic references for an inference-time noisy-audio student; the student samples candidate responses under noise, and these trajectories are optimized via group-relative policy optimization where token-level consistency with the teacher acts as a reward bonus, yielding improved semantic reliability and task performance under complex noise with no added inference cost.

What carries the argument

Group-relative policy optimization (GRPO) on noisy student trajectories that uses token-level consistency with the clean teacher as the reward signal.

If this is right

  • Average GSR rises 4.18 percent under strong noise relative to the strongest prior baseline.
  • On Qwen-Omni, the full method adds 3.02 percent Acc, 3.89 percent Noisy, and 4.53 percent GSR over GRPO alone.
  • Reasoning paths become both more correct and more directly tied to the actual acoustic input.
  • The same model can be made robust after training by applying the alignment procedure only at test time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of a clean reference model from the deployed noisy model suggests a general pattern for test-time robustness in other multimodal settings where clean exemplars can be obtained offline.
  • Because the optimization occurs only on sampled trajectories, the method could be combined with existing acoustic enhancement pipelines without retraining the core LLM.
  • If the clean teacher itself contains domain biases, those biases would be transferred into the student's selected paths, an effect not measured in the current experiments.

Load-bearing premise

The frozen clean-audio teacher supplies semantic references that are reliable enough to serve as token-level consistency rewards for the noisy student.

What would settle it

Running the same noisy-audio evaluation suite and finding that EchoDistill produces no measurable gain in GSR or accuracy over the GRPO-only baseline would falsify the value of the teacher alignment step.

Figures

Figures reproduced from arXiv: 2605.23954 by Cai Yuchen, Chunxi Luo, Gongli Xi, Jie Zhang, Jin Wang, Junhao Dong, Kaiwen Luo, Kun Wang, Liang Lin, Qiankun Li, Yuanhe Zhang, Zhenhong Zhou.

Figure 1
Figure 1. Figure 1: Task-wise comparison of noisy accuracy and GSR across Music, Sound, and Speech. Over [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sparse and uneven audio grounding in correct trajectories. (a) Distribution of token/window-level audio dependency di,w, measured by the decision-margin drop after ablating a local audio window. (b) Distribution of trajectory-level audio anchor gi , measuring the gain of using audio over the no-audio condition [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of EchoDistill. EchoDistill uses a noisy-audio student and a frozen clean-audio teacher during training, combining task reward, noisy-to-clean evidence alignment, and preference optimization while retaining single-student inference. The distillation loss L (i) distill is the masked teacher-to-student KL between these distributions over response tokens. The mask mi,t excludes prompt tokens, audio p… view at source ↗
Figure 4
Figure 4. Figure 4: EchoDistill consistently achieves higher Composite Robustness Score than the strongest [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qwen2.5-Omni results under −10 dB noise on Music, Sound, and Speech [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of noisy-to-clean consistency. MiniCPM improves rapidly in the early stage, while StepAudio exhibits a steadier upward trend, indicating that EchoDistill progres￾sively aligns noisy-input generation with clean￾audio semantic behavior. Obs. 7: Noisy-to-clean alignment progres￾sively suppresses semantic drift. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EchoDistill, a noisy-to-clean self-distillation framework for Audio Large Language Models. A frozen clean-audio teacher supplies semantic references that are used to optimize an inference-time noisy student's sampled response trajectories via Group Relative Policy Optimization (GRPO), with token-level consistency serving as a reward bonus and audio-aware shaping applied. The method claims to improve semantic reliability and task performance under complex noise without added inference cost, reporting an average 4.18% GSR gain under strong noise versus the strongest baseline and further gains (3.02% Acc, 3.89% Noisy, 4.53% GSR) over a GRPO-only ablation on Qwen-Omni.

Significance. If the central results hold, the work would offer a practical route to noise-robust ALLMs by aligning noisy inference trajectories to clean semantic evidence at no extra runtime cost. The use of GRPO on token-consistency rewards and the explicit separation of teacher and student audio conditions constitute a distinct technical contribution relative to waveform enhancement or answer-level supervision baselines.

major comments (2)
  1. The reported performance gains rest on the assumption that token-level consistency between the frozen clean teacher and the noisy student's trajectories constitutes a valid reward signal under GRPO. The manuscript provides no quantitative validation (e.g., correlation between teacher-student agreement rates and ground-truth correctness) that this signal rewards acoustically grounded reasoning rather than superficial lexical overlap; this check is load-bearing for the 4.18% GSR claim under strong noise.
  2. Ablation results on Qwen-Omni claim consistent improvements over the GRPO-only variant, yet the description does not specify how the teacher-derived reward is ablated (e.g., whether the teacher is replaced by a noisy reference or removed entirely) or report variance across noise types and seeds. Without these controls it is unclear whether the additional 3-4.5% gains are attributable to the clean-teacher alignment.
minor comments (2)
  1. The abstract states that codes are available at an anonymous repository; the final version should replace this with a permanent link and include the exact experimental configurations (noise levels, datasets, and GSR definition) used for the reported percentages.
  2. Notation for the reward bonus and audio-aware shaping is introduced only descriptively; an explicit equation or pseudocode block would clarify how the token-consistency term is combined with any base reward in the GRPO objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The reported performance gains rest on the assumption that token-level consistency between the frozen clean teacher and the noisy student's trajectories constitutes a valid reward signal under GRPO. The manuscript provides no quantitative validation (e.g., correlation between teacher-student agreement rates and ground-truth correctness) that this signal rewards acoustically grounded reasoning rather than superficial lexical overlap; this check is load-bearing for the 4.18% GSR claim under strong noise.

    Authors: We agree that the manuscript lacks an explicit quantitative validation such as correlation between teacher-student agreement rates and ground-truth correctness. In the revised version we will add a dedicated analysis subsection that computes and reports Pearson correlations (and associated p-values) between per-trajectory agreement rates and correctness labels across all noise conditions and tasks. This will directly test whether the reward signal aligns with acoustically grounded reasoning. revision: yes

  2. Referee: Ablation results on Qwen-Omni claim consistent improvements over the GRPO-only variant, yet the description does not specify how the teacher-derived reward is ablated (e.g., whether the teacher is replaced by a noisy reference or removed entirely) or report variance across noise types and seeds. Without these controls it is unclear whether the additional 3-4.5% gains are attributable to the clean-teacher alignment.

    Authors: We acknowledge the ablation description is underspecified. In revision we will explicitly state that the GRPO-only baseline removes the teacher-derived token-consistency reward entirely (retaining only the task-performance component of the reward) and will report results with standard deviations across three random seeds, broken down by individual noise type (babble, music, environmental, etc.). These additions will clarify attribution to the clean-teacher signal. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external teacher assumption and empirical results

full rationale

The paper presents EchoDistill as an empirical alignment framework in which a separately frozen clean-audio teacher supplies token-level consistency rewards to optimize a noisy student's GRPO trajectories. No equations, fitted parameters renamed as predictions, or self-citations are shown that would reduce the claimed performance gains (e.g., 4.18% GSR) to a definitional identity or tautological input. The method's load-bearing step is the external assumption that the clean teacher yields reliable semantic references, which is stated as an assumption rather than derived from the student's own outputs. All reported improvements are positioned as experimental outcomes from ablations and comparisons, leaving the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method relies on standard self-distillation and RL concepts from prior literature.

pith-pipeline@v0.9.1-grok · 5870 in / 1079 out tokens · 20712 ms · 2026-06-30T22:50:29.236338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 33 canonical work pages · 20 internal anchors

  1. [1]

    De-noising by soft-thresholding.IEEE transactions on information theory, 41(3):613–627, 1995

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  3. [3]

    Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

    Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

  4. [4]

    Pearson correlation coefficient

    Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. InNoise reduction in speech processing, pages 1–4. Springer, 2009

  5. [5]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

  6. [6]

    Noise-robust sound event detection and counting via language-queried sound separation.IEEE Signal Processing Letters, 2025

    Yuanjian Chen, Yang Xiao, Han Yin, Yadong Guan, and Xubo Liu. Noise-robust sound event detection and counting via language-queried sound separation.IEEE Signal Processing Letters, 2025

  7. [7]

    Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios.arXiv preprint arXiv:2501.01384, 2025

    Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, et al. Omnichat: Enhancing spoken dialogue systems with scalable synthetic data for diverse scenarios.arXiv preprint arXiv:2501.01384, 2025

  8. [8]

    Omnisep: Unified omni-modality sound separation with query-mixup.arXiv preprint arXiv:2410.21269, 2024

    Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, et al. Omnisep: Unified omni-modality sound separation with query-mixup.arXiv preprint arXiv:2410.21269, 2024

  9. [9]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

  10. [10]

    Adverse effects of environmental noise on acoustic voice quality measurements.Journal of Voice, 19(1):15–28, 2005

    Dimitar D Deliyski, Heather S Shaw, and Maegan K Evans. Adverse effects of environmental noise on acoustic voice quality measurements.Journal of Voice, 19(1):15–28, 2005

  11. [11]

    Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 2003

    Yariv Ephraim and David Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 2003

  12. [12]

    Alphaedit: Null-space constrained knowledge editing for language models

    Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Alphaedit: Null-space constrained knowledge editing for language models. arXiv preprint arXiv:2410.02355, 2024

  13. [13]

    Born again neural networks

    Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616. PMLR, 2018

  14. [14]

    Double the trouble: handling noise and reverberation in far-field automatic speech recognition

    David Gelbart and Nelson Morgan. Double the trouble: handling noise and reverberation in far-field automatic speech recognition. InInterspeech, pages 2185–2188, 2002

  15. [15]

    Explainable disentanglement on discrete speech representations for noise-robust asr

    Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, and Eng Siong Chng. Explainable disentanglement on discrete speech representations for noise-robust asr. In2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 2535–2540. IEEE, 2025. 10

  16. [16]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  17. [17]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  18. [18]

    Evaluating robustness of large audio language models to audio injection: An empirical study

    Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, and Wenbo Jiang. Evaluating robustness of large audio language models to audio injection: An empirical study. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25671–25687, 2025

  19. [19]

    A tandem algorithm for pitch estimation and voiced speech segregation.IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079, 2010

    Guoning Hu and DeLiang Wang. A tandem algorithm for pitch estimation and voiced speech segregation.IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2067–2079, 2010

  20. [20]

    Audiogpt: Understanding and generating speech, music, sound, and talking head

    Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. Audiogpt: Understanding and generating speech, music, sound, and talking head. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 23802–23804, 2024

  21. [21]

    Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

    Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning.arXiv preprint arXiv:2510.09285, 2025

  22. [22]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  23. [23]

    Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

  24. [24]

    Audio augmentation for speech recognition

    Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation for speech recognition. InInterspeech, volume 2015, page 3586, 2015

  25. [25]

    Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics,

    Pankaj Kumar and Subhankar Mishra. Robustness in large language models: A survey of mitigation strategies and evaluation metrics.arXiv preprint arXiv:2505.18658, 2025

  26. [26]

    Mmau- pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

    Sonal Kumar, Šimon Sedlá ˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Pli ˇcka, Miroslav Hlavá ˇcek, et al. Mmau- pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 2...

  27. [27]

    Isa-bench: Benchmarking instruction sensitivity for large audio language models.arXiv preprint arXiv:2510.23558, 2025

    Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, and Kai Yu. Isa-bench: Benchmarking instruction sensitivity for large audio language models.arXiv preprint arXiv:2510.23558, 2025

  28. [28]

    Applications of large language models and multimodal large models in autonomous driving: A comprehensive review.Drones, 9(4):238, 2025

    Jing Li, Jingyuan Li, Guo Yang, Lie Yang, Haozhuang Chi, and Lichao Yang. Applications of large language models and multimodal large models in autonomous driving: A comprehensive review.Drones, 9(4):238, 2025

  29. [29]

    Robust automatic speech recogni- tion: a bridge to practical applications

    Jinyu Li, Li Deng, Reinhold Haeb-Umbach, and Yifan Gong. Robust automatic speech recogni- tion: a bridge to practical applications. 2015

  30. [30]

    Hidden in the noise: Unveiling backdoors in audio llms alignment through latent acoustic pattern triggers

    Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, et al. Hidden in the noise: Unveiling backdoors in audio llms alignment through latent acoustic pattern triggers. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32015–32023, 2026

  31. [31]

    ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

    Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, and Qingsong Wen. Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language models.arXiv preprint arXiv:2601.04876, 2026. 11

  32. [32]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

    Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

  33. [33]

    Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779, 2019

    Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779, 2019

  34. [34]

    SEGAN: Speech Enhancement Generative Adversarial Network

    Santiago Pascual, Antonio Bonafonte, and Joan Serra. Segan: Speech enhancement generative adversarial network.arXiv preprint arXiv:1703.09452, 2017

  35. [35]

    Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019

    Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-Yiin Chang, and Tara Sainath. Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019

  36. [36]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  37. [37]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  38. [38]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen.arXiv preprint arXiv:2306.12925, 2023

  39. [39]

    On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. On-policy self-distillation for reasoning compression.arXiv e-prints, pages arXiv–2603, 2026

  40. [40]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  41. [41]

    Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

    Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, Dongsheng Li, and Yueting Zhuang. Taskbench: Benchmarking large language models for task automation.Advances in Neural Information Processing Systems, 37:4540–4574, 2024

  42. [42]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  43. [43]

    Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

    Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

  44. [44]

    Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

    Renxuan Tan, Rongpeng Li, Zhifeng Zhao, and Honggang Zhang. Beyond compromise: Pareto- lenient consensus for efficient multi-preference llm alignment.arXiv preprint arXiv:2604.05965, 2026

  45. [45]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

  46. [46]

    A hybrid dsp/deep learning approach to real-time full-band speech enhance- ment

    Jean-Marc Valin. A hybrid dsp/deep learning approach to real-time full-band speech enhance- ment. In2018 IEEE 20th international workshop on multimedia signal processing (MMSP), pages 1–5. IEEE, 2018

  47. [47]

    Audiobench: A universal benchmark for audio large language models

    Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

  48. [48]

    Supervised speech separation based on deep learning: An overview.IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702– 1726, 2018

    DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview.IEEE/ACM transactions on audio, speech, and language processing, 26(10):1702– 1726, 2018

  49. [49]

    ESPnet: End-to-End Speech Processing Toolkit

    Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit.arXiv preprint arXiv:1804.00015, 2018

  50. [50]

    Step-Audio 2 Technical Report

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

  51. [51]

    Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection.IEEE Signal Processing Letters, 2025

    Yang Xiao and Rohan Kumar Das. Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection.IEEE Signal Processing Letters, 2025

  52. [52]

    Adakws: Towards robust keyword spotting with test-time adaptation

    Yang Xiao, Tianyi Peng, Yanghao Zhou, and Rohan Kumar Das. Adakws: Towards robust keyword spotting with test-time adaptation. InINTERSPEECH 2025, 2025

  53. [53]

    Self-training with noisy student improves imagenet classification

    Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020

  54. [54]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  55. [55]

    Air-bench: Benchmarking large audio-language models via generative comprehension

    Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

  56. [56]

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  57. [57]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  58. [58]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  59. [59]

    Leveraging llm and text-queried separation for noise-robust sound event detection

    Han Yin, Yang Xiao, Jisheng Bai, and Rohan Kumar Das. Leveraging llm and text-queried separation for noise-robust sound event detection. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Satellite Workshop on Speech and Audio Language Models (SALMA), 2024

  60. [60]

    Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

    Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, and Jung-Woo Choi. Focus then listen: Exploring plug-and-play audio enhancer for noise-robust large audio language models.arXiv preprint arXiv:2603.04862, 2026

  61. [61]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  62. [62]

    See: Signal embedding energy for quantifying noise interference in large audio language models.arXiv preprint arXiv:2601.07331, 2026

    Yuanhe Zhang, Jiayu Tian, Yibo Zhang, Shilinlu Yan, Liang Lin, Zhenhong Zhou, Li Sun, and Sen Su. See: Signal embedding energy for quantifying noise interference in large audio language models.arXiv preprint arXiv:2601.07331, 2026

  63. [63]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 13 A Related Work Knowledge Distillation and Frontier Self-Distillation.Knowledge distillation minimizes the functional or distributional discrepan...