pith. sign in

arxiv: 2606.18273 · v1 · pith:EMEXYGSXnew · submitted 2026-06-05 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

Continuous Audio Thinking for Large Audio Language Models

Pith reviewed 2026-06-27 21:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS
keywords continuous audio thinkinglarge audio language modelslatent workspaceexpert distillationaudio reasoningspeech transcriptionmusic classificationacoustic information
0
0 comments X

The pith

Large audio language models preserve acoustic details like prosody and pitch through a continuous latent workspace before generating text responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large audio language models lose acoustic information such as phonetic detail, prosody, sound events, affect, and pitch because their hidden states shift toward text generation. Continuous Audio Thinking adds a continuous latent workspace where the model organizes this information, drawing from distillation by audio experts. The workspace processes in one prefill step and adds no autoregressive decoding cost. Gains appear across benchmarks for audio reasoning, understanding, music classification, speech emotion, and transcription when tested on multiple base models. Analysis shows the distilled supervision reaches the final textual outputs.

Core claim

The paper introduces Continuous Audio Thinking (CoAT), a framework that equips large audio language models with a continuous latent workspace for organizing acoustic information prior to response generation. This workspace is grounded by distillation from audio experts, allowing the model to utilize rich acoustic details when generating responses. The continuous thinking block can be processed in a single prefill, incurring no additional autoregressive decoding cost over the baseline. Experiments on three LALMs demonstrate gains across a suite of benchmarks, and analysis shows the supervision propagates to textual responses.

What carries the argument

The continuous latent workspace that organizes acoustic information from expert distillation before text generation.

If this is right

  • Performance improves on audio reasoning, audio understanding, music classification, speech emotion, and speech transcription tasks.
  • The method works across different large audio language models.
  • The continuous thinking block adds no extra autoregressive decoding cost compared with the baseline.
  • Auxiliary supervision from the distillation reaches the model's final textual responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-prefill design may support extensions to streaming or low-latency audio applications.
  • Similar latent workspaces could be tested for retaining non-text features in other modalities such as video.
  • The approach might allow smaller models to achieve results previously requiring larger capacity by better retaining input details.

Load-bearing premise

Distillation from audio experts supplies usable acoustic information into the continuous thinking positions without domain mismatch that negates the performance gains.

What would settle it

A controlled run that keeps the continuous workspace but removes the expert distillation and checks whether benchmark gains vanish would test the central claim.

Figures

Figures reproduced from arXiv: 2606.18273 by Changho Choi, Dong-Jae Lee, Gyojin Han, Jongsuk Kim, Junmo Kim.

Figure 1
Figure 1. Figure 1: Thinking paradigms in audio language models. (a) Vanilla audio LMs decode the response directly from audio and instruction tokens. (b) Discrete thinking generates textual thinking tokens autoregressively before the answer. (c) Continuous Audio Thinking (ours) prepends a fixed￾length block of continuous thinking tokens that is consumed in a single prefill, letting the model think in an audio-aligned latent … view at source ↗
Figure 2
Figure 2. Figure 2: CoAT architecture. A continuous audio thinking block is supervised by five audio experts via per-task projection heads, covering audio feature reconstruction, speech representation, sound event detection, paralinguistic features, and pitch. The projection heads decode the shared hidden states into expert-aligned predictions, used only during training. on triplets of audio, instruction, and target response,… view at source ↗
Figure 3
Figure 3. Figure 3: Linear probe accuracy at the audio￾think hidden across training checkpoints, on 4- class IEMOCAP emotion and 12-class MuchoMu￾sic dominant pitch. We probe whether CoAT’s auxiliary supervision injects task-relevant information at thinking po￾sitions by training linear probes on the LM hid￾den state at two positions. The first one is the audio-think hidden, taken as the mean over the thinking block τp where … view at source ↗
Figure 4
Figure 4. Figure 4: Example reconstructions from CoAT’s per-task heads. Each pair shows the expert target (right) and the corresponding student prediction (left) at the audio-think positions. beyond reconstruction. This indicates that the auxiliary supervision injects task-relevant information into the supervised position [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Continuous Audio Thinking (CoAT) for large audio language models (LALMs). It adds a continuous latent workspace (thinking space) that organizes acoustic information (phonetic detail, prosody, affect, etc.) prior to text response generation; this workspace is populated via distillation from audio expert models and processed in a single prefill pass. The design is claimed to incur no extra autoregressive decoding cost over the baseline LALM. Experiments across Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo 3 report gains on a suite of benchmarks covering audio reasoning, audio understanding, music classification, speech emotion recognition, and speech transcription; further analysis is said to confirm that the auxiliary supervision propagates from the thinking positions into the final textual outputs.

Significance. If the distillation successfully transfers usable acoustic structure into the latent workspace without domain mismatch or post-hoc tuning that inflates the gains, CoAT would offer an efficient mechanism for preserving non-text-aligned acoustic information inside LALMs. The single-prefill constraint is a concrete efficiency advantage. The breadth of the benchmark suite and the explicit propagation analysis are positive features that would make the result falsifiable and reproducible if the experimental details are complete.

major comments (2)
  1. [Abstract (final sentence) and the 'further analysis' paragraph] The central claim that expert distillation populates the continuous thinking positions with usable acoustic information (phonetic, prosody, affect) that then propagates to improve textual responses rests on the assumption of clean transfer without domain mismatch. The abstract asserts this propagation but provides no quantitative evidence (e.g., alignment metrics between expert and LALM latent spaces or ablation removing distillation while retaining the thinking block) that would rule out gains arising from added capacity or training rather than preserved acoustics.
  2. [Method description of the thinking block and prefill procedure] The single-prefill design is presented as incurring no additional autoregressive cost, yet the integration of the continuous thinking block with the existing LALM forward pass is not shown to preserve the original text-generation path exactly. Without an explicit statement of how the thinking positions are masked or bypassed during response generation, it is unclear whether the reported efficiency holds or whether hidden extra computation is introduced.
minor comments (2)
  1. [Abstract] The abstract lists three specific LALMs but does not state their parameter counts or base training regimes; adding this information would help readers assess the generality of the gains.
  2. [Method] Notation for the continuous latent workspace (e.g., how the thinking positions are indexed or concatenated with audio tokens) should be introduced with a small diagram or equation in the method section to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of evidence and clarify the implementation details.

read point-by-point responses
  1. Referee: [Abstract (final sentence) and the 'further analysis' paragraph] The central claim that expert distillation populates the continuous thinking positions with usable acoustic information (phonetic, prosody, affect) that then propagates to improve textual responses rests on the assumption of clean transfer without domain mismatch. The abstract asserts this propagation but provides no quantitative evidence (e.g., alignment metrics between expert and LALM latent spaces or ablation removing distillation while retaining the thinking block) that would rule out gains arising from added capacity or training rather than preserved acoustics.

    Authors: We agree that the manuscript would benefit from more direct quantitative support for the propagation claim. The existing further analysis links performance gains to the thinking positions but does not include the suggested alignment metrics or the specific ablation that isolates distillation from the added block. We will add both: (1) an ablation that retains the thinking block architecture while removing the expert distillation objective, and (2) alignment metrics (e.g., cosine similarity or CCA) between expert representations and the LALM thinking-space activations. These additions will be reported in a revised analysis section. revision: yes

  2. Referee: [Method description of the thinking block and prefill procedure] The single-prefill design is presented as incurring no additional autoregressive cost, yet the integration of the continuous thinking block with the existing LALM forward pass is not shown to preserve the original text-generation path exactly. Without an explicit statement of how the thinking positions are masked or bypassed during response generation, it is unclear whether the reported efficiency holds or whether hidden extra computation is introduced.

    Authors: We accept that an explicit description of the masking and integration is required. The current method text states that the thinking block is processed in a single prefill but does not detail how its outputs are isolated from the autoregressive loop. We will revise the method section to include a precise account of the forward pass, specifying that thinking positions receive a single prefill computation whose hidden states serve as fixed additional context, with causal masking that excludes them from subsequent token prediction. We will also add pseudocode and a diagram illustrating that the autoregressive decoding path remains unchanged from the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity detected; CoAT is an independent architectural addition

full rationale

The paper presents CoAT as a new framework that adds a continuous latent workspace to LALMs, populated via distillation from external audio experts, with single-prefill processing and empirical gains on benchmarks. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the abstract or described method. The central claim rests on the independent effectiveness of the added thinking block and distillation transfer, not on any derivation that reduces to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only; ledger populated from stated claims only. The central claim rests on the unverified premise that expert distillation supplies usable acoustic signals and that the thinking block preserves them for downstream text generation.

axioms (2)
  • domain assumption Distillation from audio experts can ground a continuous latent workspace without domain shift or loss of acoustic fidelity.
    Invoked in the sentence describing the thinking space as 'grounded by distillation from audio experts'.
  • domain assumption Processing the continuous thinking block in a single prefill preserves all benefits without additional autoregressive cost.
    Stated directly in the abstract as a property of the proposed block.
invented entities (1)
  • continuous latent workspace (thinking space) no independent evidence
    purpose: Organize acoustic information prior to response generation
    New postulated component introduced to retain phonetic, prosodic, and event-level audio details that are otherwise lost.

pith-pipeline@v0.9.1-grok · 5787 in / 1380 out tokens · 15268 ms · 2026-06-27T21:51:08.881822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 13 linked inside Pith

  1. [1]

    Tyers, and Gregor Weber

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus, 2020

  2. [2]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

  3. [3]

    Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert

    Rachel M. Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert. A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 2022

  4. [4]

    Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

  5. [5]

    Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

    Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

  6. [6]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

  7. [7]

    Beats: Audio pre-training with acoustic tokenizers

    Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pages 5178–5193. PMLR, 2023

  8. [8]

    Eat: self-supervised pre-training with efficient audio transformer

    Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, and Xie Chen. Eat: self-supervised pre-training with efficient audio transformer. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 3807–3815, 2024

  9. [9]

    Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

  10. [10]

    Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

  11. [11]

    High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

  12. [12]

    From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

    Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

  13. [13]

    Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090– 18108, 2023

    Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090– 18108, 2023

  14. [14]

    Clotho: An audio captioning dataset

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020

  15. [15]

    Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025

    Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025. 10

  16. [16]

    Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities

    Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313, 2024

  17. [17]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

  18. [18]

    Switchboard: Telephone speech corpus for research and development

    John J Godfrey, Edward C Holliman, and Jane McDaniel. Switchboard: Telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE, 1992

  19. [19]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

    Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

  20. [20]

    Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

    Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

  21. [21]

    V ocalsound: A dataset for improving human vocal sounds recognition

    Yuan Gong, Jin Yu, and James Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151–155. IEEE, 2022

  22. [22]

    Onellm: One framework to align all modalities with language

    Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024

  23. [23]

    Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  24. [24]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

  25. [25]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  26. [26]

    Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

    Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

  27. [27]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

  28. [28]

    Audiocaps: Generat- ing captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generat- ing captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019

  29. [29]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  30. [30]

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020. 11

  31. [31]

    Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

    Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. arXiv preprint arXiv:2402.01831, 2024

  32. [32]

    Clotho-aqa: A crowdsourced dataset for audio question answering

    Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. Clotho-aqa: A crowdsourced dataset for audio question answering. In2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022

  33. [33]

    Music understand- ing llama: Advancing text-to-music generation with question answering and captioning

    Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understand- ing llama: Advancing text-to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 286–290. IEEE, 2024

  34. [34]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

    Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

  35. [35]

    emotion2vec: Self-supervised pre-training for speech emotion representation

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15747–15760, 2024

  36. [36]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization, 2024

    Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization, 2024

  37. [37]

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339–3354, 2024

  38. [38]

    Mustango: Toward controllable text-to-music generation

    Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8286–8309, 2024

  39. [39]

    O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D

    Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. Spgispeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition, 2021

  40. [40]

    Librispeech: an asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

  41. [41]

    Spidr: Learning fast and stable linguistic units for spoken language models without supervision

    Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, and Emmanuel Dupoux. Spidr: Learning fast and stable linguistic units for spoken language models without supervision. Transactions on Machine Learning Research, 2025

  42. [42]

    Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

  43. [43]

    Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

    Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

  44. [44]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Pro- ceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

  45. [45]

    Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024

  46. [46]

    Gsqa: An end-to-end model for generative spoken question answering.arXiv preprint arXiv:2312.09781, 2023

    Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, and Hung-yi Lee. Gsqa: An end-to-end model for generative spoken question answering.arXiv preprint arXiv:2312.09781, 2023

  47. [47]

    Llasm: Large language and speech model.arXiv preprint arXiv:2308.15930, 2023

    Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. Llasm: Large language and speech model.arXiv preprint arXiv:2308.15930, 2023

  48. [48]

    The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use.arXiv preprint arXiv:1306.1461, 2013

    Bob L Sturm. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use.arXiv preprint arXiv:1306.1461, 2013

  49. [49]

    Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

  50. [50]

    V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021

    Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021

  51. [51]

    Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

    Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

  52. [52]

    Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  53. [53]

    Muchomusic: Evaluating music understanding in multimodal audio-language models

    Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, György Fazekas, and Dmitry Bogdanov. Muchomusic: Evaluating music understanding in multimodal audio-language models. InProceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024

  54. [54]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  55. [55]

    Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  56. [56]

    Air-bench: Benchmarking large audio-language models via generative comprehension

    Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

  57. [57]

    Mert: Acoustic music understanding model with large-scale self-supervised training

    LI Yizhi, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. Mert: Acoustic music understanding model with large-scale self-supervised training. InThe Twelfth International Conference on Learning Representations, 2024

  58. [58]

    Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

  59. [59]

    Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024. 13

  60. [60]

    Anygpt: Unified multimodal llm with discrete sequence modeling

    Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637–9662, 2024

  61. [61]

    Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

  62. [62]

    Speaking clearly: A simplified whisper-based codec for low-bitrate speech coding

    Xin Zhang, Lin Li, Xiangni Lu, Jianquan Liu, and Kong Aik Lee. Speaking clearly: A simplified whisper-based codec for low-bitrate speech coding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17037–17041. IEEE, 2026

  63. [63]

    The reference answer is[XXX], while the model’s answer is[YYY]. I think

    Zihan Zhao, Yiyang Jiang, Heyang Liu, Yu Wang, and Yanfeng Wang. Librisqa: A novel dataset and framework for spoken question answering with large language models.IEEE Transactions on Artificial Intelligence, 2024. 14 Table A: Audio expert encoders used in CoAT.ek is the expert embedding dimension and rk is the frame rate at which the expert emits features...