Continuous Audio Thinking for Large Audio Language Models

Changho Choi; Dong-Jae Lee; Gyojin Han; Jongsuk Kim; Junmo Kim

arxiv: 2606.18273 · v1 · pith:EMEXYGSXnew · submitted 2026-06-05 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

Continuous Audio Thinking for Large Audio Language Models

Gyojin Han , Dong-Jae Lee , Changho Choi , Jongsuk Kim , Junmo Kim This is my paper

Pith reviewed 2026-06-27 21:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords continuous audio thinkinglarge audio language modelslatent workspaceexpert distillationaudio reasoningspeech transcriptionmusic classificationacoustic information

0 comments

The pith

Large audio language models preserve acoustic details like prosody and pitch through a continuous latent workspace before generating text responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large audio language models lose acoustic information such as phonetic detail, prosody, sound events, affect, and pitch because their hidden states shift toward text generation. Continuous Audio Thinking adds a continuous latent workspace where the model organizes this information, drawing from distillation by audio experts. The workspace processes in one prefill step and adds no autoregressive decoding cost. Gains appear across benchmarks for audio reasoning, understanding, music classification, speech emotion, and transcription when tested on multiple base models. Analysis shows the distilled supervision reaches the final textual outputs.

Core claim

The paper introduces Continuous Audio Thinking (CoAT), a framework that equips large audio language models with a continuous latent workspace for organizing acoustic information prior to response generation. This workspace is grounded by distillation from audio experts, allowing the model to utilize rich acoustic details when generating responses. The continuous thinking block can be processed in a single prefill, incurring no additional autoregressive decoding cost over the baseline. Experiments on three LALMs demonstrate gains across a suite of benchmarks, and analysis shows the supervision propagates to textual responses.

What carries the argument

The continuous latent workspace that organizes acoustic information from expert distillation before text generation.

If this is right

Performance improves on audio reasoning, audio understanding, music classification, speech emotion, and speech transcription tasks.
The method works across different large audio language models.
The continuous thinking block adds no extra autoregressive decoding cost compared with the baseline.
Auxiliary supervision from the distillation reaches the model's final textual responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-prefill design may support extensions to streaming or low-latency audio applications.
Similar latent workspaces could be tested for retaining non-text features in other modalities such as video.
The approach might allow smaller models to achieve results previously requiring larger capacity by better retaining input details.

Load-bearing premise

Distillation from audio experts supplies usable acoustic information into the continuous thinking positions without domain mismatch that negates the performance gains.

What would settle it

A controlled run that keeps the continuous workspace but removes the expert distillation and checks whether benchmark gains vanish would test the central claim.

Figures

Figures reproduced from arXiv: 2606.18273 by Changho Choi, Dong-Jae Lee, Gyojin Han, Jongsuk Kim, Junmo Kim.

**Figure 1.** Figure 1: Thinking paradigms in audio language models. (a) Vanilla audio LMs decode the response directly from audio and instruction tokens. (b) Discrete thinking generates textual thinking tokens autoregressively before the answer. (c) Continuous Audio Thinking (ours) prepends a fixedlength block of continuous thinking tokens that is consumed in a single prefill, letting the model think in an audio-aligned latent … view at source ↗

**Figure 2.** Figure 2: CoAT architecture. A continuous audio thinking block is supervised by five audio experts via per-task projection heads, covering audio feature reconstruction, speech representation, sound event detection, paralinguistic features, and pitch. The projection heads decode the shared hidden states into expert-aligned predictions, used only during training. on triplets of audio, instruction, and target response,… view at source ↗

**Figure 3.** Figure 3: Linear probe accuracy at the audiothink hidden across training checkpoints, on 4- class IEMOCAP emotion and 12-class MuchoMusic dominant pitch. We probe whether CoAT’s auxiliary supervision injects task-relevant information at thinking positions by training linear probes on the LM hidden state at two positions. The first one is the audio-think hidden, taken as the mean over the thinking block τp where … view at source ↗

**Figure 4.** Figure 4: Example reconstructions from CoAT’s per-task heads. Each pair shows the expert target (right) and the corresponding student prediction (left) at the audio-think positions. beyond reconstruction. This indicates that the auxiliary supervision injects task-relevant information into the supervised position [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoAT adds a continuous latent thinking block with expert distillation to LALMs and claims benchmark gains at no extra inference cost, but the acoustic transfer mechanism is the part that needs checking.

read the letter

The paper's core move is inserting a continuous latent workspace into existing LALMs so the model can hold onto phonetic, prosody, and other acoustic details before generating text. They ground that space with distillation from audio experts and process the whole block in a single prefill, which keeps inference cost flat. They apply it to Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo 3 and report gains across reasoning, understanding, music classification, emotion, and transcription tasks. Further checks suggest the auxiliary signal reaches the final text outputs.

What stands out is the single-prefill design and the explicit continuous workspace; those are concrete differences from standard text-aligned LALM training. Testing the same addition on three separate models is also useful.

The soft spot is the distillation step itself. The claim rests on expert signals populating the thinking positions with usable acoustic information that then improves responses, without domain mismatch or the gains coming from extra capacity or training alone. If the expert models sit on different distributions or the distillation loss does not align cleanly with the LALM latent space, the improvements could be explained by other factors. The abstract and stress-test note flag this, and without seeing the exact loss formulation, architecture diagram, or ablations that isolate the thinking block, it is difficult to judge how cleanly the transfer works.

This is aimed at researchers building or fine-tuning audio-language models who care about preserving acoustic detail. It is worth sending to peer review because the idea is specific, the multi-model evaluation is straightforward, and the single-prefill constraint is practical. A referee can check whether the experiments actually separate the proposed mechanism from added training effects.

Referee Report

2 major / 2 minor

Summary. The paper introduces Continuous Audio Thinking (CoAT) for large audio language models (LALMs). It adds a continuous latent workspace (thinking space) that organizes acoustic information (phonetic detail, prosody, affect, etc.) prior to text response generation; this workspace is populated via distillation from audio expert models and processed in a single prefill pass. The design is claimed to incur no extra autoregressive decoding cost over the baseline LALM. Experiments across Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo 3 report gains on a suite of benchmarks covering audio reasoning, audio understanding, music classification, speech emotion recognition, and speech transcription; further analysis is said to confirm that the auxiliary supervision propagates from the thinking positions into the final textual outputs.

Significance. If the distillation successfully transfers usable acoustic structure into the latent workspace without domain mismatch or post-hoc tuning that inflates the gains, CoAT would offer an efficient mechanism for preserving non-text-aligned acoustic information inside LALMs. The single-prefill constraint is a concrete efficiency advantage. The breadth of the benchmark suite and the explicit propagation analysis are positive features that would make the result falsifiable and reproducible if the experimental details are complete.

major comments (2)

[Abstract (final sentence) and the 'further analysis' paragraph] The central claim that expert distillation populates the continuous thinking positions with usable acoustic information (phonetic, prosody, affect) that then propagates to improve textual responses rests on the assumption of clean transfer without domain mismatch. The abstract asserts this propagation but provides no quantitative evidence (e.g., alignment metrics between expert and LALM latent spaces or ablation removing distillation while retaining the thinking block) that would rule out gains arising from added capacity or training rather than preserved acoustics.
[Method description of the thinking block and prefill procedure] The single-prefill design is presented as incurring no additional autoregressive cost, yet the integration of the continuous thinking block with the existing LALM forward pass is not shown to preserve the original text-generation path exactly. Without an explicit statement of how the thinking positions are masked or bypassed during response generation, it is unclear whether the reported efficiency holds or whether hidden extra computation is introduced.

minor comments (2)

[Abstract] The abstract lists three specific LALMs but does not state their parameter counts or base training regimes; adding this information would help readers assess the generality of the gains.
[Method] Notation for the continuous latent workspace (e.g., how the thinking positions are indexed or concatenated with audio tokens) should be introduced with a small diagram or equation in the method section to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of evidence and clarify the implementation details.

read point-by-point responses

Referee: [Abstract (final sentence) and the 'further analysis' paragraph] The central claim that expert distillation populates the continuous thinking positions with usable acoustic information (phonetic, prosody, affect) that then propagates to improve textual responses rests on the assumption of clean transfer without domain mismatch. The abstract asserts this propagation but provides no quantitative evidence (e.g., alignment metrics between expert and LALM latent spaces or ablation removing distillation while retaining the thinking block) that would rule out gains arising from added capacity or training rather than preserved acoustics.

Authors: We agree that the manuscript would benefit from more direct quantitative support for the propagation claim. The existing further analysis links performance gains to the thinking positions but does not include the suggested alignment metrics or the specific ablation that isolates distillation from the added block. We will add both: (1) an ablation that retains the thinking block architecture while removing the expert distillation objective, and (2) alignment metrics (e.g., cosine similarity or CCA) between expert representations and the LALM thinking-space activations. These additions will be reported in a revised analysis section. revision: yes
Referee: [Method description of the thinking block and prefill procedure] The single-prefill design is presented as incurring no additional autoregressive cost, yet the integration of the continuous thinking block with the existing LALM forward pass is not shown to preserve the original text-generation path exactly. Without an explicit statement of how the thinking positions are masked or bypassed during response generation, it is unclear whether the reported efficiency holds or whether hidden extra computation is introduced.

Authors: We accept that an explicit description of the masking and integration is required. The current method text states that the thinking block is processed in a single prefill but does not detail how its outputs are isolated from the autoregressive loop. We will revise the method section to include a precise account of the forward pass, specifying that thinking positions receive a single prefill computation whose hidden states serve as fixed additional context, with causal masking that excludes them from subsequent token prediction. We will also add pseudocode and a diagram illustrating that the autoregressive decoding path remains unchanged from the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity detected; CoAT is an independent architectural addition

full rationale

The paper presents CoAT as a new framework that adds a continuous latent workspace to LALMs, populated via distillation from external audio experts, with single-prefill processing and empirical gains on benchmarks. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the abstract or described method. The central claim rests on the independent effectiveness of the added thinking block and distillation transfer, not on any derivation that reduces to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only; ledger populated from stated claims only. The central claim rests on the unverified premise that expert distillation supplies usable acoustic signals and that the thinking block preserves them for downstream text generation.

axioms (2)

domain assumption Distillation from audio experts can ground a continuous latent workspace without domain shift or loss of acoustic fidelity.
Invoked in the sentence describing the thinking space as 'grounded by distillation from audio experts'.
domain assumption Processing the continuous thinking block in a single prefill preserves all benefits without additional autoregressive cost.
Stated directly in the abstract as a property of the proposed block.

invented entities (1)

continuous latent workspace (thinking space) no independent evidence
purpose: Organize acoustic information prior to response generation
New postulated component introduced to retain phonetic, prosodic, and event-level audio details that are otherwise lost.

pith-pipeline@v0.9.1-grok · 5787 in / 1380 out tokens · 15268 ms · 2026-06-27T21:51:08.881822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 13 linked inside Pith

[1]

Tyers, and Gregor Weber

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus, 2020

2020
[2]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020
[3]

Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert

Rachel M. Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert. A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 2022

2022
[4]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

2008
[5]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

arXiv 2021
[6]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022
[7]

Beats: Audio pre-training with acoustic tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pages 5178–5193. PMLR, 2023

2023
[8]

Eat: self-supervised pre-training with efficient audio transformer

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, and Xie Chen. Eat: self-supervised pre-training with efficient audio transformer. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 3807–3815, 2024

2024
[9]

Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024
[10]

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

Pith/arXiv arXiv 2023
[11]

High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022
[12]

From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

Pith/arXiv arXiv 2024
[13]

Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090– 18108, 2023

Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090– 18108, 2023

2023
[14]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020

2020
[15]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025. 10

arXiv 2025
[16]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313, 2024

2024
[17]

Looped transformers as programmable computers

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

2023
[18]

Switchboard: Telephone speech corpus for research and development

John J Godfrey, Edward C Holliman, and Jane McDaniel. Switchboard: Telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE, 1992

1992
[19]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

Pith/arXiv arXiv 2025
[20]

Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

arXiv 2023
[21]

V ocalsound: A dataset for improving human vocal sounds recognition

Yuan Gong, Jin Yu, and James Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151–155. IEEE, 2022

2022
[22]

Onellm: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024

2024
[23]

Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Pith/arXiv arXiv 2024
[24]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

2021
[25]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[26]

Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

2022
[27]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

2024
[28]

Audiocaps: Generat- ing captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generat- ing captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019

2019
[29]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022
[30]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020. 11

2020
[31]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. arXiv preprint arXiv:2402.01831, 2024

arXiv 2024
[32]

Clotho-aqa: A crowdsourced dataset for audio question answering

Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. Clotho-aqa: A crowdsourced dataset for audio question answering. In2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022

2022
[33]

Music understand- ing llama: Advancing text-to-music generation with question answering and captioning

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understand- ing llama: Advancing text-to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 286–290. IEEE, 2024

2024
[34]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

arXiv 2025
[35]

emotion2vec: Self-supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15747–15760, 2024

2024
[36]

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization, 2024

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization, 2024

2024
[37]

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339–3354, 2024

2024
[38]

Mustango: Toward controllable text-to-music generation

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8286–8309, 2024

2024
[39]

O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D

Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. Spgispeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition, 2021

2021
[40]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

2015
[41]

Spidr: Learning fast and stable linguistic units for spoken language models without supervision

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, and Emmanuel Dupoux. Spidr: Learning fast and stable linguistic units for spoken language models without supervision. Transactions on Machine Learning Research, 2025

2025
[42]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

2019
[43]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

arXiv 2025
[44]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Pro- ceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

2023
[45]

Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024

2024
[46]

Gsqa: An end-to-end model for generative spoken question answering.arXiv preprint arXiv:2312.09781, 2023

Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, and Hung-yi Lee. Gsqa: An end-to-end model for generative spoken question answering.arXiv preprint arXiv:2312.09781, 2023

arXiv 2023
[47]

Llasm: Large language and speech model.arXiv preprint arXiv:2308.15930, 2023

Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. Llasm: Large language and speech model.arXiv preprint arXiv:2308.15930, 2023

arXiv 2023
[48]

The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use.arXiv preprint arXiv:1306.1461, 2013

Bob L Sturm. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use.arXiv preprint arXiv:1306.1461, 2013

Pith/arXiv arXiv 2013
[49]

Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Pith/arXiv arXiv 2023
[50]

V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021

Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021

2021
[51]

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

Pith/arXiv arXiv 2025
[52]

Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Pith/arXiv arXiv 2022
[53]

Muchomusic: Evaluating music understanding in multimodal audio-language models

Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, György Fazekas, and Dmitry Bogdanov. Muchomusic: Evaluating music understanding in multimodal audio-language models. InProceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024

2024
[54]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[55]

Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025
[56]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

1979
[57]

Mert: Acoustic music understanding model with large-scale self-supervised training

LI Yizhi, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. Mert: Acoustic music understanding model with large-scale self-supervised training. InThe Twelfth International Conference on Learning Representations, 2024

2024
[58]

Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

2021
[59]

Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024. 13

Pith/arXiv arXiv 2024
[60]

Anygpt: Unified multimodal llm with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637–9662, 2024

2024
[61]

Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Pith/arXiv arXiv 2024
[62]

Speaking clearly: A simplified whisper-based codec for low-bitrate speech coding

Xin Zhang, Lin Li, Xiangni Lu, Jianquan Liu, and Kong Aik Lee. Speaking clearly: A simplified whisper-based codec for low-bitrate speech coding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17037–17041. IEEE, 2026

2026
[63]

The reference answer is[XXX], while the model’s answer is[YYY]. I think

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yu Wang, and Yanfeng Wang. Librisqa: A novel dataset and framework for spoken question answering with large language models.IEEE Transactions on Artificial Intelligence, 2024. 14 Table A: Audio expert encoders used in CoAT.ek is the expert embedding dimension and rk is the frame rate at which the expert emits features...

2024

[1] [1]

Tyers, and Gregor Weber

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus, 2020

2020

[2] [2]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020

[3] [3]

Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert

Rachel M. Bittner, Juan José Bosch, David Rubinstein, Gabriel Meseguer-Brocal, and Sebastian Ewert. A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore, 2022

2022

[4] [4]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

2008

[5] [5]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

arXiv 2021

[6] [6]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

2022

[7] [7]

Beats: Audio pre-training with acoustic tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pages 5178–5193. PMLR, 2023

2023

[8] [8]

Eat: self-supervised pre-training with efficient audio transformer

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, and Xie Chen. Eat: self-supervised pre-training with efficient audio transformer. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 3807–3815, 2024

2024

[9] [9]

Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

Pith/arXiv arXiv 2024

[10] [10]

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

Pith/arXiv arXiv 2023

[11] [11]

High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022

[12] [12]

From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

Pith/arXiv arXiv 2024

[13] [13]

Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090– 18108, 2023

Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090– 18108, 2023

2023

[14] [14]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020

2020

[15] [15]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities.arXiv preprint arXiv:2503.03983, 2025. 10

arXiv 2025

[16] [16]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6288–6313, 2024

2024

[17] [17]

Looped transformers as programmable computers

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

2023

[18] [18]

Switchboard: Telephone speech corpus for research and development

John J Godfrey, Edward C Holliman, and Jane McDaniel. Switchboard: Telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE, 1992

1992

[19] [19]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

Pith/arXiv arXiv 2025

[20] [20]

Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand.arXiv preprint arXiv:2305.10790, 2023

arXiv 2023

[21] [21]

V ocalsound: A dataset for improving human vocal sounds recognition

Yuan Gong, Jin Yu, and James Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 151–155. IEEE, 2022

2022

[22] [22]

Onellm: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024

2024

[23] [23]

Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Pith/arXiv arXiv 2024

[24] [24]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

2021

[25] [25]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022

[26] [26]

Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen.Advances in neural information processing systems, 35:28708–28720, 2022

2022

[27] [27]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

2024

[28] [28]

Audiocaps: Generat- ing captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generat- ing captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019

2019

[29] [29]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022

[30] [30]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020. 11

2020

[31] [31]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. arXiv preprint arXiv:2402.01831, 2024

arXiv 2024

[32] [32]

Clotho-aqa: A crowdsourced dataset for audio question answering

Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. Clotho-aqa: A crowdsourced dataset for audio question answering. In2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022

2022

[33] [33]

Music understand- ing llama: Advancing text-to-music generation with question answering and captioning

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understand- ing llama: Advancing text-to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 286–290. IEEE, 2024

2024

[34] [34]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

arXiv 2025

[35] [35]

emotion2vec: Self-supervised pre-training for speech emotion representation

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15747–15760, 2024

2024

[36] [36]

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization, 2024

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization, 2024

2024

[37] [37]

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339–3354, 2024

2024

[38] [38]

Mustango: Toward controllable text-to-music generation

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8286–8309, 2024

2024

[39] [39]

O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D

Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. Spgispeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition, 2021

2021

[40] [40]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

2015

[41] [41]

Spidr: Learning fast and stable linguistic units for spoken language models without supervision

Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, and Emmanuel Dupoux. Spidr: Learning fast and stable linguistic units for spoken language models without supervision. Transactions on Machine Learning Research, 2025

2025

[42] [42]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

2019

[43] [43]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

arXiv 2025

[44] [44]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Pro- ceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

2023

[45] [45]

Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024

2024

[46] [46]

Gsqa: An end-to-end model for generative spoken question answering.arXiv preprint arXiv:2312.09781, 2023

Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, and Hung-yi Lee. Gsqa: An end-to-end model for generative spoken question answering.arXiv preprint arXiv:2312.09781, 2023

arXiv 2023

[47] [47]

Llasm: Large language and speech model.arXiv preprint arXiv:2308.15930, 2023

Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. Llasm: Large language and speech model.arXiv preprint arXiv:2308.15930, 2023

arXiv 2023

[48] [48]

The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use.arXiv preprint arXiv:1306.1461, 2013

Bob L Sturm. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use.arXiv preprint arXiv:1306.1461, 2013

Pith/arXiv arXiv 2013

[49] [49]

Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Pith/arXiv arXiv 2023

[50] [50]

V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021

Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021

2021

[51] [51]

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng. Mmsu: A massive multi-task spoken language understanding and reasoning benchmark.arXiv preprint arXiv:2506.04779, 2025

Pith/arXiv arXiv 2025

[52] [52]

Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Pith/arXiv arXiv 2022

[53] [53]

Muchomusic: Evaluating music understanding in multimodal audio-language models

Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, György Fazekas, and Dmitry Bogdanov. Muchomusic: Evaluating music understanding in multimodal audio-language models. InProceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024

2024

[54] [54]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[55] [55]

Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025

[56] [56]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1979–1998, 2024

1979

[57] [57]

Mert: Acoustic music understanding model with large-scale self-supervised training

LI Yizhi, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, et al. Mert: Acoustic music understanding model with large-scale self-supervised training. InThe Twelfth International Conference on Learning Representations, 2024

2024

[58] [58]

Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

2021

[59] [59]

Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024. 13

Pith/arXiv arXiv 2024

[60] [60]

Anygpt: Unified multimodal llm with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9637–9662, 2024

2024

[61] [61]

Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models.arXiv preprint arXiv:2407.12772, 2024

Pith/arXiv arXiv 2024

[62] [62]

Speaking clearly: A simplified whisper-based codec for low-bitrate speech coding

Xin Zhang, Lin Li, Xiangni Lu, Jianquan Liu, and Kong Aik Lee. Speaking clearly: A simplified whisper-based codec for low-bitrate speech coding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17037–17041. IEEE, 2026

2026

[63] [63]

The reference answer is[XXX], while the model’s answer is[YYY]. I think

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yu Wang, and Yanfeng Wang. Librisqa: A novel dataset and framework for spoken question answering with large language models.IEEE Transactions on Artificial Intelligence, 2024. 14 Table A: Audio expert encoders used in CoAT.ek is the expert embedding dimension and rk is the frame rate at which the expert emits features...

2024