pith. sign in

arxiv: 2605.21008 · v1 · pith:S7FX2JXGnew · submitted 2026-05-20 · 📡 eess.AS

A Survey of Audio Reasoning in Multimodal Foundation Models

Pith reviewed 2026-05-21 02:04 UTC · model grok-4.3

classification 📡 eess.AS
keywords audio reasoningmultimodal foundation modelsreasoning-augmented generationaudio-to-textaudio-visual reasoningchain-of-thoughtreinforcement learningspoken interaction
0
0 comments X

The pith

Audio reasoning in multimodal foundation models requires a dedicated survey and unified formulation because of its unique continuous and multi-scale characteristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to establish a coherent roadmap for audio reasoning by being the first to survey the field specifically. It distinguishes direct predictive modeling from reasoning-augmented generation to better organize how models align audio signals with language semantics. A reader would care if this leads to more reliable systems that can infer from speech, environmental sounds, and combined audio-visual inputs without losing fine details. The work reviews foundations, organizes advances in four categories, and covers methods like prompting and training techniques. It also points out obstacles such as data scarcity and the need to balance reasoning with speed.

Core claim

The authors present the first dedicated survey of audio reasoning in multimodal foundation models. They introduce a unified formulation to separate direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations, and systematically organize recent advances across Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning. The survey further examines emerging paradigms including Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, along with evaluation practices and open challenges.

What carries the argument

A unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation to handle the alignment of continuous acoustic signals with discrete language model semantics while preserving fine-grained information.

If this is right

  • Advances in Audio-to-Text and Audio-to-Speech can be more systematically compared and improved.
  • Agentic Audio Reasoning can support interactive spoken agents that perform step-by-step inference.
  • Methods like reinforcement learning can help overcome shortcut learning in audio tasks.
  • Latency-aware designs enable practical real-time audio reasoning applications.
  • Evaluation practices can evolve to better test for modality hallucination and grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This categorization could help in designing experiments that test reasoning depth versus prediction accuracy in audio models.
  • Connections to visual reasoning suggest potential for unified multi-modal reasoning frameworks beyond audio alone.
  • Addressing the listed obstacles might lead to foundation models that handle real-world audio interactions more robustly.
  • One could test the formulation by applying it to emerging audio datasets to see if it reveals new patterns in progress.

Load-bearing premise

The challenges in audio reasoning are fundamentally distinct from those in text and vision, necessitating a separate survey and a new unified formulation.

What would settle it

An experiment showing that general multimodal reasoning techniques without audio-specific adaptations achieve equivalent performance on audio tasks would challenge the premise for a dedicated survey.

Figures

Figures reproduced from arXiv: 2605.21008 by Daxin Tan, Dingdong Wang, Guan-Ting Lin, Han Shi, Irwin King, Jiaya Jia, Jing Xiong, Jingyao Li, Qiyong Zheng, Wenqian Cui, Zhihan Guo.

Figure 1
Figure 1. Figure 1: Timeline of representative audio reasoning models. Models are organized chronologically and grouped by major paradigms, including Audio-to-Text, Audio-to-Speech, Audio-Visual, and agentic audio reasoning. from direct generation to structured problem solving. This tax￾onomy clarifies the scope of audio reasoning and highlights the field’s current fragmentation across formulation, architecture, training, int… view at source ↗
Figure 2
Figure 2. Figure 2: A compact taxonomy of audio reasoning. We organize the literature into four paradigms: Audio-to-Text reasoning, Audio-to-Speech reasoning, Audio-Visual reasoning, and Agentic Audio Reasoning. Representative meth￾ods and design patterns are discussed in the corresponding sections. under a common probabilistic view. For clarity, Table I sum￾marizes the main symbols used throughout the paper. A. General Formu… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of major audio reasoning paradigms. The figure summarizes four paradigms covered in this survey: Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic Audio Reasoning. It contrasts text-output reasoning, cross-modal audio-visual grounding, sequential and real-time speech-output reasoning, and agentic workflows based on predefined pipelines or dynamic tool calling. sufficiently complete to trig… view at source ↗
read the original abstract

Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales. As a result, audio reasoning models must align acoustic signals with the discrete semantic space of large language models, while still preserving fine-grained information needed for reliable inference. Progress is also limited by three major obstacles: the scarcity of genuinely audio-grounded reasoning data, shortcut learning and modality hallucination, and the tension between reasoning depth and real-time latency in spoken interaction. In this paper, we present the first dedicated survey of audio reasoning. We provide a unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations of audio reasoning models, and systematically organize recent advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning and Agentic Audio Reasoning. We further examine emerging paradigms such as Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, and discuss evaluation practices, open challenges, and future directions. Our goal is to offer a coherent roadmap for developing robust, efficient, and natively grounded audio reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims to deliver the first dedicated survey of audio reasoning in multimodal foundation models. It provides a unified formulation distinguishing direct predictive modeling from reasoning-augmented generation, reviews architectural and training foundations, and organizes advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning. The survey also covers paradigms like Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, latency-aware interaction, evaluation practices, challenges, and future directions.

Significance. If the claims hold, this survey would be significant for the field by establishing a coherent framework and roadmap for audio reasoning, which is currently limited compared to text and vision. The explicit identification of distinct audio challenges and obstacles like data scarcity and modality hallucination provides a useful structure for future work. As a survey without new quantitative claims, its value lies in synthesis and organization of existing literature.

minor comments (2)
  1. [Abstract] Abstract: The premise that audio poses fundamentally distinct challenges from text and vision is stated to motivate the scope; a brief explicit contrast with vision-language reasoning surveys would strengthen the justification for a dedicated audio survey.
  2. The unified formulation is introduced in the abstract but its concrete mathematical or conceptual details are not visible in the provided high-level description; ensuring the formulation is presented with clear notation and examples in the main text would improve accessibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The summary accurately reflects the paper's contributions in providing a unified formulation of audio reasoning and organizing advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms, while highlighting key challenges such as data scarcity and modality hallucination.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a survey paper whose central contribution is a review and taxonomy of existing literature on audio reasoning. It states a motivation based on modality differences and offers a unified formulation to organize prior work, but introduces no new quantitative predictions, fitted parameters, or formal derivations that could reduce to its own inputs. All load-bearing content consists of citations to external studies and internal consistency of the proposed categories, with no self-referential loops or self-citation chains that substitute for independent evidence. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, fitted parameters, or postulated entities; it relies on standard domain assumptions from multimodal AI research.

axioms (1)
  • domain assumption Audio poses challenges distinct from text and vision because it is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales.
    Invoked in the abstract to motivate the need for specialized audio reasoning models.

pith-pipeline@v0.9.0 · 5793 in / 1271 out tokens · 34307 ms · 2026-05-21T02:04:28.515190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 33 internal anchors

  1. [1]

    Chain-of-thought prompting elicits reasoning in large 17 language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large 17 language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  2. [2]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Z.-Z. e. a. Li, “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

    J. D. et al., “Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms,” 2026. [Online]. Available: https://arxiv.org/abs/2605.00674

  5. [5]

    Let’s verify step by step,

    H. e. a. Lightman, “Let’s verify step by step,” inInternational Confer- ence on Learning Representations, vol. 2024, 2024, pp. 39 578–39 601

  6. [6]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,

    C. V . Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025

  7. [7]

    Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,

    Y . Wu, Z. Sun, S. Li, S. Welleck, and Y . Yang, “Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,” inThe Thirteenth International Conference on Learning Representations, 2025

  8. [8]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,

    H. e. a. Shao, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,”Advances in Neural Information Processing Systems, vol. 37, pp. 8612–8642, 2024

  9. [9]

    Compositional chain- of-thought prompting for large multimodal models,

    C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 14 420–14 431

  10. [10]

    On The Landscape of Spoken Language Models: A Comprehensive Survey

    S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

  11. [11]

    Mmau: A massive multi-task audio understanding and reasoning benchmark,

    S. e. a. Sakshi, “Mmau: A massive multi-task audio understanding and reasoning benchmark,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 84 929–84 964

  12. [12]

    Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,

    J. e. a. Ao, “Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 898–56 918, 2024

  13. [13]

    Recent advances in discrete speech tokens: A review,

    Y . e. a. Guo, “Recent advances in discrete speech tokens: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  14. [14]

    What enables human language? a biocultural frame- work,

    I. e. a. Arnon, “What enables human language? a biocultural frame- work,”Science, vol. 390, no. 6775, p. eadq8303, 2025

  15. [15]

    Representation of internal speech by single neurons in human supramarginal gyrus,

    S. K. e. a. Wandelt, “Representation of internal speech by single neurons in human supramarginal gyrus,”Nature human behaviour, vol. 8, no. 6, pp. 1136–1149, 2024

  16. [17]

    OmniFlatten: An end-to-end GPT model for seamless voice conversation,

    Q. e. a. Zhang, “OmniFlatten: An end-to-end GPT model for seamless voice conversation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 14 570–14 580. [Online...

  17. [18]

    To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,

    Z. R. S. et al., “To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=w6nlcS8Kkn

  18. [19]

    Benchmarking open-ended audio dialogue understanding for large audio-language models,

    K. e. a. Gao, “Benchmarking open-ended audio dialogue understanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 4763–478...

  19. [20]

    Recent advances in speech language models: A survey,

    W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943– 13 970

  20. [21]

    Sparks of large audio models: A survey and outlook,

    S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y . Ren, H. Cuayáhuitl, W. Wang, X. Zhang, R. Togneri, E. Cambriaet al., “Sparks of large au- dio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023

  21. [22]

    Audio-language models for audio-centric tasks: A survey,

    Y . Su, J. Bai, Q. Xu, K. Xu, and Y . Dou, “Audio-language models for audio-centric tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025

  22. [23]

    A survey on speech large language models for understanding,

    J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

  23. [24]

    Towards general auditory intelligence: Large multimodal models for machine listening and speaking,

    S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuanget al., “Towards general auditory intelligence: Large multimodal models for machine listening and speaking,”arXiv preprint arXiv:2511.01299, 2025

  24. [25]

    Towards holistic evaluation of large audio-language models: A comprehensive survey,

    C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181

  25. [26]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” arXiv preprint arXiv:2503.12605, 2025

  26. [27]

    Robust speech recognition via large-scale weak super- vision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 28 492–28 518

  27. [28]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

  28. [29]

    Beats: Audio pre-training with acoustic tokenizers,

    S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 5178–5193

  29. [30]

    Ast: Audio spectrogram trans- former,

    Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,” inProc. Interspeech 2021, 2021, pp. 571–575

  30. [31]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212. 04356

  31. [32]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  32. [33]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  33. [34]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”arXiv preprint arXiv:2407.10671, 2024

  34. [35]

    Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” https://lmsys.org/blog/2023-03-30-vicuna/, 2023, accessed: 2023-03-30

  35. [36]

    GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

    A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024

  36. [37]

    Llama- omni: Seamless speech interaction with large language models,

    Q. Fang, S. Niu, R. Zhou, Z. Lin, M. Chen, and Y . Feng, “LLaMA- Omni: Seamless speech interaction with large language models,”arXiv preprint arXiv:2409.06666, 2024

  37. [38]

    Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,

    D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 15 757–15 773

  38. [39]

    Moshi: a speech-text foundation model for real- time dialogue,

    A. Défossezet al., “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00080, 2024

  39. [40]

    Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,

    C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y . Liu, C. Zong, and J. Zhang, “Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,” 2024. [Online]. Available: https://arxiv.org/abs/2309.00916

  40. [41]

    Salmonn: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

  41. [42]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

  42. [43]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing 18 the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

  43. [44]

    Audio-cot: Exploring chain-of-thought reasoning in large audio language model,

    Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen, “Audio-cot: Exploring chain-of-thought reasoning in large audio language model,” arXiv preprint arXiv:2501.07246, 2025

  44. [45]

    Sar-lm: Symbolic audio reasoning with large language models,

    T. Taheri, Y . Ma, and E. Benetos, “Sar-lm: Symbolic audio reasoning with large language models,”arXiv preprint arXiv:2511.06483, 2025

  45. [46]

    Audio-reasoner: Improving reasoning capability in large audio language models,

    Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio-reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

  46. [47]

    Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,

    Z. Kong, A. Goel, J. F. Santos, S. Ghosh, R. Valle, W. Ping, and B. Catanzaro, “Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,”arXiv preprint arXiv:2508.11818, 2025

  47. [48]

    Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

    L. Li, H. Chen, Z. Li, Q. Hu, J. Kang, J. Li, L. Xie, and Y . Li, “Audio- cogito: Towards deep audio reasoning in large audio language models,” arXiv preprint arXiv:2604.12527, 2026

  48. [49]

    Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,

    G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan, “Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,”arXiv preprint arXiv:2503.11197, 2025

  49. [50]

    Omni-r1: Do you really need audio to fine-tune your audio llm?

    A. Rouditchenko, S. Bhati, E. Araujo, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Omni-r1: Do you really need audio to fine-tune your audio llm?”arXiv preprint arXiv:2505.09439, 2025

  50. [52]

    Data- balanced curriculum learning for audio question answering,

    G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Data- balanced curriculum learning for audio question answering,”arXiv preprint arXiv:2507.06815, 2025

  51. [53]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

  52. [54]

    Sari: Structured audio reasoning via curriculum-guided reinforcement learning,

    C. Wen, T. Guo, S. Zhao, W. Zou, and X. Li, “Sari: Structured audio reasoning via curriculum-guided reinforcement learning,”arXiv preprint arXiv:2504.15900, 2025

  53. [55]

    Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,

    D. Yang, S. Liu, D. Wang, Y . Wang, G. Wan, and H. Meng, “Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,” arXiv preprint arXiv:2512.03783, 2025

  54. [56]

    Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,

    J. Zhao, H. Su, L. Fan, Z. Luo, H. Wang, H. Sun, and Y . Qin, “Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,”arXiv preprint arXiv:2509.12275, 2025

  55. [57]

    Think smart, not hard: Difficulty adaptive reasoning for large audio language models,

    Z. Sheng, S. Zhou, C. Gong, and Z. Li, “Think smart, not hard: Difficulty adaptive reasoning for large audio language models,”arXiv preprint arXiv:2509.21960, 2025

  56. [58]

    Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,

    G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,”arXiv preprint arXiv:2505.14142, 2025

  57. [59]

    Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,

    H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025

  58. [60]

    Step-audio-r1 technical report,

    F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,”arXiv preprint arXiv:2511.15848, 2025

  59. [61]

    Step-Audio 2 Technical Report

    B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

  60. [62]

    Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,

    S. Wu, C. Li, W. Wang, H. Zhang, H. Wang, M. Yu, and D. Yu, “Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,”arXiv preprint arXiv:2508.08039, 2025

  61. [63]

    Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

    X. He, C. Li, J. Wang, Y . Rong, T. Xie, W. Wang, L. Liu, and D. Yu, “Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models,”arXiv preprint arXiv:2604.18187, 2026

  62. [64]

    Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,

    J. Fan, R. Ren, J. Li, R. Pandey, P. G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, and Y . Gu, “Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,”arXiv preprint arXiv:2510.20867, 2025

  63. [65]

    Soundmind: Rl-incentivized logic reasoning for audio-language models,

    X. Diao, C. Zhang, K. Kong, W. Wu, C. Ma, Z. Ouyang, P. Qing, S. V osoughi, and J. Gui, “Soundmind: Rl-incentivized logic reasoning for audio-language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 528– 540

  64. [66]

    Beyond single-audio: Advancing multi-audio processing in audio large language models,

    Y . Chen, X. Yue, X. Gao, C. Zhang, L. F. D’Haro, R. T. Tan, and H. Li, “Beyond single-audio: Advancing multi-audio processing in audio large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 10 917–10 930

  65. [67]

    Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,

    S. Kumar, S. Ghosh, Y . Lin, Y . Chen, R. Duraiswami, and D. Manocha, “Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,” 2025

  66. [68]

    Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,

    D. Wang, S. Liu, T. Zhang, Y . Chen, J. Li, and H. Meng, “Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,”arXiv preprint arXiv:2601.15668, 2026

  67. [69]

    Qwen2.5-Omni Technical Report

    J. X. et al., “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

  68. [70]

    Kimi-Audio Technical Report

    D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

  69. [71]

    Mini-omni: Language models can hear, talk while thinking in streaming,

    Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024

  70. [72]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

    ——, “Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex,”arXiv preprint arXiv:2410.11190, 2024

  71. [73]

    SLAM-omni: Timbre-controllable voice interaction system with single-stage training,

    W. e. a. Chen, “SLAM-omni: Timbre-controllable voice interaction system with single-stage training,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 2262–2282. [Online]. Available: https://aclanthology....

  72. [74]

    Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    W. Cui, X.-H. Li, D. Tan, Q. Zheng, and I. King, “Minimizing modality gap from the input side: Your speech llm can be a prosody-aware text llm,”arXiv preprint arXiv:2605.05927, 2026. [Online]. Available: https://arxiv.org/abs/2605.05927

  73. [76]

    Qwen3.5-Omni Technical Report

    Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

  74. [77]

    Mimo-audio: Audio language models are few-shot learners,

    L.-C.-T. Xiaomi, “Mimo-audio: Audio language models are few-shot learners,” 2025. [Online]. Available: https://github.com/XiaomiMiMo/ MiMo-Audio

  75. [78]

    Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,

    C. Wang, T. Peng, W. Yang, Y . Bai, G. Wang, J. Lin, L. Jia, L. Wu, J. Wang, C. Zonget al., “Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 906–917

  76. [79]

    Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

    C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H.-y. Lee, and L. Wang, “Shanks: Simultaneous hear- ing and thinking for spoken language models,”arXiv preprint arXiv:2510.06917, 2025

  77. [80]

    Can speech LLMs think while listening?

    Y .-J. Shih, D. Raj, C. Wu, W. Zhou, S. Bong, Y . Gaur, J. Mahadeokar, O. Kalinli, and M. Seltzer, “Can speech LLMs think while listening?” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=dFVenZdVbX

  78. [81]

    Chronological thinking in full-duplex spoken dialogue language models

    D. Wu, H. Zhang, C. Chen, T. Zhang, F. Tian, X. Yang, G. Yu, H. Liu, N. Hou, Y . Huet al., “Chronological thinking in full-duplex spoken dialogue language models,”arXiv preprint arXiv:2510.05150, 2025

  79. [82]

    The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

    D. Wu, T. Zhang, Y . Li, H. Liu, C. Chen, E. S. Chng, and Y . Bengio, “The silent thought: Modeling internal cognition in full- duplex spoken dialogue models via latent reasoning,”arXiv preprint arXiv:2603.17837, 2026

  80. [83]

    STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,

    C.-H. C. et al., “STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=5Z1eMhCeTb

Showing first 80 references.