A Survey of Audio Reasoning in Multimodal Foundation Models
Pith reviewed 2026-05-21 02:04 UTC · model grok-4.3
The pith
Audio reasoning in multimodal foundation models requires a dedicated survey and unified formulation because of its unique continuous and multi-scale characteristics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present the first dedicated survey of audio reasoning in multimodal foundation models. They introduce a unified formulation to separate direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations, and systematically organize recent advances across Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning. The survey further examines emerging paradigms including Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, along with evaluation practices and open challenges.
What carries the argument
A unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation to handle the alignment of continuous acoustic signals with discrete language model semantics while preserving fine-grained information.
If this is right
- Advances in Audio-to-Text and Audio-to-Speech can be more systematically compared and improved.
- Agentic Audio Reasoning can support interactive spoken agents that perform step-by-step inference.
- Methods like reinforcement learning can help overcome shortcut learning in audio tasks.
- Latency-aware designs enable practical real-time audio reasoning applications.
- Evaluation practices can evolve to better test for modality hallucination and grounding.
Where Pith is reading between the lines
- This categorization could help in designing experiments that test reasoning depth versus prediction accuracy in audio models.
- Connections to visual reasoning suggest potential for unified multi-modal reasoning frameworks beyond audio alone.
- Addressing the listed obstacles might lead to foundation models that handle real-world audio interactions more robustly.
- One could test the formulation by applying it to emerging audio datasets to see if it reveals new patterns in progress.
Load-bearing premise
The challenges in audio reasoning are fundamentally distinct from those in text and vision, necessitating a separate survey and a new unified formulation.
What would settle it
An experiment showing that general multimodal reasoning techniques without audio-specific adaptations achieve equivalent performance on audio tasks would challenge the premise for a dedicated survey.
Figures
read the original abstract
Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales. As a result, audio reasoning models must align acoustic signals with the discrete semantic space of large language models, while still preserving fine-grained information needed for reliable inference. Progress is also limited by three major obstacles: the scarcity of genuinely audio-grounded reasoning data, shortcut learning and modality hallucination, and the tension between reasoning depth and real-time latency in spoken interaction. In this paper, we present the first dedicated survey of audio reasoning. We provide a unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations of audio reasoning models, and systematically organize recent advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning and Agentic Audio Reasoning. We further examine emerging paradigms such as Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, and discuss evaluation practices, open challenges, and future directions. Our goal is to offer a coherent roadmap for developing robust, efficient, and natively grounded audio reasoning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver the first dedicated survey of audio reasoning in multimodal foundation models. It provides a unified formulation distinguishing direct predictive modeling from reasoning-augmented generation, reviews architectural and training foundations, and organizes advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning. The survey also covers paradigms like Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, latency-aware interaction, evaluation practices, challenges, and future directions.
Significance. If the claims hold, this survey would be significant for the field by establishing a coherent framework and roadmap for audio reasoning, which is currently limited compared to text and vision. The explicit identification of distinct audio challenges and obstacles like data scarcity and modality hallucination provides a useful structure for future work. As a survey without new quantitative claims, its value lies in synthesis and organization of existing literature.
minor comments (2)
- [Abstract] Abstract: The premise that audio poses fundamentally distinct challenges from text and vision is stated to motivate the scope; a brief explicit contrast with vision-language reasoning surveys would strengthen the justification for a dedicated audio survey.
- The unified formulation is introduced in the abstract but its concrete mathematical or conceptual details are not visible in the provided high-level description; ensuring the formulation is presented with clear notation and examples in the main text would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. The summary accurately reflects the paper's contributions in providing a unified formulation of audio reasoning and organizing advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms, while highlighting key challenges such as data scarcity and modality hallucination.
Circularity Check
No significant circularity
full rationale
This is a survey paper whose central contribution is a review and taxonomy of existing literature on audio reasoning. It states a motivation based on modality differences and offers a unified formulation to organize prior work, but introduces no new quantitative predictions, fitted parameters, or formal derivations that could reduce to its own inputs. All load-bearing content consists of citations to external studies and internal consistency of the proposed categories, with no self-referential loops or self-citation chains that substitute for independent evidence. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Audio poses challenges distinct from text and vision because it is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We provide a unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation... P(R,Y|A,X) = P(R|A,X) P(Y|A,X,R)
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We organize the literature into four paradigms: Audio-to-Text reasoning, Audio-to-Speech reasoning, Audio-Visual reasoning, and Agentic Audio Reasoning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chain-of-thought prompting elicits reasoning in large 17 language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large 17 language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[2]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Z.-Z. e. a. Li, “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
J. D. et al., “Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms,” 2026. [Online]. Available: https://arxiv.org/abs/2605.00674
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
H. e. a. Lightman, “Let’s verify step by step,” inInternational Confer- ence on Learning Representations, vol. 2024, 2024, pp. 39 578–39 601
work page 2024
-
[6]
Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,
C. V . Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[7]
Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,
Y . Wu, Z. Sun, S. Li, S. Welleck, and Y . Yang, “Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[8]
H. e. a. Shao, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,”Advances in Neural Information Processing Systems, vol. 37, pp. 8612–8642, 2024
work page 2024
-
[9]
Compositional chain- of-thought prompting for large multimodal models,
C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 14 420–14 431
work page 2024
-
[10]
On The Landscape of Spoken Language Models: A Comprehensive Survey
S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Mmau: A massive multi-task audio understanding and reasoning benchmark,
S. e. a. Sakshi, “Mmau: A massive multi-task audio understanding and reasoning benchmark,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 84 929–84 964
work page 2025
-
[12]
Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,
J. e. a. Ao, “Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 898–56 918, 2024
work page 2024
-
[13]
Recent advances in discrete speech tokens: A review,
Y . e. a. Guo, “Recent advances in discrete speech tokens: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[14]
What enables human language? a biocultural frame- work,
I. e. a. Arnon, “What enables human language? a biocultural frame- work,”Science, vol. 390, no. 6775, p. eadq8303, 2025
work page 2025
-
[15]
Representation of internal speech by single neurons in human supramarginal gyrus,
S. K. e. a. Wandelt, “Representation of internal speech by single neurons in human supramarginal gyrus,”Nature human behaviour, vol. 8, no. 6, pp. 1136–1149, 2024
work page 2024
-
[17]
OmniFlatten: An end-to-end GPT model for seamless voice conversation,
Q. e. a. Zhang, “OmniFlatten: An end-to-end GPT model for seamless voice conversation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 14 570–14 580. [Online...
work page 2025
-
[18]
To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,
Z. R. S. et al., “To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=w6nlcS8Kkn
work page 2025
-
[19]
Benchmarking open-ended audio dialogue understanding for large audio-language models,
K. e. a. Gao, “Benchmarking open-ended audio dialogue understanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 4763–478...
work page 2025
-
[20]
Recent advances in speech language models: A survey,
W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943– 13 970
work page 2025
-
[21]
Sparks of large audio models: A survey and outlook,
S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y . Ren, H. Cuayáhuitl, W. Wang, X. Zhang, R. Togneri, E. Cambriaet al., “Sparks of large au- dio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023
-
[22]
Audio-language models for audio-centric tasks: A survey,
Y . Su, J. Bai, Q. Xu, K. Xu, and Y . Dou, “Audio-language models for audio-centric tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025
-
[23]
A survey on speech large language models for understanding,
J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025
work page 2025
-
[24]
Towards general auditory intelligence: Large multimodal models for machine listening and speaking,
S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuanget al., “Towards general auditory intelligence: Large multimodal models for machine listening and speaking,”arXiv preprint arXiv:2511.01299, 2025
-
[25]
Towards holistic evaluation of large audio-language models: A comprehensive survey,
C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181
work page 2025
-
[26]
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” arXiv preprint arXiv:2503.12605, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Robust speech recognition via large-scale weak super- vision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[28]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[29]
Beats: Audio pre-training with acoustic tokenizers,
S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 5178–5193
work page 2023
-
[30]
Ast: Audio spectrogram trans- former,
Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,” inProc. Interspeech 2021, 2021, pp. 571–575
work page 2021
-
[31]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212. 04356
work page 2022
-
[32]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” https://lmsys.org/blog/2023-03-30-vicuna/, 2023, accessed: 2023-03-30
work page 2023
-
[36]
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Llama- omni: Seamless speech interaction with large language models,
Q. Fang, S. Niu, R. Zhou, Z. Lin, M. Chen, and Y . Feng, “LLaMA- Omni: Seamless speech interaction with large language models,”arXiv preprint arXiv:2409.06666, 2024
-
[38]
Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,
D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 15 757–15 773
work page 2023
-
[39]
Moshi: a speech-text foundation model for real- time dialogue,
A. Défossezet al., “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00080, 2024
-
[40]
Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,
C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y . Liu, C. Zong, and J. Zhang, “Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,” 2024. [Online]. Available: https://arxiv.org/abs/2309.00916
-
[41]
Salmonn: Towards generic hearing abilities for large language models,
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[42]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing 18 the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Audio-cot: Exploring chain-of-thought reasoning in large audio language model,
Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen, “Audio-cot: Exploring chain-of-thought reasoning in large audio language model,” arXiv preprint arXiv:2501.07246, 2025
-
[45]
Sar-lm: Symbolic audio reasoning with large language models,
T. Taheri, Y . Ma, and E. Benetos, “Sar-lm: Symbolic audio reasoning with large language models,”arXiv preprint arXiv:2511.06483, 2025
-
[46]
Audio-reasoner: Improving reasoning capability in large audio language models,
Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio-reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025
-
[47]
Z. Kong, A. Goel, J. F. Santos, S. Ghosh, R. Valle, W. Ping, and B. Catanzaro, “Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,”arXiv preprint arXiv:2508.11818, 2025
-
[48]
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
L. Li, H. Chen, Z. Li, Q. Hu, J. Kang, J. Li, L. Xie, and Y . Li, “Audio- cogito: Towards deep audio reasoning in large audio language models,” arXiv preprint arXiv:2604.12527, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,
G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan, “Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,”arXiv preprint arXiv:2503.11197, 2025
-
[50]
Omni-r1: Do you really need audio to fine-tune your audio llm?
A. Rouditchenko, S. Bhati, E. Araujo, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Omni-r1: Do you really need audio to fine-tune your audio llm?”arXiv preprint arXiv:2505.09439, 2025
-
[52]
Data- balanced curriculum learning for audio question answering,
G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Data- balanced curriculum learning for audio question answering,”arXiv preprint arXiv:2507.06815, 2025
-
[53]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Sari: Structured audio reasoning via curriculum-guided reinforcement learning,
C. Wen, T. Guo, S. Zhao, W. Zou, and X. Li, “Sari: Structured audio reasoning via curriculum-guided reinforcement learning,”arXiv preprint arXiv:2504.15900, 2025
-
[55]
Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,
D. Yang, S. Liu, D. Wang, Y . Wang, G. Wan, and H. Meng, “Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,” arXiv preprint arXiv:2512.03783, 2025
-
[56]
J. Zhao, H. Su, L. Fan, Z. Luo, H. Wang, H. Sun, and Y . Qin, “Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,”arXiv preprint arXiv:2509.12275, 2025
-
[57]
Think smart, not hard: Difficulty adaptive reasoning for large audio language models,
Z. Sheng, S. Zhou, C. Gong, and Z. Li, “Think smart, not hard: Difficulty adaptive reasoning for large audio language models,”arXiv preprint arXiv:2509.21960, 2025
-
[58]
Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,
G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,”arXiv preprint arXiv:2505.14142, 2025
-
[59]
H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025
-
[60]
Step-audio-r1 technical report,
F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,”arXiv preprint arXiv:2511.15848, 2025
-
[61]
B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,
S. Wu, C. Li, W. Wang, H. Zhang, H. Wang, M. Yu, and D. Yu, “Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,”arXiv preprint arXiv:2508.08039, 2025
-
[63]
X. He, C. Li, J. Wang, Y . Rong, T. Xie, W. Wang, L. Liu, and D. Yu, “Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models,”arXiv preprint arXiv:2604.18187, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
J. Fan, R. Ren, J. Li, R. Pandey, P. G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, and Y . Gu, “Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,”arXiv preprint arXiv:2510.20867, 2025
-
[65]
Soundmind: Rl-incentivized logic reasoning for audio-language models,
X. Diao, C. Zhang, K. Kong, W. Wu, C. Ma, Z. Ouyang, P. Qing, S. V osoughi, and J. Gui, “Soundmind: Rl-incentivized logic reasoning for audio-language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 528– 540
work page 2025
-
[66]
Beyond single-audio: Advancing multi-audio processing in audio large language models,
Y . Chen, X. Yue, X. Gao, C. Zhang, L. F. D’Haro, R. T. Tan, and H. Li, “Beyond single-audio: Advancing multi-audio processing in audio large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 10 917–10 930
work page 2024
-
[67]
Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,
S. Kumar, S. Ghosh, Y . Lin, Y . Chen, R. Duraiswami, and D. Manocha, “Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,” 2025
work page 2025
-
[68]
Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,
D. Wang, S. Liu, T. Zhang, Y . Chen, J. Li, and H. Meng, “Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,”arXiv preprint arXiv:2601.15668, 2026
-
[69]
J. X. et al., “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Mini-omni: Language models can hear, talk while thinking in streaming,
Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024
-
[72]
——, “Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex,”arXiv preprint arXiv:2410.11190, 2024
-
[73]
SLAM-omni: Timbre-controllable voice interaction system with single-stage training,
W. e. a. Chen, “SLAM-omni: Timbre-controllable voice interaction system with single-stage training,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 2262–2282. [Online]. Available: https://aclanthology....
work page 2025
-
[74]
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
W. Cui, X.-H. Li, D. Tan, Q. Zheng, and I. King, “Minimizing modality gap from the input side: Your speech llm can be a prosody-aware text llm,”arXiv preprint arXiv:2605.05927, 2026. [Online]. Available: https://arxiv.org/abs/2605.05927
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[76]
Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[77]
Mimo-audio: Audio language models are few-shot learners,
L.-C.-T. Xiaomi, “Mimo-audio: Audio language models are few-shot learners,” 2025. [Online]. Available: https://github.com/XiaomiMiMo/ MiMo-Audio
work page 2025
-
[78]
Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,
C. Wang, T. Peng, W. Yang, Y . Bai, G. Wang, J. Lin, L. Jia, L. Wu, J. Wang, C. Zonget al., “Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 906–917
work page 2025
-
[79]
C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H.-y. Lee, and L. Wang, “Shanks: Simultaneous hear- ing and thinking for spoken language models,”arXiv preprint arXiv:2510.06917, 2025
-
[80]
Can speech LLMs think while listening?
Y .-J. Shih, D. Raj, C. Wu, W. Zhou, S. Bong, Y . Gaur, J. Mahadeokar, O. Kalinli, and M. Seltzer, “Can speech LLMs think while listening?” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=dFVenZdVbX
work page 2026
-
[81]
Chronological thinking in full-duplex spoken dialogue language models
D. Wu, H. Zhang, C. Chen, T. Zhang, F. Tian, X. Yang, G. Yu, H. Liu, N. Hou, Y . Huet al., “Chronological thinking in full-duplex spoken dialogue language models,”arXiv preprint arXiv:2510.05150, 2025
-
[82]
D. Wu, T. Zhang, Y . Li, H. Liu, C. Chen, E. S. Chng, and Y . Bengio, “The silent thought: Modeling internal cognition in full- duplex spoken dialogue models via latent reasoning,”arXiv preprint arXiv:2603.17837, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[83]
STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,
C.-H. C. et al., “STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=5Z1eMhCeTb
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.