A Survey of Audio Reasoning in Multimodal Foundation Models

Daxin Tan; Dingdong Wang; Guan-Ting Lin; Han Shi; Irwin King; Jiaya Jia; Jing Xiong; Jingyao Li; Qiyong Zheng; Wenqian Cui

arxiv: 2605.21008 · v1 · pith:S7FX2JXGnew · submitted 2026-05-20 · 📡 eess.AS

A Survey of Audio Reasoning in Multimodal Foundation Models

Zhihan Guo , Wenqian Cui , Guan-Ting Lin , Daxin Tan , Jingyao Li , Qiyong Zheng , Dingdong Wang , Jing Xiong

show 3 more authors

Han Shi Jiaya Jia Irwin King

This is my paper

Pith reviewed 2026-05-21 02:04 UTC · model grok-4.3

classification 📡 eess.AS

keywords audio reasoningmultimodal foundation modelsreasoning-augmented generationaudio-to-textaudio-visual reasoningchain-of-thoughtreinforcement learningspoken interaction

0 comments

The pith

Audio reasoning in multimodal foundation models requires a dedicated survey and unified formulation because of its unique continuous and multi-scale characteristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to establish a coherent roadmap for audio reasoning by being the first to survey the field specifically. It distinguishes direct predictive modeling from reasoning-augmented generation to better organize how models align audio signals with language semantics. A reader would care if this leads to more reliable systems that can infer from speech, environmental sounds, and combined audio-visual inputs without losing fine details. The work reviews foundations, organizes advances in four categories, and covers methods like prompting and training techniques. It also points out obstacles such as data scarcity and the need to balance reasoning with speed.

Core claim

The authors present the first dedicated survey of audio reasoning in multimodal foundation models. They introduce a unified formulation to separate direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations, and systematically organize recent advances across Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning. The survey further examines emerging paradigms including Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, along with evaluation practices and open challenges.

What carries the argument

A unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation to handle the alignment of continuous acoustic signals with discrete language model semantics while preserving fine-grained information.

If this is right

Advances in Audio-to-Text and Audio-to-Speech can be more systematically compared and improved.
Agentic Audio Reasoning can support interactive spoken agents that perform step-by-step inference.
Methods like reinforcement learning can help overcome shortcut learning in audio tasks.
Latency-aware designs enable practical real-time audio reasoning applications.
Evaluation practices can evolve to better test for modality hallucination and grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This categorization could help in designing experiments that test reasoning depth versus prediction accuracy in audio models.
Connections to visual reasoning suggest potential for unified multi-modal reasoning frameworks beyond audio alone.
Addressing the listed obstacles might lead to foundation models that handle real-world audio interactions more robustly.
One could test the formulation by applying it to emerging audio datasets to see if it reveals new patterns in progress.

Load-bearing premise

The challenges in audio reasoning are fundamentally distinct from those in text and vision, necessitating a separate survey and a new unified formulation.

What would settle it

An experiment showing that general multimodal reasoning techniques without audio-specific adaptations achieve equivalent performance on audio tasks would challenge the premise for a dedicated survey.

Figures

Figures reproduced from arXiv: 2605.21008 by Daxin Tan, Dingdong Wang, Guan-Ting Lin, Han Shi, Irwin King, Jiaya Jia, Jing Xiong, Jingyao Li, Qiyong Zheng, Wenqian Cui, Zhihan Guo.

**Figure 1.** Figure 1: Timeline of representative audio reasoning models. Models are organized chronologically and grouped by major paradigms, including Audio-to-Text, Audio-to-Speech, Audio-Visual, and agentic audio reasoning. from direct generation to structured problem solving. This taxonomy clarifies the scope of audio reasoning and highlights the field’s current fragmentation across formulation, architecture, training, int… view at source ↗

**Figure 2.** Figure 2: A compact taxonomy of audio reasoning. We organize the literature into four paradigms: Audio-to-Text reasoning, Audio-to-Speech reasoning, Audio-Visual reasoning, and Agentic Audio Reasoning. Representative methods and design patterns are discussed in the corresponding sections. under a common probabilistic view. For clarity, Table I summarizes the main symbols used throughout the paper. A. General Formu… view at source ↗

**Figure 3.** Figure 3: Overview of major audio reasoning paradigms. The figure summarizes four paradigms covered in this survey: Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic Audio Reasoning. It contrasts text-output reasoning, cross-modal audio-visual grounding, sequential and real-time speech-output reasoning, and agentic workflows based on predefined pipelines or dynamic tool calling. sufficiently complete to trig… view at source ↗

read the original abstract

Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales. As a result, audio reasoning models must align acoustic signals with the discrete semantic space of large language models, while still preserving fine-grained information needed for reliable inference. Progress is also limited by three major obstacles: the scarcity of genuinely audio-grounded reasoning data, shortcut learning and modality hallucination, and the tension between reasoning depth and real-time latency in spoken interaction. In this paper, we present the first dedicated survey of audio reasoning. We provide a unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations of audio reasoning models, and systematically organize recent advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning and Agentic Audio Reasoning. We further examine emerging paradigms such as Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, and discuss evaluation practices, open challenges, and future directions. Our goal is to offer a coherent roadmap for developing robust, efficient, and natively grounded audio reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims to deliver the first dedicated survey of audio reasoning in multimodal foundation models. It provides a unified formulation distinguishing direct predictive modeling from reasoning-augmented generation, reviews architectural and training foundations, and organizes advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning, and Agentic Audio Reasoning. The survey also covers paradigms like Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, latency-aware interaction, evaluation practices, challenges, and future directions.

Significance. If the claims hold, this survey would be significant for the field by establishing a coherent framework and roadmap for audio reasoning, which is currently limited compared to text and vision. The explicit identification of distinct audio challenges and obstacles like data scarcity and modality hallucination provides a useful structure for future work. As a survey without new quantitative claims, its value lies in synthesis and organization of existing literature.

minor comments (2)

[Abstract] Abstract: The premise that audio poses fundamentally distinct challenges from text and vision is stated to motivate the scope; a brief explicit contrast with vision-language reasoning surveys would strengthen the justification for a dedicated audio survey.
The unified formulation is introduced in the abstract but its concrete mathematical or conceptual details are not visible in the provided high-level description; ensuring the formulation is presented with clear notation and examples in the main text would improve accessibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The summary accurately reflects the paper's contributions in providing a unified formulation of audio reasoning and organizing advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms, while highlighting key challenges such as data scarcity and modality hallucination.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a survey paper whose central contribution is a review and taxonomy of existing literature on audio reasoning. It states a motivation based on modality differences and offers a unified formulation to organize prior work, but introduces no new quantitative predictions, fitted parameters, or formal derivations that could reduce to its own inputs. All load-bearing content consists of citations to external studies and internal consistency of the proposed categories, with no self-referential loops or self-citation chains that substitute for independent evidence. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, fitted parameters, or postulated entities; it relies on standard domain assumptions from multimodal AI research.

axioms (1)

domain assumption Audio poses challenges distinct from text and vision because it is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales.
Invoked in the abstract to motivate the need for specialized audio reasoning models.

pith-pipeline@v0.9.0 · 5793 in / 1271 out tokens · 34307 ms · 2026-05-21T02:04:28.515190+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide a unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation... P(R,Y|A,X) = P(R|A,X) P(Y|A,X,R)
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We organize the literature into four paradigms: Audio-to-Text reasoning, Audio-to-Speech reasoning, Audio-Visual reasoning, and Agentic Audio Reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 33 internal anchors

[1]

Chain-of-thought prompting elicits reasoning in large 17 language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large 17 language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[2]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Z.-Z. e. a. Li, “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

J. D. et al., “Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms,” 2026. [Online]. Available: https://arxiv.org/abs/2605.00674

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Let’s verify step by step,

H. e. a. Lightman, “Let’s verify step by step,” inInternational Confer- ence on Learning Representations, vol. 2024, 2024, pp. 39 578–39 601

work page 2024
[6]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,

C. V . Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[7]

Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,

Y . Wu, Z. Sun, S. Li, S. Welleck, and Y . Yang, “Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[8]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,

H. e. a. Shao, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,”Advances in Neural Information Processing Systems, vol. 37, pp. 8612–8642, 2024

work page 2024
[9]

Compositional chain- of-thought prompting for large multimodal models,

C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 14 420–14 431

work page 2024
[10]

On The Landscape of Spoken Language Models: A Comprehensive Survey

S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Mmau: A massive multi-task audio understanding and reasoning benchmark,

S. e. a. Sakshi, “Mmau: A massive multi-task audio understanding and reasoning benchmark,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 84 929–84 964

work page 2025
[12]

Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,

J. e. a. Ao, “Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 898–56 918, 2024

work page 2024
[13]

Recent advances in discrete speech tokens: A review,

Y . e. a. Guo, “Recent advances in discrete speech tokens: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[14]

What enables human language? a biocultural frame- work,

I. e. a. Arnon, “What enables human language? a biocultural frame- work,”Science, vol. 390, no. 6775, p. eadq8303, 2025

work page 2025
[15]

Representation of internal speech by single neurons in human supramarginal gyrus,

S. K. e. a. Wandelt, “Representation of internal speech by single neurons in human supramarginal gyrus,”Nature human behaviour, vol. 8, no. 6, pp. 1136–1149, 2024

work page 2024
[17]

OmniFlatten: An end-to-end GPT model for seamless voice conversation,

Q. e. a. Zhang, “OmniFlatten: An end-to-end GPT model for seamless voice conversation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 14 570–14 580. [Online...

work page 2025
[18]

To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,

Z. R. S. et al., “To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=w6nlcS8Kkn

work page 2025
[19]

Benchmarking open-ended audio dialogue understanding for large audio-language models,

K. e. a. Gao, “Benchmarking open-ended audio dialogue understanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 4763–478...

work page 2025
[20]

Recent advances in speech language models: A survey,

W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943– 13 970

work page 2025
[21]

Sparks of large audio models: A survey and outlook,

S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y . Ren, H. Cuayáhuitl, W. Wang, X. Zhang, R. Togneri, E. Cambriaet al., “Sparks of large au- dio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023

work page arXiv 2023
[22]

Audio-language models for audio-centric tasks: A survey,

Y . Su, J. Bai, Q. Xu, K. Xu, and Y . Dou, “Audio-language models for audio-centric tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025

work page arXiv 2025
[23]

A survey on speech large language models for understanding,

J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

work page 2025
[24]

Towards general auditory intelligence: Large multimodal models for machine listening and speaking,

S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuanget al., “Towards general auditory intelligence: Large multimodal models for machine listening and speaking,”arXiv preprint arXiv:2511.01299, 2025

work page arXiv 2025
[25]

Towards holistic evaluation of large audio-language models: A comprehensive survey,

C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181

work page 2025
[26]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 28 492–28 518

work page 2023
[28]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[29]

Beats: Audio pre-training with acoustic tokenizers,

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 5178–5193

work page 2023
[30]

Ast: Audio spectrogram trans- former,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,” inProc. Interspeech 2021, 2021, pp. 571–575

work page 2021
[31]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212. 04356

work page 2022
[32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” https://lmsys.org/blog/2023-03-30-vicuna/, 2023, accessed: 2023-03-30

work page 2023
[36]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Niu, R. Zhou, Z. Lin, M. Chen, and Y . Feng, “LLaMA- Omni: Seamless speech interaction with large language models,”arXiv preprint arXiv:2409.06666, 2024

work page arXiv 2024
[38]

Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,

D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 15 757–15 773

work page 2023
[39]

Moshi: a speech-text foundation model for real- time dialogue,

A. Défossezet al., “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00080, 2024

work page arXiv 2024
[40]

Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,

C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y . Liu, C. Zong, and J. Zhang, “Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,” 2024. [Online]. Available: https://arxiv.org/abs/2309.00916

work page arXiv 2024
[41]

Salmonn: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[42]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing 18 the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Audio-cot: Exploring chain-of-thought reasoning in large audio language model,

Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen, “Audio-cot: Exploring chain-of-thought reasoning in large audio language model,” arXiv preprint arXiv:2501.07246, 2025

work page arXiv 2025
[45]

Sar-lm: Symbolic audio reasoning with large language models,

T. Taheri, Y . Ma, and E. Benetos, “Sar-lm: Symbolic audio reasoning with large language models,”arXiv preprint arXiv:2511.06483, 2025

work page arXiv 2025
[46]

Audio-reasoner: Improving reasoning capability in large audio language models,

Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio-reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025
[47]

Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,

Z. Kong, A. Goel, J. F. Santos, S. Ghosh, R. Valle, W. Ping, and B. Catanzaro, “Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,”arXiv preprint arXiv:2508.11818, 2025

work page arXiv 2025
[48]

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

L. Li, H. Chen, Z. Li, Q. Hu, J. Kang, J. Li, L. Xie, and Y . Li, “Audio- cogito: Towards deep audio reasoning in large audio language models,” arXiv preprint arXiv:2604.12527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,

G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan, “Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,”arXiv preprint arXiv:2503.11197, 2025

work page arXiv 2025
[50]

Omni-r1: Do you really need audio to fine-tune your audio llm?

A. Rouditchenko, S. Bhati, E. Araujo, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Omni-r1: Do you really need audio to fine-tune your audio llm?”arXiv preprint arXiv:2505.09439, 2025

work page arXiv 2025
[52]

Data- balanced curriculum learning for audio question answering,

G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Data- balanced curriculum learning for audio question answering,”arXiv preprint arXiv:2507.06815, 2025

work page arXiv 2025
[53]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Sari: Structured audio reasoning via curriculum-guided reinforcement learning,

C. Wen, T. Guo, S. Zhao, W. Zou, and X. Li, “Sari: Structured audio reasoning via curriculum-guided reinforcement learning,”arXiv preprint arXiv:2504.15900, 2025

work page arXiv 2025
[55]

Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,

D. Yang, S. Liu, D. Wang, Y . Wang, G. Wan, and H. Meng, “Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,” arXiv preprint arXiv:2512.03783, 2025

work page arXiv 2025
[56]

Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,

J. Zhao, H. Su, L. Fan, Z. Luo, H. Wang, H. Sun, and Y . Qin, “Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,”arXiv preprint arXiv:2509.12275, 2025

work page arXiv 2025
[57]

Think smart, not hard: Difficulty adaptive reasoning for large audio language models,

Z. Sheng, S. Zhou, C. Gong, and Z. Li, “Think smart, not hard: Difficulty adaptive reasoning for large audio language models,”arXiv preprint arXiv:2509.21960, 2025

work page arXiv 2025
[58]

Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,

G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,”arXiv preprint arXiv:2505.14142, 2025

work page arXiv 2025
[59]

Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,

H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025

work page arXiv 2025
[60]

Step-audio-r1 technical report,

F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,”arXiv preprint arXiv:2511.15848, 2025

work page arXiv 2025
[61]

Step-Audio 2 Technical Report

B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,

S. Wu, C. Li, W. Wang, H. Zhang, H. Wang, M. Yu, and D. Yu, “Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,”arXiv preprint arXiv:2508.08039, 2025

work page arXiv 2025
[63]

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

X. He, C. Li, J. Wang, Y . Rong, T. Xie, W. Wang, L. Liu, and D. Yu, “Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models,”arXiv preprint arXiv:2604.18187, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,

J. Fan, R. Ren, J. Li, R. Pandey, P. G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, and Y . Gu, “Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,”arXiv preprint arXiv:2510.20867, 2025

work page arXiv 2025
[65]

Soundmind: Rl-incentivized logic reasoning for audio-language models,

X. Diao, C. Zhang, K. Kong, W. Wu, C. Ma, Z. Ouyang, P. Qing, S. V osoughi, and J. Gui, “Soundmind: Rl-incentivized logic reasoning for audio-language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 528– 540

work page 2025
[66]

Beyond single-audio: Advancing multi-audio processing in audio large language models,

Y . Chen, X. Yue, X. Gao, C. Zhang, L. F. D’Haro, R. T. Tan, and H. Li, “Beyond single-audio: Advancing multi-audio processing in audio large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 10 917–10 930

work page 2024
[67]

Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,

S. Kumar, S. Ghosh, Y . Lin, Y . Chen, R. Duraiswami, and D. Manocha, “Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,” 2025

work page 2025
[68]

Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,

D. Wang, S. Liu, T. Zhang, Y . Chen, J. Li, and H. Meng, “Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,”arXiv preprint arXiv:2601.15668, 2026

work page arXiv 2026
[69]

Qwen2.5-Omni Technical Report

J. X. et al., “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Kimi-Audio Technical Report

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[72]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

——, “Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex,”arXiv preprint arXiv:2410.11190, 2024

work page arXiv 2024
[73]

SLAM-omni: Timbre-controllable voice interaction system with single-stage training,

W. e. a. Chen, “SLAM-omni: Timbre-controllable voice interaction system with single-stage training,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 2262–2282. [Online]. Available: https://aclanthology....

work page 2025
[74]

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

W. Cui, X.-H. Li, D. Tan, Q. Zheng, and I. King, “Minimizing modality gap from the input side: Your speech llm can be a prosody-aware text llm,”arXiv preprint arXiv:2605.05927, 2026. [Online]. Available: https://arxiv.org/abs/2605.05927

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

Qwen3.5-Omni Technical Report

Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

Mimo-audio: Audio language models are few-shot learners,

L.-C.-T. Xiaomi, “Mimo-audio: Audio language models are few-shot learners,” 2025. [Online]. Available: https://github.com/XiaomiMiMo/ MiMo-Audio

work page 2025
[78]

Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,

C. Wang, T. Peng, W. Yang, Y . Bai, G. Wang, J. Lin, L. Jia, L. Wu, J. Wang, C. Zonget al., “Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 906–917

work page 2025
[79]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H.-y. Lee, and L. Wang, “Shanks: Simultaneous hear- ing and thinking for spoken language models,”arXiv preprint arXiv:2510.06917, 2025

work page arXiv 2025
[80]

Can speech LLMs think while listening?

Y .-J. Shih, D. Raj, C. Wu, W. Zhou, S. Bong, Y . Gaur, J. Mahadeokar, O. Kalinli, and M. Seltzer, “Can speech LLMs think while listening?” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=dFVenZdVbX

work page 2026
[81]

Chronological thinking in full-duplex spoken dialogue language models

D. Wu, H. Zhang, C. Chen, T. Zhang, F. Tian, X. Yang, G. Yu, H. Liu, N. Hou, Y . Huet al., “Chronological thinking in full-duplex spoken dialogue language models,”arXiv preprint arXiv:2510.05150, 2025

work page arXiv 2025
[82]

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

D. Wu, T. Zhang, Y . Li, H. Liu, C. Chen, E. S. Chng, and Y . Bengio, “The silent thought: Modeling internal cognition in full- duplex spoken dialogue models via latent reasoning,”arXiv preprint arXiv:2603.17837, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[83]

STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,

C.-H. C. et al., “STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=5Z1eMhCeTb

work page 2026

Showing first 80 references.

[1] [1]

Chain-of-thought prompting elicits reasoning in large 17 language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large 17 language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022

[2] [2]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Z.-Z. e. a. Li, “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

J. D. et al., “Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms,” 2026. [Online]. Available: https://arxiv.org/abs/2605.00674

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Let’s verify step by step,

H. e. a. Lightman, “Let’s verify step by step,” inInternational Confer- ence on Learning Representations, vol. 2024, 2024, pp. 39 578–39 601

work page 2024

[6] [6]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,

C. V . Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[7] [7]

Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,

Y . Wu, Z. Sun, S. Li, S. Welleck, and Y . Yang, “Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[8] [8]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,

H. e. a. Shao, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought rea- soning,”Advances in Neural Information Processing Systems, vol. 37, pp. 8612–8642, 2024

work page 2024

[9] [9]

Compositional chain- of-thought prompting for large multimodal models,

C. Mitra, B. Huang, T. Darrell, and R. Herzig, “Compositional chain- of-thought prompting for large multimodal models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 14 420–14 431

work page 2024

[10] [10]

On The Landscape of Spoken Language Models: A Comprehensive Survey

S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Mmau: A massive multi-task audio understanding and reasoning benchmark,

S. e. a. Sakshi, “Mmau: A massive multi-task audio understanding and reasoning benchmark,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 84 929–84 964

work page 2025

[12] [12]

Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,

J. e. a. Ao, “Sd-eval: A benchmark dataset for spoken dialogue under- standing beyond words,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 898–56 918, 2024

work page 2024

[13] [13]

Recent advances in discrete speech tokens: A review,

Y . e. a. Guo, “Recent advances in discrete speech tokens: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[14] [14]

What enables human language? a biocultural frame- work,

I. e. a. Arnon, “What enables human language? a biocultural frame- work,”Science, vol. 390, no. 6775, p. eadq8303, 2025

work page 2025

[15] [15]

Representation of internal speech by single neurons in human supramarginal gyrus,

S. K. e. a. Wandelt, “Representation of internal speech by single neurons in human supramarginal gyrus,”Nature human behaviour, vol. 8, no. 6, pp. 1136–1149, 2024

work page 2024

[16] [17]

OmniFlatten: An end-to-end GPT model for seamless voice conversation,

Q. e. a. Zhang, “OmniFlatten: An end-to-end GPT model for seamless voice conversation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 14 570–14 580. [Online...

work page 2025

[17] [18]

To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,

Z. R. S. et al., “To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=w6nlcS8Kkn

work page 2025

[18] [19]

Benchmarking open-ended audio dialogue understanding for large audio-language models,

K. e. a. Gao, “Benchmarking open-ended audio dialogue understanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 4763–478...

work page 2025

[19] [20]

Recent advances in speech language models: A survey,

W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943– 13 970

work page 2025

[20] [21]

Sparks of large audio models: A survey and outlook,

S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y . Ren, H. Cuayáhuitl, W. Wang, X. Zhang, R. Togneri, E. Cambriaet al., “Sparks of large au- dio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023

work page arXiv 2023

[21] [22]

Audio-language models for audio-centric tasks: A survey,

Y . Su, J. Bai, Q. Xu, K. Xu, and Y . Dou, “Audio-language models for audio-centric tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025

work page arXiv 2025

[22] [23]

A survey on speech large language models for understanding,

J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

work page 2025

[23] [24]

Towards general auditory intelligence: Large multimodal models for machine listening and speaking,

S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuanget al., “Towards general auditory intelligence: Large multimodal models for machine listening and speaking,”arXiv preprint arXiv:2511.01299, 2025

work page arXiv 2025

[24] [25]

Towards holistic evaluation of large audio-language models: A comprehensive survey,

C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181

work page 2025

[25] [26]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [27]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 28 492–28 518

work page 2023

[27] [28]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[28] [29]

Beats: Audio pre-training with acoustic tokenizers,

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” inInternational Conference on Machine Learning (ICML). PMLR, 2023, pp. 5178–5193

work page 2023

[29] [30]

Ast: Audio spectrogram trans- former,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram trans- former,” inProc. Interspeech 2021, 2021, pp. 571–575

work page 2021

[30] [31]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212. 04356

work page 2022

[31] [32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [34]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,”arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [35]

Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vi- cuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” https://lmsys.org/blog/2023-03-30-vicuna/, 2023, accessed: 2023-03-30

work page 2023

[35] [36]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [37]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Niu, R. Zhou, Z. Lin, M. Chen, and Y . Feng, “LLaMA- Omni: Seamless speech interaction with large language models,”arXiv preprint arXiv:2409.06666, 2024

work page arXiv 2024

[37] [38]

Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,

D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speech gpt: Empowering large language models with intrinsic cross- modal conversational abilities,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 15 757–15 773

work page 2023

[38] [39]

Moshi: a speech-text foundation model for real- time dialogue,

A. Défossezet al., “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00080, 2024

work page arXiv 2024

[39] [40]

Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,

C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y . Liu, C. Zong, and J. Zhang, “Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,” 2024. [Online]. Available: https://arxiv.org/abs/2309.00916

work page arXiv 2024

[40] [41]

Salmonn: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024

[41] [42]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [43]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing 18 the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

Audio-cot: Exploring chain-of-thought reasoning in large audio language model,

Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen, “Audio-cot: Exploring chain-of-thought reasoning in large audio language model,” arXiv preprint arXiv:2501.07246, 2025

work page arXiv 2025

[44] [45]

Sar-lm: Symbolic audio reasoning with large language models,

T. Taheri, Y . Ma, and E. Benetos, “Sar-lm: Symbolic audio reasoning with large language models,”arXiv preprint arXiv:2511.06483, 2025

work page arXiv 2025

[45] [46]

Audio-reasoner: Improving reasoning capability in large audio language models,

Z. Xie, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio-reasoner: Improving reasoning capability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025

[46] [47]

Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,

Z. Kong, A. Goel, J. F. Santos, S. Ghosh, R. Valle, W. Ping, and B. Catanzaro, “Audio flamingo sound-cot technical report: Improving chain-of-thought reasoning in sound understanding,”arXiv preprint arXiv:2508.11818, 2025

work page arXiv 2025

[47] [48]

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

L. Li, H. Chen, Z. Li, Q. Hu, J. Kang, J. Li, L. Xie, and Y . Li, “Audio- cogito: Towards deep audio reasoning in large audio language models,” arXiv preprint arXiv:2604.12527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [49]

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,

G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan, “Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,”arXiv preprint arXiv:2503.11197, 2025

work page arXiv 2025

[49] [50]

Omni-r1: Do you really need audio to fine-tune your audio llm?

A. Rouditchenko, S. Bhati, E. Araujo, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Omni-r1: Do you really need audio to fine-tune your audio llm?”arXiv preprint arXiv:2505.09439, 2025

work page arXiv 2025

[50] [52]

Data- balanced curriculum learning for audio question answering,

G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Data- balanced curriculum learning for audio question answering,”arXiv preprint arXiv:2507.06815, 2025

work page arXiv 2025

[51] [53]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmannet al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [54]

Sari: Structured audio reasoning via curriculum-guided reinforcement learning,

C. Wen, T. Guo, S. Zhao, W. Zou, and X. Li, “Sari: Structured audio reasoning via curriculum-guided reinforcement learning,”arXiv preprint arXiv:2504.15900, 2025

work page arXiv 2025

[53] [55]

Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,

D. Yang, S. Liu, D. Wang, Y . Wang, G. Wan, and H. Meng, “Omni- autothink: Adaptive multimodal reasoning via reinforcement learning,” arXiv preprint arXiv:2512.03783, 2025

work page arXiv 2025

[54] [56]

Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,

J. Zhao, H. Su, L. Fan, Z. Luo, H. Wang, H. Sun, and Y . Qin, “Omni-clst: Error-aware curriculum learning with guided selec- tive chain-of-thought for audio question answering,”arXiv preprint arXiv:2509.12275, 2025

work page arXiv 2025

[55] [57]

Think smart, not hard: Difficulty adaptive reasoning for large audio language models,

Z. Sheng, S. Zhou, C. Gong, and Z. Li, “Think smart, not hard: Difficulty adaptive reasoning for large audio language models,”arXiv preprint arXiv:2509.21960, 2025

work page arXiv 2025

[56] [58]

Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,

G. Wijngaard, E. Formisano, M. Esposito, and M. Dumontier, “Aud- semthinker: Enhancing audio-language models through reasoning over semantics of sound,”arXiv preprint arXiv:2505.14142, 2025

work page arXiv 2025

[57] [59]

Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,

H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025

work page arXiv 2025

[58] [60]

Step-audio-r1 technical report,

F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,”arXiv preprint arXiv:2511.15848, 2025

work page arXiv 2025

[59] [61]

Step-Audio 2 Technical Report

B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [62]

Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,

S. Wu, C. Li, W. Wang, H. Zhang, H. Wang, M. Yu, and D. Yu, “Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,”arXiv preprint arXiv:2508.08039, 2025

work page arXiv 2025

[61] [63]

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

X. He, C. Li, J. Wang, Y . Rong, T. Xie, W. Wang, L. Liu, and D. Yu, “Audio-deepthinker: Progressive reasoning-aware reinforcement learning for high-quality chain-of-thought emergence in audio language models,”arXiv preprint arXiv:2604.18187, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [64]

Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,

J. Fan, R. Ren, J. Li, R. Pandey, P. G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, and Y . Gu, “Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,”arXiv preprint arXiv:2510.20867, 2025

work page arXiv 2025

[63] [65]

Soundmind: Rl-incentivized logic reasoning for audio-language models,

X. Diao, C. Zhang, K. Kong, W. Wu, C. Ma, Z. Ouyang, P. Qing, S. V osoughi, and J. Gui, “Soundmind: Rl-incentivized logic reasoning for audio-language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 528– 540

work page 2025

[64] [66]

Beyond single-audio: Advancing multi-audio processing in audio large language models,

Y . Chen, X. Yue, X. Gao, C. Zhang, L. F. D’Haro, R. T. Tan, and H. Li, “Beyond single-audio: Advancing multi-audio processing in audio large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 10 917–10 930

work page 2024

[65] [67]

Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,

S. Kumar, S. Ghosh, Y . Lin, Y . Chen, R. Duraiswami, and D. Manocha, “Polyaudio: Advancing multi-audio analysis & reasoning in large audio language models,” 2025

work page 2025

[66] [68]

Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,

D. Wang, S. Liu, T. Zhang, Y . Chen, J. Li, and H. Meng, “Emotion- thinker: Prosody-aware reinforcement learning for explainable speech emotion reasoning,”arXiv preprint arXiv:2601.15668, 2026

work page arXiv 2026

[67] [69]

Qwen2.5-Omni Technical Report

J. X. et al., “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [70]

Kimi-Audio Technical Report

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [71]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024

[70] [72]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

——, “Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex,”arXiv preprint arXiv:2410.11190, 2024

work page arXiv 2024

[71] [73]

SLAM-omni: Timbre-controllable voice interaction system with single-stage training,

W. e. a. Chen, “SLAM-omni: Timbre-controllable voice interaction system with single-stage training,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 2262–2282. [Online]. Available: https://aclanthology....

work page 2025

[72] [74]

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

W. Cui, X.-H. Li, D. Tan, Q. Zheng, and I. King, “Minimizing modality gap from the input side: Your speech llm can be a prosody-aware text llm,”arXiv preprint arXiv:2605.05927, 2026. [Online]. Available: https://arxiv.org/abs/2605.05927

work page internal anchor Pith review Pith/arXiv arXiv 2026

[73] [76]

Qwen3.5-Omni Technical Report

Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[74] [77]

Mimo-audio: Audio language models are few-shot learners,

L.-C.-T. Xiaomi, “Mimo-audio: Audio language models are few-shot learners,” 2025. [Online]. Available: https://github.com/XiaomiMiMo/ MiMo-Audio

work page 2025

[75] [78]

Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,

C. Wang, T. Peng, W. Yang, Y . Bai, G. Wang, J. Lin, L. Jia, L. Wu, J. Wang, C. Zonget al., “Opens2s: Advancing fully open-source end-to-end empathetic large speech language model,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2025, pp. 906–917

work page 2025

[76] [79]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H.-y. Lee, and L. Wang, “Shanks: Simultaneous hear- ing and thinking for spoken language models,”arXiv preprint arXiv:2510.06917, 2025

work page arXiv 2025

[77] [80]

Can speech LLMs think while listening?

Y .-J. Shih, D. Raj, C. Wu, W. Zhou, S. Bong, Y . Gaur, J. Mahadeokar, O. Kalinli, and M. Seltzer, “Can speech LLMs think while listening?” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https: //openreview.net/forum?id=dFVenZdVbX

work page 2026

[78] [81]

Chronological thinking in full-duplex spoken dialogue language models

D. Wu, H. Zhang, C. Chen, T. Zhang, F. Tian, X. Yang, G. Yu, H. Liu, N. Hou, Y . Huet al., “Chronological thinking in full-duplex spoken dialogue language models,”arXiv preprint arXiv:2510.05150, 2025

work page arXiv 2025

[79] [82]

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

D. Wu, T. Zhang, Y . Li, H. Liu, C. Chen, E. S. Chng, and Y . Bengio, “The silent thought: Modeling internal cognition in full- duplex spoken dialogue models via latent reasoning,”arXiv preprint arXiv:2603.17837, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[80] [83]

STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,

C.-H. C. et al., “STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=5Z1eMhCeTb

work page 2026