A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
Pith reviewed 2026-05-21 07:34 UTC · model grok-4.3
The pith
Large Audio Language Models advance faster than the trustworthiness frameworks needed to secure them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness, with a profound imbalance between a mature offensive landscape and underdeveloped defenses. The transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface, as shown by vulnerabilities including cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage.
What carries the argument
The six-pillar taxonomy of trustworthiness risks that evaluates endogenous mechanisms and alignment algorithms across hallucination, robustness, safety, privacy, fairness, and authentication.
If this is right
- Unified end-to-end audio models require defense strategies distinct from those developed for text or vision models.
- Continuous acoustic signals create attack vectors such as latent backdoors that text-based methods do not address.
- A strategic roadmap centered on defense-in-depth and causal auditory modeling can reduce the current imbalance between attacks and defenses.
- Filling trustworthiness gaps supports safer deployment of audio-centric applications in real-world settings.
Where Pith is reading between the lines
- Voice interfaces built on these models may require extra verification layers to limit manipulation through sound inputs.
- Regulatory guidelines for audio AI could focus on accelerating defensive research rather than capability benchmarks alone.
- Similar trustworthiness gaps may appear in other multimodal models that fuse continuous signals with language.
Load-bearing premise
That the established taxonomy comprehensively captures the main vulnerabilities in unified end-to-end LALM frameworks.
What would settle it
A documented attack, such as a cross-modal jailbreak or acoustic backdoor, that succeeds against all defenses outlined in the taxonomy and roadmap.
Figures
read the original abstract
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys Large Audio Language Models (LALMs), tracing their development from LLMs and MLLMs through architectural innovations and alignment algorithms that enable emergent reasoning. It analyzes how unified end-to-end frameworks with continuous acoustic signals expand the attack surface, establishes a taxonomy of trustworthiness vulnerabilities (cross-modal jailbreaking, latent acoustic backdoors, biometric privacy leakage), reviews the literature across six pillars (hallucination, robustness, safety, privacy, fairness, authentication), documents a profound imbalance between mature offensive capabilities and underdeveloped defenses, and outlines a roadmap for Defense-in-Depth architectures, causal auditory world modeling, and intrinsic representation engineering. A GitHub repository accompanies the survey.
Significance. A well-executed survey with this taxonomy and roadmap could become a standard reference for the emerging LALM trustworthiness literature, particularly by synthesizing cross-modal risks and advocating proactive defense strategies. The accompanying GitHub repository for literature collection is a positive step toward reproducibility in survey work.
major comments (2)
- [Abstract] Abstract: The central claim of a 'mature offensive landscape' and 'profound imbalance' between offensive and defensive capabilities rests on the assertion that continuous audio integration inherently expands the attack surface. However, the reviewed literature appears to draw primarily from LLM and vision-MLLM attacks (e.g., cross-modal jailbreaking examples) rather than providing extensive empirical demonstrations on actual unified LALM architectures such as those employing Whisper-style encoders or AudioLM decoders. This risks overstating realized LALM-specific exploits versus potential ones, which is load-bearing for the imbalance diagnosis and the call for new defenses.
- [Taxonomy and six-pillar review] Taxonomy and six-pillar review sections: The taxonomy is presented as comprehensive for vulnerabilities including latent acoustic backdoors and biometric privacy leakage in end-to-end frameworks, yet it is unclear whether the cited works contain sufficient LALM-specific attack results to substantiate that defenses are 'underdeveloped' relative to a mature offensive side. If the pillar reviews rely heavily on non-audio multimodal papers, the taxonomy's claimed coverage of critical gaps requires additional LALM-targeted evidence or explicit discussion of the extrapolation.
minor comments (2)
- [Introduction or Taxonomy] The manuscript would benefit from a dedicated subsection explicitly contrasting LALM-specific attack demonstrations with those transferred from LLMs/MLLMs to avoid reader confusion about the scope of the 'mature offensive landscape' claim.
- [Review sections] Figure or table summarizing the six analytical pillars could improve clarity by indicating the number of LALM-specific papers versus transferred results per pillar.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our survey. The comments help us better distinguish demonstrated results from extrapolated risks in this emerging area. We address each major comment below and have revised the manuscript to incorporate clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a 'mature offensive landscape' and 'profound imbalance' between offensive and defensive capabilities rests on the assertion that continuous audio integration inherently expands the attack surface. However, the reviewed literature appears to draw primarily from LLM and vision-MLLM attacks (e.g., cross-modal jailbreaking examples) rather than providing extensive empirical demonstrations on actual unified LALM architectures such as those employing Whisper-style encoders or AudioLM decoders. This risks overstating realized LALM-specific exploits versus potential ones, which is load-bearing for the imbalance diagnosis and the call for new defenses.
Authors: We appreciate the referee's careful reading and agree that the distinction between realized LALM-specific exploits and extrapolated risks from related modalities merits clearer presentation. While the survey cites emerging LALM-specific works (including attacks on unified audio-text models and continuous acoustic interfaces), many foundational attack vectors are indeed illustrated via cross-modal analogies given the relative novelty of end-to-end LALMs. In the revised manuscript we will update the abstract and introduction to explicitly note this distinction, emphasize the inherent expansion of the attack surface due to continuous signals, and qualify the 'mature offensive landscape' claim as reflecting both demonstrated cases and rapidly transferable techniques. This adjustment preserves the imbalance diagnosis while avoiding overstatement. revision: yes
-
Referee: [Taxonomy and six-pillar review] Taxonomy and six-pillar review sections: The taxonomy is presented as comprehensive for vulnerabilities including latent acoustic backdoors and biometric privacy leakage in end-to-end frameworks, yet it is unclear whether the cited works contain sufficient LALM-specific attack results to substantiate that defenses are 'underdeveloped' relative to a mature offensive side. If the pillar reviews rely heavily on non-audio multimodal papers, the taxonomy's claimed coverage of critical gaps requires additional LALM-targeted evidence or explicit discussion of the extrapolation.
Authors: We concur that the taxonomy's strength depends on transparent handling of evidence sources. The six-pillar review incorporates the limited but growing body of LALM-specific studies alongside cross-modal results to map the full risk landscape. To address the concern, the revised version will add an explicit subsection discussing the degree of extrapolation required for each vulnerability category (e.g., latent acoustic backdoors), cite additional recent LALM-targeted papers where available, and qualify claims about underdeveloped defenses by noting the current scarcity of audio-specific empirical evaluations. These changes will strengthen the roadmap's justification without altering the overall taxonomy structure. revision: yes
Circularity Check
No circularity: survey aggregates external literature
full rationale
As a survey paper, the work compiles and organizes existing research on LALMs without introducing any self-derived equations, fitted parameters, or predictions. Central claims about capability escalation outpacing trustworthiness frameworks and the imbalance between offensive and defensive landscapes rest on reviewed external literature rather than reducing to the paper's own inputs by construction. The proposed taxonomy and roadmap are presented as syntheses of prior vulnerabilities (e.g., cross-modal jailbreaking) drawn from cited works, with no self-citation chains, uniqueness theorems, or ansatzes that bear the load of the conclusions. The GitHub project link is a supplementary resource, not a definitional loop. The derivation chain is therefore self-contained through aggregation of independent sources.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard machine learning assumptions regarding model generalization, alignment, and emergent capabilities in multimodal systems.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish a systematic classification of trustworthiness challenges, identifying critical vulnerabilities including cross-modal jailbreak through acoustic cues, latent acoustic backdoors, and biometric privacy leakage. Additionally, we evaluate the landscape of current leading models through the six pillars of trustworthiness, which consist of hallucination, robustness, safety, privacy, fairness, and authentication.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Train- ing language models to follow instructions with human feed- back,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Train- ing language models to follow instructions with human feed- back,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022
work page 2022
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, P . Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Y. Yan, S. Wang, J. Huo, J. Ye, Z. Chu, X. Hu, P . S. Yu, C. Gomes, B. Selman, and Q. Wen, “Position: Multimodal large language models can significantly advance scientific reasoning,”arXiv preprint arXiv:2502.02871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Y. Yan, J. Su, J. He, F. Fu, X. Zheng, Y. Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 11 798–11 827
work page 2025
-
[9]
Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Sparks of large audio models: A survey and outlook,
S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y. Ren, H. Cuay ´ahuitl, W. Wang, X. Zhang, R. Togneri, E. Cambriaet al., “Sparks of large audio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023
-
[11]
Audio-conditioned diffusion llms for asr and deliberation pro- cessing,
M. Wang, Z. Liu, Z. Jin, G. Sun, C. Zhang, and P . C. Woodland, “Audio-conditioned diffusion llms for asr and deliberation pro- cessing,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 18 467–18 471
work page 2026
-
[12]
X. Shi, X. Wang, Z. Guo, Y. Wang, P . Zhang, X. Zhang, Z. Guo, H. Hao, Y. Xi, B. Yanget al., “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780
work page 2017
-
[14]
Panns: Large-scale pretrained audio neural networks for audio pattern recognition,
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020
work page 2020
-
[15]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
AudioPaLM: A Large Language Model That Can Speak and Listen
P . K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P . Chen, D. E. Badawy, W. Han, E. Kharitonovet al., “Audiopalm: A large language model that can speak and listen,”arXiv preprint arXiv:2306.12925, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Large language model safety: A holistic survey,
D. Shi, T. Shen, Y. Huang, Z. Li, Y. Leng, R. Jin, C. Liu, X. Wu, Z. Guo, L. Yuet al., “Large language model safety: A holistic survey,”arXiv preprint arXiv:2412.17686, 2024
-
[21]
K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y. Yan, H. Luoet al., “A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment,”arXiv preprint arXiv:2504.15585, 2025
-
[22]
A survey on trustworthy llm agents: Threats and countermeasures,
M. Yu, F. Meng, X. Zhou, S. Wang, J. Mao, L. Pan, T. Chen, K. Wang, X. Li, Y. Zhanget al., “A survey on trustworthy llm agents: Threats and countermeasures,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 6216–6226
work page 2025
-
[23]
Safety at scale: A comprehensive survey of large model and agent safety,
X. Ma, Y. Gao, Y. Wang, R. Wang, X. Wang, Y. Sun, Y. Ding, H. Xu, Y. Chen, Y. Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, vol. 8, no. 3-4, pp. 1–240, 2026
work page 2026
-
[24]
L. Lin, M. Yu, K. Luo, Y. Zhang, L. Peng, D. Wang, X. Tang, Y. Zhang, X. Yang, Z. Zhouet al., “Hidden in the noise: Unveil- ing backdoors in audio llms alignment through latent acoustic pattern triggers,”arXiv preprint arXiv:2508.02175, 2025
-
[25]
Synthetic voices, real threats: Evaluating large text-to-speech models in generating harmful audio,
G. Chen, Y. Wang, S. Ji, X. Luo, and T. Wang, “Synthetic voices, real threats: Evaluating large text-to-speech models in generating harmful audio,”arXiv preprint arXiv:2511.10913, 2025
-
[26]
Evalu- ation of audio language models for fairness, safety, and security,
R. Aloufi, S. Gupta, S. Shaw, B. Biggio, and L. Sch ¨onherr, “Evalu- ation of audio language models for fairness, safety, and security,” arXiv preprint arXiv:2603.13262, 2026
-
[27]
M. Chen, K. Wang, L. Lu, J. Zhang, and T. Zhang, “Hijacking large audio-language models via context-agnostic and impercep- tible auditory prompt injection,”arXiv preprint arXiv:2604.14604, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
S. Sakshi, V . Lokegaonkar, N. Zhang, R. Duraiswami, S. Ghosh, D. Manocha, and L. Lu, “Spur: A plug-and-play framework for integrating spatial audio understanding and reasoning into large audio-language models,”arXiv preprint arXiv:2511.06606, 2025
-
[29]
Pal: Probing audio encoders via llms-audio information transfer into llms,
T. Alex, W. Suharitdamrong, S. Atito, A. Mustafa, P . J. Jack- son, I. Razzak, and M. Awais, “Pal: Probing audio encoders via llms-audio information transfer into llms,”arXiv preprint arXiv:2506.10423, 2025
-
[30]
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
Y. You, L. Wei, X. Wu, and T. Qu, “The world is not mono: Enabling spatial understanding in large audio-language models,” arXiv preprint arXiv:2601.02954, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Llamapartialspoof: An llm-driven fake speech dataset simulat- ing disinformation generation,
H.-T. Luong, H. Li, L. Zhang, K. A. Lee, and E. S. Chng, “Llamapartialspoof: An llm-driven fake speech dataset simulat- ing disinformation generation,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
work page 2025
-
[32]
Dfallm: Achieving generalizable multitask deepfake detection by optimizing audio llm components,
Y. Li, L. Wang, Y. Wang, L. Wang, R. Cai, J. Shi, B. W. Schuller, and Z. Wu, “Dfallm: Achieving generalizable multitask deepfake detection by optimizing audio llm components,”arXiv preprint arXiv:2512.08403, 2025
-
[33]
B. Nguyen and T. Le, “Analyzing reasoning shifts in audio deepfake detection under adversarial attacks: The reasoning tax versus shield bifurcation,”arXiv preprint arXiv:2601.03615, 2026
-
[34]
A survey on speech large language models for understanding,
J. Peng, Y. Wang, B. Li, Y. Guo, H. Wang, Y. Fang, Y. Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025
work page 2025
-
[35]
Audio-language models for audio-centric tasks: A survey,
Y. Su, J. Bai, Q. Xu, K. Xu, and Y. Dou, “Audio-language models for audio-centric tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025
-
[36]
Towards holistic evaluation of large audio-language models: A comprehensive survey,
C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181
work page 2025
-
[37]
A review of speech-centric trustworthy machine learning: Privacy, safety, and fairness,
T. Feng, R. Hebbar, N. Mehlman, X. Shi, A. Kommineniet al., “A review of speech-centric trustworthy machine learning: Privacy, safety, and fairness,”arXiv preprint arXiv:2212.09006, 2022
-
[38]
Audio deepfake detection: A survey,
J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, and Y. Zhao, “Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023
-
[39]
A survey on speech deepfake detection,
M. Li, Y. Ahmadiadli, and X.-P . Zhang, “A survey on speech deepfake detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025
work page 2025
-
[40]
A comprehensive survey with critical analysis for deepfake speech detection,
L. Pham, P . Lam, D. Tran, H. Tang, T. Nguyen, A. Schindler, F. Skopik, A. Polonsky, and H. C. Vu, “A comprehensive survey with critical analysis for deepfake speech detection,”Computer Science Review, vol. 57, p. 100757, 2025
work page 2025
-
[41]
Recent advances in speech language models: A survey,
W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y. Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943– 13 970
work page 2021
-
[42]
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,
D. Zhang, S. Li, X. Zhang, J. Zhan, P . Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 15 757–15 773
work page 2023
-
[43]
Z. Ma, R. Xu, Y. Ma, C.-H. H. Yang, B. Li, J. Kim, J. Xu, J. Li, C. Busso, K. Yuet al., “The interspeech 2026 audio reasoning chal- lenge: Evaluating reasoning process quality for audio reasoning models and agents,”arXiv preprint arXiv:2602.14224, 2026
-
[44]
Sci-phi: A large language model spatial audio descriptor,
X. Jiang, H. Gamper, and S. Braun, “Sci-phi: A large language model spatial audio descriptor,”IEEE Open Journal of Signal Processing, 2026
work page 2026
-
[45]
X. Zhao, Y. Shen, Y. Jiang, Z. Wang, J. Liu, M. H. Cheng, G. C. Oliveira, R. Desimone, D. Dwyer, and Z. Ge, “It hears, it sees too: Multi-modal llm for depression detection by integrating visual understanding into audio language models,”arXiv preprint arXiv:2511.19877, 2025
-
[46]
Wearvox: An egocentric multi- channel voice assistant benchmark for wearables,
Z. Lin, Y. Xu, K. Sun, J. Zheng, Y. Huang, S. T. Appini, K. Narang, R. Tao, I. K. Jain, S. Aroraet al., “Wearvox: An egocentric multi- channel voice assistant benchmark for wearables,”arXiv preprint arXiv:2601.02391, 2025
-
[47]
How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,
K.-H. Lu, S.-W. Fu, C.-H. H. Yang, Z. Chen, S.-F. Huang, C.-K. Yang, Y.-C. Lin, C.-Y. Hsiao, W. Ren, E.-P . Hu, Y.-H. Huang, A.-Y. Cheng, C.-H. Chiang, Y. Tsao, Y.-C. F. Wang, and H.-y. Lee, “How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,”arXiv preprint arXiv:2603.19195, 2026
-
[48]
Salm: Spatial audio language model with structured embeddings for understanding and editing,
J. Hu, Y. Cao, M. Wu, Z. Luo, and J. Yang, “Salm: Spatial audio language model with structured embeddings for understanding and editing,”arXiv preprint arXiv:2507.16724, 2025
-
[49]
Latent speech- text transformer,
Y.-J. Lu, Y. Gaur, W. Zhou, B. Muller, J. Villalba, N. Dehak, L. Zettlemoyer, G. Ghosh, M. Lewis, S. Iyeret al., “Latent speech- text transformer,”arXiv preprint arXiv:2510.06195, 2025
-
[50]
Uniaudio 2.0: A unified audio language model with text-aligned factorized audio tokenization,
D. Yang, Y. Wang, D. Chong, S. Liu, X. Wu, and H. Meng, “Uniaudio 2.0: A unified audio language model with text-aligned factorized audio tokenization,”arXiv preprint arXiv:2602.04683, 2026
-
[51]
Towards audio token compression in large audio language models,
S. Bhati, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Towards audio token compression in large audio language models,”arXiv preprint arXiv:2511.20973, 2025
-
[52]
Vowelprompt: Hearing speech emo- tions from text via vowel-level prosodic augmentation,
Y. Wang, O. Hanna, R. Xie, X. Rui, M. Shen, X. Zhang, C. Fuegen, J. Wu, D. Paul, A. Guoet al., “Vowelprompt: Hearing speech emo- tions from text via vowel-level prosodic augmentation,”arXiv preprint arXiv:2602.06270, 2026
-
[53]
Moe adapter for large audio language models: Sparsity, disentanglement, and gradient-conflict-free,
Y. Lei, S. He, J. Hu, D. Zhang, X. Luo, D. Zhu, S. Feng, R. Liu, J. He, Y. Sunet al., “Moe adapter for large audio language models: Sparsity, disentanglement, and gradient-conflict-free,” arXiv preprint arXiv:2601.02967, 2026
-
[54]
Segmentwise pruning in audio-language models,
M. Gibier, R. Duroselle, P . Serrano, O. Boeffard, and J.-F. Bonastre, “Segmentwise pruning in audio-language models,”arXiv preprint arXiv:2511.14293, 2025
-
[55]
S. BN, A. M. Sherrill, J. Alaparthi, D. Mattioli, R. I. Arriaga, C. W. Wiese, and S. Abdullah, “Fine-tuning large audio-language mod- els with lora for precise temporal localization of prolonged expo- sure therapy elements,”arXiv preprint arXiv:2506.09707, 2025
-
[56]
Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language mod- els,
K. Luo, L. Lin, Y. Zhang, M. Aloqaily, D. Wang, Z. Zhou, J. Zhang, K. Wang, L. Sun, and Q. Wen, “Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language mod- els,”arXiv preprint arXiv:2601.04876, 2026
-
[57]
Extending audio context for long-form understanding in large audio-language models,
Y. Chaichana, P . Taveekitworachai, W. Sirichotedumrong, P . Man- akul, and K. Pipatanakul, “Extending audio context for long-form understanding in large audio-language models,” inProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2026, pp. 6046– 6066
work page 2026
-
[58]
Listening between the frames: Bridging temporal gaps in large audio-language mod- els,
H. Wang, Y. Li, S. Ma, H. Liu, and X. Wang, “Listening between the frames: Bridging temporal gaps in large audio-language mod- els,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 31, 2026, pp. 26 233–26 241
work page 2026
-
[59]
End-to-end contrastive language-speech pretraining model for long-form spoken ques- tion answering,
J. Hu, Z. Li, B. Qi, G. Liu, and P . Wang, “End-to-end contrastive language-speech pretraining model for long-form spoken ques- tion answering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 37, 2026, pp. 31 041–31 049
work page 2026
-
[60]
Mimo-audio: Audio language models are few-shot learners,
D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “Mimo-audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025
-
[61]
J. Wang, Z. Ma, Z. Luo, T. Wang, M. Ge, X. Wang, and L. Wang, “Pay more attention to audio: Mitigating imbalance of cross- modal attention in large audio language models,”arXiv preprint arXiv:2509.18816, 2025
-
[62]
H. He, X. Du, R. Sun, Z. Dai, Y. Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025
-
[63]
Alarm: Audio-language alignment for reasoning models,
P . Grinberg and H. Shahmohammadi, “Alarm: Audio-language alignment for reasoning models,”arXiv preprint arXiv:2603.09556, 2026
-
[64]
Sightsound- r1: Cross-modal reasoning distillation from vision to audio lan- guage models,
Q. Wang, X. Jiang, L. He, J. Wu, and N. Mesgarani, “Sightsound- r1: Cross-modal reasoning distillation from vision to audio lan- guage models,”arXiv preprint arXiv:2509.15661, 2025
-
[65]
Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation,
J. Hu, D. Zhu, X. Luo, D. Zhang, S. He, Y. Lei, H. Zheng, S. Feng, J. He, Y. Sunet al., “Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation,”arXiv preprint arXiv:2601.16547, 2026
-
[66]
Q. Yang, B. Zhao, Z. Kang, X. Li, Y. He, C. Liu, X. Zhang, X. Qu, J. Peng, and J. Wang, “Attention-weighted centered kernel alignment for knowledge distillation in large audio-language models applied to speech emotion recognition,”arXiv preprint arXiv:2602.01547, 2026
-
[67]
Feedback-driven retrieval-augmented audio generation with large audio language models,
J. Zhao, C. Li, J. Zhao, R. Chen, D. Yu, M. D. Plumbley, and W. Wang, “Feedback-driven retrieval-augmented audio generation with large audio language models,”arXiv preprint arXiv:2511.01091, 2025
-
[68]
Emo-tta: Improving test- time adaptation of audio-language models for speech emotion recognition,
J. Shi, H. Du, Y. A. Hong, and Y. Gao, “Emo-tta: Improving test- time adaptation of audio-language models for speech emotion recognition,”arXiv preprint arXiv:2509.25495, 2025
-
[69]
Wavchat: A survey of spoken dialogue models,
S. Ji, Y. Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Cheng, X. Yang, Z. Wang, Q. Yang, J. Li, Y. Jiang, J. He, Y. Chu, J. Xu, and Z. Zhao, “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024
-
[70]
From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,
Y. Chen and H. Yu, “From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,”arXiv preprint arXiv:2509.14515, 2025
-
[71]
Beyond the turn-based game: Enabling real- time conversations with duplex models,
X. Zhang, Y. Chen, S. Hu, X. Han, Z. Xu, Y. Xu, W. Zhao, M. Sun, and Z. Liu, “Beyond the turn-based game: Enabling real- time conversations with duplex models,”Conference on Empirical Methods in Natural Language Processing, pp. 11 543–11 557, 2024
work page 2024
-
[72]
Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,
W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y. Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024
-
[73]
K. Hu, E. Hosseini-Asl, C. Chen, E. Casanova, S. Ghosh, P .˙Zelasko, Z. Chen, J. Li, J. Balam, and B. Ginsburg, “Efficient and direct duplex modeling for speech-to-speech language model,” arXiv preprint arXiv:2505.15670, 2025
-
[74]
X-talk: On the underestimated potential of modular speech-to-speech dialogue system,
Z. Liu, Y. Duan, M. Wang, P . Feng, H. Zhang, X. Xing, Y. Shan, H. Zhu, Y. Dai, C. Lu, X. Qiu, L. Xie, L. Wang, N. Yan, Z. Zheng, Z. Ma, K. Yu, and X. Chen, “X-talk: On the underestimated potential of modular speech-to-speech dialogue system,”arXiv preprint arXiv:2512.18706, 2025
-
[75]
R. Yan, W. Chen, Z. Liu, Z. Ma, H. Lin, H. Wen, H. Xie, J. Wu, Y. Liang, Y. Zhaoet al., “Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conver- sation,”arXiv preprint arXiv:2603.14877, 2026
-
[76]
TiCo: Time-Controllable Spoken Dialogue Model
K.-W. Chang, W.-C. Chen, E.-P . Hu, H.-y. Lee, and J. Glass, “Tico: Time-controllable training for spoken dialogue models,”arXiv preprint arXiv:2603.22267, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[77]
C.-Y. Hsiao, K.-H. Lu, Y.-K. Fu, G.-T. Lin, H.-T. Hung, and H.-y. Lee, “Aspirin: Action space projection for interactivity-optimized reinforcement learning in full-duplex speech language models,” arXiv preprint arXiv:2604.10065, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[78]
Flm-audio: Natural monologues improves native full-duplex chatbots via dual training,
Y. Yao, X. Li, X. Jiang, X. Fang, N. Yu, W. Ma, A. Sun, and Y. Wang, “Flm-audio: Natural monologues improves native full-duplex chatbots via dual training,”arXiv preprint arXiv:2509.02521, 2025
-
[79]
MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
C.-M. Chien, M. Orsini, E. Kharitonov, N. Zeghidour, K. Livescu, and A. D ´efossez, “Moshirag: Asynchronous knowledge re- trieval for full-duplex speech language models,”arXiv preprint arXiv:2604.12928, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[80]
D. Wu, T. Zhang, Y. Li, H. Liu, C. Chen, E. S. Chng, and Y. Bengio, “The silent thought: Modeling internal cognition in full-duplex JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19 spoken dialogue models via latent reasoning,”arXiv preprint arXiv:2603.17837, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.