pith. sign in

arxiv: 2605.20266 · v1 · pith:Y6UP4EMJnew · submitted 2026-05-18 · 💻 cs.SD

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Pith reviewed 2026-05-21 07:34 UTC · model grok-4.3

classification 💻 cs.SD
keywords Large Audio Language ModelsTrustworthinessSurveyJailbreakingPrivacyRobustnessHallucinationSafety
0
0 comments X

The pith

Large Audio Language Models advance faster than the trustworthiness frameworks needed to secure them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey traces how Large Audio Language Models extend language and multimodal systems toward auditory intelligence. It shows that unified end-to-end designs and continuous acoustic inputs enlarge the attack surface faster than protective measures can respond. The authors organize risks into a taxonomy spanning hallucination, robustness, safety, privacy, fairness, and authentication, documenting a clear gap where offensive techniques are mature while defenses remain underdeveloped. They close with a roadmap that calls for defense-in-depth structures, causal world modeling, and intrinsic representation engineering to close that gap.

Core claim

The escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness, with a profound imbalance between a mature offensive landscape and underdeveloped defenses. The transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface, as shown by vulnerabilities including cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage.

What carries the argument

The six-pillar taxonomy of trustworthiness risks that evaluates endogenous mechanisms and alignment algorithms across hallucination, robustness, safety, privacy, fairness, and authentication.

If this is right

  • Unified end-to-end audio models require defense strategies distinct from those developed for text or vision models.
  • Continuous acoustic signals create attack vectors such as latent backdoors that text-based methods do not address.
  • A strategic roadmap centered on defense-in-depth and causal auditory modeling can reduce the current imbalance between attacks and defenses.
  • Filling trustworthiness gaps supports safer deployment of audio-centric applications in real-world settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice interfaces built on these models may require extra verification layers to limit manipulation through sound inputs.
  • Regulatory guidelines for audio AI could focus on accelerating defensive research rather than capability benchmarks alone.
  • Similar trustworthiness gaps may appear in other multimodal models that fuse continuous signals with language.

Load-bearing premise

That the established taxonomy comprehensively captures the main vulnerabilities in unified end-to-end LALM frameworks.

What would settle it

A documented attack, such as a cross-modal jailbreak or acoustic backdoor, that succeeds against all defenses outlined in the taxonomy and roadmap.

Figures

Figures reproduced from arXiv: 2605.20266 by Deqing Zou, Dongrui Liu, Jiaming Zhang, Jing Chen, Junhao Dong, Kai Li, Kailin Lyu, Kaiwen Luo, Kun Wang, Leo Wang, Liang Lin, Li Sun, Miao Yu, Philip S. Yu, Qiufeng Wang, Rohan Kumar Das, Sen Su, Siyuan Liang, Tianyu Shao, Ting Dang, Xia Hu, Xiaojun Jia, Xinfeng Li, Xingjun Ma, Yang Liu, Yang Xiao, Yew-Soon Ong, Yuanhe Zhang, Yu Cheng, Yueming Wu, Yu-Gang Jiang, Yuxuan Li, Zhenhong Zhou, Zhigang Zeng.

Figure 1
Figure 1. Figure 1: The Evolutionary Roadmap of LALMs from Cascaded Systems to End-to-End Causal Cognition from 2022 to 2026. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architectural and Paradigmatic Evolution from Traditional Audio Models to LALMs. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of standard LALM with Audio-CoT.This figure pro [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of the six key dimensions of LALM trustworthiness. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative Growth and Key Milestones in Trustworthy LALM Research. This chart tracks the quantitative surge in almost scholarly [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Conceptual taxonomy of trustworthy LALM evaluation. We [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The Outlook of LALM. Future research trajectories are organized along three critical dimensions: intrinsic mechanisms, multimodal safety, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys Large Audio Language Models (LALMs), tracing their development from LLMs and MLLMs through architectural innovations and alignment algorithms that enable emergent reasoning. It analyzes how unified end-to-end frameworks with continuous acoustic signals expand the attack surface, establishes a taxonomy of trustworthiness vulnerabilities (cross-modal jailbreaking, latent acoustic backdoors, biometric privacy leakage), reviews the literature across six pillars (hallucination, robustness, safety, privacy, fairness, authentication), documents a profound imbalance between mature offensive capabilities and underdeveloped defenses, and outlines a roadmap for Defense-in-Depth architectures, causal auditory world modeling, and intrinsic representation engineering. A GitHub repository accompanies the survey.

Significance. A well-executed survey with this taxonomy and roadmap could become a standard reference for the emerging LALM trustworthiness literature, particularly by synthesizing cross-modal risks and advocating proactive defense strategies. The accompanying GitHub repository for literature collection is a positive step toward reproducibility in survey work.

major comments (2)
  1. [Abstract] Abstract: The central claim of a 'mature offensive landscape' and 'profound imbalance' between offensive and defensive capabilities rests on the assertion that continuous audio integration inherently expands the attack surface. However, the reviewed literature appears to draw primarily from LLM and vision-MLLM attacks (e.g., cross-modal jailbreaking examples) rather than providing extensive empirical demonstrations on actual unified LALM architectures such as those employing Whisper-style encoders or AudioLM decoders. This risks overstating realized LALM-specific exploits versus potential ones, which is load-bearing for the imbalance diagnosis and the call for new defenses.
  2. [Taxonomy and six-pillar review] Taxonomy and six-pillar review sections: The taxonomy is presented as comprehensive for vulnerabilities including latent acoustic backdoors and biometric privacy leakage in end-to-end frameworks, yet it is unclear whether the cited works contain sufficient LALM-specific attack results to substantiate that defenses are 'underdeveloped' relative to a mature offensive side. If the pillar reviews rely heavily on non-audio multimodal papers, the taxonomy's claimed coverage of critical gaps requires additional LALM-targeted evidence or explicit discussion of the extrapolation.
minor comments (2)
  1. [Introduction or Taxonomy] The manuscript would benefit from a dedicated subsection explicitly contrasting LALM-specific attack demonstrations with those transferred from LLMs/MLLMs to avoid reader confusion about the scope of the 'mature offensive landscape' claim.
  2. [Review sections] Figure or table summarizing the six analytical pillars could improve clarity by indicating the number of LALM-specific papers versus transferred results per pillar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our survey. The comments help us better distinguish demonstrated results from extrapolated risks in this emerging area. We address each major comment below and have revised the manuscript to incorporate clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 'mature offensive landscape' and 'profound imbalance' between offensive and defensive capabilities rests on the assertion that continuous audio integration inherently expands the attack surface. However, the reviewed literature appears to draw primarily from LLM and vision-MLLM attacks (e.g., cross-modal jailbreaking examples) rather than providing extensive empirical demonstrations on actual unified LALM architectures such as those employing Whisper-style encoders or AudioLM decoders. This risks overstating realized LALM-specific exploits versus potential ones, which is load-bearing for the imbalance diagnosis and the call for new defenses.

    Authors: We appreciate the referee's careful reading and agree that the distinction between realized LALM-specific exploits and extrapolated risks from related modalities merits clearer presentation. While the survey cites emerging LALM-specific works (including attacks on unified audio-text models and continuous acoustic interfaces), many foundational attack vectors are indeed illustrated via cross-modal analogies given the relative novelty of end-to-end LALMs. In the revised manuscript we will update the abstract and introduction to explicitly note this distinction, emphasize the inherent expansion of the attack surface due to continuous signals, and qualify the 'mature offensive landscape' claim as reflecting both demonstrated cases and rapidly transferable techniques. This adjustment preserves the imbalance diagnosis while avoiding overstatement. revision: yes

  2. Referee: [Taxonomy and six-pillar review] Taxonomy and six-pillar review sections: The taxonomy is presented as comprehensive for vulnerabilities including latent acoustic backdoors and biometric privacy leakage in end-to-end frameworks, yet it is unclear whether the cited works contain sufficient LALM-specific attack results to substantiate that defenses are 'underdeveloped' relative to a mature offensive side. If the pillar reviews rely heavily on non-audio multimodal papers, the taxonomy's claimed coverage of critical gaps requires additional LALM-targeted evidence or explicit discussion of the extrapolation.

    Authors: We concur that the taxonomy's strength depends on transparent handling of evidence sources. The six-pillar review incorporates the limited but growing body of LALM-specific studies alongside cross-modal results to map the full risk landscape. To address the concern, the revised version will add an explicit subsection discussing the degree of extrapolation required for each vulnerability category (e.g., latent acoustic backdoors), cite additional recent LALM-targeted papers where available, and qualify claims about underdeveloped defenses by noting the current scarcity of audio-specific empirical evaluations. These changes will strengthen the roadmap's justification without altering the overall taxonomy structure. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external literature

full rationale

As a survey paper, the work compiles and organizes existing research on LALMs without introducing any self-derived equations, fitted parameters, or predictions. Central claims about capability escalation outpacing trustworthiness frameworks and the imbalance between offensive and defensive landscapes rest on reviewed external literature rather than reducing to the paper's own inputs by construction. The proposed taxonomy and roadmap are presented as syntheses of prior vulnerabilities (e.g., cross-modal jailbreaking) drawn from cited works, with no self-citation chains, uniqueness theorems, or ansatzes that bear the load of the conclusions. The GitHub project link is a supplementary resource, not a definitional loop. The derivation chain is therefore self-contained through aggregation of independent sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a survey and relies on cited prior literature for its analysis of architectures, risks, and proposals. It does not introduce new fitted parameters or postulated entities.

axioms (1)
  • domain assumption Standard machine learning assumptions regarding model generalization, alignment, and emergent capabilities in multimodal systems.
    Invoked when discussing architectural innovations, alignment algorithms, and the expansion of attack surfaces in end-to-end frameworks.

pith-pipeline@v0.9.0 · 5912 in / 1123 out tokens · 47391 ms · 2026-05-21T07:34:43.921810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We establish a systematic classification of trustworthiness challenges, identifying critical vulnerabilities including cross-modal jailbreak through acoustic cues, latent acoustic backdoors, and biometric privacy leakage. Additionally, we evaluate the landscape of current leading models through the six pillars of trustworthiness, which consist of hallucination, robustness, safety, privacy, fairness, and authentication.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

212 extracted references · 212 canonical work pages · 36 internal anchors

  1. [1]

    Train- ing language models to follow instructions with human feed- back,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Train- ing language models to follow instructions with human feed- back,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  4. [4]

    Qwen Technical Report

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P . Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

    Y. Yan, S. Wang, J. Huo, J. Ye, Z. Chu, X. Hu, P . S. Yu, C. Gomes, B. Selman, and Q. Wen, “Position: Multimodal large language models can significantly advance scientific reasoning,”arXiv preprint arXiv:2502.02871, 2025

  8. [8]

    A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,

    Y. Yan, J. Su, J. He, F. Fu, X. Zheng, Y. Lyu, K. Wang, S. Wang, Q. Wen, and X. Hu, “A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 11 798–11 827

  9. [9]

    Qwen3.5-Omni Technical Report

    Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

  10. [10]

    Sparks of large audio models: A survey and outlook,

    S. Latif, M. Shoukat, F. Shamshad, M. Usama, Y. Ren, H. Cuay ´ahuitl, W. Wang, X. Zhang, R. Togneri, E. Cambriaet al., “Sparks of large audio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023

  11. [11]

    Audio-conditioned diffusion llms for asr and deliberation pro- cessing,

    M. Wang, Z. Liu, Z. Jin, G. Sun, C. Zhang, and P . C. Woodland, “Audio-conditioned diffusion llms for asr and deliberation pro- cessing,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 18 467–18 471

  12. [12]

    Qwen3-ASR Technical Report

    X. Shi, X. Wang, Z. Guo, Y. Wang, P . Zhang, X. Zhang, Z. Guo, H. Hao, Y. Xi, B. Yanget al., “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026

  13. [13]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

  14. [14]

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

    Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

  15. [15]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

  16. [16]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023

  17. [17]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    P . K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P . Chen, D. E. Badawy, W. Han, E. Kharitonovet al., “Audiopalm: A large language model that can speak and listen,”arXiv preprint arXiv:2306.12925, 2023

  18. [18]

    Qwen2-Audio Technical Report

    Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  19. [19]

    Step-Audio 2 Technical Report

    B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

  20. [20]

    Large language model safety: A holistic survey,

    D. Shi, T. Shen, Y. Huang, Z. Li, Y. Leng, R. Jin, C. Liu, X. Wu, Z. Guo, L. Yuet al., “Large language model safety: A holistic survey,”arXiv preprint arXiv:2412.17686, 2024

  21. [21]

    A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

    K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y. Yan, H. Luoet al., “A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment,”arXiv preprint arXiv:2504.15585, 2025

  22. [22]

    A survey on trustworthy llm agents: Threats and countermeasures,

    M. Yu, F. Meng, X. Zhou, S. Wang, J. Mao, L. Pan, T. Chen, K. Wang, X. Li, Y. Zhanget al., “A survey on trustworthy llm agents: Threats and countermeasures,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 6216–6226

  23. [23]

    Safety at scale: A comprehensive survey of large model and agent safety,

    X. Ma, Y. Gao, Y. Wang, R. Wang, X. Wang, Y. Sun, Y. Ding, H. Xu, Y. Chen, Y. Zhaoet al., “Safety at scale: A comprehensive survey of large model and agent safety,”Foundations and Trends in Privacy and Security, vol. 8, no. 3-4, pp. 1–240, 2026

  24. [24]

    Hidden in the noise: Unveil- ing backdoors in audio llms alignment through latent acoustic pattern triggers,

    L. Lin, M. Yu, K. Luo, Y. Zhang, L. Peng, D. Wang, X. Tang, Y. Zhang, X. Yang, Z. Zhouet al., “Hidden in the noise: Unveil- ing backdoors in audio llms alignment through latent acoustic pattern triggers,”arXiv preprint arXiv:2508.02175, 2025

  25. [25]

    Synthetic voices, real threats: Evaluating large text-to-speech models in generating harmful audio,

    G. Chen, Y. Wang, S. Ji, X. Luo, and T. Wang, “Synthetic voices, real threats: Evaluating large text-to-speech models in generating harmful audio,”arXiv preprint arXiv:2511.10913, 2025

  26. [26]

    Evalu- ation of audio language models for fairness, safety, and security,

    R. Aloufi, S. Gupta, S. Shaw, B. Biggio, and L. Sch ¨onherr, “Evalu- ation of audio language models for fairness, safety, and security,” arXiv preprint arXiv:2603.13262, 2026

  27. [27]

    Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

    M. Chen, K. Wang, L. Lu, J. Zhang, and T. Zhang, “Hijacking large audio-language models via context-agnostic and impercep- tible auditory prompt injection,”arXiv preprint arXiv:2604.14604, 2026

  28. [28]

    Spur: A plug-and-play framework for integrating spatial audio understanding and reasoning into large audio-language models,

    S. Sakshi, V . Lokegaonkar, N. Zhang, R. Duraiswami, S. Ghosh, D. Manocha, and L. Lu, “Spur: A plug-and-play framework for integrating spatial audio understanding and reasoning into large audio-language models,”arXiv preprint arXiv:2511.06606, 2025

  29. [29]

    Pal: Probing audio encoders via llms-audio information transfer into llms,

    T. Alex, W. Suharitdamrong, S. Atito, A. Mustafa, P . J. Jack- son, I. Razzak, and M. Awais, “Pal: Probing audio encoders via llms-audio information transfer into llms,”arXiv preprint arXiv:2506.10423, 2025

  30. [30]

    The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

    Y. You, L. Wei, X. Wu, and T. Qu, “The world is not mono: Enabling spatial understanding in large audio-language models,” arXiv preprint arXiv:2601.02954, 2026

  31. [31]

    Llamapartialspoof: An llm-driven fake speech dataset simulat- ing disinformation generation,

    H.-T. Luong, H. Li, L. Zhang, K. A. Lee, and E. S. Chng, “Llamapartialspoof: An llm-driven fake speech dataset simulat- ing disinformation generation,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  32. [32]

    Dfallm: Achieving generalizable multitask deepfake detection by optimizing audio llm components,

    Y. Li, L. Wang, Y. Wang, L. Wang, R. Cai, J. Shi, B. W. Schuller, and Z. Wu, “Dfallm: Achieving generalizable multitask deepfake detection by optimizing audio llm components,”arXiv preprint arXiv:2512.08403, 2025

  33. [33]

    Analyzing reasoning shifts in audio deepfake detection under adversarial attacks: The reasoning tax versus shield bifurcation,

    B. Nguyen and T. Le, “Analyzing reasoning shifts in audio deepfake detection under adversarial attacks: The reasoning tax versus shield bifurcation,”arXiv preprint arXiv:2601.03615, 2026

  34. [34]

    A survey on speech large language models for understanding,

    J. Peng, Y. Wang, B. Li, Y. Guo, H. Wang, Y. Fang, Y. Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

  35. [35]

    Audio-language models for audio-centric tasks: A survey,

    Y. Su, J. Bai, Q. Xu, K. Xu, and Y. Dou, “Audio-language models for audio-centric tasks: A survey,”arXiv preprint arXiv:2501.15177, 2025

  36. [36]

    Towards holistic evaluation of large audio-language models: A comprehensive survey,

    C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181

  37. [37]

    A review of speech-centric trustworthy machine learning: Privacy, safety, and fairness,

    T. Feng, R. Hebbar, N. Mehlman, X. Shi, A. Kommineniet al., “A review of speech-centric trustworthy machine learning: Privacy, safety, and fairness,”arXiv preprint arXiv:2212.09006, 2022

  38. [38]

    Audio deepfake detection: A survey,

    J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, and Y. Zhao, “Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023

  39. [39]

    A survey on speech deepfake detection,

    M. Li, Y. Ahmadiadli, and X.-P . Zhang, “A survey on speech deepfake detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025

  40. [40]

    A comprehensive survey with critical analysis for deepfake speech detection,

    L. Pham, P . Lam, D. Tran, H. Tang, T. Nguyen, A. Schindler, F. Skopik, A. Polonsky, and H. C. Vu, “A comprehensive survey with critical analysis for deepfake speech detection,”Computer Science Review, vol. 57, p. 100757, 2025

  41. [41]

    Recent advances in speech language models: A survey,

    W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y. Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943– 13 970

  42. [42]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,

    D. Zhang, S. Li, X. Zhang, J. Zhan, P . Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 15 757–15 773

  43. [43]

    The interspeech 2026 audio reasoning chal- lenge: Evaluating reasoning process quality for audio reasoning models and agents,

    Z. Ma, R. Xu, Y. Ma, C.-H. H. Yang, B. Li, J. Kim, J. Xu, J. Li, C. Busso, K. Yuet al., “The interspeech 2026 audio reasoning chal- lenge: Evaluating reasoning process quality for audio reasoning models and agents,”arXiv preprint arXiv:2602.14224, 2026

  44. [44]

    Sci-phi: A large language model spatial audio descriptor,

    X. Jiang, H. Gamper, and S. Braun, “Sci-phi: A large language model spatial audio descriptor,”IEEE Open Journal of Signal Processing, 2026

  45. [45]

    It hears, it sees too: Multi-modal llm for depression detection by integrating visual understanding into audio language models,

    X. Zhao, Y. Shen, Y. Jiang, Z. Wang, J. Liu, M. H. Cheng, G. C. Oliveira, R. Desimone, D. Dwyer, and Z. Ge, “It hears, it sees too: Multi-modal llm for depression detection by integrating visual understanding into audio language models,”arXiv preprint arXiv:2511.19877, 2025

  46. [46]

    Wearvox: An egocentric multi- channel voice assistant benchmark for wearables,

    Z. Lin, Y. Xu, K. Sun, J. Zheng, Y. Huang, S. T. Appini, K. Narang, R. Tao, I. K. Jain, S. Aroraet al., “Wearvox: An egocentric multi- channel voice assistant benchmark for wearables,”arXiv preprint arXiv:2601.02391, 2025

  47. [47]

    How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,

    K.-H. Lu, S.-W. Fu, C.-H. H. Yang, Z. Chen, S.-F. Huang, C.-K. Yang, Y.-C. Lin, C.-Y. Hsiao, W. Ren, E.-P . Hu, Y.-H. Huang, A.-Y. Cheng, C.-H. Chiang, Y. Tsao, Y.-C. F. Wang, and H.-y. Lee, “How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,”arXiv preprint arXiv:2603.19195, 2026

  48. [48]

    Salm: Spatial audio language model with structured embeddings for understanding and editing,

    J. Hu, Y. Cao, M. Wu, Z. Luo, and J. Yang, “Salm: Spatial audio language model with structured embeddings for understanding and editing,”arXiv preprint arXiv:2507.16724, 2025

  49. [49]

    Latent speech- text transformer,

    Y.-J. Lu, Y. Gaur, W. Zhou, B. Muller, J. Villalba, N. Dehak, L. Zettlemoyer, G. Ghosh, M. Lewis, S. Iyeret al., “Latent speech- text transformer,”arXiv preprint arXiv:2510.06195, 2025

  50. [50]

    Uniaudio 2.0: A unified audio language model with text-aligned factorized audio tokenization,

    D. Yang, Y. Wang, D. Chong, S. Liu, X. Wu, and H. Meng, “Uniaudio 2.0: A unified audio language model with text-aligned factorized audio tokenization,”arXiv preprint arXiv:2602.04683, 2026

  51. [51]

    Towards audio token compression in large audio language models,

    S. Bhati, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Towards audio token compression in large audio language models,”arXiv preprint arXiv:2511.20973, 2025

  52. [52]

    Vowelprompt: Hearing speech emo- tions from text via vowel-level prosodic augmentation,

    Y. Wang, O. Hanna, R. Xie, X. Rui, M. Shen, X. Zhang, C. Fuegen, J. Wu, D. Paul, A. Guoet al., “Vowelprompt: Hearing speech emo- tions from text via vowel-level prosodic augmentation,”arXiv preprint arXiv:2602.06270, 2026

  53. [53]

    Moe adapter for large audio language models: Sparsity, disentanglement, and gradient-conflict-free,

    Y. Lei, S. He, J. Hu, D. Zhang, X. Luo, D. Zhu, S. Feng, R. Liu, J. He, Y. Sunet al., “Moe adapter for large audio language models: Sparsity, disentanglement, and gradient-conflict-free,” arXiv preprint arXiv:2601.02967, 2026

  54. [54]

    Segmentwise pruning in audio-language models,

    M. Gibier, R. Duroselle, P . Serrano, O. Boeffard, and J.-F. Bonastre, “Segmentwise pruning in audio-language models,”arXiv preprint arXiv:2511.14293, 2025

  55. [55]

    Fine-tuning large audio-language mod- els with lora for precise temporal localization of prolonged expo- sure therapy elements,

    S. BN, A. M. Sherrill, J. Alaparthi, D. Mattioli, R. I. Arriaga, C. W. Wiese, and S. Abdullah, “Fine-tuning large audio-language mod- els with lora for precise temporal localization of prolonged expo- sure therapy elements,”arXiv preprint arXiv:2506.09707, 2025

  56. [56]

    Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language mod- els,

    K. Luo, L. Lin, Y. Zhang, M. Aloqaily, D. Wang, Z. Zhou, J. Zhang, K. Wang, L. Sun, and Q. Wen, “Chronosaudio: A comprehensive long-audio benchmark for evaluating audio-large language mod- els,”arXiv preprint arXiv:2601.04876, 2026

  57. [57]

    Extending audio context for long-form understanding in large audio-language models,

    Y. Chaichana, P . Taveekitworachai, W. Sirichotedumrong, P . Man- akul, and K. Pipatanakul, “Extending audio context for long-form understanding in large audio-language models,” inProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2026, pp. 6046– 6066

  58. [58]

    Listening between the frames: Bridging temporal gaps in large audio-language mod- els,

    H. Wang, Y. Li, S. Ma, H. Liu, and X. Wang, “Listening between the frames: Bridging temporal gaps in large audio-language mod- els,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 31, 2026, pp. 26 233–26 241

  59. [59]

    End-to-end contrastive language-speech pretraining model for long-form spoken ques- tion answering,

    J. Hu, Z. Li, B. Qi, G. Liu, and P . Wang, “End-to-end contrastive language-speech pretraining model for long-form spoken ques- tion answering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 37, 2026, pp. 31 041–31 049

  60. [60]

    Mimo-audio: Audio language models are few-shot learners,

    D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “Mimo-audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025

  61. [61]

    Pay more attention to audio: Mitigating imbalance of cross- modal attention in large audio language models,

    J. Wang, Z. Ma, Z. Luo, T. Wang, M. Ge, X. Wang, and L. Wang, “Pay more attention to audio: Mitigating imbalance of cross- modal attention in large audio language models,”arXiv preprint arXiv:2509.18816, 2025

  62. [62]

    Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,

    H. He, X. Du, R. Sun, Z. Dai, Y. Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025

  63. [63]

    Alarm: Audio-language alignment for reasoning models,

    P . Grinberg and H. Shahmohammadi, “Alarm: Audio-language alignment for reasoning models,”arXiv preprint arXiv:2603.09556, 2026

  64. [64]

    Sightsound- r1: Cross-modal reasoning distillation from vision to audio lan- guage models,

    Q. Wang, X. Jiang, L. He, J. Wu, and N. Mesgarani, “Sightsound- r1: Cross-modal reasoning distillation from vision to audio lan- guage models,”arXiv preprint arXiv:2509.15661, 2025

  65. [65]

    Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation,

    J. Hu, D. Zhu, X. Luo, D. Zhang, S. He, Y. Lei, H. Zheng, S. Feng, J. He, Y. Sunet al., “Cord: Bridging the audio-text reasoning gap via weighted on-policy cross-modal distillation,”arXiv preprint arXiv:2601.16547, 2026

  66. [66]

    Attention-weighted centered kernel alignment for knowledge distillation in large audio-language models applied to speech emotion recognition,

    Q. Yang, B. Zhao, Z. Kang, X. Li, Y. He, C. Liu, X. Zhang, X. Qu, J. Peng, and J. Wang, “Attention-weighted centered kernel alignment for knowledge distillation in large audio-language models applied to speech emotion recognition,”arXiv preprint arXiv:2602.01547, 2026

  67. [67]

    Feedback-driven retrieval-augmented audio generation with large audio language models,

    J. Zhao, C. Li, J. Zhao, R. Chen, D. Yu, M. D. Plumbley, and W. Wang, “Feedback-driven retrieval-augmented audio generation with large audio language models,”arXiv preprint arXiv:2511.01091, 2025

  68. [68]

    Emo-tta: Improving test- time adaptation of audio-language models for speech emotion recognition,

    J. Shi, H. Du, Y. A. Hong, and Y. Gao, “Emo-tta: Improving test- time adaptation of audio-language models for speech emotion recognition,”arXiv preprint arXiv:2509.25495, 2025

  69. [69]

    Wavchat: A survey of spoken dialogue models,

    S. Ji, Y. Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Cheng, X. Yang, Z. Wang, Q. Yang, J. Li, Y. Jiang, J. He, Y. Chu, J. Xu, and Z. Zhao, “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

  70. [70]

    From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,

    Y. Chen and H. Yu, “From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,”arXiv preprint arXiv:2509.14515, 2025

  71. [71]

    Beyond the turn-based game: Enabling real- time conversations with duplex models,

    X. Zhang, Y. Chen, S. Hu, X. Han, Z. Xu, Y. Xu, W. Zhao, M. Sun, and Z. Liu, “Beyond the turn-based game: Enabling real- time conversations with duplex models,”Conference on Empirical Methods in Natural Language Processing, pp. 11 543–11 557, 2024

  72. [72]

    Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

    W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y. Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024

  73. [73]

    Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

    K. Hu, E. Hosseini-Asl, C. Chen, E. Casanova, S. Ghosh, P .˙Zelasko, Z. Chen, J. Li, J. Balam, and B. Ginsburg, “Efficient and direct duplex modeling for speech-to-speech language model,” arXiv preprint arXiv:2505.15670, 2025

  74. [74]

    X-talk: On the underestimated potential of modular speech-to-speech dialogue system,

    Z. Liu, Y. Duan, M. Wang, P . Feng, H. Zhang, X. Xing, Y. Shan, H. Zhu, Y. Dai, C. Lu, X. Qiu, L. Xie, L. Wang, N. Yan, Z. Zheng, Z. Ma, K. Yu, and X. Chen, “X-talk: On the underestimated potential of modular speech-to-speech dialogue system,”arXiv preprint arXiv:2512.18706, 2025

  75. [75]

    Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conver- sation,

    R. Yan, W. Chen, Z. Liu, Z. Ma, H. Lin, H. Wen, H. Xie, J. Wu, Y. Liang, Y. Zhaoet al., “Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conver- sation,”arXiv preprint arXiv:2603.14877, 2026

  76. [76]

    TiCo: Time-Controllable Spoken Dialogue Model

    K.-W. Chang, W.-C. Chen, E.-P . Hu, H.-y. Lee, and J. Glass, “Tico: Time-controllable training for spoken dialogue models,”arXiv preprint arXiv:2603.22267, 2026

  77. [77]

    ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

    C.-Y. Hsiao, K.-H. Lu, Y.-K. Fu, G.-T. Lin, H.-T. Hung, and H.-y. Lee, “Aspirin: Action space projection for interactivity-optimized reinforcement learning in full-duplex speech language models,” arXiv preprint arXiv:2604.10065, 2026

  78. [78]

    Flm-audio: Natural monologues improves native full-duplex chatbots via dual training,

    Y. Yao, X. Li, X. Jiang, X. Fang, N. Yu, W. Ma, A. Sun, and Y. Wang, “Flm-audio: Natural monologues improves native full-duplex chatbots via dual training,”arXiv preprint arXiv:2509.02521, 2025

  79. [79]

    MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

    C.-M. Chien, M. Orsini, E. Kharitonov, N. Zeghidour, K. Livescu, and A. D ´efossez, “Moshirag: Asynchronous knowledge re- trieval for full-duplex speech language models,”arXiv preprint arXiv:2604.12928, 2026

  80. [80]

    The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

    D. Wu, T. Zhang, Y. Li, H. Liu, C. Chen, E. S. Chng, and Y. Bengio, “The silent thought: Modeling internal cognition in full-duplex JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19 spoken dialogue models via latent reasoning,”arXiv preprint arXiv:2603.17837, 2026

Showing first 80 references.