pith. machine review for the scientific record. sign in

arxiv: 2604.25591 · v1 · submitted 2026-04-28 · 📡 eess.AS · cs.AI· cs.CL· cs.LG· cs.SD

Recognition: unknown

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:24 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.LGcs.SD
keywords uncertainty estimationaudio-aware LLMssemantic entropyhallucination detectionaudio reasoningtrustworthinessmultimodal modelsadaptive inference
0
0 comments X

The pith

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio-aware LLMs on general reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first systematic empirical study of uncertainty estimation methods applied to audio-aware large language models that process both sound inputs and text reasoning. It evaluates five representative techniques across multiple models and settings that include general audio understanding, reasoning tasks, hallucination detection, and unanswerable question answering. Results indicate that semantic-level and verification-based methods deliver more reliable uncertainty signals than token-level entropy baselines in standard reasoning scenarios. On trustworthiness-focused benchmarks the advantage becomes inconsistent and depends on the particular model and benchmark. This matters because audio-language systems are deployed in settings where overconfident or fabricated outputs can lead to errors, and improved uncertainty awareness could support safer adaptive behavior.

Core claim

The authors establish through systematic experiments that semantic-level and verification-based uncertainty estimation methods consistently outperform token-level baselines on general audio reasoning benchmarks for audio-aware large language models, whereas on trustworthiness-oriented benchmarks such as hallucination detection and unanswerable question answering the relative effectiveness of the methods is more dependent on the model and the specific benchmark.

What carries the argument

Empirical benchmarking of five uncertainty estimation methods (predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True)) applied to multiple audio-aware LLMs across audio-conditioned generation tasks.

Load-bearing premise

The assumption that the chosen models, benchmarks, and audio-conditioned generation settings adequately represent the perceptual ambiguity and cross-modal grounding challenges without introducing implementation-specific biases.

What would settle it

An experiment on new audio datasets or additional ALLMs in which token-level entropy methods achieve higher calibration or better hallucination detection accuracy than semantic-level methods on general reasoning tasks would falsify the main result.

Figures

Figures reproduced from arXiv: 2604.25591 by Chun-Yi Kuan, Hung-yi Lee, Wei-Ping Huang.

Figure 1
Figure 1. Figure 1: Cost–accuracy Pareto frontier of Reasoning vs. Adaptive inference view at source ↗
Figure 2
Figure 2. Figure 2: Reliability diagrams for Qwen2.5-Omni-7B [11] on MMAU [26], reported for both the overall benchmark and three domains: sound, speech, and music. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 3
Figure 3. Figure 3: Reliability diagrams for Qwen2.5-Omni-3B [11] on MMAU [26], reported for both the overall benchmark and three domains: sound, speech, and music. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 4
Figure 4. Figure 4: Reliability diagrams for Qwen2.5-Omni-7B [11] on MMAR [27], reported for both the overall benchmark and the two benchmark-defined task categories: Perception and Semantic. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 5
Figure 5. Figure 5: Reliability diagrams for Qwen2.5-Omni-3B [11] on MMAR [27], reported for both the overall benchmark and the two benchmark-defined task categories: Perception and Semantic. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 6
Figure 6. Figure 6: Reliability diagrams for Qwen2.5-Omni-7B [11] on MMSU [29], reported for both the overall benchmark and the two task categories: Perception and Reasoning. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 7
Figure 7. Figure 7: Reliability diagrams for Qwen2.5-Omni-3B [11] on MMSU [29], reported for both the overall benchmark and the two task categories: Perception and Reasoning. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 8
Figure 8. Figure 8: Reliability diagrams for Qwen2.5-Omni-7B [11] on SAKURA [30], including overall performance and four tracks: animal, emotion, language, and gender. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
Figure 9
Figure 9. Figure 9: Reliability diagrams for Qwen2.5-Omni-3B [11] on SAKURA [30], including overall performance and four tracks: animal, emotion, language, and gender. Each point corresponds to an equal-mass bin. Lower ECE and Brier score indicate better calibration view at source ↗
read the original abstract

Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the first systematic empirical study of uncertainty estimation for audio-aware large language models (ALLMs). It benchmarks five methods (predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True)) across multiple models on tasks spanning general audio understanding and reasoning, hallucination detection, and unanswerable question answering. The two main findings are that semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks, while relative effectiveness becomes more model- and benchmark-dependent on trustworthiness-oriented tasks; the work also explores uncertainty-based adaptive inference.

Significance. If the results hold under appropriate controls, this establishes a valuable baseline for uncertainty quantification in multimodal audio-language models and shows that text-LLM findings do not transfer directly due to perceptual ambiguity and cross-modal issues. It provides a foundation for reliable ALLM systems and adaptive inference, with credit for the broad multi-model, multi-task design.

major comments (1)
  1. [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The central claim that semantic-level and verification-based methods outperform baselines specifically due to better handling of perceptual ambiguity and cross-modal grounding requires that benchmarks systematically vary audio clarity, noise, or modality conflict while holding text prompts fixed. No such ablations or controls on audio degradation are described, so the comparison may not isolate ALLM-specific challenges from general LLM uncertainty.
minor comments (2)
  1. [Results] The results lack reported error bars, variance across runs, or statistical significance tests, which would strengthen verification of the consistency claims.
  2. [§3 (Methods)] Implementation details for audio conditioning (e.g., feature extraction, fusion mechanism) and exact benchmark compositions could be expanded for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment below and commit to revisions that better align our claims with the experimental evidence while preserving the value of the empirical benchmarking study.

read point-by-point responses
  1. Referee: [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The central claim that semantic-level and verification-based methods outperform baselines specifically due to better handling of perceptual ambiguity and cross-modal grounding requires that benchmarks systematically vary audio clarity, noise, or modality conflict while holding text prompts fixed. No such ablations or controls on audio degradation are described, so the comparison may not isolate ALLM-specific challenges from general LLM uncertainty.

    Authors: We agree that a strong causal attribution of the observed outperformance to better handling of perceptual ambiguity and cross-modal grounding would require controlled ablations that systematically vary audio clarity, noise, or modality conflicts while holding text prompts fixed. Our study is an empirical benchmarking effort on existing audio reasoning, hallucination, and unanswerable-question datasets, which do contain real-world audio with varying perceptual quality; however, we did not introduce explicit degradation controls or modality-conflict conditions. Consequently, we cannot fully isolate ALLM-specific factors from general LLM uncertainty behaviors. In the revised manuscript we will (i) moderate the language in the abstract, introduction, and conclusion to describe the results as consistent outperformance of semantic-level and verification-based methods on audio-conditioned tasks without claiming this is 'specifically due to' perceptual or cross-modal mechanisms, and (ii) add an explicit limitations paragraph noting the absence of such controls and recommending them for future work. This constitutes a partial revision focused on textual clarification rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking without derivations or self-referential predictions

full rationale

The paper is a purely empirical study that benchmarks five uncertainty estimation methods (predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, P(True)) on external audio reasoning and trustworthiness benchmarks across multiple ALLMs. No mathematical derivations, first-principles results, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described structure. All claims rest on direct experimental comparisons to prior text-LLM methods and standard benchmarks, with no load-bearing steps that reduce to the paper's own inputs by construction. This matches the default case of an observational evaluation whose conclusions are falsifiable against independent data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study that applies existing uncertainty methods without introducing new theoretical constructs, fitted parameters, or postulated entities.

axioms (1)
  • domain assumption Uncertainty estimation techniques developed for text-only LLMs remain meaningful when applied to audio-conditioned generation
    The study directly benchmarks the listed methods on ALLMs without deriving or justifying new foundations for the audio modality.

pith-pipeline@v0.9.0 · 5536 in / 1222 out tokens · 86215 ms · 2026-05-07T14:24:14.565870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 53 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Listen, think, and understand,

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass, “Listen, think, and understand,” inThe Twelfth International Conference on Learning Representations, 2023

  3. [3]

    Joint audio and speech understanding,

    Y . Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass, “Joint audio and speech understanding,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

  4. [4]

    Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,

    C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y . Liu, C. Zong, and J. Zhang, “Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing,”arXiv preprint arXiv:2309.00916, 2023

  5. [5]

    Audiochatllama: Towards general-purpose speech abilities for llms,

    Y . Fathullah, C. Wu, E. Lakomkin, K. Li, J. Jia, Y . Shangguan, J. Mahadeokar, O. Kalinli, C. Fuegen, and M. Seltzer, “Audiochatllama: Towards general-purpose speech abilities for llms,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers)...

  6. [6]

    Speech-copilot: Leveraging large language models for speech process- ing via task decomposition, modularization, and program generation,

    C.-Y . Kuan, C.-K. Yang, W.-P. Huang, K.-H. Lu, and H.-y. Lee, “Speech-copilot: Leveraging large language models for speech process- ing via task decomposition, modularization, and program generation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 1060–1067

  7. [7]

    Speechprompt: Prompting speech language models for speech processing tasks,

    K.-W. Chang, H. Wu, Y .-K. Wang, Y .-K. Wu, H. Shen, W.-C. Tseng, I.-t. Kang, S.-W. Li, and H.-y. Lee, “Speechprompt: Prompting speech language models for speech processing tasks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

  8. [8]

    Blsp-emo: Towards empathetic large speech-language models,

    C. Wang, M. Liao, Z. Huang, J. Wu, C. Zong, and J. Zhang, “Blsp-emo: Towards empathetic large speech-language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 19 186–19 199

  9. [9]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

    S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  10. [10]

    14 Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al

    A. H. Liu, A. Ehrenberg, A. Lo, C. Denoix, C. Barreau, G. Lample, J.-M. Delignon, K. R. Chandu, P. von Platen, P. R. Muddireddyet al., “V oxtral,”arXiv preprint arXiv:2507.13264, 2025

  11. [11]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  12. [12]

    Building a taiwanese mandarin spoken language model: A first attempt,

    C.-K. Yang, Y .-K. Fu, C.-A. Li, Y .-C. Lin, Y .-X. Lin, W.-C. Chen, H. L. Chung, C.-Y . Kuan, W.-P. Huang, K.-H. Luet al., “Building a taiwanese mandarin spoken language model: A first attempt,”arXiv preprint arXiv:2411.07111, 2024

  13. [13]

    Teaching audio-aware large language mod- els what does not hear: Mitigating hallucinations through synthesized negative samples,

    C.-Y . Kuan and H.-y. Lee, “Teaching audio-aware large language mod- els what does not hear: Mitigating hallucinations through synthesized negative samples,” inProc. Interspeech 2025, 2025, pp. 2073–2077

  14. [14]

    From alignment to advancement: Bootstrapping audio-language alignment with synthetic data,

    ——, “From alignment to advancement: Bootstrapping audio-language alignment with synthetic data,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 4604–4619, 2025

  15. [15]

    Recent advances in discrete speech tokens: A review,

    Y . Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu, “Recent advances in discrete speech tokens: A review,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025

  16. [16]

    Towards audio language modeling–an overview,

    H. Wu, X. Chen, Y .-C. Lin, K.-w. Chang, H.-L. Chung, A. H. Liu, and H.-y. Lee, “Towards audio language modeling–an overview,”arXiv preprint arXiv:2402.13236, 2024

  17. [17]

    Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025

    P. Mousavi, G. Maimon, A. Moumen, D. Petermann, J. Shi, H. Wu, H. Yang, A. Kuznetsova, A. Ploujnikov, R. Marxeret al., “Discrete audio tokens: More than a survey!”arXiv preprint arXiv:2506.10274, 2025

  18. [18]

    Recent advances in speech language models: A survey,

    W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A survey,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 13 943– 13 970

  19. [19]

    A survey on speech large language models for understanding,

    J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhanget al., “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, 2025

  20. [20]

    Wavchat: A survey of spoken dialogue models

    S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Chenget al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

  21. [21]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chenet al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

  22. [22]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  23. [23]

    On the landscape of spoken language models: A comprehensive survey,

    S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-y. Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,”Transactions on Machine Learning Research, 2025

  24. [24]

    Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

    S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,” inF orty-second International Conference on Machine Learning, 2024

  25. [25]

    Tico: Time-controllable training for spoken dialogue models,

    K.-W. Chang, W.-C. Chen, E.-P. Hu, H.-y. Lee, and J. Glass, “Tico: Time-controllable training for spoken dialogue models,”arXiv preprint arXiv:2603.22267, 2026

  26. [26]

    Mmau: A massive multi- task audio understanding and reasoning benchmark,

    S. Sakshi, U. Tyagi, S. Kumar, A. Seth, R. Selvakumar, O. Nieto, R. Duraiswami, S. Ghosh, and D. Manocha, “Mmau: A massive multi- task audio understanding and reasoning benchmark,” inThe Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.arXiv preprint arXiv:2505.13032, 2025

    Z. Ma, Y . Ma, Y . Zhu, C. Yang, Y .-W. Chao, R. Xuet al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025

  28. [28]

    Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,

    S. Kumar, ˇS. Sedl ´aˇcek, V . Lokegaonkar, F. L´opez, W. Yu, N. Anand, H. Ryu, L. Chen, M. Pli ˇcka, M. Hlav ´aˇceket al., “Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,”arXiv preprint arXiv:2508.13992, 2025

  29. [29]

    Mmsu: A massive multi-task spoken language understanding and reasoning benchmark, 2025

    D. Wang, J. Wu, J. Li, D. Yang, X. Chen, T. Zhang, and H. Meng, “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,”arXiv preprint arXiv:2506.04779, 2025

  30. [30]

    Sakura: On the multi-hop reasoning of large audio-language models based on speech and audio information,

    C.-K. Yanget al., “Sakura: On the multi-hop reasoning of large audio-language models based on speech and audio information,” in Interspeech 2025, 2025

  31. [31]

    Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,

    C.-Y . Kuan, W.-P. Huang, and H.-y. Lee, “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,” inInterspeech 2025, 2025

  32. [32]

    Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning,

    C.-Y . Kuan and H.-y. Lee, “Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  33. [33]

    Dynamic-superb: Towards a dynamic, collaborative, and compre- hensive instruction-tuning benchmark for speech,

    C.-y. Huang, K.-H. Lu, S.-H. Wang, C.-Y . Hsiao, C.-Y . Kuanet al., “Dynamic-superb: Towards a dynamic, collaborative, and compre- hensive instruction-tuning benchmark for speech,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 136–12 140

  34. [34]

    Dynamic- superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,

    C.-y. Huang, W.-C. Chen, S.-w. Yang, A. T. Liuet al., “Dynamic- superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thir- teenth International Conference on Learning Representations, 2025

  35. [35]

    Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,

    Y .-C. Lin, T.-Q. Lin, C.-K. Yang, K.-H. Lu, W.-C. Chen, C.-Y . Kuan, and H.-y. Lee, “Listen and speak fairly: a study on semantic gender bias in speech integrated large language models,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 439–446

  36. [36]

    Spoken stereoset: on evaluating social bias toward speaker in speech large language models,

    Y .-C. Lin, W.-C. Chen, and H.-y. Lee, “Spoken stereoset: on evaluating social bias toward speaker in speech large language models,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 871–878

  37. [37]

    Av-superb: A multi- task evaluation benchmark for audio-visual representation models,

    Y . Tseng, L. Berry, Y .-T. Chen, I.-H. Chiu, H.-H. Lin, M. Liu, P. Peng, Y .-J. Shih, H.-Y . Wang, H. Wuet al., “Av-superb: A multi- task evaluation benchmark for audio-visual representation models,” in ICASSP 2024-2024 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2024, pp. 6890–6894

  38. [38]

    Beyond single-audio: Advancing multi-audio processing in audio large language models,

    Y . Chen, X. Yue, X. Gao, C. Zhang, L. F. D’Haro, R. T. Tan, and H. Li, “Beyond single-audio: Advancing multi-audio processing in audio large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 10 917–10 930

  39. [39]

    Audiomarathon: A comprehensive benchmark for long-context audio understanding and efficiency in audio llms,

    P. He, Z. Wen, Y . Wang, Y . Wang, X. Liu, J. Huang, Z. Lei, Z. Gu, X. Jin, J. Yanget al., “Audiomarathon: A comprehensive benchmark for long-context audio understanding and efficiency in audio llms,” arXiv preprint arXiv:2510.07293, 2025

  40. [40]

    Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech- aware language models,

    K.-H. Lu, C.-Y . Kuan, and H.-y. Lee, “Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech- aware language models,” inProc. Interspeech 2025, 2025, pp. 2078– 2082

  41. [41]

    Evaluation of llms in speech is often flawed: Test set contamination in large language models for speech recognition,

    Y . Tseng, T. Parcollet, R. van Dalen, S. Zhang, and S. Bhattacharya, “Evaluation of llms in speech is often flawed: Test set contamination in large language models for speech recognition,”arXiv preprint arXiv:2505.22251, 2025

  42. [42]

    Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

    S.-w. Yang, M. Tu, A. T. Liu, X. Qu, H.-y. Lee, L. Lu, Y . Wang, and Y . Wu, “Paras2s: Benchmarking and aligning spoken language models for paralinguistic-aware speech-to-speech interaction,”arXiv preprint arXiv:2511.08723, 2025

  43. [43]

    Mos-bias: From hidden gen- der bias to gender-aware speech quality assessment,

    W. Ren, Y .-C. Lin, W.-C. Huang, E. Cooper, R. E. Zezario, H.- M. Wang, H.-y. Lee, and Y . Tsao, “Mos-bias: From hidden gen- der bias to gender-aware speech quality assessment,”arXiv preprint arXiv:2603.10723, 2026

  44. [44]

    Aqua-bench: Beyond finding answers to knowing when there are none in audio question answering,

    C.-Y . Kuan and H.-y. Lee, “Aqua-bench: Beyond finding answers to knowing when there are none in audio question answering,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 1–5

  45. [45]

    Aqascore: Evaluating seman- tic alignment in text-to-audio generation via audio question answering,

    C.-Y . Kuan, K.-W. Chang, and H.-y. Lee, “Aqascore: Evaluating seman- tic alignment in text-to-audio generation via audio question answering,” arXiv preprint arXiv:2601.14728, 2026

  46. [46]

    Causal tracing of audio-text fusion in large audio language models,

    W.-C. Chen, C.-y. Huang, and H.-y. Lee, “Causal tracing of audio-text fusion in large audio language models,”arXiv preprint arXiv:2603.13768, 2026

  47. [47]

    Membership inference attacks against large audio language models,

    J.-K. Dong, Y .-X. Lin, and H.-Y . Lee, “Membership inference attacks against large audio language models,”arXiv preprint arXiv:2603.28378, 2026

  48. [48]

    Echomind: An interrelated multi-level benchmark for evaluating empathetic speech language models,

    L. Zhou, L. Yu, Y . Lyu, Y . Lin, Z. Zhao, J. Ao, Y . Zhang, B. Wang, and H. Li, “Echomind: An interrelated multi-level benchmark for evaluating empathetic speech language models,”arXiv preprint arXiv:2510.22758, 2025

  49. [49]

    How contrastive decoding enhances large audio language models?

    T.-Q. Lin, W.-P. Huang, Y .-C. Lin, and H.-y. Lee, “How contrastive decoding enhances large audio language models?”arXiv preprint arXiv:2603.09232, 2026

  50. [50]

    Taigis- peech: A low-resource real-world speech intent dataset and prelim- inary results with scalable data mining in-the-wild,

    K.-W. Chang, Y .-C. Lin, H.-C. Chou, W. Ren, Y .-H. Huang, Y .-S. Tsai, C.-C. Chen, Y . Tsao, Y .-F. Liao, S. Narayananet al., “Taigis- peech: A low-resource real-world speech intent dataset and prelim- inary results with scalable data mining in-the-wild,”arXiv preprint arXiv:2603.21478, 2026

  51. [51]

    VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

    Y .-C. Lin, Y . Hirota, S.-F. Huang, and H.-y. Lee, “Vibe: V oice-induced open-ended bias evaluation for large audio-language models via real- world speech,”arXiv preprint arXiv:2604.17248, 2026

  52. [52]

    Game-time: Evaluating temporal dynamics in spoken language models,

    K.-W. Chang, E.-P. Hu, C.-Y . Kuan, W. Ren, W.-C. Chen, G.-T. Lin, Y . Tsao, S.-H. Sun, H.-y. Lee, and J. Glass, “Game-time: Evaluating temporal dynamics in spoken language models,” inICASSP 2026- 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 1–5

  53. [53]

    Aha: Aligning large audio-language models for reasoning hallucinations via counterfactual hard negatives,

    Y . Chen, W. Zhu, X. Chen, Z. Wang, X. Li, P. Qiu, H. Wang, X. Dong, Y . Xiong, A. Schneideret al., “Aha: Aligning large audio-language models for reasoning hallucinations via counterfactual hard negatives,” arXiv preprint arXiv:2512.24052, 2025

  54. [54]

    HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    F. Zhao, Y . Chen, W. Lu, D. Zhang, X. Yue, and J. Wei, “Halluaudio: A comprehensive benchmark for hallucination detection in large audio- language models,”arXiv preprint arXiv:2604.19300, 2026

  55. [55]

    Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models,

    K. Li, C. Shen, Y . Liu, J. Han, K. Zheng, X. Zou, L. Z. Wang, S. Zhang, X. Du, H. Luoet al., “Audiotrust: Benchmarking the multifaceted trustworthiness of audio large language models,”arXiv preprint arXiv:2505.16211, 2025

  56. [56]

    Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,

    H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Lianget al., “Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,”arXiv preprint arXiv:2509.21060, 2025

  57. [57]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,”arXiv preprint arXiv:1610.02136, 2016

  58. [58]

    Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift,

    Y . Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, and J. Snoek, “Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift,”Advances in neural information processing systems, vol. 32, 2019

  59. [59]

    Calibration of pre-trained transformers,

    S. Desai and G. Durrett, “Calibration of pre-trained transformers,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 295–302

  60. [60]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Younget al., “Scaling language models: Methods, analysis & insights from training gopher,”arXiv preprint arXiv:2112.11446, 2021

  61. [61]

    How can we know when language models know? on the calibration of language models for ques- tion answering,

    Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? on the calibration of language models for ques- tion answering,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021

  62. [62]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

  63. [63]

    Reducing conversational agents’ overconfidence through linguistic calibration,

    S. J. Mielke, A. Szlam, E. Dinan, and Y .-L. Boureau, “Reducing conversational agents’ overconfidence through linguistic calibration,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 857–872, 2022

  64. [64]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonsoet al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”Transactions on machine learning research, 2023

  65. [65]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” inThe Eleventh International Conference on Learning Representations, 2023

  66. [66]

    Detecting hallucinations in large language models using semantic entropy,

    S. Farquhar, J. Kossen, L. Kuhnet al., “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, pp. 625–630, 2024

  67. [67]

    Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities,

    A. Nikitin, J. Kossen, Y . Gal, and P. Marttinen, “Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities,”Advances in Neural Information Processing Systems, vol. 37, pp. 8901–8929, 2024

  68. [68]

    Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity,

    D. Nguyen, A. Payani, and B. Mirzasoleiman, “Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity,” inFindings of the Association for Computational Linguis- tics: ACL 2025, 2025, pp. 4530–4540

  69. [69]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

  70. [70]

    On the calibration of large language models and alignment,

    C. Zhu, B. Xu, Q. Wang, Y . Zhang, and Z. Mao, “On the calibration of large language models and alignment,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 9778–9795

  71. [71]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

  72. [72]

    Teaching models to express their uncertainty in words, 2022

    S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,”arXiv preprint arXiv:2205.14334, 2022

  73. [73]

    Do large language models know what they don’t know?

    Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X.-J. Huang, “Do large language models know what they don’t know?” inFindings of the association for Computational Linguistics: ACL 2023, 2023, pp. 8653– 8665

  74. [74]

    Uncertainty quantification of large language models through multi- dimensional responses,

    T. Chen, X. Liu, L. Da, J. Chen, V . Papalexakis, and H. Wei, “Uncertainty quantification of large language models through multi- dimensional responses,”arXiv preprint arXiv:2502.16820, 2025

  75. [75]

    A survey of uncertainty estimation in llms: Theory meets practice,

    H.-Y . Huang, Y . Yang, Z. Zhang, S. Lee, and Y . Wu, “A survey of uncertainty estimation in llms: Theory meets practice,”arXiv preprint arXiv:2410.15326, 2024

  76. [76]

    Uncertainty in natural language processing: Sources, quantification, and applications,

    M. Hu, Z. Zhang, S. Zhao, M. Huang, and B. Wu, “Uncertainty in natural language processing: Sources, quantification, and applications,” arXiv preprint arXiv:2306.04459, 2023

  77. [77]

    A survey of uncertainty estimation methods on large language models,

    Z. Xia, J. Xu, Y . Zhang, and H. Liu, “A survey of uncertainty estimation methods on large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 21 381–21 396

  78. [78]

    Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure

    L. Aichberger, K. Schweighofer, and S. Hochreiter, “Rethinking un- certainty estimation in natural language generation,”arXiv preprint arXiv:2412.15176, 2024

  79. [79]

    Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

    M. Chen, G. Chen, W. Wang, and Y . Yang, “Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization,” arXiv preprint arXiv:2505.12346, 2025

  80. [80]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176, 2023

Showing first 80 references.