pith. machine review for the scientific record. sign in

arxiv: 2604.24401 · v1 · submitted 2026-04-27 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

Recognition: unknown

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:33 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLeess.AS
keywords audio-language modelstext priorsaudio reliancebenchmark evaluationdiagnostic frameworkspeech understandingevaluation reliabilitymultimodal assessment
0
0 comments X

The pith

Audio-language models keep 60-72 percent of their benchmark scores even with no audio input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current benchmarks for audio-language models may not actually test whether models perceive sound. It introduces a diagnostic approach that checks how much models can answer from text and general knowledge alone, then measures how much they truly need the acoustic signal. Tests across eight models and three benchmarks reveal that removing all audio drops performance only modestly. Among questions that need audio, most are solvable from short local segments rather than the full recording. This finding matters because it means high scores may reflect clever use of language patterns instead of auditory understanding.

Core claim

Large audio-language models retain 60-72% of their full audio scores even without any audio input. Among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These results indicate that benchmarks often fail to isolate genuine auditory perception.

What carries the argument

Two-axis diagnostic framework that separates text prior (how much a question can be answered from text and knowledge alone) from audio reliance (how much the acoustic signal is actually required).

If this is right

  • Benchmark scores likely overestimate how much models rely on actual audio processing.
  • Evaluation protocols should routinely include a no-audio baseline to verify genuine audio dependence.
  • Most audio-dependent items can be answered from short fragments, so full-clip requirements may be unnecessary in many cases.
  • Benchmark design needs to reduce the influence of text priors to better isolate auditory capabilities.
  • Guidelines for improving evaluation reliability follow directly from the measured gaps between text-only and full-audio performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar text-prior inflation could affect vision-language or other multimodal benchmarks, suggesting a general diagnostic approach across modalities.
  • Training objectives might be adjusted to penalize over-reliance on text patterns when audio is present.
  • Developers could apply the same two-axis test during model iteration to detect when improvements come from language shortcuts rather than audio features.

Load-bearing premise

Removing the audio input cleanly measures text priors without introducing new biases or artifacts in how questions are presented.

What would settle it

Re-running the benchmarks after masking or removing all text cues and observing whether scores collapse to near chance levels.

Figures

Figures reproduced from arXiv: 2604.24401 by Chen-An Li, Chih-Kai Yang, Hung-yi Lee, Ke-Han Lu, Leonardo Haw-Yang Foo.

Figure 1
Figure 1. Figure 1: Overview of the proposed diagnostic framework. Consequently, these studies do not quantify how much bench￾mark performance can be achieved without audio. In this work, we analyze benchmark behavior along two di￾agnostic axes. Text Prior measures how much of a benchmark can be solved from the textual prompt and general knowledge alone. Audio Reliance measures how much model performance actually depends on t… view at source ↗
Figure 2
Figure 2. Figure 2: Retention rate (%) across three benchmarks for eight models. Higher retention indicates greater reliance on informa￾tion preserved in short audio fragments view at source ↗
Figure 3
Figure 3. Figure 3: Model-averaged stacked distribution of item cate￾gories across the three benchmarks. 60% of their Full accuracy without audio. The effect is particu￾larly pronounced in some cases, such as Audio-Flamingo-3 on MMAU and MMAU-Pro, and the average RTP across models also exceeds 60%. These results indicate that benchmark per￾formance can be substantially driven by text prior. Overall, our findings reveal a sign… view at source ↗
read the original abstract

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce a two-axis diagnostic framework for evaluating text priors and audio reliance in Large Audio-Language Models. By testing eight LALMs on three benchmarks, the authors find that these models retain 60-72% of their full audio performance even without audio input, suggesting heavy reliance on text priors. Additionally, they report that among audio-requiring items, only 3.0-4.2% necessitate the complete audio clip, with the majority being solvable using localized audio fragments. The work challenges the validity of current benchmarks as measures of auditory understanding and offers practical guidelines for improved evaluation.

Significance. If the experimental conditions are free of confounds, this paper makes a significant contribution by empirically demonstrating that text-based information can account for a large portion of LALM performance on audio benchmarks. The direct, parameter-free ablations on multiple models and benchmarks provide reproducible evidence that current evaluation practices may overestimate audio understanding capabilities. The finding that localized fragments suffice for most items further implies that benchmarks may not be testing holistic audio comprehension, and the proposed guidelines could help in designing more robust tests.

major comments (1)
  1. [Experimental Setup] The headline results (60-72% retention without audio; 3.0-4.2% needing full clip) are load-bearing for the paper's critique of benchmarks. However, the manuscript does not provide explicit details on how the no-audio condition is implemented, including prompt modifications, audio input handling (e.g., zeroed signals or absent input), and controls for question phrasing. As noted in the stress-test, if these alter model behavior, the percentages may not purely reflect text priors. This requires clarification to validate the isolation of the diagnostic axes.
minor comments (2)
  1. [Abstract] The abstract reports quantitative results but omits the names of the three benchmarks and eight models, which would aid assessment of representativeness.
  2. [Results] The results section should report data splits, statistical significance tests, or confidence intervals for the key percentages to allow readers to evaluate robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to strengthen the manuscript. We address the major comment on experimental setup below and will revise the paper to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Experimental Setup] The headline results (60-72% retention without audio; 3.0-4.2% needing full clip) are load-bearing for the paper's critique of benchmarks. However, the manuscript does not provide explicit details on how the no-audio condition is implemented, including prompt modifications, audio input handling (e.g., zeroed signals or absent input), and controls for question phrasing. As noted in the stress-test, if these alter model behavior, the percentages may not purely reflect text priors. This requires clarification to validate the isolation of the diagnostic axes.

    Authors: We agree that explicit implementation details are essential for validating the isolation of text priors. In the no-audio condition, we used the exact same prompts and question phrasing as the full-audio setting for all models and benchmarks, with no modifications or additional text. Audio input was handled by completely omitting the audio modality: for models accepting separate audio features, we provided no audio encoder output; for others requiring a fixed audio tensor, we supplied a zero-valued tensor of identical shape and duration. This approach was chosen to minimize any unintended behavioral shifts. We performed internal consistency checks across models confirming that response distributions aligned with text-only inference expectations. We will add a dedicated subsection 'Implementation of the No-Audio Condition' to the Methods section, specifying the exact procedure per model (including any architecture-specific handling), the zero-tensor details, and the verification steps used to confirm no prompt alterations occurred. This revision will directly address the concern and allow readers to assess whether the reported retention rates purely reflect text priors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ablation study

full rationale

The paper defines its diagnostic framework operationally through direct performance measurements on existing benchmarks and models (full-audio vs. no-audio conditions). Claims such as 60-72% score retention and 3.0-4.2% full-clip necessity are reported as observed empirical outcomes rather than derived via equations, fitted parameters, or self-referential constructions. No load-bearing self-citations, ansatzes, or uniqueness theorems reduce the results to inputs by definition. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that existing benchmarks aim to test auditory understanding, which it then challenges empirically. No free parameters or invented entities are introduced; the new axes are diagnostic tools rather than new postulates.

axioms (1)
  • domain assumption Existing audio-language benchmarks are designed to measure genuine auditory perception rather than text-based inference
    Invoked implicitly when interpreting high no-audio scores as evidence of benchmark failure.

pith-pipeline@v0.9.0 · 5473 in / 1228 out tokens · 121327 ms · 2026-05-07T17:33:56.849128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

    Introduction The rapid development of Large Audio-Language Models (LALMs) [1–15], which extend Large Language Models (LLMs) [16, 17] with auditory perception and knowledge [18, 19], has led to consistent performance gains on speech and au- dio benchmarks [20–32]. These improvements are often inter- preted as evidence of strong auditory understanding [33, ...

  2. [2]

    Methodology We propose two diagnostic axes for auditing audio-language benchmarks.Text prior(§2.1) quantifies how much perfor- mance is achievable without audio.Audio reliance(§2.2) mea- sures how much of the audio signal models actually use. 2.1. Text Prior A desirable audio-language benchmark should require access to the auditory signal to be solved cor...

  3. [3]

    Benchmarks We evaluate on three public audio-language benchmarks

    Experimental Setups 3.1. Benchmarks We evaluate on three public audio-language benchmarks. MMAU[20] is a 10,000-item MCQ dataset covering sound, music, and speech; we use the 1,000-item test-mini split. Table 2:Evaluated LALMs and their text backbones. LALM Text Backbone Audio-Flamingo-3 [4] Qwen2.5-7B-Instruct [41] DeSTA-2.5 [6] Llama-3.1-8B-Instruct [16...

  4. [4]

    Results 4.1. Results on Text Prior Table 3a reports the accuracies under the Full, None, and Text Backbone (TB) settings for all models and benchmarks, to- gether with the corresponding text-prior rateR TP. TB accuracy substantially exceeds chance for most mod- els across benchmarks. Averaged across models, TB surpasses chance by 12.4%, 5.4%, and 3.6% on ...

  5. [5]

    Benchmark designers should measuretext priorto en- sure tasks cannot be solved using text-only cues and thus gen- uinely assess auditory understanding

    Recommended Practices We hope this study provides practical guidance for LALM re- search. Benchmark designers should measuretext priorto en- sure tasks cannot be solved using text-only cues and thus gen- uinely assess auditory understanding. Model developers should also compare performance with and without audio to verify that improvements arise from audi...

  6. [6]

    Across three benchmarks and eight LALMs, models retain 60–72% of their full accuracy without audio, revealing strong text priors

    Conclusion We analyze how current audio-language benchmarks depend on audio by decomposing performance into text prior and audio reliance. Across three benchmarks and eight LALMs, models retain 60–72% of their full accuracy without audio, revealing strong text priors. Under partial audio, only 3.0–4.2% of items require cross-segment information, while mos...

  7. [7]

    Acknowledgments We acknowledge the computational and storage support pro- vided by the National Center for High-performance Comput- ing (NCHC) of the National Applied Research Laboratories (NARLabs) in Taiwan. This work was supported by the Min- istry of Education (MOE) of Taiwan under the project Tai- wan Centers of Excellence in Artificial Intelligence,...

  8. [8]

    In addition, large language models were utilized as judges for the automatic evaluation

    Generative AI Use Disclosure Generative AI tools were used in this paper solely for language polishing and writing refinement. In addition, large language models were utilized as judges for the automatic evaluation

  9. [9]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2- audio technical report,”arXiv preprint arXiv:2407.10759, 2024. [Online]. Available: https://arxiv.org/abs/2407.10759

  10. [10]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025. [Online]. Available: https://arxiv.org/ abs/2503.20215

  11. [11]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P. Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3-omni technical report,” arXi...

  12. [12]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

    S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S. gil Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  13. [13]

    Available: https://openreview.net/forum?id= FjByDpDVIO

    [Online]. Available: https://openreview.net/forum?id= FjByDpDVIO

  14. [15]

    DeSTA2.5-Audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,

    K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, S.-F. Huang, C.-K. Yang, C.-E. Yu, C.-W. Chen, W.-C. Chen, C.-y. Huang, Y .-C. Lin, Y .-X. Lin, C.-A. Fu, C.-Y . Kuan, W. Ren, X. Chen, W.-P. Huang, E.-P. Hu, T.-Q. Lin, Y .-K. Wu, K.-P. Huang, H.-Y . Huang, H.-C. Chou, K.-W. Chang, C.-H. Chiang, B. Ginsburg, Y .-C. F. Wang, and H.-y. Lee, “DeSTA2.5-Audio: Towar...

  15. [17]
  16. [18]

    Building a taiwanese mandarin spoken language model: A first attempt,

    C.-K. Yang, Y .-K. Fu, C.-A. Li, Y .-C. Lin, Y .-X. Lin, W.-C. Chen, H. L. Chung, C.-Y . Kuan, W.-P. Huang, K.-H. Luet al., “Building a taiwanese mandarin spoken language model: A first attempt,” arXiv preprint arXiv:2411.07111, 2024

  17. [19]

    Speech-copilot: Leveraging large language models for speech processing via task decomposition, modular- ization, and program generation,

    C.-Y . Kuanet al., “Speech-copilot: Leveraging large language models for speech processing via task decomposition, modular- ization, and program generation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 1060–1067

  18. [20]

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment,

    K.-H. Lu, Z. Chen, S.-W. Fu, H. Huang, B. Ginsburg, Y .-C. F. Wang, and H. yi Lee, “DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment,” inInter- speech 2024, 2024, pp. 4159–4163

  19. [21]

    Developing instruction- following speech language model without speech instruction- tuning data,

    K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, J. Balam, B. Gins- burg, Y .-C. F. Wang, and H.-Y . Lee, “Developing instruction- following speech language model without speech instruction- tuning data,” inICASSP 2025 - 2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  20. [22]

    A preliminary exploration with gpt-4o voice mode,

    Y .-X. Lin, C.-K. Yang, W.-C. Chen, C.-A. Li, C.-y. Huang, X. Chen, and H.-y. Lee, “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025

  21. [23]

    Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,

    Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro, “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 25 125–25 148, also available as arXiv:2402.01831. ...

  22. [24]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=14rn7HpKVk

  23. [25]

    BLSP-emo: Towards empathetic large speech-language models,

    C. Wang, M. Liao, Z. Huang, J. Wu, C. Zong, and J. Zhang, “BLSP-emo: Towards empathetic large speech-language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: As- sociation for Computational Linguistics, Nov. 2024, pp. 19 186– 19 199

  24. [26]

    The Llama 3 Herd of Models

    A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  25. [27]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  26. [28]

    Audiolens: A closer look at auditory attribute percep- tion of large audio-language models.arXiv preprint arXiv:2506.05140, 2025

    C.-K. Yang, N. Ho, Y .-J. Lee, and H.-y. Lee, “Audiolens: A closer look at auditory attribute perception of large audio-language mod- els,”arXiv preprint arXiv:2506.05140, 2025

  27. [29]

    Sake: Towards editing auditory attribute knowledge of large audio-language models,

    C.-K. Yang, Y .-T. Piao, T.-W. Hsu, S.-W. Fu, Z. Chen, K.-H. Lu, S.-F. Huang, C.-H. H. Yang, Y .-C. F. Wang, Y .-N. Chenet al., “Sake: Towards editing auditory attribute knowledge of large audio-language models,”arXiv preprint arXiv:2510.16917, 2025

  28. [30]

    MMAU: A massive multi-task audio understanding and reasoning benchmark,

    S. Sakshi, U. Tyagi, S. Kumar, A. Seth, R. Selvakumar, O. Nieto, R. Duraiswami, S. Ghosh, and D. Manocha, “MMAU: A massive multi-task audio understanding and reasoning benchmark,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=TeV AZXr3yv

  29. [31]

    MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

    Z. Ma, Y . Ma, Y . Zhu, C. Yang, Y .-W. Chao, R. Xu, W. Chen, Y . Chen, Z. Chen, J. Conget al., “MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Online]. Available: https: //openreview.net/forum?id=f...

  30. [32]

    Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,

    S. Kumar, ˇS. Sedl ´aˇcek, V . Lokegaonkar, F. L ´opez, W. Yu, N. Anand, H. Ryu, L. Chen, M. Pli ˇcka, M. Hlav ´aˇceket al., “Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,”arXiv preprint arXiv:2508.13992, 2025

  31. [33]

    AudioBench: A universal benchmark for audio large language models,

    B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang, Z. Liu, A. Aw, and N. F. Chen, “AudioBench: A universal benchmark for audio large language models,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), L. Chiruzzo, A. Ritter, and...

  32. [34]

    AIR-bench: Benchmarking large audio-language models via generative comprehension,

    Q. Yang, J. Xu, W. Liu, Y . Chu, Z. Jiang, X. Zhou, Y . Leng, Y . Lv, Z. Zhao, C. Zhou, and J. Zhou, “AIR-bench: Benchmarking large audio-language models via generative comprehension,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Tha...

  33. [35]

    Dynamic- superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,

    C.-y. Huang, K.-H. Lu, S.-H. Wang, C.-Y . Hsiao, C.-Y . Kuan, H. Wu, S. Arora, K.-W. Chang, J. Shi, Y . Penget al., “Dynamic- superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 136–12 140

  34. [36]

    Dynamic-superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,

    C.-y. Huang, W.-C. Chen, S.-w. Yang, A. T. Liu, C.-A. Li, Y .-X. Lin, W.-C. Tseng, A. Diwan, Y .-J. Shih, J. Shi et al., “Dynamic-superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: h...

  35. [37]

    SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,

    C.-K. Yang, N. Ho, Y .-T. Piao, and H. yi Lee, “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,” inInterspeech 2025, 2025, pp. 1788–1792

  36. [38]

    Listen and speak fairly: a study on seman- tic gender bias in speech integrated large language models,

    Y .-C. Lin, T.-Q. Lin, C.-K. Yang, K.-H. Lu, W.-C. Chen, C.-Y . Kuan, and H.-Y . Lee, “Listen and speak fairly: a study on seman- tic gender bias in speech integrated large language models,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 439–446

  37. [39]

    Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models,

    K.-H. Lu, C.-Y . Kuan, and H. yi Lee, “Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models,” inInterspeech 2025, 2025, pp. 2078–2082

  38. [40]

    Mugen: Evaluating and improving multi-audio understanding of large audio-language models,

    C.-K. Yang, Y .-S. Tsai, Y .-K. Guo, P.-L. Tsai, Y .-T. Piao, H.- W. Chen, T.-L. Hsiao, Y .-M. Hsu, K.-H. Lu, and H.-y. Lee, “Mugen: Evaluating and improving multi-audio understanding of large audio-language models,”arXiv preprint arXiv:2603.09714, 2026

  39. [41]

    Speechr: A bench- mark for speech reasoning in large audio-language models,

    W. Yang, Y . Li, Y . Wei, M. Fang, and L. Chen, “Speechr: A bench- mark for speech reasoning in large audio-language models,”arXiv preprint arXiv:2508.02018, 2025

  40. [42]

    VoxEval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models,

    W. Cui, X. Jiao, Z. Meng, and I. King, “VoxEval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computatio...

  41. [43]

    On the landscape of spoken language models: A comprehensive survey,

    S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H. yi Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,” Transactions on Machine Learning Research, 2025. [Online]. Available: https://openreview.net/forum?id=BvxaP3sVbA

  42. [44]

    Towards holistic evaluation of large audio-language models: A comprehensive survey,

    C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 10 1...

  43. [45]

    Hypothesis only baselines in natural language inference,

    A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme, “Hypothesis only baselines in natural language inference,” inProceedings of the Seventh Joint Conference on Lexical and Computational Semantics, M. Nissim, J. Berant, and A. Lenci, Eds. New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 180–191. [Online]. Av...

  44. [46]

    Annotation artifacts in natural language inference data,

    S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith, “Annotation artifacts in natural language inference data,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orlea...

  45. [47]

    Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering,

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  46. [48]

    Are we on the right way for evaluating large vision-language models?

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, and F. Zhao, “Are we on the right way for evaluating large vision-language models?” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, p...

  47. [49]

    When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

    C.-A. Li, T.-H. Lin, and H.-y. Lee, “When silence matters: The impact of irrelevant audio on text reasoning in large audio- language models,”arXiv preprint arXiv:2510.00626, 2025

  48. [50]

    Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,

    H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Liang, C. Wu, Q. He, T. Lee, X. Chen, W.-L. Zheng, W. Wang, M. D. Plumbley, J. liu, and Q. Kong, “Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,” inThe Fourteenth International Conference on Learning Representations,...

  49. [51]

    Qwen2. 5 technical report,

    A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint, 2024

  50. [52]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chenet al., “Phi-4-mini technical report: Compact yet powerful multi- modal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

  51. [53]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  52. [54]

    Un ministral, des ministraux,

    Mistral AI, “Un ministral, des ministraux,” 2024. [Online]. Available: https://mistral.ai/news/ministraux

  53. [55]

    Claude 4.5 haiku (version 20251001),

    Anthropic, “Claude 4.5 haiku (version 20251001),” https:// www.anthropic.com/news/claude-4-5-haiku, 2025, model ver- sion: claude-haiku-4-5-20251001. Accessed 2026-03-05