pith. sign in

arxiv: 2606.08194 · v1 · pith:SXKLMDAInew · submitted 2026-06-06 · 💻 cs.CL · cs.AI

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

Pith reviewed 2026-06-27 19:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual benchmarkaudio-language modelsnaturalistic evaluationaudio understandinglow-resource languagescultural interpretationperformance gaps
0
0 comments X

The pith

A new benchmark of 5637 natural audio questions shows large audio-language models have substantial performance gaps, especially open-source models and low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GlobeAudio, a set of 5637 multiple-choice questions in six languages written by native speakers using real-world audio recordings. To answer correctly, a model needs higher-level auditory reasoning and culturally grounded interpretation instead of simple pattern matching. Tests of closed-source and open-source large audio-language models plus cascaded pipelines reveal clear drops in accuracy under these realistic conditions. The gaps are widest for open-source systems and languages with fewer resources. This matters because most prior tests rely on artificial audio that does not match everyday use.

Core claim

GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. Systematic evaluation of representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines, reveals substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages.

What carries the argument

GlobeAudio benchmark, a collection of native-speaker questions on naturally occurring audio that tests higher-level auditory reasoning and cultural interpretation in audio-language models.

If this is right

  • Open-source audio-language models show larger accuracy drops than closed-source ones on natural audio.
  • Low-resource languages exhibit the widest performance shortfalls in the evaluated systems.
  • Cascaded ASR-LLM pipelines share the same limitations as unified models under realistic acoustic conditions.
  • Future audio-language systems will need training and testing that incorporate authentic acoustic variability and cultural context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data for these models may need to shift toward more diverse, naturally recorded multilingual audio to close the observed gaps.
  • The benchmark could serve as a diagnostic tool to identify whether models rely on surface-level cues or deeper acoustic-cultural integration.
  • Similar naturalistic tests might be developed for other modalities to check whether current evaluation practices overstate model capabilities.

Load-bearing premise

That questions crafted by native speakers from real audio truly require higher-level auditory reasoning and cultural interpretation rather than simpler pattern matching.

What would settle it

If future models achieve near-ceiling accuracy on GlobeAudio yet still fail on everyday audio tasks outside the benchmark, the claim that these questions measure the intended skills would be undermined.

Figures

Figures reproduced from arXiv: 2606.08194 by Ryner Tan, Wenxuan Zhang.

Figure 1
Figure 1. Figure 1: Examples of GLOBEAUDIO, which require models to first unpack compound auditory tasks through multiple reasoning steps. LALMs unify audio processing and language un￾derstanding, enabling tasks like spoken QA and speech translation for diverse applications such as virtual assistants and educational tools (Su et al., 2025; Deshmukh et al., 2023). Driven by their rapid adoption, the community has introduced ev… view at source ↗
Figure 2
Figure 2. Figure 2: GLOBEAUDIO’s data preparation pipeline, from curation to rigorous annotation and iterative cross￾annotator quality checks, culminating in a high-fidelity dataset. 2 GLOBEAUDIO Dataset 2.1 Design Considerations To provide a realistic reflection of real-world per￾formance when LALMs are exposed to the acoustic and sociolinguistic conditions encountered in-the￾Wild, we design GLOBEAUDIO with three core princi… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of Qwen3-Omni-Flash on various settings, from Audio and Transcript input, to "Blind" which denotes model without audio or transcript input. ing as our primary evaluation pipeline. • Transcript: To decouple linguistic semantics from acoustic nuances, the audio input is re￾moved and reduced to its text transcript. • Blind: To detect guessing heuristics and re￾liance on prior world knowledge, the mod… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis on varying language of transcripts in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of a Chinese question from GLOBEAUDIO [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of a Russian question from GLOBEAUDIO [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example of a Singlish question from GLOBEAUDIO [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of a Thai question from GLOBEAUDIO [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompts for Open and Closed-Source Models in various languages [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts for Cascade pipelines in various languages [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 12
Figure 12. Figure 12: YouTube downloader query terms 13 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces GlobeAudio, a multilingual multicultural benchmark with 5,637 multiple-choice questions across six typologically diverse languages. Questions are created by native speakers from naturally occurring audio, with the claim that success requires higher-level auditory reasoning and culturally grounded interpretation rather than simpler pattern matching. The authors evaluate closed-source and open-source LALMs plus cascaded ASR-LLM pipelines, reporting substantial performance gaps under natural acoustic conditions, especially for open-source models and low-resource languages. The dataset is released on Hugging Face.

Significance. A well-validated naturalistic benchmark could meaningfully advance evaluation standards for audio-language models by exposing gaps not captured in existing synthetic or text-heavy tests. The public release of the dataset supports reproducibility and follow-on work. However, the significance is currently limited by the absence of evidence that the questions actually measure the claimed higher-level skills.

major comments (3)
  1. [Benchmark construction] Benchmark construction (likely §3 or equivalent): The central claim that 'in order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation' is not supported by any reported validation. No experiments compare model performance on audio vs. text-only versions of the questions, no controls test whether ASR transcription alone suffices, and no checks rule out keyword matching or other low-level strategies. This directly undermines the performance-gap findings.
  2. [Evaluation and results] Evaluation and results sections: The abstract and high-level description supply no quantitative results, error analysis, per-language breakdowns, or statistical significance tests for the claimed 'substantial performance gaps.' Without these, it is impossible to assess whether the gaps are robust or driven by a subset of questions.
  3. [Question validation] Question validation: No details are provided on inter-annotator agreement, pilot testing, or exclusion criteria used by native speakers to ensure questions require cultural interpretation rather than surface-level acoustic or linguistic cues.
minor comments (1)
  1. [Abstract] The abstract states the dataset URL but the manuscript should include a brief data card or statistics table (e.g., questions per language, audio duration distribution) for immediate clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our submission. We address each of the major comments point-by-point below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (likely §3 or equivalent): The central claim that 'in order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation' is not supported by any reported validation. No experiments compare model performance on audio vs. text-only versions of the questions, no controls test whether ASR transcription alone suffices, and no checks rule out keyword matching or other low-level strategies. This directly undermines the performance-gap findings.

    Authors: We agree that additional validation would strengthen our claims. The benchmark was constructed by native speakers selecting questions that require cultural knowledge and auditory reasoning from natural audio clips, as opposed to synthetic or text-based tests. However, we did not include audio-versus-text comparisons or explicit low-level strategy controls in the current version. We will revise Section 3 to include a more detailed justification of the question design and add an analysis of model performance on transcribed text versions of a subset of questions to address this concern. revision: yes

  2. Referee: [Evaluation and results] Evaluation and results sections: The abstract and high-level description supply no quantitative results, error analysis, per-language breakdowns, or statistical significance tests for the claimed 'substantial performance gaps.' Without these, it is impossible to assess whether the gaps are robust or driven by a subset of questions.

    Authors: The full manuscript in Section 4 presents detailed quantitative results, including per-language performance tables, error analysis, and statistical tests. The abstract summarizes the key findings at a high level following standard practices for brevity. To improve clarity, we will incorporate specific quantitative highlights and mention the availability of detailed breakdowns in the abstract during revision. revision: yes

  3. Referee: [Question validation] Question validation: No details are provided on inter-annotator agreement, pilot testing, or exclusion criteria used by native speakers to ensure questions require cultural interpretation rather than surface-level acoustic or linguistic cues.

    Authors: Section 3 describes the question creation process by native speakers, but we acknowledge that more explicit details on validation steps such as inter-annotator agreement and pilot testing would be beneficial. We will expand this section in the revised manuscript to include these details, including any agreement metrics and criteria used for question inclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark paper with no derivations or self-referential predictions

full rationale

The paper introduces GlobeAudio as a benchmark consisting of 5,637 questions crafted by native speakers on naturally occurring audio. Its central claim—that models must possess higher-level auditory reasoning and cultural interpretation to perform well—is presented as definitional to the benchmark's design rather than derived from any equations, fitted parameters, or prior results. No mathematical derivations, uniqueness theorems, self-citations as load-bearing premises, or renamings of known results appear in the provided text. Empirical evaluations of LALMs are reported as direct measurements on the new dataset and are independent of the construction process. This matches the default expectation for a benchmark introduction paper, warranting a score of 0 with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that native-speaker curation of questions from natural audio produces a valid test of higher-level reasoning; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Native speaker expertise ensures cultural authenticity and higher-level reasoning demands in question design
    Stated directly in the abstract as the basis for the benchmark's validity

pith-pipeline@v0.9.1-grok · 5728 in / 1137 out tokens · 16362 ms · 2026-06-27T19:50:25.083953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [9]

    2024 , eprint=

    Qwen2-Audio Technical Report , author=. 2024 , eprint=

  9. [10]

    2024 , url=

    Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang , booktitle=. 2024 , url=

  10. [11]

    2024 , eprint=

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment , author=. 2024 , eprint=

  11. [12]

    2024 , eprint=

    Moshi: a speech-text foundation model for real-time dialogue , author=. 2024 , eprint=

  12. [13]

    2025 , url=

    Qingkai Fang and Shoutao Guo and Yan Zhou and Zhengrui Ma and Shaolei Zhang and Yang Feng , booktitle=. 2025 , url=

  13. [15]

    2025 , eprint=

    Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST , author=. 2025 , eprint=

  14. [16]

    2025 , eprint=

    Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices , author=. 2025 , eprint=

  15. [17]

    2025 , eprint=

    Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models , author=. 2025 , eprint=

  16. [18]

    2025 , eprint=

    Voxtral , author=. 2025 , eprint=

  17. [19]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  18. [20]

    2025 , eprint=

    Audio-Language Models for Audio-Centric Tasks: A survey , author=. 2025 , eprint=

  19. [21]

    Pengi: An Audio Language Model for Audio Tasks , url =

    Deshmukh, Soham and Elizalde, Benjamin and Singh, Rita and Wang, Huaming , booktitle =. Pengi: An Audio Language Model for Audio Tasks , url =

  20. [22]

    2025 , eprint=

    AudioBench: A Universal Benchmark for Audio Large Language Models , author=. 2025 , eprint=

  21. [23]

    2024 , eprint=

    AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension , author=. 2024 , eprint=

  22. [24]

    2024 , eprint=

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark , author=. 2024 , eprint=

  23. [25]

    Massive Sound Embedding Benchmark (

    Georg Heigold and Ehsan Variani and Tom Bagby and Cyril Allauzen and Ji Ma and Shankar Kumar and Michael Riley , booktitle=. Massive Sound Embedding Benchmark (. 2025 , url=

  24. [26]

    2025 , eprint=

    Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding , author=. 2025 , eprint=

  25. [27]

    2025 , eprint=

    mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks , author=. 2025 , eprint=

  26. [28]

    2025 , eprint=

    AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs , author=. 2025 , eprint=

  27. [29]

    2025 , eprint=

    Can Large Audio Language Models Understand Audio Well? Speech, Scene and Events Understanding Benchmark for LALMs , author=. 2025 , eprint=

  28. [30]

    2025 , eprint=

    Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation , author=. 2025 , eprint=

  29. [31]

    Not All Languages Are Created Equal in

    Haoyang Huang and Tianyi Tang and Dongdong Zhang and Xin Zhao and Ting Song and Yan Xia and Furu Wei , booktitle=. Not All Languages Are Created Equal in. 2023 , url=

  30. [32]

    C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

    Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878

  31. [33]

    2025 , eprint=

    EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA , author=. 2025 , eprint=

  32. [35]

    2024 , eprint=

    MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models , author=. 2024 , eprint=

  33. [36]

    2025 , eprint=

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models , author=. 2025 , eprint=

  34. [37]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  35. [38]

    2025 , eprint=

    Qwen3-Omni Technical Report , author=. 2025 , eprint=

  36. [39]

    2022 , eprint=

    Robust Speech Recognition via Large-Scale Weak Supervision , author=. 2022 , eprint=

  37. [40]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  38. [41]

    2025 , eprint=

    IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models , author=. 2025 , eprint=

  39. [42]

    2025 , url=

    Xize Cheng and Dongjie Fu and Chenyuhao Wen and Shannon Yu and Zehan Wang and Shengpeng Ji and Siddhant Arora and Tao Jin and Shinji Watanabe and Zhou Zhao , booktitle=. 2025 , url=

  40. [43]

    2025 , eprint=

    SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models , author=. 2025 , eprint=

  41. [44]

    2025 , eprint=

    MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix , author=. 2025 , eprint=

  42. [46]

    2025 , eprint=

    WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild , author=. 2025 , eprint=

  43. [47]

    2026 , eprint=

    Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration , author=. 2026 , eprint=

  44. [48]

    2024 , eprint=

    Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond , author=. 2024 , eprint=

  45. [49]

    2022 , eprint=

    XTREME-S: Evaluating Cross-lingual Speech Representations , author=. 2022 , eprint=

  46. [50]

    Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, and Natasa Milic-Frayling

    Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, and Natasa Milic-Frayling. 2025. https://arxiv.org/abs/2510.06371 Everydaymmqa: A multilingual and multimodal framework for culturally grounded spoken visual qa . Preprint, arXiv...

  47. [51]

    Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, and David Ifeoluwa Adelani

    Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, and David Ifeoluwa Adelani. 2025. https://arxiv.org/abs/2506.08400 msteb: Massively multilingual evaluation of llms on speech and text tasks . Preprint, arXiv:2506.08400

  48. [52]

    Xize Cheng, Dongjie Fu, Chenyuhao Wen, Shannon Yu, Zehan Wang, Shengpeng Ji, Siddhant Arora, Tao Jin, Shinji Watanabe, and Zhou Zhao. 2025. https://openreview.net/forum?id=vCej5sO61x AH a-bench: Benchmarking audio hallucinations in large audio-language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Ben...

  49. [53]

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. 2024. https://arxiv.org/abs/2407.10759 Qwen2-audio technical report . Preprint, arXiv:2407.10759

  50. [54]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://arxiv.org/abs/2507.06261 Gemini 2.5: Pus...

  51. [55]

    Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson

    Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan Van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. 2022. https://arxiv.org/abs/2203.10752 Xtreme-s: Evaluating cross-lingual speech repre...

  52. [56]

    Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/3a2e5889b4bbef997ddb13b55d5acf77-Paper-Conference.pdf Pengi: An audio language model for audio tasks . In Advances in Neural Information Processing Systems, volume 36, pages 18090--18108. Curran Associates, Inc

  53. [57]

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2025. https://openreview.net/forum?id=PYmrUQmMEw LL a MA -omni: Seamless speech interaction with large language models . In The Thirteenth International Conference on Learning Representations

  54. [58]

    Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, and Jindong Gu. 2025 a . https://doi.org/10.18653/v1/2025.acl-long.237 Benchmarking open-ended audio dialogue understanding for large audio-language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4763--4784, Vienna, Austria. As...

  55. [59]

    Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, and AiTi Aw. 2025 b . https://arxiv.org/abs/2505.16774 Ifeval-audio: Benchmarking instruction-following capability in audio-based large language models . Preprint, arXiv:2505.16774

  56. [60]

    Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. 2025. https://arxiv.org/abs/2507.08128 Audio flamingo 3: Advancing audio intelligence with fully open large audio language models . Preprint, arXiv:2507.08128

  57. [61]

    Google. 2025 a . Gemini 2.0 Flash Model Card . https://modelcards.withgoogle.com/assets/documents/gemini-2-flash.pdf. [Online]

  58. [62]

    Google. 2025 b . Gemma 3n Model Overview . https://ai.google.dev/gemma/docs/gemma-3n. [Online]

  59. [63]

    Google. 2026. Gemini 3.1 Pro Model Card . https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf. [Online]

  60. [64]

    Peize He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu, Jiajie Huang, Zehui Lei, Zhuangcheng Gu, Xiangqi Jin, Jiabing Yang, Kai Li, Zhifei Liu, Weijia Li, Cunxiang Wang, Conghui He, and Linfeng Zhang. 2025. https://arxiv.org/abs/2510.07293 Audiomarathon: A comprehensive benchmark for long-context audio understanding and efficiency in audio llms . Prep...

  61. [65]

    Georg Heigold, Ehsan Variani, Tom Bagby, Cyril Allauzen, Ji Ma, Shankar Kumar, and Michael Riley. 2025. https://openreview.net/forum?id=X0juYgFVng Massive sound embedding benchmark ( MSEB ) . In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  62. [66]

    Evan King, Adam Sabra, Manjunath Kudlur, James Wang, and Pete Warden. 2025. https://arxiv.org/abs/2509.02523 Flavors of moonshine: Tiny specialized asr models for edge devices . Preprint, arXiv:2509.02523

  63. [67]

    Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. https://doi.org/10.1162/tacl_a_00430 On generative spoken language modeling from raw audio . Transactions of the Association for Computational Linguistics, 9:1336--1354

  64. [68]

    Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, and Laurent Besacier. 2024. https://arxiv.org/abs/2408.03900 Speech-massive: A multilingual speech dataset for slu and beyond . Preprint, arXiv:2408.03900

  65. [69]

    Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Amélie Héliou, and 87 others. 2025. http...

  66. [70]

    MERaLiON Team . 2024. https://arxiv.org/abs/2412.09818 Meralion-audiollm: Bridging audio and language with large language models . Preprint, arXiv:2412.09818

  67. [71]

    Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. https://doi.org/10.1016/j.csl.2019.101027 Voxceleb: Large-scale speaker verification in the wild . Computer Speech & Language, 60:101027

  68. [72]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

  69. [73]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. https://doi.org/10.48550/ARXIV.2212.04356 Robust speech recognition via large-scale weak supervision . arXiv preprint

  70. [74]

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. 2024. https://arxiv.org/abs/2410.19168 Mmau: A massive multi-task audio understanding and reasoning benchmark . Preprint, arXiv:2410.19168

  71. [75]

    Fabian David Schmidt, Ivan Vulić, Goran Glavaš, and David Ifeoluwa Adelani. 2025. https://arxiv.org/abs/2501.06117 Fleurs-slu: A massively multilingual benchmark for spoken language understanding . Preprint, arXiv:2501.06117

  72. [76]

    Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, and Boris Ginsburg. 2025. https://arxiv.org/abs/2509.14128 Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast . Preprint, arXiv:2509.14128

  73. [77]

    Syed Abdul Gaffar Shakhadri, Kruthika KR, and Kartik Basavaraj Angadi. 2025. https://arxiv.org/abs/2501.02832 Samba-asr: State-of-the-art speech recognition leveraging structured state-space models . Preprint, arXiv:2501.02832

  74. [78]

    Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, and Yong Dou. 2025. https://arxiv.org/abs/2501.15177 Audio-language models for audio-centric tasks: A survey . Preprint, arXiv:2501.15177

  75. [79]

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, and Chao Zhang. 2024. https://openreview.net/forum?id=14rn7HpKVk SALMONN : Towards generic hearing abilities for large language models . In The Twelfth International Conference on Learning Representations

  76. [80]

    Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F. Chen. 2025. https://arxiv.org/abs/2406.16020 Audiobench: A universal benchmark for audio large language models . Preprint, arXiv:2406.16020

  77. [81]

    Xiaomi. 2026. Xiaomi MiMo V2.5 . https://mimo.xiaomi.com/mimo-v2-5/. [Online]

  78. [82]

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, and 19 others. 2025. https://arxiv.org/abs/2509.17765 Qwen3-omni technical report . Preprint, arXiv:2509.17765

  79. [83]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  80. [84]

    Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. 2024. https://arxiv.org/abs/2402.07729 Air-bench: Benchmarking large audio-language models via generative comprehension . Preprint, arXiv:2402.07729

Showing first 80 references.