pith. machine review for the scientific record. sign in

arxiv: 2505.15957 · v4 · submitted 2025-05-21 · 📡 eess.AS · cs.AI· cs.CL· cs.SD

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

classification 📡 eess.AS cs.AIcs.CLcs.SD
keywords modelsauditorylalmslargesurveyadvancementsaudio-languagecomprehensive
0
0 comments X
read the original abstract

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

    cs.CL 2025-11 conditional novelty 7.0

    ST models override masculine ILM biases with acoustic input, using first-person pronouns to link terms to the speaker and accessing gender cues across the full frequency spectrum rather than pitch alone.

  2. Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

    eess.AS 2025-09 unverdicted novelty 7.0

    Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.

  3. When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

    cs.SD 2025-10 unverdicted novelty 5.0

    Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.