Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F · 2025 · arXiv 2406.16020

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

PitchBench: Measuring Pitch Hearing in Audio-Language Models

cs.SD · 2026-05-25 · unverdicted · novelty 7.0

PitchBench shows that frontier audio-language models have highly unreliable pitch perception across instruments, durations, noise levels, and formats.

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

cs.CV · 2025-10-16 · conditional · novelty 7.0

XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.

VoiceBench: Benchmarking LLM-Based Voice Assistants

cs.CL · 2024-10-22 · unverdicted · novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

cs.SD · 2026-04-27 · unverdicted · novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 5.0

GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

cs.CL · 2026-05-11 · unverdicted · novelty 4.0

Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.

On The Landscape of Spoken Language Models: A Comprehensive Survey

cs.CL · 2025-04-11 · unverdicted · novelty 3.0

A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.

citing papers explorer

Showing 4 of 4 citing papers after filters.

PitchBench: Measuring Pitch Hearing in Audio-Language Models cs.SD · 2026-05-25 · unverdicted · none · ref 3
PitchBench shows that frontier audio-language models have highly unreliable pitch perception across instruments, durations, noise levels, and formats.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 33
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models cs.CL · 2026-06-06 · unverdicted · none · ref 80
GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents cs.CL · 2026-05-11 · unverdicted · none · ref 179
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.

Audiobench: A universal benchmark for audio large language models

fields

years

verdicts

representative citing papers

citing papers explorer