Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
Evaluation of llms in speech is often flawed: Test set contamination in large language models for speech recognition
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
fields
eess.AS 2years
2026 2verdicts
UNVERDICTED 2roles
dataset 1polarities
use dataset 1representative citing papers
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
citing papers explorer
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
-
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.