arxiv: 2601.12248 · v3 · submitted 2026-01-18 · 📡 eess.AS · cs.AI· cs.CL· cs.LG· cs.SD

Recognition: no theorem link

AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

Chun-Yi Kuan , Hung-yi Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:00 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.LGcs.SD

keywords audio question answeringunanswerable questionsbenchmarkaudio-language modelsanswer detectionmodel reliability

0 comments

The pith

Audio-language models struggle to detect when questions have no reliable answer from the audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AQUA-Bench to evaluate how well audio question answering models handle cases with no reliable answer. It defines three scenarios: absent answer detection where the correct option is missing, incompatible answer set detection where choices do not match the question category, and incompatible audio question detection where the question lacks grounding in the audio. Experiments show models perform well on standard answerable tasks but often fail on unanswerable ones. This matters for real-world applications where questions may be misleading or lack sufficient audio information, and systems should know when to abstain.

Core claim

AQUA-Bench is a benchmark for Audio Question Unanswerability Assessment that evaluates models across three scenarios: Absent Answer Detection where the correct option is missing, Incompatible Answer Set Detection where choices are categorically mismatched with the question, and Incompatible Audio Question Detection where the question is irrelevant or lacks sufficient grounding in the audio. The benchmark shows that while models excel on answerable audio questions, they face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.

What carries the argument

AQUA-Bench, the benchmark that measures model reliability on unanswerable audio questions through its three defined detection scenarios.

If this is right

Audio-language models require new mechanisms to recognize when no reliable answer exists from the audio.
The three scenarios offer a structured test for building more trustworthy audio question answering systems.
Current benchmarks limited to answerable questions do not fully capture model limitations in real settings.
Development of audio-aware systems should prioritize robustness to ill-posed or incompatible queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be created for video or text question answering to address unanswerability across modalities.
Models might improve by incorporating uncertainty estimation specifically for audio inputs.
Evaluating performance on diverse audio conditions like noise or accents could reveal additional weaknesses.

Load-bearing premise

The three defined scenarios cover the main real-world cases where audio questions have no reliable answer.

What would settle it

If models trained with explicit refusal mechanisms still fail to identify unanswerable questions at high rates on both AQUA-Bench and new external audio recordings, the existence of a significant blind spot would be supported.

read the original abstract

Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AQUA-Bench adds a benchmark for unanswerable audio questions and shows models struggle there, but the supporting details on data and metrics stay thin.

read the letter

The main point is that this paper introduces AQUA-Bench to test audio question answering models on cases where no reliable answer exists. It sets up three scenarios—absent answer detection, incompatible answer set detection, and incompatible audio question detection—and reports that models handle standard answerable questions fine but run into clear problems on these unanswerable ones. That gap is real and worth measuring for practical audio systems like voice interfaces. The work is new because earlier audio QA benchmarks focused only on questions that do have answers, so this fills an obvious hole by creating targeted tests for unanswerability. The scenarios are straightforward and line up with common real-world issues, such as misleading questions or missing information in the audio. This gives a direct way to check if models are guessing or staying silent when they should. The direction is practical and the experiments point to a consistent weakness without claiming a full fix. The soft spot is the level of detail. The description does not spell out how the unanswerable examples were built, the dataset scale, the exact scoring rules, or any statistical checks on the performance drops. Without those pieces it is hard to judge how strong the evidence is or how reproducible the benchmark will be for others. If the full paper has more on construction and controls, that would tighten things up. This is for researchers working on audio-language models who want to test reliability beyond basic accuracy. Anyone evaluating multimodal systems for deployment would get something useful from seeing the blind spot quantified. I would send it to peer review. The idea is timely, the contribution is clear, and referees could help fill in the methods and results so the benchmark becomes more usable.

Referee Report

2 major / 1 minor

Summary. The paper introduces AQUA-Bench, a new benchmark for Audio Question Unanswerability Assessment in audio-language models. It defines three scenarios—Absent Answer Detection (missing correct option), Incompatible Answer Set Detection (categorically mismatched choices), and Incompatible Audio Question Detection (irrelevant or ungrounded questions)—to evaluate cases where no reliable answer can be inferred from the audio. Experiments indicate that models perform strongly on standard answerable audio QA tasks but exhibit notable challenges on these unanswerable cases, revealing a blind spot in current audio-language understanding.

Significance. If the benchmark construction proves rigorous and the reported gaps hold under detailed scrutiny, the work could meaningfully advance trustworthy audio-language systems by shifting focus from answer generation to reliable detection of unanswerability, a common real-world issue. The explicit scenario definitions provide a concrete starting point for future robustness research.

major comments (2)

[Abstract / Experiments] Abstract and experimental description: the central claim of performance gaps on unanswerable questions rests on the benchmark's construction and evaluation, yet no details are given on audio sources, question/option generation procedures for the three scenarios, sample counts, or exact metrics (e.g., accuracy vs. refusal rate). This information is load-bearing for assessing whether the observed challenges are genuine or artifactual.
[Scenario definitions] Definition of scenarios: the assumption that Absent Answer Detection, Incompatible Answer Set Detection, and Incompatible Audio Question Detection cover the primary real-world unanswerable cases lacks supporting justification, coverage analysis, or comparison to broader taxonomies of ill-posed audio questions.

minor comments (1)

[Abstract] Abstract: consider adding a brief statement on benchmark scale (total items, distribution across scenarios) to help readers gauge the scope of the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the manuscript requires additional details on benchmark construction and stronger justification for the scenario definitions to support the central claims. We will revise the paper accordingly and address each major comment below.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental description: the central claim of performance gaps on unanswerable questions rests on the benchmark's construction and evaluation, yet no details are given on audio sources, question/option generation procedures for the three scenarios, sample counts, or exact metrics (e.g., accuracy vs. refusal rate). This information is load-bearing for assessing whether the observed challenges are genuine or artifactual.

Authors: We agree that the current manuscript lacks sufficient details on these aspects, which are essential for evaluating the benchmark's validity. In the revised version, we will expand the Experiments section (and add a new subsection on benchmark construction) to explicitly describe: the audio sources and datasets used (with references), the precise procedures for generating questions and options across the three scenarios (including any manual curation or automated methods), the exact sample counts per scenario, and the metrics employed (e.g., accuracy on answerable questions versus detection accuracy or refusal rate on unanswerable cases). This will allow readers to assess whether the reported performance gaps are genuine. revision: yes
Referee: [Scenario definitions] Definition of scenarios: the assumption that Absent Answer Detection, Incompatible Answer Set Detection, and Incompatible Audio Question Detection cover the primary real-world unanswerable cases lacks supporting justification, coverage analysis, or comparison to broader taxonomies of ill-posed audio questions.

Authors: We acknowledge that the manuscript would be strengthened by explicit justification and comparison to broader taxonomies. In the revision, we will add a dedicated discussion subsection that provides rationale for why these three scenarios represent primary real-world unanswerable cases in audio QA, includes a coverage analysis with examples, and compares them to related taxonomies from text-based unanswerable QA (e.g., SQuAD 2.0) and visual QA benchmarks. We will also cite relevant prior work to support the claim that these categories capture the main failure modes. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces AQUA-Bench as a new evaluation benchmark with three explicitly defined scenarios for unanswerable audio questions. Its central claims rest on standard model evaluations against these new definitions rather than any derivation, fitted parameter, or self-referential construction. No equations, predictions, or load-bearing self-citations appear in the provided text that reduce the reported results to the inputs by construction. The work is self-contained as a benchmark contribution with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that existing benchmarks ignore unanswerable questions and that the three new scenarios adequately represent real-world cases.

axioms (1)

domain assumption Existing audio QA benchmarks mainly cover answerable questions.
Stated directly in the abstract as the motivation for the new benchmark.

pith-pipeline@v0.9.0 · 5500 in / 1080 out tokens · 25453 ms · 2026-05-16T14:00:27.707662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 8 internal anchors

[1]

Evaluating the capabilities of these models is essential for guiding their development and deployment in real-world applica- tions [19–26]

INTRODUCTION Audio-aware large language models (ALLMs) [1–18] have recently demonstrated strong performance across a wide range of audio- related tasks. Evaluating the capabilities of these models is essential for guiding their development and deployment in real-world applica- tions [19–26]. Among existing evaluation protocols, multiple-choice audio quest...

work page
[2]

AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

METHOD 2.1. Task Formulation The standard multiple-choice Audio Question Answering (AQA) task requires a model to select the correct answer from a candidate setCgiven an audio clipAand a questionQ. The objective is to predict the correct optionc i ∈C. We focus on three categories: animal sounds, musical instruments, and vocal (non-verbal human) sounds, wi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

EXPERIMENTAL SETUPS In this work, we adopt a set of well-known, fully open-source, and extensively documented models as baselines. These include Qwen- Audio-Chat [1], Qwen2-Audio-Instruct [2], Qwen2.5-Omni [8], SALMONN [3] (7B and 13B variants), LTU [4], LTU-AS [5], GAMA [6], Audio Flamingo 2 [12], Audio Flamingo 3 (AF3) [11], Audio-Reasoner [10], Phi4-Mu...

work page
[4]

forced-choice,

RESULT 4.1. Performance on Answerable Questions Before evaluating how models handle unanswerable questions, we first measure their performance on standard tasks where a correct answer is always available. This initial step allows us to establish a clear baseline and confirm their core capabilities in audio under- standing. As presented in Table 1 under th...

work page
[5]

Experiments show that while ALLMs excel on standard answerable tasks, they suffer from a pro- nounced forced-choice bias, often answering when they should ab- stain

CONCLUSION We present AQUA-Bench, a benchmark for evaluating unanswer- ability in audio question answering through three scenarios: Absent Answer Detection, Incompatible Answer Set Detection, and Incom- patible Audio Question Detection. Experiments show that while ALLMs excel on standard answerable tasks, they suffer from a pro- nounced forced-choice bias...

work page
[6]

unanswerability

DISCUSSION ON METHODOLOGY AND LIMITATIONS In this section, we provide further context regarding our experimen- tal design choices, data quality control, and current limitations. 6.1. Rationale for Experimental Design Our benchmark is designed with specific constraints to strictly iso- late and evaluate a model’s ability to handle unanswerable queries. Sim...

work page
[7]

ACKNOWLEDGEMENT We thank the National Center for High-performance Computing (NCHC) of the National Applied Research Laboratories (NARLabs) in Taiwan for providing computational and storage resources

work page
[8]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al., “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Qwen2-Audio Technical Report

Yunfei Chu et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Salmonn: Towards generic hearing abil- ities for large language models,

Changli Tang et al., “Salmonn: Towards generic hearing abil- ities for large language models,” inThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[11]

Listen, think, and understand,

Yuan Gong et al., “Listen, think, and understand,” in The Twelfth International Conference on Learning Represen- tations, 2023

work page 2023
[12]

Joint audio and speech understanding,

Yuan Gong et al., “Joint audio and speech understanding,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[13]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

Sreyan Ghosh et al., “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,” inProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, 2024, pp. 6288– 6313

work page 2024
[14]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Qwen2.5-Omni Technical Report

Jin Xu et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin et al., “Phi-4-mini technical re- port: Compact yet powerful multimodal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Audio-reasoner: Improving reasoning ca- pability in large audio language models,

Zhifei Xie et al., “Audio-reasoner: Improving reasoning ca- pability in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025
[18]

Audio flamingo 3: Advancing audio in- telligence with fully open large audio language models,

Sreyan Ghosh et al., “Audio flamingo 3: Advancing audio in- telligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025

work page 2025
[19]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

Sreyan Ghosh et al., “Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,” inF orty-second International Conference on Ma- chine Learning, 2024

work page 2024
[20]

Building a taiwanese mandarin spoken language model: A first attempt,

Chih-Kai Yang et al., “Building a taiwanese mandarin spoken language model: A first attempt,”arXiv preprint arXiv:2411.07111, 2024

work page arXiv 2024
[21]

Speechprompt: Prompting speech lan- guage models for speech processing tasks,

Kai-Wei Chang et al., “Speechprompt: Prompting speech lan- guage models for speech processing tasks,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 2024

work page 2024
[22]

Speech-copilot: Lever- aging large language models for speech processing via task decomposition, modularization, and program generation,

Chun-Yi Kuan, Chih-Kai Yang, et al., “Speech-copilot: Lever- aging large language models for speech processing via task decomposition, modularization, and program generation,” in 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 1060–1067

work page 2024
[23]

Teaching audio-aware large language models what does not hear: Mitigating hallucinations through synthesized negative samples,

Chun-Yi Kuan et al., “Teaching audio-aware large language models what does not hear: Mitigating hallucinations through synthesized negative samples,”Interspeech 2025, 2025

work page 2025
[24]

From alignment to advance- ment: Bootstrapping audio-language alignment with synthetic data,

Chun-Yi Kuan and Hung-yi Lee, “From alignment to advance- ment: Bootstrapping audio-language alignment with synthetic data,”IEEE Transactions on Audio, Speech and Language Pro- cessing, vol. 33, pp. 4604–4619, 2025

work page 2025
[25]

On the landscape of spoken language models: A comprehensive survey,

Siddhant Arora, Kai-Wei Chang, et al., “On the landscape of spoken language models: A comprehensive survey,”Transac- tions on Machine Learning Research, 2025

work page 2025
[26]

Understanding sounds, missing the questions: The challenge of object hallucination in large audio- language models,

Chun-Yi Kuan et al., “Understanding sounds, missing the questions: The challenge of object hallucination in large audio- language models,” inInterspeech 2025, 2025

work page 2025
[27]

Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning,

Chun-Yi Kuan et al., “Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2025

work page 2025
[28]

Mmau: A massive multi-task audio under- standing and reasoning benchmark,

S Sakshi et al., “Mmau: A massive multi-task audio under- standing and reasoning benchmark,” inThe Thirteenth Inter- national Conference on Learning Representations, 2024

work page 2024
[29]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.ArXiv, abs/2505.13032, 2025

Ziyang Ma et al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025
[30]

Evaluation of llms in speech is often flawed: Test set contamination in large language models for speech recognition,

Yuan Tseng et al., “Evaluation of llms in speech is often flawed: Test set contamination in large language models for speech recognition,”arXiv preprint arXiv:2505.22251, 2025

work page arXiv 2025
[31]

Listen and speak fairly: a study on se- mantic gender bias in speech integrated large language mod- els,

Yi-Cheng Lin et al., “Listen and speak fairly: a study on se- mantic gender bias in speech integrated large language mod- els,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 439–446

work page 2024
[32]

Speech-ifeval: Evaluating instruction- following and quantifying catastrophic forgetting in speech- aware language models,

Ke-Han Lu et al., “Speech-ifeval: Evaluating instruction- following and quantifying catastrophic forgetting in speech- aware language models,”Interspeech 2025, 2025

work page 2025
[33]

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang, En-Pei Hu, et al., “Game-time: Evaluating temporal dynamics in spoken language models,”arXiv preprint arXiv:2509.26388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Dynamic-superb: Towards a dy- namic, collaborative, and comprehensive instruction-tuning benchmark for speech,

Chien-yu Huang et al., “Dynamic-superb: Towards a dy- namic, collaborative, and comprehensive instruction-tuning benchmark for speech,” inICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12136–12140

work page 2024
[35]

Dynamic-superb phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,

Chien-yu Huang et al., “Dynamic-superb phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[36]

Unsolvable problem detection: Ro- bust understanding evaluation for large multimodal models,

Atsuyuki Miyai et al., “Unsolvable problem detection: Ro- bust understanding evaluation for large multimodal models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6497–6540

work page 2025
[37]

Toward unsupervised realistic visual question answering,

Yuwei Zhang et al., “Toward unsupervised realistic visual question answering,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023, pp. 15613– 15624

work page 2023
[38]

Unk-vqa: A dataset and a probe into the abstention ability of multi-modal large models,

Yangyang Guo et al., “Unk-vqa: A dataset and a probe into the abstention ability of multi-modal large models,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[39]

Clip-up: Clip-based unanswerable prob- lem detection for visual question answering,

Ben Vardi et al., “Clip-up: Clip-based unanswerable prob- lem detection for visual question answering,”arXiv preprint arXiv:2501.01371, 2025

work page arXiv 2025
[40]

GPT-4 Technical Report

Josh Achiam et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

ESC: Dataset for Environmental Sound Clas- sification,

Karol J. Piczak, “ESC: Dataset for Environmental Sound Clas- sification,” inProceedings of the 23rd Annual ACM Confer- ence on Multimedia. 2015, pp. 1015–1018, ACM Press

work page 2015
[42]

Music instrument sounds for classifi- cation,

Abdulvahap, “Music instrument sounds for classifi- cation,”kaggle.com/datasets/abdulvahap/ music-instrunment-sounds-for-classification

work page
[43]

V ocalsound: A dataset for improving hu- man vocal sounds recognition,

Yuan Gong et al., “V ocalsound: A dataset for improving hu- man vocal sounds recognition,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2022, pp. 151–155

work page 2022
[44]

Chain-of-thought prompting elicits reasoning in large language models,

Jason Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

work page 2022