Recognition: unknown
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
Pith reviewed 2026-05-09 22:06 UTC · model grok-4.3
The pith
A new audio trivia benchmark shows humans at 32 percent accuracy while state-of-the-art models score below 9 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AUDITA comprises human-authored trivia questions grounded in real-world audio and designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies. Probing queries ensure they cannot be answered from isolated text or sound cues alone. Human average accuracy reaches 32.13 percent, demonstrating meaningful comprehension of the audio, while state-of-the-art audio question answering models average below 8.86 percent. Item response theory further estimates latent proficiency, question difficulty, and systematic deficiencies.
What carries the argument
The AUDITA dataset of carefully curated, human-authored trivia questions grounded in real-world audio that require long-range temporal dependencies and cannot be solved from short cues, text, or metadata alone.
If this is right
- Current audio question answering models will need new techniques for handling extended temporal dependencies to close the gap with human performance.
- Item response theory can be applied to future audio benchmarks to expose specific model deficiencies beyond raw accuracy.
- Benchmark design should prioritize anti-shortcut features such as long-range dependencies to prevent models from succeeding without genuine reasoning.
- Human performance levels on the dataset provide a concrete target for measuring progress in auditory comprehension.
Where Pith is reading between the lines
- Audio AI systems may need architectures that integrate temporal sequence modeling more explicitly rather than treating audio as a set of independent events.
- Similar anti-shortcut designs could be applied to video or multimodal benchmarks to test whether models truly reason across time in other domains.
- Repeated testing on this dataset could track whether scaling existing models or adding more data closes the observed gap or whether new training objectives are required.
Load-bearing premise
The curated questions cannot be answered from isolated text, short sound cues, metadata, or lexical priors and instead require robust auditory reasoning over long-range temporal dependencies.
What would settle it
Demonstrating that the questions can be answered at high accuracy using only transcripts or metadata without listening to the audio, or that models reach near-human performance while still failing on separate tasks that test temporal audio reasoning.
Figures
read the original abstract
Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AUDITA, a large-scale benchmark of human-authored trivia questions grounded in real-world audio clips. The questions are curated to require robust auditory reasoning over long-range temporal dependencies and to resist shortcut solutions based on isolated text, short sound cues, metadata, or lexical priors. It reports human accuracy averaging 32.13% (indicating task difficulty yet meaningful comprehension) versus state-of-the-art audio QA models averaging below 8.86%, and applies Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and systematic model/data deficiencies.
Significance. If the dataset's design is validated to enforce genuine auditory reasoning, the work would provide a valuable, challenging benchmark exposing substantial gaps between human and model performance in audio understanding. The application of IRT is a positive methodological choice that moves beyond raw accuracy to analyze item and person parameters. However, the absence of supporting validation details in the abstract reduces the immediate strength of the central human-vs-model contrast.
major comments (2)
- [Abstract] Abstract: The central claim that questions 'cannot be answered from isolated text or sound cues alone' and demand 'long-range temporal dependencies' is load-bearing for interpreting the 32.13% vs. <8.86% accuracy gap as evidence of auditory reasoning deficits, yet the abstract provides no description of curation safeguards, text-only human baselines, audio-masked ablations, or inter-annotator agreement to rule out lexical/metadata shortcuts.
- [Abstract] Abstract: No information is given on the specific SOTA audio QA models evaluated (architectures, training regimes), the number of questions or audio clips, or any statistical tests supporting the accuracy comparison; these omissions leave the headline contrast under-supported.
minor comments (1)
- [Abstract] Abstract: Typographical error in 'state of-the-art' (should be 'state-of-the-art').
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. The comments on the abstract are well-taken, and we have prepared revisions to incorporate additional supporting details on curation and evaluation while preserving the abstract's brevity. We respond point by point to the major comments below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that questions 'cannot be answered from isolated text or sound cues alone' and demand 'long-range temporal dependencies' is load-bearing for interpreting the 32.13% vs. <8.86% accuracy gap as evidence of auditory reasoning deficits, yet the abstract provides no description of curation safeguards, text-only human baselines, audio-masked ablations, or inter-annotator agreement to rule out lexical/metadata shortcuts.
Authors: We agree that the abstract would benefit from greater transparency on these points to support the central claim. The full manuscript (Section 3) details the curation safeguards, including the use of human-authored trivia questions explicitly designed with challenging distractors and requirements for integrating information across long audio segments. We have revised the abstract to include a concise summary of these design choices and the inter-annotator agreement reported in Section 4. Text-only human baselines and audio-masked ablations were not performed, as the question authoring process and subsequent IRT analysis already demonstrate that model failures align with deficits in auditory reasoning rather than shortcut exploitation; we view these as valuable directions for future work but not required to validate the current results. revision: yes
-
Referee: [Abstract] Abstract: No information is given on the specific SOTA audio QA models evaluated (architectures, training regimes), the number of questions or audio clips, or any statistical tests supporting the accuracy comparison; these omissions leave the headline contrast under-supported.
Authors: We acknowledge that the abstract's length constraints led to these omissions. The manuscript provides the requested details in Sections 4 and 5, including the specific state-of-the-art audio QA models and their architectures/training regimes, the exact number of questions and audio clips in AUDITA, and statistical tests confirming the significance of the human-model accuracy gap. We have revised the abstract to briefly report the dataset scale, name the evaluated models, and note the statistical support for the performance contrast. revision: yes
Circularity Check
Empirical benchmark paper with no derivations or self-referential predictions
full rationale
This is a dataset creation and evaluation paper that reports direct human and model accuracy measurements on curated audio QA items. No mathematical derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations appear in the abstract or described structure. The IRT analysis is a standard post-hoc statistical method applied to the collected responses rather than a circular reduction of the main claim. The central contrast (human 32.13% vs. model <8.86%) rests on empirical results and curation intent, not on any tautological loop where the result is presupposed by the inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Questions can be authored such that they cannot be solved from isolated text, short audio segments, or metadata alone
- standard math Item Response Theory can be applied to estimate latent proficiency and question difficulty from the collected responses
Reference graph
Works this paper leans on
-
[1]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-mini technical report: Compact yet pow- erful multimodal language models via mixture-of- loras.CoRR, abs/2503.01743. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 20...
work page internal anchor Pith review arXiv 2018
-
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang
Computer Vision Foundation / IEEE Computer Society. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang
-
[3]
Bottom-up and top-down attention for im- age captioning and visual question answering.2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6077–6086. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: visual question an- swering. In2015 IEEE Internationa...
work page internal anchor Pith review arXiv 2018
-
[4]
Mandar Joshi, Eunsol Choi, Daniel S
IEEE Computer Society. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen- sion. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 16...
2017
-
[5]
TVQA: Localized, Compositional Video Question Answering
TVQA: localized, compositional video ques- tion answering.CoRR, abs/1809.01696. Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022. Learning to an- swer questions in dynamic audio-visual scenarios. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19086–19...
work page Pith review arXiv 2022
-
[6]
The CLEAR benchmark: Continual learning on real-world imagery. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual. Samuel Lipping, Parthasaarathy Sudarsanam, Konstanti- nos Drossos, and Tuomas Virtanen. 2022. Clotho- aqa: A crowdsourced dataset for a...
-
[7]
Surv., 55(10):197:1–197:45
QA dataset explosion: A taxonomy of NLP resources for question answering and reading compre- hension.ACM Comput. Surv., 55(10):197:1–197:45. S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha
-
[8]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Mmau: A massive multi-task audio un- derstanding and reasoning benchmark.Preprint, arXiv:2410.19168. Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Tristan Thrush, Woj- ciech Galuba, Devi Parikh, and Douwe Kiela. 2021. Human-adversarial visual question answering. InAd- vances in Neural Information Processing Systems 34: Annual C...
work page internal anchor Pith review arXiv 2021
-
[9]
Audio Duration (s) 63.42 65.25 41.81 13.75 Avg
Removing encoding artifacts and markup Statistic Pavements Audio- Packets Quizmasters External Sources Questions (%) 6.94% 17.02% 42.70% 33.33% Avg. Audio Duration (s) 63.42 65.25 41.81 13.75 Avg. Ques. Length 6.56 4.68 14.54 12.04 Table 5: Distribution of questions in the AUDITABench- mark Dataset by source, with audio duration and ques- tion length stat...
-
[10]
“Pathétique
Converting acceptance instructions into ex- plicit canonical surface forms. For example, the original annotation: ““Pathétique” Sonata [or Beethoven’s Piano Sonata No. 8 in C minor, Op. 13 (accept any un- derlined part)]” was normalized to: “Pathetique Sonata or Beethoven’s Piano Sonata No. 8 in C minor or Sonata No. 8 in C minor” No new aliases were adde...
2024
-
[11]
Name the title character of these movies
Feedback page:After completing both re- sponses, participants see a feedback page showing: • The correct answer • Their text response • Their multiple-choice (MCQ) selection • Whether the text response and MCQ choice were correct or incorrect 3.Performance tracking: • Text Score:Number of correct text re- sponses • MCQ Score:Number of correct multiple-cho...
-
[12]
Systematic confusion: Many questions con- tain plausible distractors that are similar to the correct answer, leading models to select in- correct options more frequently than random guessing
-
[13]
Question–audio alignment: Some questions rely on subtle audio cues that the model cannot reliably perceive, further skewing predictions toward incorrect distractors
-
[14]
Name the title character of these movies
High difficulty nature: The dataset contains rare or niche knowledge (e.g., obscure songs, actors, or composers), which is challenging even for models with strong general language understanding, resulting in lower-than-chance accuracy in multiple-choice settings. Thus, the below-chance performance does not indicate a quality issue with the dataset, but ra...
2021
-
[15]
What animal makes the sound in this clip?
relies on crowd workers to write questions about environmental sound recordings originally collected for captioning. Questions such as“What animal makes the sound in this clip?”or“Is the sound recorded indoors or outdoors?”better re- semble real-world queries. However, this process introduces strong linguistic priors: models trained on ClothoAQA perform c...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.