arxiv: 2604.21766 · v1 · submitted 2026-04-23 · 💻 cs.CL

Recognition: unknown

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Tasnim Kabir , Dmytro Kurdydyk , Aadi Palnitkar , Liam Dorn , Ahmed Haj Ahmed , Jordan Lee Boyd-Graber

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords audio question answeringbenchmark datasethuman versus AIaudio reasoningtrivia questionstemporal dependenciesitem response theory

0 comments

The pith

A new audio trivia benchmark shows humans at 32 percent accuracy while state-of-the-art models score below 9 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large collection of trivia questions tied to real-world audio clips that are written to demand actual listening and reasoning across the full recording rather than quick guesses. People reach about one third correct answers on average, which shows the questions test genuine understanding even though they remain hard. Leading audio AI systems stay under nine percent, pointing to reliance on surface patterns like brief sounds or text instead of processing the audio over time. The work also applies item response theory to break down question difficulty and model weaknesses. This setup matters because it blocks the shortcuts that have let models look effective on earlier audio tasks without real comprehension.

Core claim

AUDITA comprises human-authored trivia questions grounded in real-world audio and designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies. Probing queries ensure they cannot be answered from isolated text or sound cues alone. Human average accuracy reaches 32.13 percent, demonstrating meaningful comprehension of the audio, while state-of-the-art audio question answering models average below 8.86 percent. Item response theory further estimates latent proficiency, question difficulty, and systematic deficiencies.

What carries the argument

The AUDITA dataset of carefully curated, human-authored trivia questions grounded in real-world audio that require long-range temporal dependencies and cannot be solved from short cues, text, or metadata alone.

If this is right

Current audio question answering models will need new techniques for handling extended temporal dependencies to close the gap with human performance.
Item response theory can be applied to future audio benchmarks to expose specific model deficiencies beyond raw accuracy.
Benchmark design should prioritize anti-shortcut features such as long-range dependencies to prevent models from succeeding without genuine reasoning.
Human performance levels on the dataset provide a concrete target for measuring progress in auditory comprehension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Audio AI systems may need architectures that integrate temporal sequence modeling more explicitly rather than treating audio as a set of independent events.
Similar anti-shortcut designs could be applied to video or multimodal benchmarks to test whether models truly reason across time in other domains.
Repeated testing on this dataset could track whether scaling existing models or adding more data closes the observed gap or whether new training objectives are required.

Load-bearing premise

The curated questions cannot be answered from isolated text, short sound cues, metadata, or lexical priors and instead require robust auditory reasoning over long-range temporal dependencies.

What would settle it

Demonstrating that the questions can be answered at high accuracy using only transcripts or metadata without listening to the audio, or that models reach near-human performance while still failing on separate tasks that test temporal audio reasoning.

Figures

Figures reproduced from arXiv: 2604.21766 by Aadi Palnitkar, Ahmed Haj Ahmed, Dmytro Kurdydyk, Jordan Lee Boyd-Graber, Liam Dorn, Tasnim Kabir.

**Figure 2.** Figure 2: Category-level accuracy and average item [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Distributions of IRT ability (θ) for humans (blue) and models (orange) are shown on a shared scale. Kernel density estimates highlight that humans cluster at higher θ, with dashed lines indicating each group’s range, revealing a clear latent ability gap despite model variability [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Item accuracy plotted against item difficulty [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AUDITA pushes a harder audio QA benchmark with a clear human-model gap on long-range reasoning, but the abstract leaves the anti-shortcut validation thin.

read the letter

The main point is that this paper releases AUDITA, a dataset of human-authored trivia questions paired with real audio, built to require long-range temporal dependencies instead of letting models rely on short cues, captions, or metadata. Humans hit 32% accuracy while current audio QA models stay under 9%, and they add Item Response Theory to break down question difficulty and model weaknesses beyond raw scores. That contrast and the IRT angle are the concrete contributions here. The design choice to source questions from diverse internet trivia authors and include probing queries that block obvious shortcuts is a reasonable incremental step over prior audio benchmarks that lean on classification or caption grounding. It does surface a real gap that could matter for anyone trying to build systems that actually listen rather than pattern-match. The soft spot is exactly the one the stress-test flags: the abstract claims the questions cannot be solved from isolated text or short sound cues, yet it gives no evidence of the controls that would confirm this, such as text-only human baselines, audio-masked versions, or leakage checks. Without those details or inter-annotator numbers in the summary, the performance difference could partly reflect dataset artifacts or pretraining differences rather than pure auditory reasoning deficits. Curation process and statistical tests are also absent from what is shown. This is for audio-language model researchers and benchmark builders who need tougher evaluation sets. A reader working on multimodal reasoning would find the human-model split and IRT analysis worth seeing, even if the methods need fleshing out. It deserves peer review so the full paper can be checked on the validation steps and whether the long-range claim actually holds with data.

Referee Report

2 major / 1 minor

Summary. The paper introduces AUDITA, a large-scale benchmark of human-authored trivia questions grounded in real-world audio clips. The questions are curated to require robust auditory reasoning over long-range temporal dependencies and to resist shortcut solutions based on isolated text, short sound cues, metadata, or lexical priors. It reports human accuracy averaging 32.13% (indicating task difficulty yet meaningful comprehension) versus state-of-the-art audio QA models averaging below 8.86%, and applies Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and systematic model/data deficiencies.

Significance. If the dataset's design is validated to enforce genuine auditory reasoning, the work would provide a valuable, challenging benchmark exposing substantial gaps between human and model performance in audio understanding. The application of IRT is a positive methodological choice that moves beyond raw accuracy to analyze item and person parameters. However, the absence of supporting validation details in the abstract reduces the immediate strength of the central human-vs-model contrast.

major comments (2)

[Abstract] Abstract: The central claim that questions 'cannot be answered from isolated text or sound cues alone' and demand 'long-range temporal dependencies' is load-bearing for interpreting the 32.13% vs. <8.86% accuracy gap as evidence of auditory reasoning deficits, yet the abstract provides no description of curation safeguards, text-only human baselines, audio-masked ablations, or inter-annotator agreement to rule out lexical/metadata shortcuts.
[Abstract] Abstract: No information is given on the specific SOTA audio QA models evaluated (architectures, training regimes), the number of questions or audio clips, or any statistical tests supporting the accuracy comparison; these omissions leave the headline contrast under-supported.

minor comments (1)

[Abstract] Abstract: Typographical error in 'state of-the-art' (should be 'state-of-the-art').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. The comments on the abstract are well-taken, and we have prepared revisions to incorporate additional supporting details on curation and evaluation while preserving the abstract's brevity. We respond point by point to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that questions 'cannot be answered from isolated text or sound cues alone' and demand 'long-range temporal dependencies' is load-bearing for interpreting the 32.13% vs. <8.86% accuracy gap as evidence of auditory reasoning deficits, yet the abstract provides no description of curation safeguards, text-only human baselines, audio-masked ablations, or inter-annotator agreement to rule out lexical/metadata shortcuts.

Authors: We agree that the abstract would benefit from greater transparency on these points to support the central claim. The full manuscript (Section 3) details the curation safeguards, including the use of human-authored trivia questions explicitly designed with challenging distractors and requirements for integrating information across long audio segments. We have revised the abstract to include a concise summary of these design choices and the inter-annotator agreement reported in Section 4. Text-only human baselines and audio-masked ablations were not performed, as the question authoring process and subsequent IRT analysis already demonstrate that model failures align with deficits in auditory reasoning rather than shortcut exploitation; we view these as valuable directions for future work but not required to validate the current results. revision: yes
Referee: [Abstract] Abstract: No information is given on the specific SOTA audio QA models evaluated (architectures, training regimes), the number of questions or audio clips, or any statistical tests supporting the accuracy comparison; these omissions leave the headline contrast under-supported.

Authors: We acknowledge that the abstract's length constraints led to these omissions. The manuscript provides the requested details in Sections 4 and 5, including the specific state-of-the-art audio QA models and their architectures/training regimes, the exact number of questions and audio clips in AUDITA, and statistical tests confirming the significance of the human-model accuracy gap. We have revised the abstract to briefly report the dataset scale, name the evaluated models, and note the statistical support for the performance contrast. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential predictions

full rationale

This is a dataset creation and evaluation paper that reports direct human and model accuracy measurements on curated audio QA items. No mathematical derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations appear in the abstract or described structure. The IRT analysis is a standard post-hoc statistical method applied to the collected responses rather than a circular reduction of the main claim. The central contrast (human 32.13% vs. model <8.86%) rests on empirical results and curation intent, not on any tautological loop where the result is presupposed by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that human-authored trivia questions can be constructed to require genuine long-range audio reasoning rather than surface cues; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Questions can be authored such that they cannot be solved from isolated text, short audio segments, or metadata alone
Stated in abstract as the design goal for probing queries and challenging distractors
standard math Item Response Theory can be applied to estimate latent proficiency and question difficulty from the collected responses
Mentioned as the method to expose systematic deficiencies

pith-pipeline@v0.9.0 · 5500 in / 1320 out tokens · 28042 ms · 2026-05-09T22:06:00.606822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Phi-4-mini technical report: Compact yet pow- erful multimodal language models via mixture-of- loras.CoRR, abs/2503.01743. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 20...

work page internal anchor Pith review arXiv 2018
[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang

Computer Vision Foundation / IEEE Computer Society. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang
[3]

Qwen2-Audio Technical Report

Bottom-up and top-down attention for im- age captioning and visual question answering.2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6077–6086. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: visual question an- swering. In2015 IEEE Internationa...

work page internal anchor Pith review arXiv 2018
[4]

Mandar Joshi, Eunsol Choi, Daniel S

IEEE Computer Society. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen- sion. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 16...

2017
[5]

TVQA: Localized, Compositional Video Question Answering

TVQA: localized, compositional video ques- tion answering.CoRR, abs/1809.01696. Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu. 2022. Learning to an- swer questions in dynamic audio-visual scenarios. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 19086–19...

work page Pith review arXiv 2022
[6]

InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual

The CLEAR benchmark: Continual learning on real-world imagery. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual. Samuel Lipping, Parthasaarathy Sudarsanam, Konstanti- nos Drossos, and Tuomas Virtanen. 2022. Clotho- aqa: A crowdsourced dataset for a...

work page arXiv 2021
[7]

Surv., 55(10):197:1–197:45

QA dataset explosion: A taxonomy of NLP resources for question answering and reading compre- hension.ACM Comput. Surv., 55(10):197:1–197:45. S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha
[8]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Mmau: A massive multi-task audio un- derstanding and reasoning benchmark.Preprint, arXiv:2410.19168. Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Tristan Thrush, Woj- ciech Galuba, Devi Parikh, and Douwe Kiela. 2021. Human-adversarial visual question answering. InAd- vances in Neural Information Processing Systems 34: Annual C...

work page internal anchor Pith review arXiv 2021
[9]

Audio Duration (s) 63.42 65.25 41.81 13.75 Avg

Removing encoding artifacts and markup Statistic Pavements Audio- Packets Quizmasters External Sources Questions (%) 6.94% 17.02% 42.70% 33.33% Avg. Audio Duration (s) 63.42 65.25 41.81 13.75 Avg. Ques. Length 6.56 4.68 14.54 12.04 Table 5: Distribution of questions in the AUDITABench- mark Dataset by source, with audio duration and ques- tion length stat...
[10]

“Pathétique

Converting acceptance instructions into ex- plicit canonical surface forms. For example, the original annotation: ““Pathétique” Sonata [or Beethoven’s Piano Sonata No. 8 in C minor, Op. 13 (accept any un- derlined part)]” was normalized to: “Pathetique Sonata or Beethoven’s Piano Sonata No. 8 in C minor or Sonata No. 8 in C minor” No new aliases were adde...

2024
[11]

Name the title character of these movies

Feedback page:After completing both re- sponses, participants see a feedback page showing: • The correct answer • Their text response • Their multiple-choice (MCQ) selection • Whether the text response and MCQ choice were correct or incorrect 3.Performance tracking: • Text Score:Number of correct text re- sponses • MCQ Score:Number of correct multiple-cho...
[12]

Systematic confusion: Many questions con- tain plausible distractors that are similar to the correct answer, leading models to select in- correct options more frequently than random guessing
[13]

Question–audio alignment: Some questions rely on subtle audio cues that the model cannot reliably perceive, further skewing predictions toward incorrect distractors
[14]

Name the title character of these movies

High difficulty nature: The dataset contains rare or niche knowledge (e.g., obscure songs, actors, or composers), which is challenging even for models with strong general language understanding, resulting in lower-than-chance accuracy in multiple-choice settings. Thus, the below-chance performance does not indicate a quality issue with the dataset, but ra...

2021
[15]

What animal makes the sound in this clip?

relies on crowd workers to write questions about environmental sound recordings originally collected for captioning. Questions such as“What animal makes the sound in this clip?”or“Is the sound recorded indoors or outdoors?”better re- semble real-world queries. However, this process introduces strong linguistic priors: models trained on ClothoAQA perform c...

2022