Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Konstantinos Drossos; Parthasaarathy Sudarsanam; Samuel Lipping; Tuomas Virtanen

arxiv: 2204.09634 · v2 · pith:76PIDF56new · submitted 2022-04-20 · 💻 cs.SD · cs.LG· eess.AS

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

Samuel Lipping , Parthasaarathy Sudarsanam , Konstantinos Drossos , Tuomas Virtanen This is my paper

classification 💻 cs.SD cs.LGeess.AS

keywords answersaudiodatasetquestionquestionsclassifieraccuracyanswering

0 comments

read the original abstract

Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task - an LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an LSTM-based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StressTest: Can YOUR Speech LM Handle the Stress?
cs.CL 2025-05 conditional novelty 6.0

Speech language models fail at reasoning about sentence stress but improve after fine-tuning on a new 17k-example synthetic dataset that varies stress to alter meaning.
Adaptive Perturbation Selection for Contrastive Audio Decoding
cs.SD 2026-06 unverdicted novelty 5.0

Adaptive selection among a library of audio perturbations in contrastive decoding produces task-dependent accuracy gains, including +4.3% on an existence task via a hidden-state selector.
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
cs.CL 2026-04 unverdicted novelty 5.0

AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.