Crowdsourcing a Dataset of Audio Captions
Pith reviewed 2026-05-24 18:07 UTC · model grok-4.3
The pith
A three-step crowdsourcing process yields audio captions with fewer typographical errors and an average Jaccard similarity of 0.24.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that their three-step crowdsourcing framework produces a dataset with reduced typographical errors compared to initial captions, and that the selected captions for each audio clip exhibit an average Jaccard similarity of 0.24, indicating they are dissimilar yet contain overlapping information content.
What carries the argument
The three-step framework consisting of initial caption gathering, grammatical correction and rephrasing, and rating to select top captions.
If this is right
- The resulting dataset has improved quality through error reduction.
- Captions balance diversity and shared content as measured by Jaccard similarity.
- The method can scale dataset creation for audio captioning tasks.
- Practices from image captioning datasets transfer to audio.
Where Pith is reading between the lines
- Similar crowdsourcing could be applied to create datasets for other audio-related tasks like sound event detection.
- The framework might reduce costs and time for building multimodal datasets in general.
- Future work could explore automated rating instead of worker ratings to further scale the process.
Load-bearing premise
Crowd worker ratings reliably select high-quality captions without subjective bias or exclusion of valid unconventional descriptions.
What would settle it
Compare the selected captions against expert-annotated gold standard captions for the same audio clips to check if error rates are actually lower and if the similarity measure correlates with caption usefulness.
Figures
read the original abstract
Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a three-step crowdsourcing framework for building an audio captioning dataset: (1) collect initial captions for audio clips, (2) obtain grammatically corrected or rephrased versions, and (3) rate the captions and retain the top ones. The authors claim that the resulting dataset exhibits fewer typographical errors than the initial captions and that captions per sound have an average Jaccard similarity of 0.24, which they interpret as evidence of diversity while retaining shared information content.
Significance. If the evaluation metrics are shown to be reliable, the framework could provide a practical, reusable method for constructing audio caption datasets by adapting established practices from image captioning and machine translation. The work is primarily methodological and empirical rather than theoretical; its value would lie in enabling higher-quality training data for audio captioning models, provided the claimed improvements in error reduction and controlled diversity are substantiated.
major comments (3)
- [Abstract] Abstract: The reduction in typographical errors is presented as an objective outcome of the framework, yet no description is given of the detection procedure (e.g., automated spell-checking, manual review, or specific tools), sample sizes, or statistical significance testing. This metric is load-bearing for the quality claim and cannot be assessed without these details.
- [Abstract] Abstract: The Jaccard similarity of 0.24 is interpreted as indicating that captions are dissimilar yet share information, but the paper provides neither a baseline comparison (e.g., similarity among non-selected captions or random pairs) nor an analysis showing that overlapping tokens are primarily content words describing the audio rather than stop words. Without these, the balance-of-diversity interpretation is unsupported.
- [Abstract] Abstract / step-3 description: No information is supplied on the rating protocol in step 3, including number of raters per caption, inter-rater agreement statistics, correlation of ratings with any objective quality measure, or safeguards against bias (e.g., penalizing unconventional but accurate descriptions). This directly affects the validity of the 'top ones' selection that defines the final dataset.
minor comments (2)
- [Abstract] Abstract: 'practises' should be 'practices'; 'amount of typographical errors' is better phrased as 'number of typographical errors'.
- The manuscript should report basic dataset statistics (number of audio clips, total captions collected and retained, average caption length) to allow readers to contextualize the reported similarity and error figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to provide the requested details and analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reduction in typographical errors is presented as an objective outcome of the framework, yet no description is given of the detection procedure (e.g., automated spell-checking, manual review, or specific tools), sample sizes, or statistical significance testing. This metric is load-bearing for the quality claim and cannot be assessed without these details.
Authors: We agree that the typographical error reduction claim requires more methodological detail. In the revised manuscript we will describe the exact detection procedure (including any automated tools or manual review), report the sample sizes examined, and include statistical significance testing. revision: yes
-
Referee: [Abstract] Abstract: The Jaccard similarity of 0.24 is interpreted as indicating that captions are dissimilar yet share information, but the paper provides neither a baseline comparison (e.g., similarity among non-selected captions or random pairs) nor an analysis showing that overlapping tokens are primarily content words describing the audio rather than stop words. Without these, the balance-of-diversity interpretation is unsupported.
Authors: We acknowledge that the diversity interpretation would be strengthened by baselines and token-level analysis. We will add comparisons against non-selected captions and random pairs, and we will analyze the overlapping tokens to determine the proportion that are content words versus stop words. revision: yes
-
Referee: [Abstract] Abstract / step-3 description: No information is supplied on the rating protocol in step 3, including number of raters per caption, inter-rater agreement statistics, correlation of ratings with any objective quality measure, or safeguards against bias (e.g., penalizing unconventional but accurate descriptions). This directly affects the validity of the 'top ones' selection that defines the final dataset.
Authors: We agree that the rating protocol description is incomplete. In the revision we will specify the number of raters per caption, report inter-rater agreement, discuss any correlation with objective quality measures, and describe safeguards against selection bias. revision: yes
Circularity Check
No circularity: purely empirical framework with direct measurements
full rationale
The paper presents a three-step crowdsourcing procedure for collecting audio captions and reports objective measurements (typographical error counts and average Jaccard similarity of 0.24) on the resulting dataset. No derivations, equations, fitted parameters, or predictions appear; the central claims are direct empirical outcomes of the described process rather than reductions to self-defined quantities or self-citations. The evaluation does not rely on any load-bearing uniqueness theorems or ansatzes imported from prior author work. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Crowdsourcing platforms can be used to gather initial audio captions from non-expert workers.
- domain assumption Grammatical correction and rephrasing improve caption quality.
- domain assumption Rating by workers can identify the best captions for the dataset.
Reference graph
Works this paper leans on
-
[1]
Crowdsourcing a Dataset of Audio Captions
INTRODUCTION Multimodal datasets usually have a set of data in one modality and paired set of data in another modality, creating an association of two different forms of media. These datasets differ from a typical classification or regression dataset in the sense that the two modal- ities convey the same content, but in different form. One example is the c...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
PROPOSED FRAMEWORK Our proposed framework consists of three serially executed steps, inspired by practices followed in the creation of image captioning and machine translation datasets [8, 7, 17], and which is imple- mented on an online crowdsourcing platform. We employ as poten- tial annotators the registered users of the platform (the total amount of re...
work page 2019
-
[3]
EV ALUA TION We evaluate our framework in the process of creating a new audio captioning dataset that will be released in Autumn 2019, by objec- tively assessing the impact of the three steps in terms of grammat- ical correctness and diversity of the gathered captions. We assess the grammatical correctness through the amount of typographical errors (the l...
work page 2019
-
[4]
RESULTS & DISCUSSION Figure 2 illustrates the frequency of audio files with typographical errors in their captions, for both initial and edited captions. It can be seen that the edited captions are less likely to contain any typo- graphical errors than the initial captions. This means that the second step has a positive impact on the grammatical correctnes...
work page 2019
-
[5]
Our frame- work is based on three steps of gathering, editing, and scoring the captions
CONCLUSIONS & FUTURE WORK In this paper we presented a framework for the creation of an audio captioning dataset, using a crowdsourcing platform. Our frame- work is based on three steps of gathering, editing, and scoring the captions. We objectively evaluated the framework during the pro- cess of creating a new dataset for audio captioning, and in terms o...
work page 2019
-
[6]
Deep visual-semantic alignments for generating image descriptions,
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 4, pp. 664–676, Apr. 2017. [Online]. Available: https: //doi.org/10.1109/TPAMI.2016.2598339
-
[7]
Microsoft COCO Captions: Data Collection and Evaluation Server
X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Doll´ar, and C. L. Zitnick, “Microsoft coco captions: Data collec- tion and evaluation server,”arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Lin- guistics, vol. 2, pp. 67–78, 2014. [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/229
work page 2014
-
[9]
Automated au- dio captioning with recurrent neural networks,
K. Drossos, S. Adavanne, and T. Virtanen, “Automated au- dio captioning with recurrent neural networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2017, pp. 374–378
work page 2017
-
[10]
Deep learning for image-to-text gener- ation: A technical overview,
X. He and L. Deng, “Deep learning for image-to-text gener- ation: A technical overview,” IEEE Signal Processing Maga- zine, vol. 34, no. 6, pp. 109–116, Nov 2017
work page 2017
-
[11]
Video captioning using deep learning: An overview of methods, datasets and metrics,
M. Amaresh and S. Chitrakala, “Video captioning using deep learning: An overview of methods, datasets and metrics,” in 2019 International Conference on Communication and Signal Processing (ICCSP), April 2019, pp. 0656–0661
work page 2019
-
[12]
Microsoft COCO: Common Objects in Context
T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[13]
Framing image description as a ranking task: Data, models and evaluation metrics,
M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” in J. Artif. Intell. Res., 2013
work page 2013
-
[14]
Beads: A dataset of binaural emotionally annotated digital sounds,
K. Drossos, A. Floros, and A. Giannakoulopoulos, “Beads: A dataset of binaural emotionally annotated digital sounds,” in IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications , July 2014, pp. 158– 163
work page 2014
-
[15]
Collecting image annotations using amazon’s mechanical turk,
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk , ser. CSLDAMT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 139–147. [Online]. Available: ...
-
[16]
ProSoundEffects, “Master Library 2.0,” http://www. prosoundeffects.com/blog/master-library-2-0-nab/, accessed March 2017, 2015. [Online]. Available: http://www. prosoundeffects.com/blog/master-library-2-0-nab/
work page 2017
-
[17]
Audio caption: Listen and tell,
M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen and tell,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2019, pp. 830–834
work page 2019
-
[18]
Audiocaps: Gener- ating captions for audios in the wild,
C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Gener- ating captions for audios in the wild,” in NAACL-HLT, 2019
work page 2019
-
[19]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 776–780
work page 2017
-
[20]
Using the ama- zon mechanical turk for transcription of spoken language,
M. Marge, S. Banerjee, and A. I. Rudnicky, “Using the ama- zon mechanical turk for transcription of spoken language,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp. 5270–5273
work page 2010
-
[21]
Utility data annotation with ama- zon mechanical turk,
A. Sorokin and D. Forsyth, “Utility data annotation with ama- zon mechanical turk,” in 2008 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition Work- shops, June 2008, pp. 1–8
work page 2008
-
[22]
Crowdsourcing translation: Professional quality from non-professionals,
O. F. Zaidan and C. Callison-Burch, “Crowdsourcing translation: Professional quality from non-professionals,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - V olume 1 , ser. HLT ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1220–
work page 2011
-
[23]
Available: http://dl.acm.org/citation.cfm?id= 2002472.2002626
[Online]. Available: http://dl.acm.org/citation.cfm?id= 2002472.2002626
-
[24]
NLTK: The natural language toolkit,
E. Loper and S. Bird, “NLTK: The natural language toolkit,” in In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Process- ing and Computational Linguistics , 2002
work page 2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.