Crowdsourcing a Dataset of Audio Captions

Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen

arxiv: 1907.09238 · v1 · pith:XPP56FCFnew · submitted 2019-07-22 · 💻 cs.SD · eess.AS

Crowdsourcing a Dataset of Audio Captions

Samuel Lipping , Konstantinos Drossos , Tuomas Virtanen This is my paper

Pith reviewed 2026-05-24 18:07 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords audio captioningcrowdsourcingdatasetJaccard similaritytypographical errorsmulti-modal

0 comments

The pith

A three-step crowdsourcing process yields audio captions with fewer typographical errors and an average Jaccard similarity of 0.24.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a framework for crowdsourcing an audio captioning dataset using three steps inspired by image captioning and translation datasets. Initial captions are gathered from workers, then edited for grammar and rephrasing, and finally rated to keep the best versions. Evaluation shows the final dataset has fewer typographical errors than the raw initial captions. On average, captions for each sound share a Jaccard similarity of 0.24, which the authors interpret as providing some common information while remaining dissimilar. This matters because manual creation of such datasets is time-consuming, making crowdsourcing attractive for building resources for audio-to-text models.

Core claim

The authors establish that their three-step crowdsourcing framework produces a dataset with reduced typographical errors compared to initial captions, and that the selected captions for each audio clip exhibit an average Jaccard similarity of 0.24, indicating they are dissimilar yet contain overlapping information content.

What carries the argument

The three-step framework consisting of initial caption gathering, grammatical correction and rephrasing, and rating to select top captions.

If this is right

The resulting dataset has improved quality through error reduction.
Captions balance diversity and shared content as measured by Jaccard similarity.
The method can scale dataset creation for audio captioning tasks.
Practices from image captioning datasets transfer to audio.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar crowdsourcing could be applied to create datasets for other audio-related tasks like sound event detection.
The framework might reduce costs and time for building multimodal datasets in general.
Future work could explore automated rating instead of worker ratings to further scale the process.

Load-bearing premise

Crowd worker ratings reliably select high-quality captions without subjective bias or exclusion of valid unconventional descriptions.

What would settle it

Compare the selected captions against expert-annotated gold standard captions for the same audio clips to check if error rates are actually lower and if the similarity measure correlates with caption usefulness.

Figures

Figures reproduced from arXiv: 1907.09238 by Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen.

**Figure 2.** Figure 2: The number of typographical errors in the captions by the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Jaccard similarity between initial and edited captions. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-similarity of selected, non-selected, and initial cap [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable three-step crowdsourcing method for audio captions that cuts typos and gets moderate diversity via Jaccard 0.24, but the rating step lacks reported checks for reliability or bias.

read the letter

The main takeaway is a three-step crowdsourcing pipeline for audio caption datasets: gather raw captions, edit them for grammar and phrasing, then rate and retain the top ones. The authors report that the final set has fewer typographical errors than the initial captions and an average Jaccard similarity of 0.24 across captions for the same sound, which they interpret as a reasonable mix of diversity and shared content.

Referee Report

3 major / 2 minor

Summary. The paper proposes a three-step crowdsourcing framework for building an audio captioning dataset: (1) collect initial captions for audio clips, (2) obtain grammatically corrected or rephrased versions, and (3) rate the captions and retain the top ones. The authors claim that the resulting dataset exhibits fewer typographical errors than the initial captions and that captions per sound have an average Jaccard similarity of 0.24, which they interpret as evidence of diversity while retaining shared information content.

Significance. If the evaluation metrics are shown to be reliable, the framework could provide a practical, reusable method for constructing audio caption datasets by adapting established practices from image captioning and machine translation. The work is primarily methodological and empirical rather than theoretical; its value would lie in enabling higher-quality training data for audio captioning models, provided the claimed improvements in error reduction and controlled diversity are substantiated.

major comments (3)

[Abstract] Abstract: The reduction in typographical errors is presented as an objective outcome of the framework, yet no description is given of the detection procedure (e.g., automated spell-checking, manual review, or specific tools), sample sizes, or statistical significance testing. This metric is load-bearing for the quality claim and cannot be assessed without these details.
[Abstract] Abstract: The Jaccard similarity of 0.24 is interpreted as indicating that captions are dissimilar yet share information, but the paper provides neither a baseline comparison (e.g., similarity among non-selected captions or random pairs) nor an analysis showing that overlapping tokens are primarily content words describing the audio rather than stop words. Without these, the balance-of-diversity interpretation is unsupported.
[Abstract] Abstract / step-3 description: No information is supplied on the rating protocol in step 3, including number of raters per caption, inter-rater agreement statistics, correlation of ratings with any objective quality measure, or safeguards against bias (e.g., penalizing unconventional but accurate descriptions). This directly affects the validity of the 'top ones' selection that defines the final dataset.

minor comments (2)

[Abstract] Abstract: 'practises' should be 'practices'; 'amount of typographical errors' is better phrased as 'number of typographical errors'.
The manuscript should report basic dataset statistics (number of audio clips, total captions collected and retained, average caption length) to allow readers to contextualize the reported similarity and error figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to provide the requested details and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The reduction in typographical errors is presented as an objective outcome of the framework, yet no description is given of the detection procedure (e.g., automated spell-checking, manual review, or specific tools), sample sizes, or statistical significance testing. This metric is load-bearing for the quality claim and cannot be assessed without these details.

Authors: We agree that the typographical error reduction claim requires more methodological detail. In the revised manuscript we will describe the exact detection procedure (including any automated tools or manual review), report the sample sizes examined, and include statistical significance testing. revision: yes
Referee: [Abstract] Abstract: The Jaccard similarity of 0.24 is interpreted as indicating that captions are dissimilar yet share information, but the paper provides neither a baseline comparison (e.g., similarity among non-selected captions or random pairs) nor an analysis showing that overlapping tokens are primarily content words describing the audio rather than stop words. Without these, the balance-of-diversity interpretation is unsupported.

Authors: We acknowledge that the diversity interpretation would be strengthened by baselines and token-level analysis. We will add comparisons against non-selected captions and random pairs, and we will analyze the overlapping tokens to determine the proportion that are content words versus stop words. revision: yes
Referee: [Abstract] Abstract / step-3 description: No information is supplied on the rating protocol in step 3, including number of raters per caption, inter-rater agreement statistics, correlation of ratings with any objective quality measure, or safeguards against bias (e.g., penalizing unconventional but accurate descriptions). This directly affects the validity of the 'top ones' selection that defines the final dataset.

Authors: We agree that the rating protocol description is incomplete. In the revision we will specify the number of raters per caption, report inter-rater agreement, discuss any correlation with objective quality measures, and describe safeguards against selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with direct measurements

full rationale

The paper presents a three-step crowdsourcing procedure for collecting audio captions and reports objective measurements (typographical error counts and average Jaccard similarity of 0.24) on the resulting dataset. No derivations, equations, fitted parameters, or predictions appear; the central claims are direct empirical outcomes of the described process rather than reductions to self-defined quantities or self-citations. The evaluation does not rely on any load-bearing uniqueness theorems or ansatzes imported from prior author work. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The paper relies on standard assumptions about crowdsourcing effectiveness without introducing new free parameters or invented entities.

axioms (3)

domain assumption Crowdsourcing platforms can be used to gather initial audio captions from non-expert workers.
Basis for the first step of the framework.
domain assumption Grammatical correction and rephrasing improve caption quality.
Core of the second step.
domain assumption Rating by workers can identify the best captions for the dataset.
Foundation for the third step and final dataset selection.

pith-pipeline@v0.9.0 · 5765 in / 1283 out tokens · 31013 ms · 2026-05-24T18:07:15.642349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

[1]

Crowdsourcing a Dataset of Audio Captions

INTRODUCTION Multimodal datasets usually have a set of data in one modality and paired set of data in another modality, creating an association of two different forms of media. These datasets differ from a typical classiﬁcation or regression dataset in the sense that the two modal- ities convey the same content, but in different form. One example is the c...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

There is

PROPOSED FRAMEWORK Our proposed framework consists of three serially executed steps, inspired by practices followed in the creation of image captioning and machine translation datasets [8, 7, 17], and which is imple- mented on an online crowdsourcing platform. We employ as poten- tial annotators the registered users of the platform (the total amount of re...

work page 2019
[3]

bass”, “glitch

EV ALUA TION We evaluate our framework in the process of creating a new audio captioning dataset that will be released in Autumn 2019, by objec- tively assessing the impact of the three steps in terms of grammat- ical correctness and diversity of the gathered captions. We assess the grammatical correctness through the amount of typographical errors (the l...

work page 2019
[4]

It can be seen that the edited captions are less likely to contain any typo- graphical errors than the initial captions

RESULTS & DISCUSSION Figure 2 illustrates the frequency of audio ﬁles with typographical errors in their captions, for both initial and edited captions. It can be seen that the edited captions are less likely to contain any typo- graphical errors than the initial captions. This means that the second step has a positive impact on the grammatical correctnes...

work page 2019
[5]

Our frame- work is based on three steps of gathering, editing, and scoring the captions

CONCLUSIONS & FUTURE WORK In this paper we presented a framework for the creation of an audio captioning dataset, using a crowdsourcing platform. Our frame- work is based on three steps of gathering, editing, and scoring the captions. We objectively evaluated the framework during the pro- cess of creating a new dataset for audio captioning, and in terms o...

work page 2019
[6]

Deep visual-semantic alignments for generating image descriptions,

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 4, pp. 664–676, Apr. 2017. [Online]. Available: https: //doi.org/10.1109/TPAMI.2016.2598339

work page doi:10.1109/tpami.2016.2598339 2017
[7]

Microsoft COCO Captions: Data Collection and Evaluation Server

X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Doll´ar, and C. L. Zitnick, “Microsoft coco captions: Data collec- tion and evaluation server,”arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Lin- guistics, vol. 2, pp. 67–78, 2014. [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/229

work page 2014
[9]

Automated au- dio captioning with recurrent neural networks,

K. Drossos, S. Adavanne, and T. Virtanen, “Automated au- dio captioning with recurrent neural networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2017, pp. 374–378

work page 2017
[10]

Deep learning for image-to-text gener- ation: A technical overview,

X. He and L. Deng, “Deep learning for image-to-text gener- ation: A technical overview,” IEEE Signal Processing Maga- zine, vol. 34, no. 6, pp. 109–116, Nov 2017

work page 2017
[11]

Video captioning using deep learning: An overview of methods, datasets and metrics,

M. Amaresh and S. Chitrakala, “Video captioning using deep learning: An overview of methods, datasets and metrics,” in 2019 International Conference on Communication and Signal Processing (ICCSP), April 2019, pp. 0656–0661

work page 2019
[12]

Microsoft COCO: Common Objects in Context

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Framing image description as a ranking task: Data, models and evaluation metrics,

M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” in J. Artif. Intell. Res., 2013

work page 2013
[14]

Beads: A dataset of binaural emotionally annotated digital sounds,

K. Drossos, A. Floros, and A. Giannakoulopoulos, “Beads: A dataset of binaural emotionally annotated digital sounds,” in IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications , July 2014, pp. 158– 163

work page 2014
[15]

Collecting image annotations using amazon’s mechanical turk,

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk , ser. CSLDAMT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 139–147. [Online]. Available: ...

work page arXiv 2010
[16]

Master Library 2.0,

ProSoundEffects, “Master Library 2.0,” http://www. prosoundeffects.com/blog/master-library-2-0-nab/, accessed March 2017, 2015. [Online]. Available: http://www. prosoundeffects.com/blog/master-library-2-0-nab/

work page 2017
[17]

Audio caption: Listen and tell,

M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen and tell,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2019, pp. 830–834

work page 2019
[18]

Audiocaps: Gener- ating captions for audios in the wild,

C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Gener- ating captions for audios in the wild,” in NAACL-HLT, 2019

work page 2019
[19]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 776–780

work page 2017
[20]

Using the ama- zon mechanical turk for transcription of spoken language,

M. Marge, S. Banerjee, and A. I. Rudnicky, “Using the ama- zon mechanical turk for transcription of spoken language,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp. 5270–5273

work page 2010
[21]

Utility data annotation with ama- zon mechanical turk,

A. Sorokin and D. Forsyth, “Utility data annotation with ama- zon mechanical turk,” in 2008 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition Work- shops, June 2008, pp. 1–8

work page 2008
[22]

Crowdsourcing translation: Professional quality from non-professionals,

O. F. Zaidan and C. Callison-Burch, “Crowdsourcing translation: Professional quality from non-professionals,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - V olume 1 , ser. HLT ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1220–

work page 2011
[23]

Available: http://dl.acm.org/citation.cfm?id= 2002472.2002626

[Online]. Available: http://dl.acm.org/citation.cfm?id= 2002472.2002626

work page arXiv
[24]

NLTK: The natural language toolkit,

E. Loper and S. Bird, “NLTK: The natural language toolkit,” in In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Process- ing and Computational Linguistics , 2002

work page 2002

[1] [1]

Crowdsourcing a Dataset of Audio Captions

INTRODUCTION Multimodal datasets usually have a set of data in one modality and paired set of data in another modality, creating an association of two different forms of media. These datasets differ from a typical classiﬁcation or regression dataset in the sense that the two modal- ities convey the same content, but in different form. One example is the c...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

There is

PROPOSED FRAMEWORK Our proposed framework consists of three serially executed steps, inspired by practices followed in the creation of image captioning and machine translation datasets [8, 7, 17], and which is imple- mented on an online crowdsourcing platform. We employ as poten- tial annotators the registered users of the platform (the total amount of re...

work page 2019

[3] [3]

bass”, “glitch

EV ALUA TION We evaluate our framework in the process of creating a new audio captioning dataset that will be released in Autumn 2019, by objec- tively assessing the impact of the three steps in terms of grammat- ical correctness and diversity of the gathered captions. We assess the grammatical correctness through the amount of typographical errors (the l...

work page 2019

[4] [4]

It can be seen that the edited captions are less likely to contain any typo- graphical errors than the initial captions

RESULTS & DISCUSSION Figure 2 illustrates the frequency of audio ﬁles with typographical errors in their captions, for both initial and edited captions. It can be seen that the edited captions are less likely to contain any typo- graphical errors than the initial captions. This means that the second step has a positive impact on the grammatical correctnes...

work page 2019

[5] [5]

Our frame- work is based on three steps of gathering, editing, and scoring the captions

CONCLUSIONS & FUTURE WORK In this paper we presented a framework for the creation of an audio captioning dataset, using a crowdsourcing platform. Our frame- work is based on three steps of gathering, editing, and scoring the captions. We objectively evaluated the framework during the pro- cess of creating a new dataset for audio captioning, and in terms o...

work page 2019

[6] [6]

Deep visual-semantic alignments for generating image descriptions,

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 4, pp. 664–676, Apr. 2017. [Online]. Available: https: //doi.org/10.1109/TPAMI.2016.2598339

work page doi:10.1109/tpami.2016.2598339 2017

[7] [7]

Microsoft COCO Captions: Data Collection and Evaluation Server

X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Doll´ar, and C. L. Zitnick, “Microsoft coco captions: Data collec- tion and evaluation server,”arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Lin- guistics, vol. 2, pp. 67–78, 2014. [Online]. Available: https://transacl.org/ojs/index.php/tacl/article/view/229

work page 2014

[9] [9]

Automated au- dio captioning with recurrent neural networks,

K. Drossos, S. Adavanne, and T. Virtanen, “Automated au- dio captioning with recurrent neural networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2017, pp. 374–378

work page 2017

[10] [10]

Deep learning for image-to-text gener- ation: A technical overview,

X. He and L. Deng, “Deep learning for image-to-text gener- ation: A technical overview,” IEEE Signal Processing Maga- zine, vol. 34, no. 6, pp. 109–116, Nov 2017

work page 2017

[11] [11]

Video captioning using deep learning: An overview of methods, datasets and metrics,

M. Amaresh and S. Chitrakala, “Video captioning using deep learning: An overview of methods, datasets and metrics,” in 2019 International Conference on Communication and Signal Processing (ICCSP), April 2019, pp. 0656–0661

work page 2019

[12] [12]

Microsoft COCO: Common Objects in Context

T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Framing image description as a ranking task: Data, models and evaluation metrics,

M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” in J. Artif. Intell. Res., 2013

work page 2013

[14] [14]

Beads: A dataset of binaural emotionally annotated digital sounds,

K. Drossos, A. Floros, and A. Giannakoulopoulos, “Beads: A dataset of binaural emotionally annotated digital sounds,” in IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications , July 2014, pp. 158– 163

work page 2014

[15] [15]

Collecting image annotations using amazon’s mechanical turk,

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk , ser. CSLDAMT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 139–147. [Online]. Available: ...

work page arXiv 2010

[16] [16]

Master Library 2.0,

ProSoundEffects, “Master Library 2.0,” http://www. prosoundeffects.com/blog/master-library-2-0-nab/, accessed March 2017, 2015. [Online]. Available: http://www. prosoundeffects.com/blog/master-library-2-0-nab/

work page 2017

[17] [17]

Audio caption: Listen and tell,

M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen and tell,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2019, pp. 830–834

work page 2019

[18] [18]

Audiocaps: Gener- ating captions for audios in the wild,

C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Gener- ating captions for audios in the wild,” in NAACL-HLT, 2019

work page 2019

[19] [19]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 776–780

work page 2017

[20] [20]

Using the ama- zon mechanical turk for transcription of spoken language,

M. Marge, S. Banerjee, and A. I. Rudnicky, “Using the ama- zon mechanical turk for transcription of spoken language,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp. 5270–5273

work page 2010

[21] [21]

Utility data annotation with ama- zon mechanical turk,

A. Sorokin and D. Forsyth, “Utility data annotation with ama- zon mechanical turk,” in 2008 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition Work- shops, June 2008, pp. 1–8

work page 2008

[22] [22]

Crowdsourcing translation: Professional quality from non-professionals,

O. F. Zaidan and C. Callison-Burch, “Crowdsourcing translation: Professional quality from non-professionals,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - V olume 1 , ser. HLT ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 1220–

work page 2011

[23] [23]

Available: http://dl.acm.org/citation.cfm?id= 2002472.2002626

[Online]. Available: http://dl.acm.org/citation.cfm?id= 2002472.2002626

work page arXiv

[24] [24]

NLTK: The natural language toolkit,

E. Loper and S. Bird, “NLTK: The natural language toolkit,” in In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Process- ing and Computational Linguistics , 2002

work page 2002