Sound Search by Text Description or Vocal Imitation?

Yichi Zhang; Yiting Zhang; Zhiyao Duan

arxiv: 1907.08661 · v1 · pith:5Z6EAEV6new · submitted 2019-07-19 · 💻 cs.HC · cs.SD· eess.AS

Sound Search by Text Description or Vocal Imitation?

Yichi Zhang , Yiting Zhang , Zhiyao Duan This is my paper

Pith reviewed 2026-05-24 18:55 UTC · model grok-4.3

classification 💻 cs.HC cs.SDeess.AS

keywords sound searchvocal imitationtext descriptionuser studyaudio retrievalquery by vocal imitationsubjective evaluationease of use

0 comments

The pith

Vocal imitation search yields higher user satisfaction than text descriptions for sounds that are hard to put into words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds two web-based sound search systems, one that takes vocal imitations as input and one that takes text labels, then runs a user study to see which works better for different kinds of sounds. Participants gave the vocal system significantly higher satisfaction scores on categories they found difficult to describe in text, and they also rated the vocal system easier to use overall on the test collection. A sympathetic reader would care because everyday sound search often fails when words cannot capture the exact audio quality, so a practical alternative that bypasses verbal description could change how people find audio clips. The work is framed as a pilot that moves vocal-imitation algorithms out of simulation and into real user interaction.

Core claim

Users reported significantly higher search satisfaction with the vocal-imitation engine than with the text-description engine for sound categories difficult to describe by text, and they gave the vocal engine a better overall ease-of-use rating on the limited sound library used in the experiments.

What carries the argument

Two web-based search engines, Vroom! accepting vocal imitations and TextSearch accepting text descriptions, evaluated through subjective satisfaction and ease-of-use ratings collected from real users.

Load-bearing premise

The subjective ratings collected from users on the limited sound library accurately reflect real-world search performance and generalize beyond the specific experimental setup and participant pool.

What would settle it

A follow-up study with a larger and more diverse sound library in which participants show equal or higher satisfaction ratings for the text engine on the same hard-to-describe categories.

Figures

Figures reproduced from arXiv: 1907.08661 by Yichi Zhang, Yiting Zhang, Zhiyao Duan.

**Figure 2.** Figure 2: Experimental framework hosting the proposed vocal imitation based search engine [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Average user ratings of sound search by text descrip [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Searching sounds by text labels is often difficult, as text descriptions cannot describe the audio content in detail. Query by vocal imitation bridges such gap and provides a novel way to sound search. Several algorithms for sound search by vocal imitation have been proposed and evaluated in a simulation environment, however, they have not been deployed into a real search engine nor evaluated by real users. This pilot work conducts a subjective study to compare these two approaches to sound search, and tries to answer the question of which approach works better for what kinds of sounds. To do so, we developed two web-based search engines for sound, one by vocal imitation (Vroom!) and the other by text description (TextSearch). We also developed an experimental framework to host these engines to collect statistics of user behaviors and ratings. Results showed that Vroom! received significantly higher search satisfaction ratings than TextSearch did for sound categories that were difficult for subjects to describe by text. Results also showed a better overall ease-of-use rating for Vroom! than TextSearch on the limited sound library in our experiments. These findings suggest advantages of vocal-imitation-based search for sound in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pilot user study shows vocal imitation search can outperform text for hard-to-describe sounds, but the small library and thin methods reporting keep the practical takeaway modest.

read the letter

The core takeaway is that this paper ran the first real-user test of a vocal-imitation sound search engine against a text baseline and found higher satisfaction ratings for Vroom! on sounds people struggled to describe in words, plus better overall ease-of-use scores. They actually built and hosted two web engines to collect the ratings, which moves past the simulation-only work they cite. That deployment step is the clearest advance here and gives the comparison some grounding in actual user behavior rather than just algorithm tests on held-out data. The finding that vocal imitation helps most when text descriptions fall short aligns with the practical problem they set out to address. The paper is straightforward about running on a limited sound library, which is honest. The soft spots are the missing participant counts, library size, statistical tests, and exclusion rules in the abstract; without those, it's difficult to judge how solid the significance claims are. The stress-test point on generalization is also fair—the results are explicitly tied to their small test set, and nothing shows the advantage would survive larger, more varied collections or different user groups. No circularity or invented parameters, just a straightforward empirical comparison. This is for HCI researchers or audio-tool builders who want an early signal on vocal-imitation interfaces. A reader could pick up the idea and the basic comparison, but the work is too preliminary for strong deployment advice. It deserves peer review so the methods can be checked and the limitations discussed more fully.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a pilot subjective user study comparing two web-based sound search engines: Vroom! (vocal imitation) and TextSearch (text description). It reports that Vroom! received significantly higher search satisfaction ratings than TextSearch for sound categories difficult to describe by text, along with better overall ease-of-use ratings, on a limited sound library; the authors conclude that these findings suggest advantages for vocal-imitation-based search in practice.

Significance. If the results hold and generalize, the work supplies initial real-user evidence on the relative strengths of vocal imitation versus text for sound retrieval, which could inform HCI interface design for audio databases. The development of deployable web engines and a framework for collecting user behavior statistics is a concrete practical contribution. However, the explicitly limited scope of the library and participant pool constrains the strength of any broader claims.

major comments (1)

[Abstract] Abstract: The claim that the findings 'suggest advantages of vocal-imitation-based search for sound in practice' is load-bearing for the paper's contribution yet is not supported by any experiments or arguments beyond the 'limited sound library in our experiments' explicitly noted in the same paragraph; no tests on larger libraries, varied sound distributions, or different user populations are reported.

minor comments (2)

[Abstract] Abstract: Key methodological details (participant count, statistical tests and p-values, library size, exclusion criteria, bias controls) are omitted, preventing readers from assessing the reported 'significantly higher' ratings without consulting the full methods section.
[Results] Results section: Ensure all satisfaction and ease-of-use claims are accompanied by exact statistical values, degrees of freedom, and effect sizes rather than qualitative descriptions alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to align the abstract's claims more closely with the pilot study's limited scope. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the findings 'suggest advantages of vocal-imitation-based search for sound in practice' is load-bearing for the paper's contribution yet is not supported by any experiments or arguments beyond the 'limited sound library in our experiments' explicitly noted in the same paragraph; no tests on larger libraries, varied sound distributions, or different user populations are reported.

Authors: We agree that the abstract's final sentence overgeneralizes beyond the evidence. The study is presented as pilot work with an explicitly limited sound library, and no broader tests are reported. We will revise the abstract to state that the findings suggest advantages of vocal-imitation-based search within the conditions of the limited sound library tested, removing the unqualified 'in practice' phrasing. This revision will ensure the claim matches the reported experiments without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

Empirical user study with no derivations or fitted parameters

full rationale

This paper is a subjective user study that develops two web-based sound search engines (Vroom! for vocal imitation and TextSearch for text) and collects user ratings on satisfaction and ease-of-use for a limited sound library. No equations, parameter fitting, derivations, or load-bearing self-citations appear in the abstract or described content; results are reported as direct empirical observations from participant data. The central claims rest on statistical comparisons of ratings rather than any chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the user ratings and the limited sound library; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5733 in / 968 out tokens · 22927 ms · 2026-05-24T18:55:12.710450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results showed that Vroom! received significantly higher search satisfaction ratings than TextSearch did for sound categories that were difficult for subjects to describe by text.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also developed an experimental framework to host these engines to collect statistics of user behaviors and ratings.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

Traditional search engines for audio ﬁles use text labels as queries

INTRODUCTION Designing methods to access and manage multimedia documents such as audio recordings is an important information retrieval task. Traditional search engines for audio ﬁles use text labels as queries. However this is not always effective. First, it requires users to be familiar with the audio library taxonomy and text labels, which is unrealist...

work page
[2]

Speciﬁcally, we designed a web-based search en- gine called Vroom!

How does vocal-imitation-based search compare with the tradi- tional text-based search for different kinds of sounds in terms of search effectiveness and efﬁciency? To answer the above questions, in this work, we conduct a sub- jective study to compare sound search by vocal imitation and by text description. Speciﬁcally, we designed a web-based search en-...

work page
[3]

Sound Search by Text Description or Vocal Imitation?

RELA TED WORK Sound search by text description has been widely accepted in our daily life. For example, Freesound [5] is an online collaborative sound database with more than 400,000 sounds. Those sounds are tagged with text descriptions for text-based search. SoundCloud [6] is another online audio distribution platform that enables users to search sounds...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[4]

The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations

and an SVM classiﬁer. The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations. Hel ´en and Virtanen [19] designed a query by example system for generic audio. Hand-crafted frame-level fea- tures were extracted from both query and sound samples and the query-sample pairwise similarity ...

work page
[5]

Mean- while, the beneﬁts of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]

to integrate these two modules together, in which the transfer learning based TL-IMINET is our most updated model [25]. Mean- while, the beneﬁts of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]. To understand what such neu- ral networks actually learns, we...

work page
[6]

Go Search!

SEARCH ENGINES FOR COMPARISON 3.1. Search by V ocal Imitation: Vroom! We designed a web-based sound search engine by vocal imitation, called Vroom!. The frontend GUI is designed using Javascript, HTML, and CSS languages. It allows a user to record a vocal imitation of sound that he/she is looking for using the recorder.js Javascript library [28]. It also ...

work page
[7]

Go Search!

SUBJECTIVE EV ALUA TION 4.1. Experimental Framework To quantify search behaviors and user experiences and to make quantitative comparisons between Vroom! and TextSearch, we de- signed an experimental framework that wraps around each search engine. The experimental framework is another web application. It guides each subject to make 20 searches and rate th...

work page 2015
[8]

Go Search!

ease-of-use rating evaluates a user’s overall experience of each search engine upon the completion of all 20 searches. It can be seen that Vroom! shows a statistically signiﬁcantly higher ease-of-use rating than TextSearch at the signiﬁcance level of 0.05 (p=0.0108, unpairted t-test). This aligns with the average satisfaction rating of all categories. How...

work page
[9]

We designed a search engine for each approach and an experimental framework for the study

CONCLUSIONS AND DISCUSSIONS This paper presented a subjective study to compare vocal-imitation- based and text-based search for sounds. We designed a search engine for each approach and an experimental framework for the study. User ratings and behavioral data collected from 20 sub- jects showed that vocal-imitation-based search has signiﬁcant ad- vantages...

work page
[10]

Sound retrieval from voice imi- tation queries in collaborative databases,

D. S. Blancas and J. Janer, “Sound retrieval from voice imi- tation queries in collaborative databases,” in Proc. Audio En- gineering Society 53rd International Conference on Semantic Audio, 2014, pp. 1–6

work page 2014
[11]

Retrieving sounds by vocal imitation recognition,

Y . Zhang and Z. Duan, “Retrieving sounds by vocal imitation recognition,” in Proc. Machine Learning for Signal Process- ing (MLSP), 2015 IEEE International Workshop on, 2015, pp. 1–6

work page 2015
[12]

Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,

——, “Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on , 2018, pp. 2406–2410

work page 2018
[13]

V ocalsketch: V ocally imitating audio concepts,

M. Cartwright and B. Pardo, “V ocalsketch: V ocally imitating audio concepts,” in Proc. the 33rd Annual ACM Conference on Human Factors in Computing Systems , 2015, pp. 43–46

work page 2015
[14]

https://freesound.org [Accessed 04/23/2019]

work page 2019
[15]

https://soundcloud.com [Accessed 04/23/2019]

work page 2019
[16]

Query-by-example: A data base language,

M. M. Zloof, “Query-by-example: A data base language,” IBM Systems Journal, vol. 16, no. 4, pp. 324–343, 1977

work page 1977
[17]

An industrial strength audio search algorithm,

A. Wang, “An industrial strength audio search algorithm,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2003, pp. 7–13

work page 2003
[18]

An audio ﬁngerprinting system for live version identiﬁcation using image processing techniques,

Z. Raﬁi, B. Coover, and J. Han, “An audio ﬁngerprinting system for live version identiﬁcation using image processing techniques,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , 2014, pp. 644–648

work page 2014
[19]

Known artist live song id: A hashprint approach,

T. J. Tsai, T. Pr¨atzlich, and M. M¨uller, “Known artist live song id: A hashprint approach,” in Proc. International Society for Music Information Retrieval Conference (ISMIR) , 2016, pp. 427–433

work page 2016
[20]

Large-scale cover song recognition using hashed chroma landmarks,

T. Bertin-Mahieux and D. P. Ellis, “Large-scale cover song recognition using hashed chroma landmarks,” in Proc. Appli- cations of Signal Processing to Audio and Acoustics (WAS- PAA), 2011 IEEE Workshop on, 2011, pp. 117–120

work page 2011
[21]

A lattice-based approach to query-by-example spoken document retrieval,

T. K. Chia, K. C. Sim, H. Li, and H. T. Ng, “A lattice-based approach to query-by-example spoken document retrieval,” in Proc. the 31st annual international ACM SIGIR conference on Research and development in information retrieval , 2008, pp. 363–370

work page 2008
[22]

Query by humming: musical information retrieval in an audio database,

A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by humming: musical information retrieval in an audio database,” in Proc. the 3rd ACM International Conference on Multimedia, 1995, pp. 231–236

work page 1995
[23]

A comparative evaluation of search techniques for query-by-humming using the musart testbed,

R. B. Dannenberg, W. P. Birmingham, B. Pardo, N. Hu, C. Meek, and G. Tzanetakis, “A comparative evaluation of search techniques for query-by-humming using the musart testbed,” Journal of the Association for Information Science and Technology, vol. 8, no. 5, pp. 687–701, 2007

work page 2007
[24]

Query-by- beating-boxing: Music retrieval for the DJ,

A. Kapur, M. Benning, and G. Tzanetakis, “Query-by- beating-boxing: Music retrieval for the DJ,” in Proc. Inter- national Society for Music Information Retrieval Conference (ISMIR), 2004, pp. 170–177

work page 2004
[25]

Drum loops retrieval from spoken queries,

O. Gillet and G. Richard, “Drum loops retrieval from spoken queries,” Journal of Intelligent Information Systems , vol. 24, pp. 159–177, 2005

work page 2005
[26]

Querying freesound with a micro- phone,

G. Roma and X. Serra, “Querying freesound with a micro- phone,” in Proc. the 1st Web Audio Conference (WAC), 2015

work page 2015
[27]

The timbre toolbox: Extracting audio descrip- tors from musical signal,

G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. McAdams, “The timbre toolbox: Extracting audio descrip- tors from musical signal,” The Journal of the Acoustical Soci- ety of America, vol. 130, no. 5, pp. 2902–2916, 2011

work page 2011
[28]

Audio query by example using similarity measures between probability density functions of features,

M. Hel ´en and T. Virtanen, “Audio query by example using similarity measures between probability density functions of features,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, no. 1, p. 179303, 2009

work page 2010
[29]

IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,

Y . Zhang and Z. Duan, “IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Interna- tional Conference on, 2016, pp. 2269–2273

work page 2016
[30]

On information and sufﬁ- ciency,

S. Kullback and R. A. Leibler, “On information and sufﬁ- ciency,” The Annals of Mathematical Statistics , vol. 22, no. 1, pp. 79–86, 1951

work page 1951
[31]

Dynamic programming algorithm op- timization for spoken word recognition,

H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,” Acoustics, Speech and Signal Processing, IEEE Transaction on , vol. 26, no. 1, pp. 43–49, 1978

work page 1978
[32]

Cosine similarity scoring without score normaliza- tion techniques,

N. Dehak, R. Dehak, J. R. Glass, D. A. Reynolds, and P. Kenny, “Cosine similarity scoring without score normaliza- tion techniques,” in Odyssey, 2010, pp. 1–5

work page 2010
[33]

IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,

Y . Zhang and Z. Duan, “IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,” in Proc. Applications of Signal Processing to Audio and Acous- tics (WASPAA), 2017 IEEE Workshop on, 2017, pp. 304–308

work page 2017
[34]

Siamese style convolu- tional neural networks for sound search by vocal imitation,

Y . Zhang, B. Pardo, and Z. Duan, “Siamese style convolu- tional neural networks for sound search by vocal imitation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 429–441, 2019

work page 2019
[35]

Improving content-based audio re- trieval by vocal imitation feedback,

B. Kim and B. Pardo, “Improving content-based audio re- trieval by vocal imitation feedback,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE Interna- tional Conference on, 2019, pp. 4100–4104

work page 2019
[36]

Understanding representations learned in deep architectures,

D. Erhan, A. Courville, and Y . Bengio, “Understanding representations learned in deep architectures,” Department d‘Informatique et Recherche Operationnelle, University of Montreal, QC, Canada, Tech. Rep 1355, pp. 1–25, 2010

work page 2010
[37]

https://github.com/addpipe/simple-recorderjs-demo [Ac- cessed 04/23/2019]

work page 2019
[38]

Deep convolutional neural net- works and data augmentation for environmental sound classi- ﬁcation,

J. Salamon and J. P. Bello, “Deep convolutional neural net- works and data augmentation for environmental sound classi- ﬁcation,”IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017

work page 2017
[39]

Deep learning for spoken language identiﬁca- tion,

G. Montavon, “Deep learning for spoken language identiﬁca- tion,” in Proc. NIPS Workshop on deep learning for Speech Recognition and Related Applications , 2009, pp. 1–4

work page 2009
[40]

Wordnet: a lexical database for english,

G. A. Miller, “Wordnet: a lexical database for english,” Com- munications of the ACM, vol. 38, no. 11, pp. 39–41, 1995

work page 1995

[1] [1]

Traditional search engines for audio ﬁles use text labels as queries

INTRODUCTION Designing methods to access and manage multimedia documents such as audio recordings is an important information retrieval task. Traditional search engines for audio ﬁles use text labels as queries. However this is not always effective. First, it requires users to be familiar with the audio library taxonomy and text labels, which is unrealist...

work page

[2] [2]

Speciﬁcally, we designed a web-based search en- gine called Vroom!

How does vocal-imitation-based search compare with the tradi- tional text-based search for different kinds of sounds in terms of search effectiveness and efﬁciency? To answer the above questions, in this work, we conduct a sub- jective study to compare sound search by vocal imitation and by text description. Speciﬁcally, we designed a web-based search en-...

work page

[3] [3]

Sound Search by Text Description or Vocal Imitation?

RELA TED WORK Sound search by text description has been widely accepted in our daily life. For example, Freesound [5] is an online collaborative sound database with more than 400,000 sounds. Those sounds are tagged with text descriptions for text-based search. SoundCloud [6] is another online audio distribution platform that enables users to search sounds...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[4] [4]

The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations

and an SVM classiﬁer. The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations. Hel ´en and Virtanen [19] designed a query by example system for generic audio. Hand-crafted frame-level fea- tures were extracted from both query and sound samples and the query-sample pairwise similarity ...

work page

[5] [5]

Mean- while, the beneﬁts of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]

to integrate these two modules together, in which the transfer learning based TL-IMINET is our most updated model [25]. Mean- while, the beneﬁts of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]. To understand what such neu- ral networks actually learns, we...

work page

[6] [6]

Go Search!

SEARCH ENGINES FOR COMPARISON 3.1. Search by V ocal Imitation: Vroom! We designed a web-based sound search engine by vocal imitation, called Vroom!. The frontend GUI is designed using Javascript, HTML, and CSS languages. It allows a user to record a vocal imitation of sound that he/she is looking for using the recorder.js Javascript library [28]. It also ...

work page

[7] [7]

Go Search!

SUBJECTIVE EV ALUA TION 4.1. Experimental Framework To quantify search behaviors and user experiences and to make quantitative comparisons between Vroom! and TextSearch, we de- signed an experimental framework that wraps around each search engine. The experimental framework is another web application. It guides each subject to make 20 searches and rate th...

work page 2015

[8] [8]

Go Search!

ease-of-use rating evaluates a user’s overall experience of each search engine upon the completion of all 20 searches. It can be seen that Vroom! shows a statistically signiﬁcantly higher ease-of-use rating than TextSearch at the signiﬁcance level of 0.05 (p=0.0108, unpairted t-test). This aligns with the average satisfaction rating of all categories. How...

work page

[9] [9]

We designed a search engine for each approach and an experimental framework for the study

CONCLUSIONS AND DISCUSSIONS This paper presented a subjective study to compare vocal-imitation- based and text-based search for sounds. We designed a search engine for each approach and an experimental framework for the study. User ratings and behavioral data collected from 20 sub- jects showed that vocal-imitation-based search has signiﬁcant ad- vantages...

work page

[10] [10]

Sound retrieval from voice imi- tation queries in collaborative databases,

D. S. Blancas and J. Janer, “Sound retrieval from voice imi- tation queries in collaborative databases,” in Proc. Audio En- gineering Society 53rd International Conference on Semantic Audio, 2014, pp. 1–6

work page 2014

[11] [11]

Retrieving sounds by vocal imitation recognition,

Y . Zhang and Z. Duan, “Retrieving sounds by vocal imitation recognition,” in Proc. Machine Learning for Signal Process- ing (MLSP), 2015 IEEE International Workshop on, 2015, pp. 1–6

work page 2015

[12] [12]

Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,

——, “Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on , 2018, pp. 2406–2410

work page 2018

[13] [13]

V ocalsketch: V ocally imitating audio concepts,

M. Cartwright and B. Pardo, “V ocalsketch: V ocally imitating audio concepts,” in Proc. the 33rd Annual ACM Conference on Human Factors in Computing Systems , 2015, pp. 43–46

work page 2015

[14] [14]

https://freesound.org [Accessed 04/23/2019]

work page 2019

[15] [15]

https://soundcloud.com [Accessed 04/23/2019]

work page 2019

[16] [16]

Query-by-example: A data base language,

M. M. Zloof, “Query-by-example: A data base language,” IBM Systems Journal, vol. 16, no. 4, pp. 324–343, 1977

work page 1977

[17] [17]

An industrial strength audio search algorithm,

A. Wang, “An industrial strength audio search algorithm,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2003, pp. 7–13

work page 2003

[18] [18]

An audio ﬁngerprinting system for live version identiﬁcation using image processing techniques,

Z. Raﬁi, B. Coover, and J. Han, “An audio ﬁngerprinting system for live version identiﬁcation using image processing techniques,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , 2014, pp. 644–648

work page 2014

[19] [19]

Known artist live song id: A hashprint approach,

T. J. Tsai, T. Pr¨atzlich, and M. M¨uller, “Known artist live song id: A hashprint approach,” in Proc. International Society for Music Information Retrieval Conference (ISMIR) , 2016, pp. 427–433

work page 2016

[20] [20]

Large-scale cover song recognition using hashed chroma landmarks,

T. Bertin-Mahieux and D. P. Ellis, “Large-scale cover song recognition using hashed chroma landmarks,” in Proc. Appli- cations of Signal Processing to Audio and Acoustics (WAS- PAA), 2011 IEEE Workshop on, 2011, pp. 117–120

work page 2011

[21] [21]

A lattice-based approach to query-by-example spoken document retrieval,

T. K. Chia, K. C. Sim, H. Li, and H. T. Ng, “A lattice-based approach to query-by-example spoken document retrieval,” in Proc. the 31st annual international ACM SIGIR conference on Research and development in information retrieval , 2008, pp. 363–370

work page 2008

[22] [22]

Query by humming: musical information retrieval in an audio database,

A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by humming: musical information retrieval in an audio database,” in Proc. the 3rd ACM International Conference on Multimedia, 1995, pp. 231–236

work page 1995

[23] [23]

A comparative evaluation of search techniques for query-by-humming using the musart testbed,

R. B. Dannenberg, W. P. Birmingham, B. Pardo, N. Hu, C. Meek, and G. Tzanetakis, “A comparative evaluation of search techniques for query-by-humming using the musart testbed,” Journal of the Association for Information Science and Technology, vol. 8, no. 5, pp. 687–701, 2007

work page 2007

[24] [24]

Query-by- beating-boxing: Music retrieval for the DJ,

A. Kapur, M. Benning, and G. Tzanetakis, “Query-by- beating-boxing: Music retrieval for the DJ,” in Proc. Inter- national Society for Music Information Retrieval Conference (ISMIR), 2004, pp. 170–177

work page 2004

[25] [25]

Drum loops retrieval from spoken queries,

O. Gillet and G. Richard, “Drum loops retrieval from spoken queries,” Journal of Intelligent Information Systems , vol. 24, pp. 159–177, 2005

work page 2005

[26] [26]

Querying freesound with a micro- phone,

G. Roma and X. Serra, “Querying freesound with a micro- phone,” in Proc. the 1st Web Audio Conference (WAC), 2015

work page 2015

[27] [27]

The timbre toolbox: Extracting audio descrip- tors from musical signal,

G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. McAdams, “The timbre toolbox: Extracting audio descrip- tors from musical signal,” The Journal of the Acoustical Soci- ety of America, vol. 130, no. 5, pp. 2902–2916, 2011

work page 2011

[28] [28]

Audio query by example using similarity measures between probability density functions of features,

M. Hel ´en and T. Virtanen, “Audio query by example using similarity measures between probability density functions of features,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, no. 1, p. 179303, 2009

work page 2010

[29] [29]

IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,

Y . Zhang and Z. Duan, “IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Interna- tional Conference on, 2016, pp. 2269–2273

work page 2016

[30] [30]

On information and sufﬁ- ciency,

S. Kullback and R. A. Leibler, “On information and sufﬁ- ciency,” The Annals of Mathematical Statistics , vol. 22, no. 1, pp. 79–86, 1951

work page 1951

[31] [31]

Dynamic programming algorithm op- timization for spoken word recognition,

H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,” Acoustics, Speech and Signal Processing, IEEE Transaction on , vol. 26, no. 1, pp. 43–49, 1978

work page 1978

[32] [32]

Cosine similarity scoring without score normaliza- tion techniques,

N. Dehak, R. Dehak, J. R. Glass, D. A. Reynolds, and P. Kenny, “Cosine similarity scoring without score normaliza- tion techniques,” in Odyssey, 2010, pp. 1–5

work page 2010

[33] [33]

IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,

Y . Zhang and Z. Duan, “IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,” in Proc. Applications of Signal Processing to Audio and Acous- tics (WASPAA), 2017 IEEE Workshop on, 2017, pp. 304–308

work page 2017

[34] [34]

Siamese style convolu- tional neural networks for sound search by vocal imitation,

Y . Zhang, B. Pardo, and Z. Duan, “Siamese style convolu- tional neural networks for sound search by vocal imitation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 429–441, 2019

work page 2019

[35] [35]

Improving content-based audio re- trieval by vocal imitation feedback,

B. Kim and B. Pardo, “Improving content-based audio re- trieval by vocal imitation feedback,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE Interna- tional Conference on, 2019, pp. 4100–4104

work page 2019

[36] [36]

Understanding representations learned in deep architectures,

D. Erhan, A. Courville, and Y . Bengio, “Understanding representations learned in deep architectures,” Department d‘Informatique et Recherche Operationnelle, University of Montreal, QC, Canada, Tech. Rep 1355, pp. 1–25, 2010

work page 2010

[37] [37]

https://github.com/addpipe/simple-recorderjs-demo [Ac- cessed 04/23/2019]

work page 2019

[38] [38]

Deep convolutional neural net- works and data augmentation for environmental sound classi- ﬁcation,

J. Salamon and J. P. Bello, “Deep convolutional neural net- works and data augmentation for environmental sound classi- ﬁcation,”IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017

work page 2017

[39] [39]

Deep learning for spoken language identiﬁca- tion,

G. Montavon, “Deep learning for spoken language identiﬁca- tion,” in Proc. NIPS Workshop on deep learning for Speech Recognition and Related Applications , 2009, pp. 1–4

work page 2009

[40] [40]

Wordnet: a lexical database for english,

G. A. Miller, “Wordnet: a lexical database for english,” Com- munications of the ACM, vol. 38, no. 11, pp. 39–41, 1995

work page 1995