Sound Search by Text Description or Vocal Imitation?
Pith reviewed 2026-05-24 18:55 UTC · model grok-4.3
The pith
Vocal imitation search yields higher user satisfaction than text descriptions for sounds that are hard to put into words.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Users reported significantly higher search satisfaction with the vocal-imitation engine than with the text-description engine for sound categories difficult to describe by text, and they gave the vocal engine a better overall ease-of-use rating on the limited sound library used in the experiments.
What carries the argument
Two web-based search engines, Vroom! accepting vocal imitations and TextSearch accepting text descriptions, evaluated through subjective satisfaction and ease-of-use ratings collected from real users.
Load-bearing premise
The subjective ratings collected from users on the limited sound library accurately reflect real-world search performance and generalize beyond the specific experimental setup and participant pool.
What would settle it
A follow-up study with a larger and more diverse sound library in which participants show equal or higher satisfaction ratings for the text engine on the same hard-to-describe categories.
Figures
read the original abstract
Searching sounds by text labels is often difficult, as text descriptions cannot describe the audio content in detail. Query by vocal imitation bridges such gap and provides a novel way to sound search. Several algorithms for sound search by vocal imitation have been proposed and evaluated in a simulation environment, however, they have not been deployed into a real search engine nor evaluated by real users. This pilot work conducts a subjective study to compare these two approaches to sound search, and tries to answer the question of which approach works better for what kinds of sounds. To do so, we developed two web-based search engines for sound, one by vocal imitation (Vroom!) and the other by text description (TextSearch). We also developed an experimental framework to host these engines to collect statistics of user behaviors and ratings. Results showed that Vroom! received significantly higher search satisfaction ratings than TextSearch did for sound categories that were difficult for subjects to describe by text. Results also showed a better overall ease-of-use rating for Vroom! than TextSearch on the limited sound library in our experiments. These findings suggest advantages of vocal-imitation-based search for sound in practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a pilot subjective user study comparing two web-based sound search engines: Vroom! (vocal imitation) and TextSearch (text description). It reports that Vroom! received significantly higher search satisfaction ratings than TextSearch for sound categories difficult to describe by text, along with better overall ease-of-use ratings, on a limited sound library; the authors conclude that these findings suggest advantages for vocal-imitation-based search in practice.
Significance. If the results hold and generalize, the work supplies initial real-user evidence on the relative strengths of vocal imitation versus text for sound retrieval, which could inform HCI interface design for audio databases. The development of deployable web engines and a framework for collecting user behavior statistics is a concrete practical contribution. However, the explicitly limited scope of the library and participant pool constrains the strength of any broader claims.
major comments (1)
- [Abstract] Abstract: The claim that the findings 'suggest advantages of vocal-imitation-based search for sound in practice' is load-bearing for the paper's contribution yet is not supported by any experiments or arguments beyond the 'limited sound library in our experiments' explicitly noted in the same paragraph; no tests on larger libraries, varied sound distributions, or different user populations are reported.
minor comments (2)
- [Abstract] Abstract: Key methodological details (participant count, statistical tests and p-values, library size, exclusion criteria, bias controls) are omitted, preventing readers from assessing the reported 'significantly higher' ratings without consulting the full methods section.
- [Results] Results section: Ensure all satisfaction and ease-of-use claims are accompanied by exact statistical values, degrees of freedom, and effect sizes rather than qualitative descriptions alone.
Simulated Author's Rebuttal
We thank the referee for highlighting the need to align the abstract's claims more closely with the pilot study's limited scope. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the findings 'suggest advantages of vocal-imitation-based search for sound in practice' is load-bearing for the paper's contribution yet is not supported by any experiments or arguments beyond the 'limited sound library in our experiments' explicitly noted in the same paragraph; no tests on larger libraries, varied sound distributions, or different user populations are reported.
Authors: We agree that the abstract's final sentence overgeneralizes beyond the evidence. The study is presented as pilot work with an explicitly limited sound library, and no broader tests are reported. We will revise the abstract to state that the findings suggest advantages of vocal-imitation-based search within the conditions of the limited sound library tested, removing the unqualified 'in practice' phrasing. This revision will ensure the claim matches the reported experiments without overstating generalizability. revision: yes
Circularity Check
Empirical user study with no derivations or fitted parameters
full rationale
This paper is a subjective user study that develops two web-based sound search engines (Vroom! for vocal imitation and TextSearch for text) and collects user ratings on satisfaction and ease-of-use for a limited sound library. No equations, parameter fitting, derivations, or load-bearing self-citations appear in the abstract or described content; results are reported as direct empirical observations from participant data. The central claims rest on statistical comparisons of ratings rather than any chain that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results showed that Vroom! received significantly higher search satisfaction ratings than TextSearch did for sound categories that were difficult for subjects to describe by text.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also developed an experimental framework to host these engines to collect statistics of user behaviors and ratings.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Traditional search engines for audio files use text labels as queries
INTRODUCTION Designing methods to access and manage multimedia documents such as audio recordings is an important information retrieval task. Traditional search engines for audio files use text labels as queries. However this is not always effective. First, it requires users to be familiar with the audio library taxonomy and text labels, which is unrealist...
-
[2]
Specifically, we designed a web-based search en- gine called Vroom!
How does vocal-imitation-based search compare with the tradi- tional text-based search for different kinds of sounds in terms of search effectiveness and efficiency? To answer the above questions, in this work, we conduct a sub- jective study to compare sound search by vocal imitation and by text description. Specifically, we designed a web-based search en-...
-
[3]
Sound Search by Text Description or Vocal Imitation?
RELA TED WORK Sound search by text description has been widely accepted in our daily life. For example, Freesound [5] is an online collaborative sound database with more than 400,000 sounds. Those sounds are tagged with text descriptions for text-based search. SoundCloud [6] is another online audio distribution platform that enables users to search sounds...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[4]
and an SVM classifier. The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations. Hel ´en and Virtanen [19] designed a query by example system for generic audio. Hand-crafted frame-level fea- tures were extracted from both query and sound samples and the query-sample pairwise similarity ...
-
[5]
to integrate these two modules together, in which the transfer learning based TL-IMINET is our most updated model [25]. Mean- while, the benefits of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]. To understand what such neu- ral networks actually learns, we...
-
[6]
SEARCH ENGINES FOR COMPARISON 3.1. Search by V ocal Imitation: Vroom! We designed a web-based sound search engine by vocal imitation, called Vroom!. The frontend GUI is designed using Javascript, HTML, and CSS languages. It allows a user to record a vocal imitation of sound that he/she is looking for using the recorder.js Javascript library [28]. It also ...
-
[7]
SUBJECTIVE EV ALUA TION 4.1. Experimental Framework To quantify search behaviors and user experiences and to make quantitative comparisons between Vroom! and TextSearch, we de- signed an experimental framework that wraps around each search engine. The experimental framework is another web application. It guides each subject to make 20 searches and rate th...
work page 2015
-
[8]
ease-of-use rating evaluates a user’s overall experience of each search engine upon the completion of all 20 searches. It can be seen that Vroom! shows a statistically significantly higher ease-of-use rating than TextSearch at the significance level of 0.05 (p=0.0108, unpairted t-test). This aligns with the average satisfaction rating of all categories. How...
-
[9]
We designed a search engine for each approach and an experimental framework for the study
CONCLUSIONS AND DISCUSSIONS This paper presented a subjective study to compare vocal-imitation- based and text-based search for sounds. We designed a search engine for each approach and an experimental framework for the study. User ratings and behavioral data collected from 20 sub- jects showed that vocal-imitation-based search has significant ad- vantages...
-
[10]
Sound retrieval from voice imi- tation queries in collaborative databases,
D. S. Blancas and J. Janer, “Sound retrieval from voice imi- tation queries in collaborative databases,” in Proc. Audio En- gineering Society 53rd International Conference on Semantic Audio, 2014, pp. 1–6
work page 2014
-
[11]
Retrieving sounds by vocal imitation recognition,
Y . Zhang and Z. Duan, “Retrieving sounds by vocal imitation recognition,” in Proc. Machine Learning for Signal Process- ing (MLSP), 2015 IEEE International Workshop on, 2015, pp. 1–6
work page 2015
-
[12]
——, “Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on , 2018, pp. 2406–2410
work page 2018
-
[13]
V ocalsketch: V ocally imitating audio concepts,
M. Cartwright and B. Pardo, “V ocalsketch: V ocally imitating audio concepts,” in Proc. the 33rd Annual ACM Conference on Human Factors in Computing Systems , 2015, pp. 43–46
work page 2015
-
[14]
https://freesound.org [Accessed 04/23/2019]
work page 2019
-
[15]
https://soundcloud.com [Accessed 04/23/2019]
work page 2019
-
[16]
Query-by-example: A data base language,
M. M. Zloof, “Query-by-example: A data base language,” IBM Systems Journal, vol. 16, no. 4, pp. 324–343, 1977
work page 1977
-
[17]
An industrial strength audio search algorithm,
A. Wang, “An industrial strength audio search algorithm,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2003, pp. 7–13
work page 2003
-
[18]
An audio fingerprinting system for live version identification using image processing techniques,
Z. Rafii, B. Coover, and J. Han, “An audio fingerprinting system for live version identification using image processing techniques,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , 2014, pp. 644–648
work page 2014
-
[19]
Known artist live song id: A hashprint approach,
T. J. Tsai, T. Pr¨atzlich, and M. M¨uller, “Known artist live song id: A hashprint approach,” in Proc. International Society for Music Information Retrieval Conference (ISMIR) , 2016, pp. 427–433
work page 2016
-
[20]
Large-scale cover song recognition using hashed chroma landmarks,
T. Bertin-Mahieux and D. P. Ellis, “Large-scale cover song recognition using hashed chroma landmarks,” in Proc. Appli- cations of Signal Processing to Audio and Acoustics (WAS- PAA), 2011 IEEE Workshop on, 2011, pp. 117–120
work page 2011
-
[21]
A lattice-based approach to query-by-example spoken document retrieval,
T. K. Chia, K. C. Sim, H. Li, and H. T. Ng, “A lattice-based approach to query-by-example spoken document retrieval,” in Proc. the 31st annual international ACM SIGIR conference on Research and development in information retrieval , 2008, pp. 363–370
work page 2008
-
[22]
Query by humming: musical information retrieval in an audio database,
A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by humming: musical information retrieval in an audio database,” in Proc. the 3rd ACM International Conference on Multimedia, 1995, pp. 231–236
work page 1995
-
[23]
A comparative evaluation of search techniques for query-by-humming using the musart testbed,
R. B. Dannenberg, W. P. Birmingham, B. Pardo, N. Hu, C. Meek, and G. Tzanetakis, “A comparative evaluation of search techniques for query-by-humming using the musart testbed,” Journal of the Association for Information Science and Technology, vol. 8, no. 5, pp. 687–701, 2007
work page 2007
-
[24]
Query-by- beating-boxing: Music retrieval for the DJ,
A. Kapur, M. Benning, and G. Tzanetakis, “Query-by- beating-boxing: Music retrieval for the DJ,” in Proc. Inter- national Society for Music Information Retrieval Conference (ISMIR), 2004, pp. 170–177
work page 2004
-
[25]
Drum loops retrieval from spoken queries,
O. Gillet and G. Richard, “Drum loops retrieval from spoken queries,” Journal of Intelligent Information Systems , vol. 24, pp. 159–177, 2005
work page 2005
-
[26]
Querying freesound with a micro- phone,
G. Roma and X. Serra, “Querying freesound with a micro- phone,” in Proc. the 1st Web Audio Conference (WAC), 2015
work page 2015
-
[27]
The timbre toolbox: Extracting audio descrip- tors from musical signal,
G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. McAdams, “The timbre toolbox: Extracting audio descrip- tors from musical signal,” The Journal of the Acoustical Soci- ety of America, vol. 130, no. 5, pp. 2902–2916, 2011
work page 2011
-
[28]
Audio query by example using similarity measures between probability density functions of features,
M. Hel ´en and T. Virtanen, “Audio query by example using similarity measures between probability density functions of features,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, no. 1, p. 179303, 2009
work page 2010
-
[29]
IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,
Y . Zhang and Z. Duan, “IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Interna- tional Conference on, 2016, pp. 2269–2273
work page 2016
-
[30]
On information and suffi- ciency,
S. Kullback and R. A. Leibler, “On information and suffi- ciency,” The Annals of Mathematical Statistics , vol. 22, no. 1, pp. 79–86, 1951
work page 1951
-
[31]
Dynamic programming algorithm op- timization for spoken word recognition,
H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,” Acoustics, Speech and Signal Processing, IEEE Transaction on , vol. 26, no. 1, pp. 43–49, 1978
work page 1978
-
[32]
Cosine similarity scoring without score normaliza- tion techniques,
N. Dehak, R. Dehak, J. R. Glass, D. A. Reynolds, and P. Kenny, “Cosine similarity scoring without score normaliza- tion techniques,” in Odyssey, 2010, pp. 1–5
work page 2010
-
[33]
IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,
Y . Zhang and Z. Duan, “IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,” in Proc. Applications of Signal Processing to Audio and Acous- tics (WASPAA), 2017 IEEE Workshop on, 2017, pp. 304–308
work page 2017
-
[34]
Siamese style convolu- tional neural networks for sound search by vocal imitation,
Y . Zhang, B. Pardo, and Z. Duan, “Siamese style convolu- tional neural networks for sound search by vocal imitation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 429–441, 2019
work page 2019
-
[35]
Improving content-based audio re- trieval by vocal imitation feedback,
B. Kim and B. Pardo, “Improving content-based audio re- trieval by vocal imitation feedback,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE Interna- tional Conference on, 2019, pp. 4100–4104
work page 2019
-
[36]
Understanding representations learned in deep architectures,
D. Erhan, A. Courville, and Y . Bengio, “Understanding representations learned in deep architectures,” Department d‘Informatique et Recherche Operationnelle, University of Montreal, QC, Canada, Tech. Rep 1355, pp. 1–25, 2010
work page 2010
-
[37]
https://github.com/addpipe/simple-recorderjs-demo [Ac- cessed 04/23/2019]
work page 2019
-
[38]
Deep convolutional neural net- works and data augmentation for environmental sound classi- fication,
J. Salamon and J. P. Bello, “Deep convolutional neural net- works and data augmentation for environmental sound classi- fication,”IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017
work page 2017
-
[39]
Deep learning for spoken language identifica- tion,
G. Montavon, “Deep learning for spoken language identifica- tion,” in Proc. NIPS Workshop on deep learning for Speech Recognition and Related Applications , 2009, pp. 1–4
work page 2009
-
[40]
Wordnet: a lexical database for english,
G. A. Miller, “Wordnet: a lexical database for english,” Com- munications of the ACM, vol. 38, no. 11, pp. 39–41, 1995
work page 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.