pith. sign in

arxiv: 1907.08661 · v1 · pith:5Z6EAEV6new · submitted 2019-07-19 · 💻 cs.HC · cs.SD· eess.AS

Sound Search by Text Description or Vocal Imitation?

Pith reviewed 2026-05-24 18:55 UTC · model grok-4.3

classification 💻 cs.HC cs.SDeess.AS
keywords sound searchvocal imitationtext descriptionuser studyaudio retrievalquery by vocal imitationsubjective evaluationease of use
0
0 comments X

The pith

Vocal imitation search yields higher user satisfaction than text descriptions for sounds that are hard to put into words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds two web-based sound search systems, one that takes vocal imitations as input and one that takes text labels, then runs a user study to see which works better for different kinds of sounds. Participants gave the vocal system significantly higher satisfaction scores on categories they found difficult to describe in text, and they also rated the vocal system easier to use overall on the test collection. A sympathetic reader would care because everyday sound search often fails when words cannot capture the exact audio quality, so a practical alternative that bypasses verbal description could change how people find audio clips. The work is framed as a pilot that moves vocal-imitation algorithms out of simulation and into real user interaction.

Core claim

Users reported significantly higher search satisfaction with the vocal-imitation engine than with the text-description engine for sound categories difficult to describe by text, and they gave the vocal engine a better overall ease-of-use rating on the limited sound library used in the experiments.

What carries the argument

Two web-based search engines, Vroom! accepting vocal imitations and TextSearch accepting text descriptions, evaluated through subjective satisfaction and ease-of-use ratings collected from real users.

Load-bearing premise

The subjective ratings collected from users on the limited sound library accurately reflect real-world search performance and generalize beyond the specific experimental setup and participant pool.

What would settle it

A follow-up study with a larger and more diverse sound library in which participants show equal or higher satisfaction ratings for the text engine on the same hard-to-describe categories.

Figures

Figures reproduced from arXiv: 1907.08661 by Yichi Zhang, Yiting Zhang, Zhiyao Duan.

Figure 1
Figure 1. Figure 1: Frontend GUIs of (a) the vocal-imitation-based search [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental framework hosting the proposed vocal imitation based search engine [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average user ratings of sound search by text descrip [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Searching sounds by text labels is often difficult, as text descriptions cannot describe the audio content in detail. Query by vocal imitation bridges such gap and provides a novel way to sound search. Several algorithms for sound search by vocal imitation have been proposed and evaluated in a simulation environment, however, they have not been deployed into a real search engine nor evaluated by real users. This pilot work conducts a subjective study to compare these two approaches to sound search, and tries to answer the question of which approach works better for what kinds of sounds. To do so, we developed two web-based search engines for sound, one by vocal imitation (Vroom!) and the other by text description (TextSearch). We also developed an experimental framework to host these engines to collect statistics of user behaviors and ratings. Results showed that Vroom! received significantly higher search satisfaction ratings than TextSearch did for sound categories that were difficult for subjects to describe by text. Results also showed a better overall ease-of-use rating for Vroom! than TextSearch on the limited sound library in our experiments. These findings suggest advantages of vocal-imitation-based search for sound in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a pilot subjective user study comparing two web-based sound search engines: Vroom! (vocal imitation) and TextSearch (text description). It reports that Vroom! received significantly higher search satisfaction ratings than TextSearch for sound categories difficult to describe by text, along with better overall ease-of-use ratings, on a limited sound library; the authors conclude that these findings suggest advantages for vocal-imitation-based search in practice.

Significance. If the results hold and generalize, the work supplies initial real-user evidence on the relative strengths of vocal imitation versus text for sound retrieval, which could inform HCI interface design for audio databases. The development of deployable web engines and a framework for collecting user behavior statistics is a concrete practical contribution. However, the explicitly limited scope of the library and participant pool constrains the strength of any broader claims.

major comments (1)
  1. [Abstract] Abstract: The claim that the findings 'suggest advantages of vocal-imitation-based search for sound in practice' is load-bearing for the paper's contribution yet is not supported by any experiments or arguments beyond the 'limited sound library in our experiments' explicitly noted in the same paragraph; no tests on larger libraries, varied sound distributions, or different user populations are reported.
minor comments (2)
  1. [Abstract] Abstract: Key methodological details (participant count, statistical tests and p-values, library size, exclusion criteria, bias controls) are omitted, preventing readers from assessing the reported 'significantly higher' ratings without consulting the full methods section.
  2. [Results] Results section: Ensure all satisfaction and ease-of-use claims are accompanied by exact statistical values, degrees of freedom, and effect sizes rather than qualitative descriptions alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to align the abstract's claims more closely with the pilot study's limited scope. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the findings 'suggest advantages of vocal-imitation-based search for sound in practice' is load-bearing for the paper's contribution yet is not supported by any experiments or arguments beyond the 'limited sound library in our experiments' explicitly noted in the same paragraph; no tests on larger libraries, varied sound distributions, or different user populations are reported.

    Authors: We agree that the abstract's final sentence overgeneralizes beyond the evidence. The study is presented as pilot work with an explicitly limited sound library, and no broader tests are reported. We will revise the abstract to state that the findings suggest advantages of vocal-imitation-based search within the conditions of the limited sound library tested, removing the unqualified 'in practice' phrasing. This revision will ensure the claim matches the reported experiments without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

Empirical user study with no derivations or fitted parameters

full rationale

This paper is a subjective user study that develops two web-based sound search engines (Vroom! for vocal imitation and TextSearch for text) and collects user ratings on satisfaction and ease-of-use for a limited sound library. No equations, parameter fitting, derivations, or load-bearing self-citations appear in the abstract or described content; results are reported as direct empirical observations from participant data. The central claims rest on statistical comparisons of ratings rather than any chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the user ratings and the limited sound library; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5733 in / 968 out tokens · 22927 ms · 2026-05-24T18:55:12.710450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    Traditional search engines for audio files use text labels as queries

    INTRODUCTION Designing methods to access and manage multimedia documents such as audio recordings is an important information retrieval task. Traditional search engines for audio files use text labels as queries. However this is not always effective. First, it requires users to be familiar with the audio library taxonomy and text labels, which is unrealist...

  2. [2]

    Specifically, we designed a web-based search en- gine called Vroom!

    How does vocal-imitation-based search compare with the tradi- tional text-based search for different kinds of sounds in terms of search effectiveness and efficiency? To answer the above questions, in this work, we conduct a sub- jective study to compare sound search by vocal imitation and by text description. Specifically, we designed a web-based search en-...

  3. [3]

    Sound Search by Text Description or Vocal Imitation?

    RELA TED WORK Sound search by text description has been widely accepted in our daily life. For example, Freesound [5] is an online collaborative sound database with more than 400,000 sounds. Those sounds are tagged with text descriptions for text-based search. SoundCloud [6] is another online audio distribution platform that enables users to search sounds...

  4. [4]

    The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations

    and an SVM classifier. The major limitation of supervised sys- tems, however, is that they cannot retrieve sounds that do not have training imitations. Hel ´en and Virtanen [19] designed a query by example system for generic audio. Hand-crafted frame-level fea- tures were extracted from both query and sound samples and the query-sample pairwise similarity ...

  5. [5]

    Mean- while, the benefits of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]

    to integrate these two modules together, in which the transfer learning based TL-IMINET is our most updated model [25]. Mean- while, the benefits of applying positive and negative imitations to update the cosine similarity between the query and sound candidate embedding was investigated in [26]. To understand what such neu- ral networks actually learns, we...

  6. [6]

    Go Search!

    SEARCH ENGINES FOR COMPARISON 3.1. Search by V ocal Imitation: Vroom! We designed a web-based sound search engine by vocal imitation, called Vroom!. The frontend GUI is designed using Javascript, HTML, and CSS languages. It allows a user to record a vocal imitation of sound that he/she is looking for using the recorder.js Javascript library [28]. It also ...

  7. [7]

    Go Search!

    SUBJECTIVE EV ALUA TION 4.1. Experimental Framework To quantify search behaviors and user experiences and to make quantitative comparisons between Vroom! and TextSearch, we de- signed an experimental framework that wraps around each search engine. The experimental framework is another web application. It guides each subject to make 20 searches and rate th...

  8. [8]

    Go Search!

    ease-of-use rating evaluates a user’s overall experience of each search engine upon the completion of all 20 searches. It can be seen that Vroom! shows a statistically significantly higher ease-of-use rating than TextSearch at the significance level of 0.05 (p=0.0108, unpairted t-test). This aligns with the average satisfaction rating of all categories. How...

  9. [9]

    We designed a search engine for each approach and an experimental framework for the study

    CONCLUSIONS AND DISCUSSIONS This paper presented a subjective study to compare vocal-imitation- based and text-based search for sounds. We designed a search engine for each approach and an experimental framework for the study. User ratings and behavioral data collected from 20 sub- jects showed that vocal-imitation-based search has significant ad- vantages...

  10. [10]

    Sound retrieval from voice imi- tation queries in collaborative databases,

    D. S. Blancas and J. Janer, “Sound retrieval from voice imi- tation queries in collaborative databases,” in Proc. Audio En- gineering Society 53rd International Conference on Semantic Audio, 2014, pp. 1–6

  11. [11]

    Retrieving sounds by vocal imitation recognition,

    Y . Zhang and Z. Duan, “Retrieving sounds by vocal imitation recognition,” in Proc. Machine Learning for Signal Process- ing (MLSP), 2015 IEEE International Workshop on, 2015, pp. 1–6

  12. [12]

    Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,

    ——, “Visualization and interpretation of Siamese style con- volutional neural networks for sound search by vocal im- itation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on , 2018, pp. 2406–2410

  13. [13]

    V ocalsketch: V ocally imitating audio concepts,

    M. Cartwright and B. Pardo, “V ocalsketch: V ocally imitating audio concepts,” in Proc. the 33rd Annual ACM Conference on Human Factors in Computing Systems , 2015, pp. 43–46

  14. [14]

    https://freesound.org [Accessed 04/23/2019]

  15. [15]

    https://soundcloud.com [Accessed 04/23/2019]

  16. [16]

    Query-by-example: A data base language,

    M. M. Zloof, “Query-by-example: A data base language,” IBM Systems Journal, vol. 16, no. 4, pp. 324–343, 1977

  17. [17]

    An industrial strength audio search algorithm,

    A. Wang, “An industrial strength audio search algorithm,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2003, pp. 7–13

  18. [18]

    An audio fingerprinting system for live version identification using image processing techniques,

    Z. Rafii, B. Coover, and J. Han, “An audio fingerprinting system for live version identification using image processing techniques,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , 2014, pp. 644–648

  19. [19]

    Known artist live song id: A hashprint approach,

    T. J. Tsai, T. Pr¨atzlich, and M. M¨uller, “Known artist live song id: A hashprint approach,” in Proc. International Society for Music Information Retrieval Conference (ISMIR) , 2016, pp. 427–433

  20. [20]

    Large-scale cover song recognition using hashed chroma landmarks,

    T. Bertin-Mahieux and D. P. Ellis, “Large-scale cover song recognition using hashed chroma landmarks,” in Proc. Appli- cations of Signal Processing to Audio and Acoustics (WAS- PAA), 2011 IEEE Workshop on, 2011, pp. 117–120

  21. [21]

    A lattice-based approach to query-by-example spoken document retrieval,

    T. K. Chia, K. C. Sim, H. Li, and H. T. Ng, “A lattice-based approach to query-by-example spoken document retrieval,” in Proc. the 31st annual international ACM SIGIR conference on Research and development in information retrieval , 2008, pp. 363–370

  22. [22]

    Query by humming: musical information retrieval in an audio database,

    A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith, “Query by humming: musical information retrieval in an audio database,” in Proc. the 3rd ACM International Conference on Multimedia, 1995, pp. 231–236

  23. [23]

    A comparative evaluation of search techniques for query-by-humming using the musart testbed,

    R. B. Dannenberg, W. P. Birmingham, B. Pardo, N. Hu, C. Meek, and G. Tzanetakis, “A comparative evaluation of search techniques for query-by-humming using the musart testbed,” Journal of the Association for Information Science and Technology, vol. 8, no. 5, pp. 687–701, 2007

  24. [24]

    Query-by- beating-boxing: Music retrieval for the DJ,

    A. Kapur, M. Benning, and G. Tzanetakis, “Query-by- beating-boxing: Music retrieval for the DJ,” in Proc. Inter- national Society for Music Information Retrieval Conference (ISMIR), 2004, pp. 170–177

  25. [25]

    Drum loops retrieval from spoken queries,

    O. Gillet and G. Richard, “Drum loops retrieval from spoken queries,” Journal of Intelligent Information Systems , vol. 24, pp. 159–177, 2005

  26. [26]

    Querying freesound with a micro- phone,

    G. Roma and X. Serra, “Querying freesound with a micro- phone,” in Proc. the 1st Web Audio Conference (WAC), 2015

  27. [27]

    The timbre toolbox: Extracting audio descrip- tors from musical signal,

    G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. McAdams, “The timbre toolbox: Extracting audio descrip- tors from musical signal,” The Journal of the Acoustical Soci- ety of America, vol. 130, no. 5, pp. 2902–2916, 2011

  28. [28]

    Audio query by example using similarity measures between probability density functions of features,

    M. Hel ´en and T. Virtanen, “Audio query by example using similarity measures between probability density functions of features,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, no. 1, p. 179303, 2009

  29. [29]

    IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,

    Y . Zhang and Z. Duan, “IMISOUND: An unsupervised sys- tem for sound query by vocal imitation,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE Interna- tional Conference on, 2016, pp. 2269–2273

  30. [30]

    On information and suffi- ciency,

    S. Kullback and R. A. Leibler, “On information and suffi- ciency,” The Annals of Mathematical Statistics , vol. 22, no. 1, pp. 79–86, 1951

  31. [31]

    Dynamic programming algorithm op- timization for spoken word recognition,

    H. Sakoe and S. Chiba, “Dynamic programming algorithm op- timization for spoken word recognition,” Acoustics, Speech and Signal Processing, IEEE Transaction on , vol. 26, no. 1, pp. 43–49, 1978

  32. [32]

    Cosine similarity scoring without score normaliza- tion techniques,

    N. Dehak, R. Dehak, J. R. Glass, D. A. Reynolds, and P. Kenny, “Cosine similarity scoring without score normaliza- tion techniques,” in Odyssey, 2010, pp. 1–5

  33. [33]

    IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,

    Y . Zhang and Z. Duan, “IMINET: Convolutional semi- Siamese networks for sound search by vocal imitation,” in Proc. Applications of Signal Processing to Audio and Acous- tics (WASPAA), 2017 IEEE Workshop on, 2017, pp. 304–308

  34. [34]

    Siamese style convolu- tional neural networks for sound search by vocal imitation,

    Y . Zhang, B. Pardo, and Z. Duan, “Siamese style convolu- tional neural networks for sound search by vocal imitation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 429–441, 2019

  35. [35]

    Improving content-based audio re- trieval by vocal imitation feedback,

    B. Kim and B. Pardo, “Improving content-based audio re- trieval by vocal imitation feedback,” in Proc. Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE Interna- tional Conference on, 2019, pp. 4100–4104

  36. [36]

    Understanding representations learned in deep architectures,

    D. Erhan, A. Courville, and Y . Bengio, “Understanding representations learned in deep architectures,” Department d‘Informatique et Recherche Operationnelle, University of Montreal, QC, Canada, Tech. Rep 1355, pp. 1–25, 2010

  37. [37]

    https://github.com/addpipe/simple-recorderjs-demo [Ac- cessed 04/23/2019]

  38. [38]

    Deep convolutional neural net- works and data augmentation for environmental sound classi- fication,

    J. Salamon and J. P. Bello, “Deep convolutional neural net- works and data augmentation for environmental sound classi- fication,”IEEE Signal Processing Letters , vol. 24, no. 3, pp. 279–283, 2017

  39. [39]

    Deep learning for spoken language identifica- tion,

    G. Montavon, “Deep learning for spoken language identifica- tion,” in Proc. NIPS Workshop on deep learning for Speech Recognition and Related Applications , 2009, pp. 1–4

  40. [40]

    Wordnet: a lexical database for english,

    G. A. Miller, “Wordnet: a lexical database for english,” Com- munications of the ACM, vol. 38, no. 11, pp. 39–41, 1995