Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation
Pith reviewed 2026-05-22 02:40 UTC · model grok-4.3
The pith
DMA-KWS spots user-defined keywords more reliably by first locating candidates with streaming phoneme search and then verifying them with a separate matcher.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DMA-KWS achieves state-of-the-art results on the LibriPhrase Hard subset with 97.85% AUC and 6.13% EER by combining a dual-stage matching pipeline (CTC streaming phoneme search followed by QbyT phoneme verification), multi-modal enrollment that merges user speech and text embeddings, and parameter-efficient continual adaptation that updates only 187k parameters using synthetic and real data. In speaker-dependent tests it outperforms text-only enrollment, showing that the added voice information and staged verification improve discrimination of confusable words while remaining suitable for on-device use.
What carries the argument
Dual-stage matching pipeline that runs CTC decoding for streaming candidate location followed by QbyT phoneme matcher for fine verification.
If this is right
- Confusable words are distinguished more accurately because the second-stage phoneme matcher examines candidates the first stage only flagged.
- Registered users see higher accuracy when enrollment combines their recorded speech with text embeddings rather than text alone.
- On-device deployment remains practical because continual adaptation changes only 187k parameters.
- Performance stays consistent across speakers with varying pronunciations due to the multi-modal and adaptation components.
- The system can incorporate new real and synthetic data without full retraining.
Where Pith is reading between the lines
- Similar staged verification might help other on-device audio tasks such as custom command recognition in noisy rooms.
- The low parameter count for updates suggests the approach could be combined with existing wake-word engines without increasing memory footprint much.
- If the phoneme matcher generalizes well, the framework could support open-vocabulary keyword spotting beyond a fixed set of user words.
Load-bearing premise
The CTC-based first stage will find the right candidate segments and the second-stage matcher will reliably reject confusable words even when speakers pronounce them differently.
What would settle it
On a new test set containing many phonetically similar keywords and a diverse group of speakers, measure whether the AUC falls below 97.85% or the EER rises above 6.13% when the dual-stage pipeline is replaced by a single-stage baseline.
Figures
read the original abstract
User-defined keyword spotting (KWS) is crucial for personalized voice interaction, yet existing methods face several challenges: (1) insufficient discriminability among confusable words, (2) performance inconsistency across speakers with varying pronunciations, and (3) high data cost to ensure reliable wake-word performance. In this paper, we introduce DMA-KWS, an efficient and robust framework for user-defined keyword spotting. First, it adopts a dual-stage matching pipeline: CTC decoding with streaming phoneme search to locate candidate segments, followed by QbyT with a phoneme matcher for fine-grained verification, enabling it to better distinguish confusable words. Next, multi-modal enrollment fuses user-specific speech with text embeddings to further improve accuracy for registered users. Finally, a parameter-efficient continual adaptation mechanism performs lightweight updates using synthetic and real data. Extensive experiments demonstrate the superior performance of DMA-KWS. On the LibriPhrase Hard subset, it achieves 97.85% AUC and 6.13% EER, reaching state-of-the-art performance. In speaker-dependent settings, DMA-KWS consistently outperforms text-only enrollment, demonstrating significant performance gains. Moreover, the proposed parameter-efficient fine-tuning mechanism adapts DMA-KWS with only 187k updated parameters, further enhancing KWS performance while ensuring suitability for on-device deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DMA-KWS, a framework for user-defined keyword spotting featuring a dual-stage matching pipeline (CTC streaming phoneme search followed by QbyT phoneme matcher), multi-modal enrollment fusing speech and text embeddings, and parameter-efficient continual adaptation using synthetic and real data. It reports state-of-the-art results on the LibriPhrase Hard subset with 97.85% AUC and 6.13% EER, and superior performance over text-only enrollment in speaker-dependent scenarios with only 187k updated parameters.
Significance. Should the claims be substantiated by comprehensive ablations and statistical analysis, the work would represent a meaningful contribution to efficient, personalized keyword spotting systems suitable for on-device deployment. The dual-stage approach and multi-modal fusion could improve robustness to confusable words and speaker variations, while the adaptation mechanism addresses data efficiency concerns in real-world applications.
major comments (2)
- §4.2 (Ablation studies): The central claim that the dual-stage pipeline (CTC streaming search + QbyT matcher) drives the reported gains in discriminability for confusable words lacks a direct ablation that replaces the full pipeline with a single-stage QbyT matcher while holding multi-modal enrollment and continual adaptation fixed; without this, attribution of the 97.85% AUC / 6.13% EER to the dual-stage design remains unverified.
- §4.1 (Main results, LibriPhrase Hard): The SOTA claim of 97.85% AUC and 6.13% EER is presented without error bars, confidence intervals, or statistical significance tests relative to baselines, which is load-bearing for confirming consistent outperformance across speaker-dependent settings and data splits.
minor comments (2)
- Figure 3: The visualization of the dual-stage matching pipeline would benefit from explicit annotation of the candidate segment boundaries produced by CTC search.
- §3.3: The parameter count for the continual adaptation (187k) is stated but the breakdown of which modules are updated (e.g., which layers in the phoneme matcher) could be clarified for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of our results and ablations.
read point-by-point responses
-
Referee: §4.2 (Ablation studies): The central claim that the dual-stage pipeline (CTC streaming search + QbyT matcher) drives the reported gains in discriminability for confusable words lacks a direct ablation that replaces the full pipeline with a single-stage QbyT matcher while holding multi-modal enrollment and continual adaptation fixed; without this, attribution of the 97.85% AUC / 6.13% EER to the dual-stage design remains unverified.
Authors: We agree that a direct ablation isolating the dual-stage pipeline (CTC + QbyT) while holding multi-modal enrollment and continual adaptation fixed would provide clearer attribution of gains in discriminability for confusable words. Our existing §4.2 ablations examine component contributions but do not include this exact controlled comparison. In the revised manuscript we will add this specific ablation experiment to verify the dual-stage design's role in the reported 97.85% AUC and 6.13% EER. revision: yes
-
Referee: §4.1 (Main results, LibriPhrase Hard): The SOTA claim of 97.85% AUC and 6.13% EER is presented without error bars, confidence intervals, or statistical significance tests relative to baselines, which is load-bearing for confirming consistent outperformance across speaker-dependent settings and data splits.
Authors: We acknowledge that the absence of error bars, confidence intervals, and statistical significance tests limits the strength of the SOTA claims. The current manuscript reports point estimates on the LibriPhrase Hard subset. In the revision we will add standard deviations across multiple runs or speaker splits, along with appropriate statistical tests (e.g., paired t-tests) against baselines to confirm consistent outperformance. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical framework (DMA-KWS) with a dual-stage pipeline, multi-modal enrollment, and parameter-efficient adaptation, evaluated via standard supervised training on LibriPhrase. No equations, derivations, or self-referential definitions are described that reduce reported metrics (e.g., 97.85% AUC) to fitted inputs by construction. Performance claims rest on external benchmarks and experimental comparisons rather than self-citation chains or ansatz smuggling. This is a normal non-finding for an applied ML paper whose central results are falsifiable outside the fitted values.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Phoneme-level representations combined with CTC decoding and QbyT matching are sufficient to distinguish confusable keywords across speakers.
Reference graph
Works this paper leans on
-
[1]
Dual data scaling for robust two-stage user-defined keyword spotting,
Z. Ai, H. Cheng, Y . Wang, S. Mu, Y . Zhou, and S. Xu, “Dual data scaling for robust two-stage user-defined keyword spotting,” inProc. ICASSP 2026 IEEE Int. Conf. Acoust., Speech Signal Process., 2026, pp. 18 307–18 311
work page 2026
-
[2]
Alexa, siri, cortana, and more: An introduction to voice assistants,
M. B. Hoy, “Alexa, siri, cortana, and more: An introduction to voice assistants,”Med. Ref. Serv. Q., vol. 37, pp. 81 – 88, 2018
work page 2018
-
[3]
Deep spoken keyword spotting: An overview,
I. L ´opez-Espejo, Z.-H. Tan, J. H. L. Hansen, and J. Jensen, “Deep spoken keyword spotting: An overview,”IEEE Access, vol. 10, pp. 4169–4199, 2022
work page 2022
-
[4]
Attention-based end-to- end models for small-footprint keyword spotting,
C. Shan, J. Zhang, Y . Wang, and L. Xie, “Attention-based end-to- end models for small-footprint keyword spotting,” inProc. Interspeech, 2018, pp. 2037–2041
work page 2018
-
[5]
Learning audio-text agreement for open-vocabulary keyword spotting,
H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning audio-text agreement for open-vocabulary keyword spotting,” inProc. Interspeech, 2022, pp. 1871–1875
work page 2022
-
[6]
Query-by-example keyword spotting system using multi-head attention and soft-triple loss,
J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-by-example keyword spotting system using multi-head attention and soft-triple loss,” inProc. ICASSP 2021 IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6858–6862
work page 2021
-
[7]
Auto-KWS 2021 Challenge: Task, Datasets, and Baselines,
J. Wang, Y . He, C. Zhao, Q. Shao, W.-W. Tu, T. Ko, H.-Y . Lee, and L. Xie, “Auto-KWS 2021 Challenge: Task, Datasets, and Baselines,” in Proc. Interspeech, 2021, pp. 4244–4248
work page 2021
-
[8]
PhonMatchNet: phoneme-guided zero-shot keyword spotting for user-defined keywords,
Y .-H. Lee and N. Cho, “PhonMatchNet: phoneme-guided zero-shot keyword spotting for user-defined keywords,” inProc. Interspeech, 2023, pp. 3964–3968
work page 2023
-
[9]
MM-KWS: multi-modal prompts for multilingual user-defined keyword spotting,
Z. Ai, Z. Chen, and S. Xu, “MM-KWS: multi-modal prompts for multilingual user-defined keyword spotting,” inProc. Interspeech, 2024, pp. 2415–2419
work page 2024
-
[10]
Improving acoustic based keyword spotting using lvcsr lattices,
P. Motlicek, F. Valente, and I. Szoke, “Improving acoustic based keyword spotting using lvcsr lattices,” inProc. ICASSP 2012 IEEE Int. Conf. Acoust., Speech Signal Process., 2012, pp. 4413–4416
work page 2012
-
[11]
Quantifying the value of pronunciation lexicons for keyword search in lowresource languages,
G. Chen, S. Khudanpur, D. Povey, J. Trmal, D. Yarowsky, and O. Yilmaz, “Quantifying the value of pronunciation lexicons for keyword search in lowresource languages,” inProc. ICASSP 2013 IEEE Int. Conf. Acoust., Speech Signal Process., 2013, pp. 8560–8564
work page 2013
-
[12]
Multi-task learning and weighted cross-entropy for dnn-based keyword spotting,
S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, and S. Vitaladevuni, “Multi-task learning and weighted cross-entropy for dnn-based keyword spotting,” inProc. Interspeech, 2016, pp. 760–764
work page 2016
-
[13]
Robust speech recognition via large-scale weak super- vision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inProc. Int. Conf. Mach. Learn., 2023, pp. 28 492–28 518
work page 2023
-
[14]
Z. Yang, S. Sun, J. Li, X. Zhang, X. Wang, L. Ma, and L. Xie, “CaTT- KWS: a multi-stage customized keyword spotting framework based on cascaded transducer-transformer,” inProc. Interspeech, 2022, pp. 1681– 1685
work page 2022
-
[15]
WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,
B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zenget al., “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” inProc. ICASSP 2022 IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 6182–6186
work page 2022
-
[16]
Streaming small-footprint keyword spotting using sequence- to-sequence models,
Y . He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. Mc- Graw, “Streaming small-footprint keyword spotting using sequence- to-sequence models,” inProc. 2017 IEEE Autom. Speech Recognit. Understanding Workshop, 2017, pp. 474–481
work page 2017
-
[17]
DONUT: CTC-based Query-by-Example Keyword Spotting
L. Lugosch, S. Myer, and V . S. Tomar, “DONUT: ctc-based query-by- example keyword spotting,”arXiv:1811.10736, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Streaming keyword spotting boosted by cross-layer discrimination consistency,
Y . Xi, H. Li, X. Gu, H. Li, Y . Jiang, and K. Yu, “Streaming keyword spotting boosted by cross-layer discrimination consistency,” inProc. ICASSP 2025 IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5
work page 2025
-
[19]
TDT-KWS: fast and accurate keyword spotting using token-and-duration transducer,
Y . Xi, H. Li, B. Yang, H. Li, H. Xu, and K. Yu, “TDT-KWS: fast and accurate keyword spotting using token-and-duration transducer,” inProc. ICASSP 2024 IEEE Int. Conf. Acoust., Speech Signal Process., 2024, pp. 11 350–11 355
work page 2024
-
[20]
MFA-KWS: effective keyword spotting with multi-head frame-asynchronous decoding,
Y . Xi, H. Li, X. Gu, Y . Jiang, and K. Yu, “MFA-KWS: effective keyword spotting with multi-head frame-asynchronous decoding,”IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 3014–3027, 2025
work page 2025
-
[21]
Query-by-example on- device keyword spotting,
B. Kim, M. Lee, J. Lee, Y . Kim, and K. Hwang, “Query-by-example on- device keyword spotting,” inProc. 2019 IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 532–538
work page 2019
-
[22]
Open-vocabulary keyword spotting with audio and text embeddings,
N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak, “Open-vocabulary keyword spotting with audio and text embeddings,” inProc. Interspeech, 2019, pp. 3362–3366
work page 2019
-
[23]
Gen- eralized keyword spotting using asr embeddings,
R. Kirandevraj, V . K. Kurmi, V . P. Namboodiri, and C. Jawahar, “Gen- eralized keyword spotting using asr embeddings,” inProc. Interspeech, 2022, pp. 126–130
work page 2022
-
[24]
End-to-end asr-free keyword search from speech,
K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, and B. Kings- bury, “End-to-end asr-free keyword search from speech,”IEEE J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1351–1359, 2017. JOURNAL OF LATEX CLASS FILES, DECEMBER 2025 14
work page 2017
-
[25]
Y . Jia, X. Wang, X. Qin, Y . Zhang, X. Wang, J. Wang, D. Zhang, and M. Li, “The 2020 personalized voice trigger challenge: Open datasets, evaluation metrics, baseline system and results,” inProc. Interspeech, 2021, pp. 4239–4243
work page 2020
-
[26]
Matching latent encoding for audio-text based keyword spotting,
K. Nishu, M. Cho, and D. Naik, “Matching latent encoding for audio-text based keyword spotting,” inProc. Interspeech, 2023, pp. 1613–1617
work page 2023
-
[27]
Open- vocabulary keyword-spotting with adaptive instance normalization,
A. Navon, A. Shamsian, N. Glazer, G. Hetz, and J. Keshet, “Open- vocabulary keyword-spotting with adaptive instance normalization,” in Proc. ICASSP 2024 IEEE Int. Conf. Acoust., Speech Signal Process., 2024, pp. 11 656–11 660
work page 2024
-
[28]
Flexible keyword spotting based on homogeneous audio-text embedding,
K. Nishu, M. Cho, P. Dixon, and D. Naik, “Flexible keyword spotting based on homogeneous audio-text embedding,” inProc. ICASSP 2024 IEEE Int. Conf. Acoust., Speech Signal Process., 2024, pp. 5050–5054
work page 2024
-
[29]
Phoneme-level contrastive learning for user-defined keyword spotting with flexible enrollment,
K. Li, H. Zhou, K. Shen, Y . Dai, and J. Du, “Phoneme-level contrastive learning for user-defined keyword spotting with flexible enrollment,” in Proc. ICASSP 2025 IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5
work page 2025
-
[30]
Y . Xi, B. Yang, H. Li, J. Guo, and K. Yu, “Contrastive learning with audio discrimination for customizable keyword spotting in continuous speech,” inProc. ICASSP 2024 IEEE Int. Conf. Acoust., Speech Signal Process., 2024, pp. 11 666–11 670
work page 2024
-
[31]
SLiCK: exploiting subsequences for length-constrained keyword spotting,
K. Nishu, M. Cho, and D. Naik, “SLiCK: exploiting subsequences for length-constrained keyword spotting,” inProc. ICASSP 2025 IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5
work page 2025
-
[32]
Y . Jung, Y .-H. Lee, M. Jung, J. Roh, C. W. Han, and H.-Y . Cho, “Adversarial deep metric learning for cross-modal audio-text alignment in open-vocabulary keyword spotting,” inProc. Interspeech, 2025, pp. 2645–2649
work page 2025
-
[33]
Fully end-to-end streaming open-vocabulary keyword spotting with W-CTC forced alignment,
D. Kim and J. Hwang, “Fully end-to-end streaming open-vocabulary keyword spotting with W-CTC forced alignment,” inProc. Interspeech, 2025, pp. 519–523
work page 2025
-
[34]
Open vocabulary keyword spotting through transfer learning from speech synthesis,
K. V and A. Vuppala, “Open vocabulary keyword spotting through transfer learning from speech synthesis,” inProc. Int. Conf. Signal Process. Commun., 2024, pp. 1–5
work page 2024
-
[35]
CTC-aligned Audio- Text Embedding for Streaming Open-vocabulary Keyword Spotting,
S. Jin, Y . Jung, S. Lee, J. Roh, C. Han, and H. Cho, “CTC-aligned Audio- Text Embedding for Streaming Open-vocabulary Keyword Spotting,” in Proc. Interspeech, 2024, pp. 332–336
work page 2024
-
[36]
Duration-aware phone embedding upsampling for open vocabulary keyword spotting,
R. Gundluru, N. Doppa, and S. R. M. K, “Duration-aware phone embedding upsampling for open vocabulary keyword spotting,” inProc. Natl. Conf. Commun., 2025, pp. 1–6
work page 2025
-
[37]
Wav2kws: Transfer learning from speech representations for keyword spotting,
D. Seo, H.-S. Oh, and Y . Jung, “Wav2kws: Transfer learning from speech representations for keyword spotting,”IEEE Access, vol. 9, pp. 80 682– 80 691, 2021
work page 2021
-
[38]
WeKws: a production first small-footprint end-to-end keyword spotting toolkit,
J. Wang, M. Xu, J. Hou, B. Zhang, X.-L. Zhang, L. Xie, and F. Pan, “WeKws: a production first small-footprint end-to-end keyword spotting toolkit,” inProc. ICASSP 2023 IEEE Int. Conf. Acoust., Speech Signal Process.IEEE, 2023, pp. 1–5
work page 2023
-
[39]
Low-bit quantization and quantization-aware training for small-footprint keyword spotting,
Y . Mishchenko, Y . Goren, M. Sun, C. Beauchene, S. Matsoukas, O. Rybakov, and S. N. P. Vitaladevuni, “Low-bit quantization and quantization-aware training for small-footprint keyword spotting,” in Proc. 2019 IEEE Int. Conf. Mach. Learn. Appl., 2019, pp. 706–711
work page 2019
-
[40]
Convolutional neural networks for small- footprint keyword spotting,
T. N. Sainath and C. Parada, “Convolutional neural networks for small- footprint keyword spotting,” inProc. Interspeech, 2015, pp. 1478–1482
work page 2015
-
[41]
Metric learning for keyword spotting,
J. Huh, M. Lee, H. Heo, S. Mun, and J. S. Chung, “Metric learning for keyword spotting,” inProc. 2021 IEEE Spoken Lang. Technol. Workshop, 2021, pp. 133–140
work page 2021
-
[42]
Generalized end-to-end loss for speaker verification,
L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” inProc. ICASSP 2018 IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 4879–4883
work page 2018
-
[43]
The dku system description for the interspeech 2021 auto-kws challenge,
Y . Wang, Y . Jia, M. Ma, Z. Cai, and M. Li, “The dku system description for the interspeech 2021 auto-kws challenge,”arXiv:2104.04993, 2021
-
[44]
Deep convolutional acoustic word embeddings using word-pair side information,
H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” inProc. ICASSP 2016 IEEE Int. Conf. Acoust., Speech Signal Process.IEEE, 2016, pp. 4950–4954
work page 2016
-
[45]
Acoustic span embeddings for multilingual query-by-example search,
Y . Hu, S. Settle, and K. Livescu, “Acoustic span embeddings for multilingual query-by-example search,” inProc. 2021 IEEE Spoken Lang. Technol. Workshop, 2021, pp. 935–942
work page 2021
-
[46]
Bridging the gap between audio and text using parallel-attention for user-defined keyword spotting,
Y . Kim, J. Jung, J. Park, B.-Y . Kim, and J. S. Chung, “Bridging the gap between audio and text using parallel-attention for user-defined keyword spotting,”IEEE Signal Process. Lett., vol. 31, pp. 2100–2104, 2024
work page 2024
-
[47]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter,”arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[48]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[49]
P. Zhu, Q. Wang, D. Agarwal, and K. Partridge, “LLM-Synth4KWS: scalable automatic generation and synthesis of confusable data for custom keyword spotting,” inProc. Interspeech, 2025, pp. 2675–2679
work page 2025
-
[50]
Utilizing tts synthesized data for efficient development of keyword spotting model,
H. J. Park, D. Agarwal, N. Chen, R. Sun, K. Partridge, J. Chen, H. Zhang, P. Zhu, J. W. Bartel, K. Kastner, Y . Wang, A. Rosenberg, and Q. Wang, “Utilizing tts synthesized data for efficient development of keyword spotting model,” inProc. ISCA SynData4GenAI Workshop, 2024, pp. 16–20
work page 2024
-
[51]
Fully unsupervised training of few-shot keyword spotting,
D. Lee, M. Kim, S. H. Mun, M. H. Han, and N. S. Kim, “Fully unsupervised training of few-shot keyword spotting,” inProc. 2023 IEEE Spoken Lang. Technol. Workshop, 2023, pp. 266–272
work page 2023
-
[52]
Synth4Kws: synthesized speech for user defined keyword spotting in low resource environments,
P. Zhu, D. Agarwal, J. W. Bartel, K. Partridge, H. J. Park, and Q. Wang, “Synth4Kws: synthesized speech for user defined keyword spotting in low resource environments,” inProc. ISCA SynData4GenAI Workshop, 2024, pp. 11–15
work page 2024
-
[53]
Conformer: Convolution- augmented Transformer for Speech Recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented Transformer for Speech Recognition,” inProc. Interspeech, 2020, pp. 5036–5040
work page 2020
-
[54]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Represent., 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[55]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProc. ACL, 2025, pp. 6255–6271
work page 2025
-
[56]
LibriSpeech: An ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” inProc. ICASSP 2015 IEEE Int. Conf. Acoust., Speech Signal Process., 2015, pp. 5206– 5210
work page 2015
-
[57]
GigaSpeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,
G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y . Wang, Z. You, and Z. Yan, “GigaSpeech: an evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” inProc. Interspeech, 2021, pp. 3670–3674
work page 2021
-
[58]
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
P. Warden, “Speech Commands: A dataset for limited-vocabulary speech recognition,”arXiv:1804.03209, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[59]
S. Becker, J. Vielhaben, M. Ackermann, K.-R. M ¨uller, S. Lapuschkin, and W. Samek, “AudioMNIST: Exploring explainable artificial intelli- gence for audio analysis on a simple benchmark,”J. Franklin Inst., vol. 361, no. 1, pp. 418–428, 2024
work page 2024
-
[60]
Efficient keyword spotting using dilated convolutions and gating,
A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy, M. Poumeyrol, and T. Lavril, “Efficient keyword spotting using dilated convolutions and gating,” inProc. ICASSP 2018 IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 6351–6355
work page 2018
-
[61]
H. Zeinali, L. Burget, and J. H. ˇCernock´y, “A multi purpose and large scale speech corpus in persian and english for speaker and speech recognition: The deepmine database,” inProc. 2019 IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 397–402
work page 2019
-
[62]
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” inProc. Interspeech, 2019, pp. 2613– 2617
work page 2019
-
[63]
WavLM: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[64]
Re-weighted interval loss for handling data imbalance problem of end- to-end keyword spotting,
K. Zhang, Z. Wu, D. Yuan, J. Luan, J. Jia, H. Meng, and B. Song, “Re-weighted interval loss for handling data imbalance problem of end- to-end keyword spotting,” inProc. Interspeech, 2020, pp. 2567–2571
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.