Improving Automatic Speech Recognition for Speakers Treated for Oral Cancer using Data Augmentation and LLM Error Correction
Pith reviewed 2026-05-19 19:26 UTC · model grok-4.3
The pith
Combining data augmentation and LLM error correction cuts word error rates by 40-50% for oral cancer speech recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors apply various data augmentation techniques to a corpus of Dutch oral cancer speech to create synthetic data and finetune Whisper and MMS models, achieving an average 8% relative WER decrease with TTS augmentation. Employing LLMs for error correction provides an additional 21.4-26.2% relative decrease for finetuned models, resulting in overall 40% and 50% relative WER decreases for Whisper and MMS respectively.
What carries the argument
Data augmentation techniques, particularly text-to-speech synthesis, combined with large language model-based error correction to improve ASR performance on impaired speech.
If this is right
- Finetuning ASR models on augmented OC speech data reduces WER by about 8% on average.
- LLM error correction further decreases WER by 21-26% for finetuned models and 10% for non-finetuned ones.
- The combined approach achieves 40% relative improvement for Whisper and 50% for MMS on OC speech.
- This strategy is viable for recognizing speech from patients treated for oral cancer.
Where Pith is reading between the lines
- Similar augmentation and correction methods could be tested on other speech impairments like those from stroke or Parkinson's disease.
- Integrating these techniques into real-time ASR applications might improve accessibility for medical patients in daily use.
- The success suggests that synthetic data can bridge gaps in medical speech datasets where real recordings are hard to obtain.
Load-bearing premise
The synthetic speech samples must match the real speech variations of oral cancer patients closely enough, and the language model fixes must not change any medically important information.
What would settle it
Testing the improved models on a new set of real oral cancer speech recordings and finding no reduction in word error rates compared to baseline would disprove the effectiveness of the augmentation and correction pipeline.
read the original abstract
In recent years, the performance of automatic speech recognition (ASR) systems has made considerable progress. Unfortunately, for people with speech impairments, such as people treated for oral cancer (OC), ASR performance is still lagging behind. The scarcity and variability of OC speech data makes development of ASR models for this type of speech difficult. In this work, we use data augmentation and large language model (LLM) error correction to mitigate this problem. We apply various augmentation techniques on a corpus of Dutch oral cancer speech to create synthetic data, and evaluate their effect on ASR performance. We finetune Whisper and Massively Multilingual Speech (MMS) models for each augmentation technique and observe, on average, an 8% relative decrease in Word Error Rate (WER) when including data created using text-to-speech (TTS). When employing LLMs for error correction, we see a further 21.4-26.2% relative decrease in WER for finetuned ASR models and a 10.0% relative decrease for non-finetuned models. Overall, we achieve a 40% relative WER decrease for Whisper and a 50% relative WER decrease for MMS, indicating that a combination of data augmentation and LLM correction is a viable strategy for the recognition of OC speech.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the use of data augmentation (including TTS synthesis on Dutch OC speech transcripts) to generate synthetic training data for fine-tuning Whisper and MMS ASR models, followed by LLM-based error correction on the ASR outputs. It reports an average 8% relative WER reduction from TTS augmentation, additional 21.4-26.2% relative reductions from LLM correction on fine-tuned models, and overall relative WER decreases of 40% for Whisper and 50% for MMS, concluding that the combination is a viable strategy for OC speech recognition.
Significance. If the reported relative gains are confirmed with absolute WER values, proper baselines, statistical tests, and evidence that synthetic data matches real OC acoustic characteristics, the work would demonstrate a practical, low-resource approach to improving ASR for a clinically important impaired-speech population. The empirical focus on a real patient corpus and the combination of augmentation with LLM post-processing are strengths that could inform follow-on studies, though the current evidence leaves the magnitude and generalizability of the gains only partially supported.
major comments (3)
- [Abstract] Abstract: the headline claims of 40% and 50% relative WER reductions are presented without any absolute baseline WER values, number of speakers or utterances in the test set, dataset sizes, or statistical significance tests, preventing assessment of whether the improvements are practically meaningful or robust.
- [Data Augmentation] Data augmentation and evaluation protocol: the central assumption that TTS-generated synthetic data sufficiently captures post-treatment articulatory distortions (altered formants, hypernasality, consonant distortions) is not supported by any acoustic analysis or direct comparison of synthetic versus real OC recordings; if the test set consists of real patient speech, observed WER drops may reflect domain mismatch rather than improved robustness.
- [LLM Error Correction] LLM error correction: the additional WER reductions from LLM post-processing are reported without any domain-specific validation or error analysis confirming that medically relevant terminology and clinical intent are preserved; this is a load-bearing assumption for the claim that the pipeline is viable for OC speech.
minor comments (1)
- [Abstract] The description of the 8% average relative decrease from TTS augmentation does not specify how the average is computed across the different augmentation techniques and the two base models.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We have addressed each major comment in detail below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claims of 40% and 50% relative WER reductions are presented without any absolute baseline WER values, number of speakers or utterances in the test set, dataset sizes, or statistical significance tests, preventing assessment of whether the improvements are practically meaningful or robust.
Authors: We agree that the abstract would benefit from additional context to allow readers to better evaluate the practical significance of the results. In the revised manuscript, we have updated the abstract to report the absolute baseline WER values for the models, the number of speakers and utterances in the test set, the relevant dataset sizes, and a note on the statistical significance of the observed improvements. revision: yes
-
Referee: [Data Augmentation] Data augmentation and evaluation protocol: the central assumption that TTS-generated synthetic data sufficiently captures post-treatment articulatory distortions (altered formants, hypernasality, consonant distortions) is not supported by any acoustic analysis or direct comparison of synthetic versus real OC recordings; if the test set consists of real patient speech, observed WER drops may reflect domain mismatch rather than improved robustness.
Authors: We thank the referee for this observation. Our TTS augmentation is based on transcripts from the OC corpus to increase exposure to domain-specific lexical content and sentence structures rather than to synthesize the precise acoustic distortions of impaired speech. The test set consists of real patient recordings, and the reported gains are empirical. We have revised the manuscript to include an explicit discussion of this limitation of the augmentation approach and its implications for interpreting the source of the WER reductions. revision: partial
-
Referee: [LLM Error Correction] LLM error correction: the additional WER reductions from LLM post-processing are reported without any domain-specific validation or error analysis confirming that medically relevant terminology and clinical intent are preserved; this is a load-bearing assumption for the claim that the pipeline is viable for OC speech.
Authors: We agree that domain-specific validation is important for this component. In the revised manuscript, we have added an error analysis of the LLM corrections that examines preservation of medically relevant terminology and clinical intent, including representative examples and a summary of the types of changes made by the LLM. revision: yes
Circularity Check
No circularity: purely empirical ASR finetuning and evaluation
full rationale
The manuscript presents an experimental pipeline: data augmentation via TTS and other techniques on Dutch OC transcripts, finetuning of Whisper and MMS models, followed by LLM-based error correction, with WER measured on held-out real OC recordings. No equations, uniqueness theorems, or self-citations are invoked to derive the reported 40-50% relative WER reductions; the gains are direct empirical outcomes of the described training and inference steps. The central claims rest on standard ML evaluation protocols rather than any reduction to author-defined parameters or prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic speech generated by TTS can usefully supplement scarce real recordings of oral cancer speech for model training.
Reference graph
Works this paper leans on
-
[1]
The global incidence of lip, oral cavity, and pharyngeal cancers by subsite in 2012,
K. D. Shield, J. Ferlay, A. Jemal, R. Sankaranarayanan, A. K. Chaturvedi, F. Bray, and I. Soerjomataram, “The global incidence of lip, oral cavity, and pharyngeal cancers by subsite in 2012,”CA: a cancer journal for clinicians, vol. 67, no. 1, pp. 51–64, 2017
work page 2012
-
[2]
Cancer statistics for the year 2020: An overview,
J. Ferlay, M. Colombet, I. Soerjomataram, D. M. Parkin, M. Pi ˜neros, A. Znaor, and F. Bray, “Cancer statistics for the year 2020: An overview,”International Journal of Cancer, vol. 149, no. 4, pp. 778–789, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/ abs/10.1002/ijc.33588
-
[3]
Speech Deficits Associated with Oral and Oropharyngeal Carcinomas,
G. Constantinescu and J. M. Rieger, “Speech Deficits Associated with Oral and Oropharyngeal Carcinomas,” inClinical Care and Rehabilitation in Head and Neck Cancer, P. C. Doyle, Ed. Springer International Publishing, 2019, pp. 265–279. [Online]. Available: https://doi.org/10.1007/978-3-030-04702-3 16
-
[4]
Speech Disorders Related to Head and Neck Cancer,
T. Bressmann, “Speech Disorders Related to Head and Neck Cancer,” inThe Handbook of Language and Speech Disorders. John Wiley & Sons, Ltd, 2021, pp. 495–527. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119606987.ch22
-
[5]
T. B. Tienkamp, T. Rebernik, R. A. D’Cruz, R. van Son, M. Wieling, M. J. H. Witjes, S. de Visscher, and D. Abur, “Articulatory–kinematic changes in speech following surgical treatment for oral or oropharyngeal cancer: A systematic review,”International Journal of Language & Communication Disorders, vol. 60, no. 1, p. e13148, 2025. [Online]. Available: htt...
-
[6]
Transformers in speech processing: Overcoming challenges and paving the future,
S. Latif, S. A. M. Zaidi, H. Cuay ´ahuitl, F. Shamshad, M. Shoukat, M. Usama, and J. Qadir, “Transformers in speech processing: Overcoming challenges and paving the future,”Computer Science Review, vol. 58, p. 100768, 2025. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S1574013725000449
work page 2025
-
[7]
Low-resource automatic speech recognition and error analyses of oral cancer speech,
B. M. Halpern, S. Feng, R. Van Son, M. Van Den Brekel, and O. Scharenborg, “Low-resource automatic speech recognition and error analyses of oral cancer speech,”Speech Communication, vol. 141, pp. 14–27, 2022-06. [Online]. Available: https://linkinghub.elsevier.com/ retrieve/pii/S0167639322000620
work page 2022
-
[8]
Automatic speech recognition and error analyses of Dutch oral cancer speech,
K. Wildenburg, “Automatic speech recognition and error analyses of Dutch oral cancer speech,” Master’s thesis, University of Groningen,
-
[9]
Available: https://campus-fryslan.studenttheses.ub.rug
[Online]. Available: https://campus-fryslan.studenttheses.ub.rug. nl/224/
-
[10]
S. A. Sheikh, M. Sahidullah, and I. Kodrasi, “Overview of automatic speech analysis and technologies for neurodegenerative disorders: Di- agnosis and assistive applications,”IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 5, pp. 700–716, 2025
work page 2025
-
[11]
A survey of technologies for automatic Dysarthric speech recognition,
Z. Qian, K. Xiao, and C. Yu, “A survey of technologies for automatic Dysarthric speech recognition,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2023, no. 1, p. 48, 2023-11-
work page 2023
-
[12]
Available: https://asmp-eurasipjournals.springeropen.com/ articles/10.1186/s13636-023-00318-2
[Online]. Available: https://asmp-eurasipjournals.springeropen.com/ articles/10.1186/s13636-023-00318-2
-
[13]
Audio augmentation for speech recognition,
T. Ko, V . Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” inProc. Interspeech 2015. ISCA, 2015-09- 06, pp. 3586–3589. [Online]. Available: https://www.isca-archive.org/ interspeech 2015/ko15 interspeech.html
work page 2015
-
[14]
V ocal Tract Length Perturbation (VTLP) improves speech recognition,
N. Jaitly and G. E. Hinton, “V ocal Tract Length Perturbation (VTLP) improves speech recognition,”ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013
work page 2013
-
[15]
Data Augmentation Using Healthy Speech for Dysarthric Speech Recognition,
B. Vachhani, C. Bhat, and S. K. Kopparapu, “Data Augmentation Using Healthy Speech for Dysarthric Speech Recognition,” in Proc. Interspeech 2018. ISCA, 2018-09-02, pp. 471–475. [Online]. Available: https://www.isca-archive.org/interspeech 2018/vachhani18 interspeech.html
work page 2018
-
[16]
Investigation of Data Augmentation Techniques for Disordered Speech Recognition,
M. Geng, X. Xie, S. Liu, J. Yu, S. Hu, X. Liu, and H. Meng, “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” inProc. Interspeech 2020, 2020, pp. 696–
work page 2020
-
[17]
Available: https://www.isca-archive.org/interspeech 2020/geng20 interspeech.pdf
[Online]. Available: https://www.isca-archive.org/interspeech 2020/geng20 interspeech.pdf
work page 2020
-
[18]
Exploring Alternative Data Augmentation Methods in Dysarthric Automatic Speech Recognition,
R. Gracelli and J. Almeida, “Exploring Alternative Data Augmentation Methods in Dysarthric Automatic Speech Recognition,” inProc. 2024 IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS), 2024-06, pp. 243–248. [Online]. Available: https://ieeexplore.ieee.org/document/10600718/
-
[19]
Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation,
S. A. Naeini, L. Simmatis, D. Jafari, Y . Yunusova, and B. Taati, “Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation,”IEEE Journal of Translational Engineering in Health and Medicine, vol. 12, pp. 382–389, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10464345/
-
[20]
Unsupervised Rhythm and V oice Conversion to Improve ASR on Dysarthric Speech,
K. El Hajal, E. Hermann, S. Hovsepyan, and M. M. Doss, “Unsupervised Rhythm and V oice Conversion to Improve ASR on Dysarthric Speech,” inProc. Interspeech 2025, 2025, pp. 2760–
work page 2025
-
[21]
Available: https://www.isca-archive.org/interspeech 2025/elhajal25 interspeech.html#
[Online]. Available: https://www.isca-archive.org/interspeech 2025/elhajal25 interspeech.html#
work page 2025
-
[22]
C.-J. Li, E. Yeo, K. Choi, P. A. P ´erez-Toro, M. Someki, R. K. Das, Z. Yue, J. R. Orozco-Arroyave, E. N ¨oth, and D. R. Mortensen, “Towards Inclusive ASR: Investigating V oice Conversion for Dysarthric Speech Recognition in Low-Resource Languages,” in Proc. Interspeech 2025, 2025, pp. 2128–2132. [Online]. Available: https://www.isca-archive.org/interspee...
work page 2025
-
[23]
Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition,
M. Soleymanpour, M. T. Johnson, R. Soleymanpour, and J. Berry, “Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition,” inProc. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022-05-23, pp. 7382–7386. [Online]. Available: https:// ieeexplore.ieee.org/document/9746585/
-
[24]
Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation,
E. Hermann and M. Magimai. Doss, “Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation,” inProc. Interspeech 2023, 2023, pp. 156–160. [Online]. Available: https: //www.isca-archive.org/interspeech 2023/hermann23 interspeech.html
work page 2023
-
[25]
W.-Z. Leung, M. Cross, A. Ragni, and S. Goetze, “Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis,” inProc. Interspeech 2024, 2024, pp. 2494–2498. [Online]. Available: https://www.isca-archive. org/interspeech 2024/leung24 interspeech.html
work page 2024
-
[26]
B. M. Halpern, W.-C. Huang, L. P. Violeta, R. van Son, and T. Toda, “Improving Severity Preservation of Healthy-to-Pathological V oice Conversion With Global Style Tokens,” inProc. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–7. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/ 10389707
work page 2023
-
[27]
The design for the wall street journal-based csr corpus,
D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” inProc. HLT ’91 Workshop on Speech and Natural Language. USA: Association for Computational Linguistics, 1992, p. 357–362. [Online]. Available: https://doi.org/10.3115/1075527.1075614
-
[28]
Het Corpus Gesproken Nederlands,
N. Oostdijk, “Het Corpus Gesproken Nederlands,” 1999. [Online]. Available: https://hdl.handle.net/2066/76350
work page 1999
-
[29]
Manipulation of oral cancer speech using neural articulatory synthesis,
B. M. Halpern, T. Rebernik, T. Tienkamp, R. van Son, M. van den Brekel, M. Wieling, M. Witjes, and O. Scharenborg, “Manipulation of oral cancer speech using neural articulatory synthesis,” 2022-03-31, pre-published. [Online]. Available: http://arxiv.org/abs/2203.17072
-
[30]
Robust Dysarthric Speech Recognition with GAN Enhancement and LLM Correction,
Y . He, K. P. Seng, C. S. Lim, and L. M. Ang, “Robust Dysarthric Speech Recognition with GAN Enhancement and LLM Correction,”Advanced Intelligent Systems, p. e202500873, 2025-10. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/aisy.202500873
-
[31]
Robust Speech Recognition via Large-Scale Weak Super- vision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Super- vision,” inProc. 2023 International Conference on Machine Learning (ICML), 2022-06-12
work page 2023
-
[32]
Scaling Speech Technology to 1,000+ Languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling Speech Technology to 1,000+ Languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024. [Online]. Available: https://jmlr.org/papers/volume25/23-1318/23-1318.pdf
work page 2024
-
[33]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[34]
Improved ASR Performance for Dysarthric Speech Using Two-stage Data Augmentation,
C. Bhat, A. Panda, and H. Strik, “Improved ASR Performance for Dysarthric Speech Using Two-stage Data Augmentation,” inProc. Interspeech 2022. ISCA, 2022-09-18, pp. 46–50. [Online]. Available: https://www.isca-archive.org/interspeech 2022/bhat22 interspeech.html
work page 2022
-
[35]
V oice Conversion With Just Nearest Neighbors,
M. Baas, B. van Niekerk, and H. Kamper, “V oice Conversion With Just Nearest Neighbors,” inProc. Interspeech 2023, 2023-05-
work page 2023
-
[36]
Available: https://www.isca-archive.org/interspeech 2023/ baas23 interspeech.html
[Online]. Available: https://www.isca-archive.org/interspeech 2023/ baas23 interspeech.html
work page 2023
-
[37]
WavLM: Large-Scale Self-Supervised Pre- Training for Full Stack Speech Processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-Scale Self-Supervised Pre- Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022-06-17. [Onlin...
-
[38]
XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model,
E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model,” inProc. Interspeech 2024, 2024-06-07. [Online]. Available: https: //www.isca-archive.org/interspeech 2024/casanova24 interspeech.html
work page 2024
-
[39]
Wablieft: An Easy-to- Read Newspaper Corpus for Dutch,
V . Vandeghinste, B. Bult ´e, and L. Augustinus, “Wablieft: An Easy-to- Read Newspaper Corpus for Dutch,”Proceedings of CLARIN Annual Conference 2019, pp. 188–191, 2019-10-01
work page 2019
-
[40]
Z. Yue, F. Xiong, H. Christensen, and J. Barker, “Exploring Appropriate Acoustic and Language Modelling Choices for Continuous Dysarthric Speech Recognition,” inProc. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020-05, pp. 6094–6098. [Online]. Available: https://ieeexplore.ieee. org/document/9054343/
-
[41]
L. Prananta, B. Halpern, S. Feng, and O. Scharenborg, “The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition,” inProc. Interspeech
-
[42]
ISCA, 2022-09-18, pp. 36–40. [Online]. Available: https: //www.isca-archive.org/interspeech 2022/prananta22 interspeech.html
work page 2022
-
[43]
G. Schu, P. Janbakhshi, and I. Kodrasi, “On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches,” inProc. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. [Online]. Available: https://ieeexplore.ieee.org/document/10095981/
-
[44]
Accurate synthesis of dysarthric Speech for ASR data augmentation,
M. Soleymanpour, M. T. Johnson, R. Soleymanpour, and J. Berry, “Accurate synthesis of dysarthric Speech for ASR data augmentation,”Speech Communication, vol. 164, p. 103112, 2024-10-
work page 2024
-
[45]
Available: https://www.sciencedirect.com/science/article/ pii/S0167639324000839
[Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0167639324000839
-
[46]
Nasaliteitsmeting met de nasometer,
J. van de Weijer and I. Slis, “Nasaliteitsmeting met de nasometer,” Logopedie & F oniatrie, vol. 63, pp. 97–101, 1991. [Online]. Available: https://hdl.handle.net/2066/323177
work page 1991
-
[47]
De ontwikkeling van een fonetisch gebalanceerde standaardtekst,
H. Martens, G. Nuffelen, and M. Bodt, “De ontwikkeling van een fonetisch gebalanceerde standaardtekst,”Logopedie, vol. 23, pp. 31–36, 01 2010
work page 2010
-
[48]
The IFA corpus: a phonemically segmented Dutch
R. van Son, D. Binnenpoorte, H. Heuvel, and L. Pols, “The IFA corpus: a phonemically segmented Dutch ”open source” speech database,” in Proc. Eurospeech 2001, 2001
work page 2001
-
[49]
librosa: Audio and music signal analysis in Python
B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “Librosa: Audio and Music Signal Analysis in Python,” inProc. Python in Science Conference, 2015, pp. 18–24. [Online]. Available: https://doi.org/10.25080/Majora-7b98e3ed-003
-
[50]
TorchAudio: Building Blocks for Audio and Speech Processing,
Y .-Y . Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C.-F. Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V . Quenneville- B´elair, and Y . Shi, “TorchAudio: Building Blocks for Audio and Speech Processing,” inProc....
work page 2022
-
[51]
E. Ma, “NLP augmentation,” 2019. [Online]. Available: https: //github.com/makcedward/nlpaug
work page 2019
-
[52]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, 2020-05, pp. 4218–4222. [Online]. Available: https://aclanthology.org/...
work page 2020
-
[53]
Wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,”Ad- vances in Neural Information Processing Systems, vol. 33, pp. 12 449– 12 460, 2020
work page 2020
-
[54]
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning Methods,
S. Mangrulkar, S. Gugger, L. Debut, Y . Belkada, S. Paul, and B. Bossan, “PEFT: State-of-the-art Parameter-Efficient Fine-Tuning Methods,”
-
[55]
Available: https://github.com/huggingface/peft
[Online]. Available: https://github.com/huggingface/peft
-
[56]
B. M. Halpern, T. Tienkamp, D. Abur, and T. Toda, “Towards Explain- able Reference-Free Speech Intelligibility Evaluation of People with Pathological Speech,” 2025, unpublished
work page 2025
-
[57]
A Survey on LoRA of Large Language Models,
Y . Mao, Y . Ge, Y . Fan, W. Xu, Y . Mi, Z. Hu, and Y . Gao, “A Survey on LoRA of Large Language Models,”Frontiers of Computer Science, vol. 19, no. 7, p. 197605, 2024-12-14. [Online]. Available: https://doi.org/10.1007/s11704-024-40663-9
-
[58]
LoRA- Whisper: Parameter-Efficient and Extensible Multilingual ASR,
Z. Song, J. Zhuo, Y . Yang, Z. Ma, S. Zhang, and X. Chen, “LoRA- Whisper: Parameter-Efficient and Extensible Multilingual ASR,” in Proc. Interspeech 2024, 2024. [Online]. Available: https://www. isca-archive.org/interspeech 2024/song24 interspeech.html
work page 2024
-
[59]
P. Gabler, B. C. Geiger, B. Schuppler, and R. Kern, “Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition,”Information, vol. 14, no. 2, p. 137, 2023-02. [Online]. Available: https: //www.mdpi.com/2078-2489/14/2/137
work page 2023
-
[60]
Impact of Speech Mode in Automatic Pathological Speech Detection,
S. A. Sheikh and I. Kodrasi, “Impact of Speech Mode in Automatic Pathological Speech Detection,” inProc. 2024 European Signal Processing Conference (EUSIPCO), 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10714947/
-
[61]
Automatic Speech Recognition of Conversational Speech in Individuals With Disordered Speech,
J. Tobin, P. Nelson, B. MacDonald, R. Heywood, R. Cave, K. Seaver, A. Desjardins, P.-P. Jiang, and J. R. Green, “Automatic Speech Recognition of Conversational Speech in Individuals With Disordered Speech,”Journal of Speech, Language, and Hearing Research, vol. 67, no. 11, pp. 4176–4185, 2024-11-07. [Online]. Available: https://pubs.asha.org/doi/10.1044/2...
-
[62]
Clever Hans Effect Found in Automatic Detection of Alzheimer’s Disease through Speech,
Y .-L. Liu, R. Feng, J.-H. Yuan, and Z.-H. Ling, “Clever Hans Effect Found in Automatic Detection of Alzheimer’s Disease through Speech,” inProc. Interspeech 2024, 2024, pp. 2435–2439. [Online]. Available: https://www.isca-archive.org/interspeech 2024/liu24f interspeech.html
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.