Sink or SWIM: Tackling Real-Time ASR at Scale
Pith reviewed 2026-05-16 12:05 UTC · model grok-4.3
The pith
SWIM enables real-time Whisper ASR for up to 20 concurrent clients using buffer merging without model changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWIM adds a buffer merging strategy to Whisper that achieves true model-level parallelization, letting one unmodified model handle multiple concurrent audio streams for real-time multilingual transcription. In tests it keeps word error rates comparable to single-client baselines while cutting average delay to roughly 2.4 seconds at five clients and scaling cleanly to twenty clients with rising throughput.
What carries the argument
The buffer merging strategy, which combines input buffers from several clients so that Whisper can run a single forward pass on the combined data while still producing separate, accurate transcriptions for each stream.
If this is right
- Maintains accuracy comparable to Whisper-Streaming while lowering delay in multi-client use.
- Supports simultaneous transcription in English, Italian, and Spanish.
- Scales to twenty concurrent clients without quality loss and with higher total throughput.
- Requires no modification to the underlying Whisper model.
Where Pith is reading between the lines
- Buffer merging may transfer to other large transformer ASR models for similar multi-user scaling.
- The approach could support lower-cost deployment of live captioning in large virtual events or public services.
- Dynamic adjustment of merge windows might further reduce latency under fluctuating client numbers.
Load-bearing premise
Merging client buffers preserves transcription fidelity without adding errors or needing any change to the Whisper model itself.
What would settle it
Run the same set of multi-client audio recordings once with merged buffers and once with separate model instances, then compare word error rate and latency to check whether the merged version produces measurably higher errors.
Figures
read the original abstract
Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SWIM, a real-time ASR system built on unmodified OpenAI Whisper that uses a buffer merging strategy to support multiple concurrent audio streams. It claims to deliver multilingual transcription (English, Italian, Spanish) for up to 20 clients while maintaining ~8.2% WER comparable to single-client Whisper-Streaming, reducing average delay to ~2.4 s at 5 clients (vs. 3.4 s baseline), and scaling throughput without quality degradation.
Significance. If the multi-client empirical results hold with proper validation, the work would be significant for practical deployment of scalable real-time ASR in multi-user settings such as live captioning or voice interfaces, by demonstrating efficient parallelization on existing models without retraining or architectural changes.
major comments (3)
- Abstract: the central claim that SWIM 'maintains comparable accuracy' and 'without degrading transcription quality' up to 20 clients rests on unreported WER values for the multi-client regime; only the single-client Whisper-Streaming WER of ~8.2% is stated, leaving the no-degradation assertion unsupported.
- Evaluation section: no quantitative ablation or error analysis is provided for the buffer merging strategy under concurrent multilingual streams, so it is impossible to assess whether merging introduces cross-client interference, context loss, or language-specific errors in Italian/Spanish.
- Results: reported delays (~2.4 s at 5 clients) and scaling claims lack error bars, full evaluation protocol details, baseline comparisons beyond the single-client case, or statistical significance tests, which are load-bearing for the performance and robustness assertions.
minor comments (1)
- Abstract: define 'delay' and 'throughput' precisely and state the exact test conditions (audio length, hardware, batching) used for the multi-client measurements.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating the changes made to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the central claim that SWIM 'maintains comparable accuracy' and 'without degrading transcription quality' up to 20 clients rests on unreported WER values for the multi-client regime; only the single-client Whisper-Streaming WER of ~8.2% is stated, leaving the no-degradation assertion unsupported.
Authors: We agree that the abstract should explicitly support the accuracy claim with multi-client WER data. Our experiments showed WER remaining at ~8.2% for up to 20 clients across languages, but this was not stated clearly enough. We have revised the abstract to include the multi-client WER figures and added a reference to the corresponding table in the evaluation section. revision: yes
-
Referee: Evaluation section: no quantitative ablation or error analysis is provided for the buffer merging strategy under concurrent multilingual streams, so it is impossible to assess whether merging introduces cross-client interference, context loss, or language-specific errors in Italian/Spanish.
Authors: We acknowledge this gap. We have added a new subsection with quantitative ablations of the buffer merging strategy, including measurements of cross-client interference, context retention, and per-language WER breakdowns for Italian and Spanish to demonstrate the absence of significant degradation. revision: yes
-
Referee: Results: reported delays (~2.4 s at 5 clients) and scaling claims lack error bars, full evaluation protocol details, baseline comparisons beyond the single-client case, or statistical significance tests, which are load-bearing for the performance and robustness assertions.
Authors: We agree that greater statistical rigor is needed. We have updated the results to include error bars from repeated trials, expanded the evaluation protocol description, added multi-client baseline comparisons, and included statistical significance tests for the reported latency improvements. revision: yes
Circularity Check
No significant circularity in empirical system implementation
full rationale
The paper describes an engineering implementation of SWIM for scalable real-time ASR using unmodified Whisper with a buffer merging strategy. Claims rest entirely on reported empirical measurements of WER, latency, and throughput across 1-20 clients in multiple languages. No equations, derivations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text. No self-citation chains or self-definitional reductions are present. The work is self-contained against external benchmarks via direct system evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ASR—A Real-Time Speech Recognition on Portable Devices,
A. S. Sharma and R. Bhalley, “ASR—A Real-Time Speech Recognition on Portable Devices,” inPro- ceedings of the 2nd International Conference on Ad- vances in Computing, Communication, & Automation 11 (ICACCA’16). Bareilly, India: IEEE, Sep. 2016, pp. 1–4
work page 2016
-
[2]
Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,
W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,” inProceedings of the 6th International Conference on Learning Representations (ICLR’18), M. Ranzato, Ed., Vancouver, BC, Canada, Apr./May 2018
work page 2018
-
[3]
Automatic Speech Recognition: Systematic Literature Review,
S. Alharbi, M. Alrazgan, A. Alrashed, T. Alnomasi, R. Almojel, R. Alharbi, S. Alharbi, S. Alturki, F. Al- shehri, and M. Almojil, “Automatic Speech Recognition: Systematic Literature Review,”IEEE Access, vol. 9, pp. 131 858–131 876, 2021
work page 2021
-
[4]
B. Lecouteux, M. Vacher, and F. Portet, “Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,” inProceed- ings of the 12th INTERSPEECH Conference (INTER- SPEECH’11), R. Pieraccini, Ed. Firenze, Italy: ISCA, Aug. 2011, pp. 2273–2276
work page 2011
-
[5]
Conversational AI: The Science Behind the Alexa Prize
A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Na- gar, E. King, K. Bland, A. Wartick, Y . Pan, H. Song, S. Jayadevan, G. Hwang, and A. Pettigrue, “Conversa- tional AI: The Science Behind the Alexa Prize,”arXiv e-prints, vol. arXiv:1801.03604, pp. 1–18, Jan. 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
A Review of ASR Technologies for Children’s Speech,
M. Gerosa, D. Giuliani, S. Narayanan, and A. Potami- anos, “A Review of ASR Technologies for Children’s Speech,” inProceedings of the 2nd Workshop on Child, Computer and Interaction (WOCCI’09). Cambridge, MA, USA: ACM, Nov. 2009, pp. 7:1–7:8
work page 2009
-
[7]
Attention- Based wav2text with Feature Transfer Learning,
A. Tjandra, S. Sakti, and S. Nakamura, “Attention- Based wav2text with Feature Transfer Learning,” in Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU’17). Okinawa, Japan: IEEE, Jan. 2017, pp. 309–315
work page 2017
-
[8]
Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,
P. Dubey and B. Shah, “Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,”arXiv e-prints, vol. arXiv:2204.00977, pp. 1–6, Apr. 2022
-
[9]
Deep Speech: Scaling up end-to-end speech recognition
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di- amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep Speech: Scaling up End-to-End Speech Recognition,”arXiv e-prints, vol. arXiv:1412.5567, pp. 1–12, Dec. 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
End-to-End Speech Recognition: A Sur- vey,
R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-End Speech Recognition: A Sur- vey,”IEEE Transactions on Audio, Speech and Language Processing, vol. 32, pp. 325–351, 2023
work page 2023
-
[11]
Recent Advances in End-to-End Automatic Speech Recognition,
J. Li, “Recent Advances in End-to-End Automatic Speech Recognition,”APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, pp. 1–64, Apr. 2022
work page 2022
-
[12]
WaveNet: A Generative Model for Raw Audio,
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” inProceedings of the 9th Workshop on Speech Synthesis (SSW’16). Sunnyvale, USA: ISCA, Sep. 2016, p. 125
work page 2016
-
[13]
50 Years of Progress in Speech and Speaker Recognition Research,
S. Furui, “50 Years of Progress in Speech and Speaker Recognition Research,”Transactions on Computer and Information Technology, vol. 1, no. 2, pp. 64–74, Nov. 2005
work page 2005
-
[14]
The Application of Hidden Markov Models in Speech Recognition,
M. Gales and S. Young, “The Application of Hidden Markov Models in Speech Recognition,”Foundations and Trends® in Signal Processing, vol. 1, no. 3, pp. 195– 304, Feb. 2008
work page 2008
-
[15]
Unsuper- vised Automatic Speech Recognition: A Review,
H. Aldarmaki, A. Ullah, S. Ram, and N. Zaki, “Unsuper- vised Automatic Speech Recognition: A Review,”Speech Communication, vol. 139, pp. 76–91, Apr. 2022
work page 2022
-
[16]
wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,” inProceedings of the 33rd Conference on Neural Information Processing Sys- tems (NeurIPS’20), M. Ranzato, R. Hadsell, M. F. Bal- can, and H.-T. Lin, Eds., vol. 33. Vancouver, BC, Canada: Curran Associates, Inc., Dec. 2020, pp. 1–12
work page 2020
-
[17]
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,
G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y . Wang, Z. You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” inProceedings of the 22nd INTERSPEECH Conference (INTERSPEECH’21), L. Bur...
work page 2021
-
[18]
Robust Speech Recogni- tion via Large-Scale Weak Supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recogni- tion via Large-Scale Weak Supervision,” inProceedings of the 40th International Conference on Machine Learn- ing (ICML’23). Honolulu, Hawaii: PMLR, Jul. 2023, pp. 28 492–28 518
work page 2023
-
[19]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGres- ley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y . Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan...
work page 2016
-
[20]
Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,
A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,”Speech Communication, vol. 34, no. 1-2, pp. 25–40, 2001
work page 2001
-
[21]
Multi-Stream End-to-End Speech Recognition,
R. Li, X. Wang, S. H. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, “Multi-Stream End-to-End Speech Recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1026–1039, Jun. 2019
work page 2019
-
[22]
P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V . Wan, and L. Zhang, “Real-Time ASR from Meetings,” inPro- 12 ceedings of the 9th INTERSPEECH Conference (INTER- SPEECH’09). Brighton, United Kingdom: ISCA, Sep. 2023, pp. 2119–2122
work page 2023
-
[23]
C. Arriaga, A. Pozo, J. Conde, and A. Alonso, “Evalua- tion of Real-Time Transcriptions Using End-to-End ASR Models,”arXiv preprint arXiv:2409.05674, 2024
-
[24]
A. V . Ivanov, P. L. Lange, D. Suendermann-Oeft, V . Ra- manarayanan, Y . Qian, Z. Yu, and J. Tao, “Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,” inProceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS’16), ser. Lecture Notes in Electrical Engineering 427. Saa...
work page 2016
-
[25]
Preech: A System for Privacy-Preserving Speech Transcription,
S. Ahmed, A. R. Chowdhury, K. Fawaz, and P. Ra- manathan, “Preech: A System for Privacy-Preserving Speech Transcription,” inProceedings of the 29th USENIX Security Symposium (USENIX Security’20), S. Capkun and F. Roesner, Eds. Boston, MA, USA: USENIX, Aug. 2020, pp. 2703–2720
work page 2020
-
[26]
H. Abdullah, K. Warren, V . Bindschaedler, N. Papernot, and P. Traynor, “SoK: The Faults in our ASRs,” in Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP’21), A. Oprea and T. Holz, Eds. San Francisco, CA, USA: IEEE, May 2021, pp. 730–747
work page 2021
-
[27]
Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,
A.-L. Geirgescu, A. Pappalardo, H. Cucu, and M. Blott, “Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,”EURASIP Jour- nal on Audio, Speech, and Music Processing, vol. 2021, no. 1, pp. 28:1–28:30, Jul. 2021
work page 2021
-
[28]
Anatomy of Industrial Scale Multilingual ASR,
F. M. Ramirez, L. Chkhetiani, A. Ehrenberg, R. McHardy, R. Botros, Y . Khare, A. Vanzo, T. Peyash, G. Oexle, M. Lianget al., “Anatomy of Industrial Scale Multilingual ASR,”arXiv preprint arXiv:2404.09841, 2024
-
[29]
Decoupled Federated Learning for ASR with Non- IID Data,
H. Zhu, J. Wang, G. Cheng, P. Zhang, and Y . Yan, “Decoupled Federated Learning for ASR with Non- IID Data,” inProceedings of the 23rd INTERSPEECH Conference (INTERSPEECH’22), K. Lee, L. Lamel, M. Hasegawa-Johnson, K. Livescu, and O. Kang, Eds. Incheon, South Korea: ISCA, Aug. 2023, pp. 2628–2632
work page 2023
-
[30]
S. Basak, H. Agrawal, S. Jena, S. Gite, M. Bachute, B. Pradhan, and M. Assiri, “Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,”Computer Modeling in Engineering and Sciences, vol. 135, no. 2, pp. 1053–1089, Oct. 2022
work page 2022
-
[31]
Whispy: Adapting STT Whisper Models to Real- Time Environments,
A. Bevilacqua, P. Saviano, A. Amirante, and S. P. Ro- mano, “Whispy: Adapting STT Whisper Models to Real- Time Environments,”arXiv preprint arXiv:2405.03484, 2024
-
[32]
Turning Whisper into Real-Time Transcription System,
D. Machá ˇcek, R. Dabre, and O. Bojar, “Turning Whisper into Real-Time Transcription System,” inProceedings of the 13th International Joint Conference on Natural Lan- guage Processing (IJCNLP’23), S. Saha and H. Sujaini, Eds. Bali, Indonesia: ACL, Nov. 2023, pp. 17–24
work page 2023
-
[33]
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,
P. Polák, N.-Q. Pham, T. N. Nguyen, D. Liu, C. Mullov, J. Niehues, O. Bojar, and A. Waibel, “CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,” inProceedings of the 19th International Con- ference on Spoken Language Translation (IWSLT’22), E. Salesky, M. Federico, and M. Costa-jussà, Eds. Dublin, Ireland: ACL, May 2022, pp. 277–285
work page 2022
-
[34]
Scaling Up Online Speech Recognition Using ConvNets,
V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhoma- nenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Collobert, “Scaling Up Online Speech Recognition Using ConvNets,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 3376–3380
work page 2020
-
[35]
Federated Acous- tic Modeling for Automatic Speech Recognition,
X. Cui, S. Lu, and B. Kingsbury, “Federated Acous- tic Modeling for Automatic Speech Recognition,” in Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP’21), T. Davidson and D. Yu, Eds. Toronto, Canada: IEEE, Jun. 2021, pp. 6748–6752
work page 2021
-
[36]
Y . Zhang, L. Xu, A. Mendoza, G. Yang, P. Chinprut- thiwong, and G. Gu, “Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS’19). San Diego, CA, USA: The Internet Society, Feb. 2019, pp. 1–15
work page 2019
-
[37]
Measuring the Accuracy of Automatic Speech Recognition Solutions,
K. Kuhn, V . Kersken, B. Reuter, N. Egger, and G. Zimmermann, “Measuring the Accuracy of Automatic Speech Recognition Solutions,”ACM Transactions on Accessible Computing, vol. 16, no. 4, pp. 25:1–25:23, Jan. 2024
work page 2024
-
[38]
Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,
N. Jamal, S. Shanta, F. Mahmud, and M. Sha’abani, “Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,” in Proceedings of the International Conference on Electri- cal and Electronic Engineering (IC3E’17). Johor Bahru, Malaysia: AIP, Aug. 2017
work page 2017
-
[39]
M. B. Mustafa, F. Rosdi, S. S. Salim, and M. U. Mughal, “Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,”Expert Systems with Applications, vol. 42, no. 8, pp. 3924–3932, May 2015
work page 2015
-
[40]
Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,
S. Latif, J. Qadir, A. Qayyum, M. Usama, and S. You- nis, “Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,”IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2021
work page 2021
-
[41]
D. Loakes, “Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?”Frontiers in Com- munication, vol. 7, Aug. 2022
work page 2022
-
[42]
Institutional and Academic Transcripts of Police Interrogations,
M. Komter, “Institutional and Academic Transcripts of Police Interrogations,”Frontiers in Communication, vol. 7, Aug. 2022
work page 2022
-
[43]
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,
L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,”Pro- ceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989
work page 1989
-
[44]
Conformer: Convolution-Augmented Transformer for Speech Recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. P. Pang, 13 “Conformer: Convolution-Augmented Transformer for Speech Recognition,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. ...
work page 2020
-
[45]
Streaming End-to-end Speech Recognition for Mobile Devices,
Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Al- varez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-y. Chang, K. Rao, and A. Gru- enstein, “Streaming End-to-end Speech Recognition for Mobile Devices,” inProceedings of the 44th Interna- tional Conference on Acoustics...
work page 2019
-
[46]
Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,
M. B. Hoy, “Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,”Medical Reference Services Quarterly, vol. 37, no. 1, pp. 81–88, Jan./Mar. 2018
work page 2018
-
[47]
K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y . Wang, “Transformer in Transformer,” inProceedings of the 35th International Conference on Neural Information Process- ing Systems (NIPS’21), M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Lian, and J. Wortman Vaughan, Eds., vol. 34. Virtual: Curran Associates, Inc., Dec. 2021, pp. 15 908–15 919
work page 2021
-
[48]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. M. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” inProceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17), I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Long Beach, CA, USA: Curran Associat...
work page 2017
-
[49]
Scaling Speech Technology to 1,000+ Languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1,000+ Languages,”Journal of Machine Learning Research, vol. 25, pp. 1–52, Mar. 2024
work page 2024
-
[50]
Librispeech: An ASR Corpus Based on Public Domain Audio Books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” inProceedings of the 40th International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15), D. Gray and D. Cochran, Eds. Brisbane, Australia: IEEE, Apr. 2015, pp. 5206–5210
work page 2015
-
[51]
Long-Form Speech Genera- tion with Spoken Language Models,
S. J. Park, J. Salazar, A. Jansen, K. Kinoshita, Y . M. Ro, and R. Skerry-Ryan, “Long-Form Speech Genera- tion with Spoken Language Models,” inProceedings of the 42nd International Conferene on Machine Learning (ICML’25). Vancouver, BC, Canada: PMLR, Jul. 2025, pp. 48 245–48 261
work page 2025
-
[52]
MLS: A Large-Scale Multilingual Dataset for Speech Research,
V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Col- lobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 2757–2761
work page 2020
-
[53]
FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,” inProceedings of IEEE Spoken Language Technology Workshop (SLT’22), S. Watanabe, S. Khudanpur, J. Hirschberg, M. Saraclar, and M. Del- croix, Eds. Doha, Qatar: IEEE, Jan. 2023...
work page 2023
-
[54]
Word Error Rate Estimation for Speech Recognition: e-WER,
A. Ali and S. Renals, “Word Error Rate Estimation for Speech Recognition: e-WER,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: ACM, Jul. 2018, pp. 29–24
work page 2018
-
[55]
Testing the Correlation of Word Error Rate and Perplexity,
D. Klakow and J. Peters, “Testing the Correlation of Word Error Rate and Perplexity,”Speech Communica- tion, vol. 38, no. 1–2, pp. 19–28, Sep. 2002
work page 2002
-
[56]
Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,
Y .-Y . Wang, A. Acero, and C. Chelba, “Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,” inProceedings of the 7th Workshop on Automatic Speech Recognition and Understanding (ASRU’03). St. Thomas, U.S. Virgin Islands: IEEE, Nov./Dec. 2003, pp. 577–582
work page 2003
-
[57]
What Size Test Set Gives Good Error Rate Estimates?
I. Guyon, J. Makhoul, R. Schwartz, and V . Vapnik, “What Size Test Set Gives Good Error Rate Estimates?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 52–64, Jan. 1998
work page 1998
-
[58]
Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,
C. Park, H. Kang, and T. Hain, “Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,” inProceedings of the 32nd European Signal Processing Conference (EUSIPCO’24). Lyon, France: IEEE, Aug. 2024, pp. 131–135
work page 2024
-
[59]
R. Sawata, Y . Kashiwagi, and S. Takahashi, “Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp...
work page 2022
-
[60]
Semantic Word Error Rate for Sentence Similarity,
C. Spiccia, A. Augello, G. Pilato, and G. Vassallo, “Semantic Word Error Rate for Sentence Similarity,” inProceedings of the 10th International Conference on Semantic Computing (ICSC’16), T. Li, A. Scherp, D. Ostrowski, and W. Wang, Eds. Laguna Hill, CA, USA: IEEE, Feb. 2016, pp. 266–269
work page 2016
-
[61]
S. Raybaud, D. Langlois, and K. Smaïli, ““This Sen- tence Is Wrong.” Detecting Errors in Machine-Translated Sentences,”Machine Translation, vol. 25, pp. 1–34, Jul. 2011
work page 2011
-
[62]
A Normalized Levenshtein Distance Metric,
Y . Li and B. Liu, “A Normalized Levenshtein Distance Metric,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091–1095, Jun. 2007
work page 2007
-
[63]
Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,
W. J. Heeringa, “Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,” PhD Thesis, Uni- versity of Groningen, Groningen, Netherlands, 2004
work page 2004
-
[64]
Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,
G.-X. Shi, W.-Q. Zhang, G.-B. Wang, J. Zhao, S.-Z. Chai, and Z.-Y . Zhao, “Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,” EURASIP Journal on Audio, Speech and Music Process- ing, pp. 21:1–21:14, Jul. 2021. 14
work page 2021
-
[65]
Temporal Alignment Networks for Long-Term Video,
T. Han, W. Xie, and A. Zisserman, “Temporal Alignment Networks for Long-Term Video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22), K. Dana, G. Hua, S. Roth, D. Samaras, and R. Singh, Eds. New Orleans, LA, USA: IEEE, Jun. 2022, pp. 2896–2906
work page 2022
-
[66]
Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,
J. Li, Y . Meng, Z. Wu, H. Meng, Q. Tian, Y . Wang, and Y . Wang, “Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp. 8007–8011
work page 2022
-
[67]
TEGRA: Table Extraction by Global Record Align- ment,
X. Chu, Y . He, K. Chakrabarti, and K. Ganjam, “TEGRA: Table Extraction by Global Record Align- ment,” inProceedings of the 41st International Con- ference on Management of Data (SIGMOD’15), S. B. Davison and Z. Ives, Eds. Melbourne, Australia: ACM, May/Jun. 2015, pp. 1713–1728
work page 2015
-
[68]
Efficient whisper on streaming speech
R. Wang, Z. Xu, and F. X. Lin, “Efficient Whisper on Streaming Speech,”arXiv preprint arXiv:2412.11272, 2024
-
[69]
Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,
M. Bain, J. Huh, T. Han, and A. Zisserman, “Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,” inProceedings of the 24th INTERSPEECH Conference (INTERSPEECH’23), S. King, K. Knill, and P. Wagner, Eds. Dublin, Ireland: ISCA, Aug. 2023, pp. 4489–4493
work page 2023
-
[70]
A Guided Tour to Approximate String Matching,
G. Navarro, “A Guided Tour to Approximate String Matching,”Computing Surveys, vol. 33, no. 1, pp. 31–88, Mar. 2001
work page 2001
-
[71]
M. M. Deza and E. Deza,Encyclopedia of Distances. Springer, 2009
work page 2009
-
[72]
On the Theory and Computation of Evo- lutionary Distances,
P. H. Sellers, “On the Theory and Computation of Evo- lutionary Distances,”Journal on Applied Mathematics, vol. 26, no. 4, pp. 787–793, 1974
work page 1974
-
[73]
Binary Codes Capable of Correcting Deletions, Insertions and Reversals,
V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,”Soviet Physics Dok- lady, vol. 10, no. 8, pp. 707–710, Feb. 1966
work page 1966
-
[74]
V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,
G. Dimauro, D. Caivano, V . Bevilacqua, F. Girardi, and V . Napoletano, “V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,” inProceedings of the 11th International Symposium on Medical Measurements and Applications (MeMeA’16). Benevento, Italy: IEEE, May 2016, pp. 1–6
work page 2016
-
[75]
Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,
G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Girardi, “Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,” IEEE Access, vol. 5, pp. 22 199–22 208, Nov. 2017
work page 2017
-
[76]
Italian Parkinson’s V oice and Speech,
G. Dimauro and F. Girardi, “Italian Parkinson’s V oice and Speech,” 2019
work page 2019
-
[77]
C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Reg- nell, and A. Wesslén,Experimentation in Software Engi- neering. Springer, 2012. Federico Bruzzoneis currently a Ph.D. student in Computer Science at Università degli Studi di Mi- lano, Italy. He was born in 2000 and since he was a child he has been passionate about computer science and music. He got h...
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.