pith. sign in

arxiv: 2601.17097 · v1 · submitted 2026-01-22 · 💻 cs.SD · cs.SE· eess.AS

Sink or SWIM: Tackling Real-Time ASR at Scale

Pith reviewed 2026-05-16 12:05 UTC · model grok-4.3

classification 💻 cs.SD cs.SEeess.AS
keywords real-time ASRmulti-client transcriptionWhisperbuffer mergingscalable ASRmultilingual speech recognitionlow-latency ASR
0
0 comments X

The pith

SWIM enables real-time Whisper ASR for up to 20 concurrent clients using buffer merging without model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWIM as a system that extends real-time automatic speech recognition to multiple simultaneous users by adding a buffer merging step on top of the existing Whisper model. This strategy lets the model process several audio streams in parallel while keeping transcription accuracy intact across English, Italian, and Spanish. The work shows that latency stays low, around 2.4 seconds at five clients, and the system continues to work up to twenty clients with no loss in quality and higher overall throughput. A reader would care because current real-time ASR tools struggle when more than one person speaks at once, limiting their use in live meetings or services. SWIM demonstrates that these limits can be pushed outward by reusing the same model on merged buffers rather than retraining or duplicating it.

Core claim

SWIM adds a buffer merging strategy to Whisper that achieves true model-level parallelization, letting one unmodified model handle multiple concurrent audio streams for real-time multilingual transcription. In tests it keeps word error rates comparable to single-client baselines while cutting average delay to roughly 2.4 seconds at five clients and scaling cleanly to twenty clients with rising throughput.

What carries the argument

The buffer merging strategy, which combines input buffers from several clients so that Whisper can run a single forward pass on the combined data while still producing separate, accurate transcriptions for each stream.

If this is right

  • Maintains accuracy comparable to Whisper-Streaming while lowering delay in multi-client use.
  • Supports simultaneous transcription in English, Italian, and Spanish.
  • Scales to twenty concurrent clients without quality loss and with higher total throughput.
  • Requires no modification to the underlying Whisper model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Buffer merging may transfer to other large transformer ASR models for similar multi-user scaling.
  • The approach could support lower-cost deployment of live captioning in large virtual events or public services.
  • Dynamic adjustment of merge windows might further reduce latency under fluctuating client numbers.

Load-bearing premise

Merging client buffers preserves transcription fidelity without adding errors or needing any change to the Whisper model itself.

What would settle it

Run the same set of multi-client audio recordings once with merged buffers and once with separate model instances, then compare word error rate and latency to check whether the merged version produces measurably higher errors.

Figures

Figures reproduced from arXiv: 2601.17097 by Dario Pellegrino, Federico Bruzzone, Matteo Brancaleoni, Walter Cazzola.

Figure 1
Figure 1. Figure 1: Architecture of SWIM. Multiple clients ❶ stream audio to dedicated services ❷, which forward data to a shared ASR model ❸ for parallel processing. model. Instead, SWIM is based on a shared ASR model9 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Audio chunk processing time diagram. The outer maximum with zero ensures that we do not consider negative delays: if the shared ASR takes longer than 1 second to process all active services, the next chunk will already have arrived for every connected services, so there’s no extra waiting time. Faster Whisper Model. In the adopted architecture, the entire shared audio buffer is passed to the model as a sin… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation results with different chunk durations using box plots. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The similarity score distribution per client across all datasets using boxen plots. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Delay per language across the Multilingual LibriSpeech dataset and Google Fleurs dataset using violin plots. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: WER per language across the Multilingual LibriSpeech dataset and Google Fleurs dataset using violin plots. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hypothesis delay vs Confirmation delay with fixed [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents SWIM, a real-time ASR system built on unmodified OpenAI Whisper that uses a buffer merging strategy to support multiple concurrent audio streams. It claims to deliver multilingual transcription (English, Italian, Spanish) for up to 20 clients while maintaining ~8.2% WER comparable to single-client Whisper-Streaming, reducing average delay to ~2.4 s at 5 clients (vs. 3.4 s baseline), and scaling throughput without quality degradation.

Significance. If the multi-client empirical results hold with proper validation, the work would be significant for practical deployment of scalable real-time ASR in multi-user settings such as live captioning or voice interfaces, by demonstrating efficient parallelization on existing models without retraining or architectural changes.

major comments (3)
  1. Abstract: the central claim that SWIM 'maintains comparable accuracy' and 'without degrading transcription quality' up to 20 clients rests on unreported WER values for the multi-client regime; only the single-client Whisper-Streaming WER of ~8.2% is stated, leaving the no-degradation assertion unsupported.
  2. Evaluation section: no quantitative ablation or error analysis is provided for the buffer merging strategy under concurrent multilingual streams, so it is impossible to assess whether merging introduces cross-client interference, context loss, or language-specific errors in Italian/Spanish.
  3. Results: reported delays (~2.4 s at 5 clients) and scaling claims lack error bars, full evaluation protocol details, baseline comparisons beyond the single-client case, or statistical significance tests, which are load-bearing for the performance and robustness assertions.
minor comments (1)
  1. Abstract: define 'delay' and 'throughput' precisely and state the exact test conditions (audio length, hardware, batching) used for the multi-client measurements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating the changes made to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central claim that SWIM 'maintains comparable accuracy' and 'without degrading transcription quality' up to 20 clients rests on unreported WER values for the multi-client regime; only the single-client Whisper-Streaming WER of ~8.2% is stated, leaving the no-degradation assertion unsupported.

    Authors: We agree that the abstract should explicitly support the accuracy claim with multi-client WER data. Our experiments showed WER remaining at ~8.2% for up to 20 clients across languages, but this was not stated clearly enough. We have revised the abstract to include the multi-client WER figures and added a reference to the corresponding table in the evaluation section. revision: yes

  2. Referee: Evaluation section: no quantitative ablation or error analysis is provided for the buffer merging strategy under concurrent multilingual streams, so it is impossible to assess whether merging introduces cross-client interference, context loss, or language-specific errors in Italian/Spanish.

    Authors: We acknowledge this gap. We have added a new subsection with quantitative ablations of the buffer merging strategy, including measurements of cross-client interference, context retention, and per-language WER breakdowns for Italian and Spanish to demonstrate the absence of significant degradation. revision: yes

  3. Referee: Results: reported delays (~2.4 s at 5 clients) and scaling claims lack error bars, full evaluation protocol details, baseline comparisons beyond the single-client case, or statistical significance tests, which are load-bearing for the performance and robustness assertions.

    Authors: We agree that greater statistical rigor is needed. We have updated the results to include error bars from repeated trials, expanded the evaluation protocol description, added multi-client baseline comparisons, and included statistical significance tests for the reported latency improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical system implementation

full rationale

The paper describes an engineering implementation of SWIM for scalable real-time ASR using unmodified Whisper with a buffer merging strategy. Claims rest entirely on reported empirical measurements of WER, latency, and throughput across 1-20 clients in multiple languages. No equations, derivations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text. No self-citation chains or self-definitional reductions are present. The work is self-contained against external benchmarks via direct system evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are invoked; the contribution is a software engineering system on top of an existing model.

pith-pipeline@v0.9.0 · 5542 in / 983 out tokens · 26066 ms · 2026-05-16T12:05:58.444111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 2 internal anchors

  1. [1]

    ASR—A Real-Time Speech Recognition on Portable Devices,

    A. S. Sharma and R. Bhalley, “ASR—A Real-Time Speech Recognition on Portable Devices,” inPro- ceedings of the 2nd International Conference on Ad- vances in Computing, Communication, & Automation 11 (ICACCA’16). Bareilly, India: IEEE, Sep. 2016, pp. 1–4

  2. [2]

    Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,

    W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,” inProceedings of the 6th International Conference on Learning Representations (ICLR’18), M. Ranzato, Ed., Vancouver, BC, Canada, Apr./May 2018

  3. [3]

    Automatic Speech Recognition: Systematic Literature Review,

    S. Alharbi, M. Alrazgan, A. Alrashed, T. Alnomasi, R. Almojel, R. Alharbi, S. Alharbi, S. Alturki, F. Al- shehri, and M. Almojil, “Automatic Speech Recognition: Systematic Literature Review,”IEEE Access, vol. 9, pp. 131 858–131 876, 2021

  4. [4]

    Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,

    B. Lecouteux, M. Vacher, and F. Portet, “Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,” inProceed- ings of the 12th INTERSPEECH Conference (INTER- SPEECH’11), R. Pieraccini, Ed. Firenze, Italy: ISCA, Aug. 2011, pp. 2273–2276

  5. [5]

    Conversational AI: The Science Behind the Alexa Prize

    A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Na- gar, E. King, K. Bland, A. Wartick, Y . Pan, H. Song, S. Jayadevan, G. Hwang, and A. Pettigrue, “Conversa- tional AI: The Science Behind the Alexa Prize,”arXiv e-prints, vol. arXiv:1801.03604, pp. 1–18, Jan. 2018

  6. [6]

    A Review of ASR Technologies for Children’s Speech,

    M. Gerosa, D. Giuliani, S. Narayanan, and A. Potami- anos, “A Review of ASR Technologies for Children’s Speech,” inProceedings of the 2nd Workshop on Child, Computer and Interaction (WOCCI’09). Cambridge, MA, USA: ACM, Nov. 2009, pp. 7:1–7:8

  7. [7]

    Attention- Based wav2text with Feature Transfer Learning,

    A. Tjandra, S. Sakti, and S. Nakamura, “Attention- Based wav2text with Feature Transfer Learning,” in Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU’17). Okinawa, Japan: IEEE, Jan. 2017, pp. 309–315

  8. [8]

    Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,

    P. Dubey and B. Shah, “Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,”arXiv e-prints, vol. arXiv:2204.00977, pp. 1–6, Apr. 2022

  9. [9]

    Deep Speech: Scaling up end-to-end speech recognition

    A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di- amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep Speech: Scaling up End-to-End Speech Recognition,”arXiv e-prints, vol. arXiv:1412.5567, pp. 1–12, Dec. 2014

  10. [10]

    End-to-End Speech Recognition: A Sur- vey,

    R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-End Speech Recognition: A Sur- vey,”IEEE Transactions on Audio, Speech and Language Processing, vol. 32, pp. 325–351, 2023

  11. [11]

    Recent Advances in End-to-End Automatic Speech Recognition,

    J. Li, “Recent Advances in End-to-End Automatic Speech Recognition,”APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, pp. 1–64, Apr. 2022

  12. [12]

    WaveNet: A Generative Model for Raw Audio,

    A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” inProceedings of the 9th Workshop on Speech Synthesis (SSW’16). Sunnyvale, USA: ISCA, Sep. 2016, p. 125

  13. [13]

    50 Years of Progress in Speech and Speaker Recognition Research,

    S. Furui, “50 Years of Progress in Speech and Speaker Recognition Research,”Transactions on Computer and Information Technology, vol. 1, no. 2, pp. 64–74, Nov. 2005

  14. [14]

    The Application of Hidden Markov Models in Speech Recognition,

    M. Gales and S. Young, “The Application of Hidden Markov Models in Speech Recognition,”Foundations and Trends® in Signal Processing, vol. 1, no. 3, pp. 195– 304, Feb. 2008

  15. [15]

    Unsuper- vised Automatic Speech Recognition: A Review,

    H. Aldarmaki, A. Ullah, S. Ram, and N. Zaki, “Unsuper- vised Automatic Speech Recognition: A Review,”Speech Communication, vol. 139, pp. 76–91, Apr. 2022

  16. [16]

    wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,” inProceedings of the 33rd Conference on Neural Information Processing Sys- tems (NeurIPS’20), M. Ranzato, R. Hadsell, M. F. Bal- can, and H.-T. Lin, Eds., vol. 33. Vancouver, BC, Canada: Curran Associates, Inc., Dec. 2020, pp. 1–12

  17. [17]

    GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,

    G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y . Wang, Z. You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” inProceedings of the 22nd INTERSPEECH Conference (INTERSPEECH’21), L. Bur...

  18. [18]

    Robust Speech Recogni- tion via Large-Scale Weak Supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recogni- tion via Large-Scale Weak Supervision,” inProceedings of the 40th International Conference on Machine Learn- ing (ICML’23). Honolulu, Hawaii: PMLR, Jul. 2023, pp. 28 492–28 518

  19. [19]

    Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,

    D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGres- ley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y . Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan...

  20. [20]

    Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,

    A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,”Speech Communication, vol. 34, no. 1-2, pp. 25–40, 2001

  21. [21]

    Multi-Stream End-to-End Speech Recognition,

    R. Li, X. Wang, S. H. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, “Multi-Stream End-to-End Speech Recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1026–1039, Jun. 2019

  22. [22]

    Real-Time ASR from Meetings,

    P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V . Wan, and L. Zhang, “Real-Time ASR from Meetings,” inPro- 12 ceedings of the 9th INTERSPEECH Conference (INTER- SPEECH’09). Brighton, United Kingdom: ISCA, Sep. 2023, pp. 2119–2122

  23. [23]

    Evaluation of real-time tran- scriptions using end-to-end asr models.arXiv preprint arXiv:2409.05674,

    C. Arriaga, A. Pozo, J. Conde, and A. Alonso, “Evalua- tion of Real-Time Transcriptions Using End-to-End ASR Models,”arXiv preprint arXiv:2409.05674, 2024

  24. [24]

    Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,

    A. V . Ivanov, P. L. Lange, D. Suendermann-Oeft, V . Ra- manarayanan, Y . Qian, Z. Yu, and J. Tao, “Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,” inProceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS’16), ser. Lecture Notes in Electrical Engineering 427. Saa...

  25. [25]

    Preech: A System for Privacy-Preserving Speech Transcription,

    S. Ahmed, A. R. Chowdhury, K. Fawaz, and P. Ra- manathan, “Preech: A System for Privacy-Preserving Speech Transcription,” inProceedings of the 29th USENIX Security Symposium (USENIX Security’20), S. Capkun and F. Roesner, Eds. Boston, MA, USA: USENIX, Aug. 2020, pp. 2703–2720

  26. [26]

    SoK: The Faults in our ASRs,

    H. Abdullah, K. Warren, V . Bindschaedler, N. Papernot, and P. Traynor, “SoK: The Faults in our ASRs,” in Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP’21), A. Oprea and T. Holz, Eds. San Francisco, CA, USA: IEEE, May 2021, pp. 730–747

  27. [27]

    Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,

    A.-L. Geirgescu, A. Pappalardo, H. Cucu, and M. Blott, “Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,”EURASIP Jour- nal on Audio, Speech, and Music Processing, vol. 2021, no. 1, pp. 28:1–28:30, Jul. 2021

  28. [28]

    Anatomy of Industrial Scale Multilingual ASR,

    F. M. Ramirez, L. Chkhetiani, A. Ehrenberg, R. McHardy, R. Botros, Y . Khare, A. Vanzo, T. Peyash, G. Oexle, M. Lianget al., “Anatomy of Industrial Scale Multilingual ASR,”arXiv preprint arXiv:2404.09841, 2024

  29. [29]

    Decoupled Federated Learning for ASR with Non- IID Data,

    H. Zhu, J. Wang, G. Cheng, P. Zhang, and Y . Yan, “Decoupled Federated Learning for ASR with Non- IID Data,” inProceedings of the 23rd INTERSPEECH Conference (INTERSPEECH’22), K. Lee, L. Lamel, M. Hasegawa-Johnson, K. Livescu, and O. Kang, Eds. Incheon, South Korea: ISCA, Aug. 2023, pp. 2628–2632

  30. [30]

    Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,

    S. Basak, H. Agrawal, S. Jena, S. Gite, M. Bachute, B. Pradhan, and M. Assiri, “Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,”Computer Modeling in Engineering and Sciences, vol. 135, no. 2, pp. 1053–1089, Oct. 2022

  31. [31]

    Whispy: Adapting STT Whisper Models to Real- Time Environments,

    A. Bevilacqua, P. Saviano, A. Amirante, and S. P. Ro- mano, “Whispy: Adapting STT Whisper Models to Real- Time Environments,”arXiv preprint arXiv:2405.03484, 2024

  32. [32]

    Turning Whisper into Real-Time Transcription System,

    D. Machá ˇcek, R. Dabre, and O. Bojar, “Turning Whisper into Real-Time Transcription System,” inProceedings of the 13th International Joint Conference on Natural Lan- guage Processing (IJCNLP’23), S. Saha and H. Sujaini, Eds. Bali, Indonesia: ACL, Nov. 2023, pp. 17–24

  33. [33]

    CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,

    P. Polák, N.-Q. Pham, T. N. Nguyen, D. Liu, C. Mullov, J. Niehues, O. Bojar, and A. Waibel, “CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,” inProceedings of the 19th International Con- ference on Spoken Language Translation (IWSLT’22), E. Salesky, M. Federico, and M. Costa-jussà, Eds. Dublin, Ireland: ACL, May 2022, pp. 277–285

  34. [34]

    Scaling Up Online Speech Recognition Using ConvNets,

    V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhoma- nenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Collobert, “Scaling Up Online Speech Recognition Using ConvNets,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 3376–3380

  35. [35]

    Federated Acous- tic Modeling for Automatic Speech Recognition,

    X. Cui, S. Lu, and B. Kingsbury, “Federated Acous- tic Modeling for Automatic Speech Recognition,” in Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP’21), T. Davidson and D. Yu, Eds. Toronto, Canada: IEEE, Jun. 2021, pp. 6748–6752

  36. [36]

    Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,

    Y . Zhang, L. Xu, A. Mendoza, G. Yang, P. Chinprut- thiwong, and G. Gu, “Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS’19). San Diego, CA, USA: The Internet Society, Feb. 2019, pp. 1–15

  37. [37]

    Measuring the Accuracy of Automatic Speech Recognition Solutions,

    K. Kuhn, V . Kersken, B. Reuter, N. Egger, and G. Zimmermann, “Measuring the Accuracy of Automatic Speech Recognition Solutions,”ACM Transactions on Accessible Computing, vol. 16, no. 4, pp. 25:1–25:23, Jan. 2024

  38. [38]

    Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,

    N. Jamal, S. Shanta, F. Mahmud, and M. Sha’abani, “Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,” in Proceedings of the International Conference on Electri- cal and Electronic Engineering (IC3E’17). Johor Bahru, Malaysia: AIP, Aug. 2017

  39. [39]

    Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,

    M. B. Mustafa, F. Rosdi, S. S. Salim, and M. U. Mughal, “Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,”Expert Systems with Applications, vol. 42, no. 8, pp. 3924–3932, May 2015

  40. [40]

    Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,

    S. Latif, J. Qadir, A. Qayyum, M. Usama, and S. You- nis, “Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,”IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2021

  41. [41]

    Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?

    D. Loakes, “Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?”Frontiers in Com- munication, vol. 7, Aug. 2022

  42. [42]

    Institutional and Academic Transcripts of Police Interrogations,

    M. Komter, “Institutional and Academic Transcripts of Police Interrogations,”Frontiers in Communication, vol. 7, Aug. 2022

  43. [43]

    A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,

    L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,”Pro- ceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989

  44. [44]

    Conformer: Convolution-Augmented Transformer for Speech Recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. P. Pang, 13 “Conformer: Convolution-Augmented Transformer for Speech Recognition,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. ...

  45. [45]

    Streaming End-to-end Speech Recognition for Mobile Devices,

    Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Al- varez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-y. Chang, K. Rao, and A. Gru- enstein, “Streaming End-to-end Speech Recognition for Mobile Devices,” inProceedings of the 44th Interna- tional Conference on Acoustics...

  46. [46]

    Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,

    M. B. Hoy, “Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,”Medical Reference Services Quarterly, vol. 37, no. 1, pp. 81–88, Jan./Mar. 2018

  47. [47]

    Transformer in Transformer,

    K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y . Wang, “Transformer in Transformer,” inProceedings of the 35th International Conference on Neural Information Process- ing Systems (NIPS’21), M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Lian, and J. Wortman Vaughan, Eds., vol. 34. Virtual: Curran Associates, Inc., Dec. 2021, pp. 15 908–15 919

  48. [48]

    Attention Is All You Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. M. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” inProceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17), I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Long Beach, CA, USA: Curran Associat...

  49. [49]

    Scaling Speech Technology to 1,000+ Languages,

    V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1,000+ Languages,”Journal of Machine Learning Research, vol. 25, pp. 1–52, Mar. 2024

  50. [50]

    Librispeech: An ASR Corpus Based on Public Domain Audio Books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” inProceedings of the 40th International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15), D. Gray and D. Cochran, Eds. Brisbane, Australia: IEEE, Apr. 2015, pp. 5206–5210

  51. [51]

    Long-Form Speech Genera- tion with Spoken Language Models,

    S. J. Park, J. Salazar, A. Jansen, K. Kinoshita, Y . M. Ro, and R. Skerry-Ryan, “Long-Form Speech Genera- tion with Spoken Language Models,” inProceedings of the 42nd International Conferene on Machine Learning (ICML’25). Vancouver, BC, Canada: PMLR, Jul. 2025, pp. 48 245–48 261

  52. [52]

    MLS: A Large-Scale Multilingual Dataset for Speech Research,

    V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Col- lobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 2757–2761

  53. [53]

    FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,” inProceedings of IEEE Spoken Language Technology Workshop (SLT’22), S. Watanabe, S. Khudanpur, J. Hirschberg, M. Saraclar, and M. Del- croix, Eds. Doha, Qatar: IEEE, Jan. 2023...

  54. [54]

    Word Error Rate Estimation for Speech Recognition: e-WER,

    A. Ali and S. Renals, “Word Error Rate Estimation for Speech Recognition: e-WER,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: ACM, Jul. 2018, pp. 29–24

  55. [55]

    Testing the Correlation of Word Error Rate and Perplexity,

    D. Klakow and J. Peters, “Testing the Correlation of Word Error Rate and Perplexity,”Speech Communica- tion, vol. 38, no. 1–2, pp. 19–28, Sep. 2002

  56. [56]

    Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,

    Y .-Y . Wang, A. Acero, and C. Chelba, “Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,” inProceedings of the 7th Workshop on Automatic Speech Recognition and Understanding (ASRU’03). St. Thomas, U.S. Virgin Islands: IEEE, Nov./Dec. 2003, pp. 577–582

  57. [57]

    What Size Test Set Gives Good Error Rate Estimates?

    I. Guyon, J. Makhoul, R. Schwartz, and V . Vapnik, “What Size Test Set Gives Good Error Rate Estimates?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 52–64, Jan. 1998

  58. [58]

    Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,

    C. Park, H. Kang, and T. Hain, “Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,” inProceedings of the 32nd European Signal Processing Conference (EUSIPCO’24). Lyon, France: IEEE, Aug. 2024, pp. 131–135

  59. [59]

    Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,

    R. Sawata, Y . Kashiwagi, and S. Takahashi, “Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp...

  60. [60]

    Semantic Word Error Rate for Sentence Similarity,

    C. Spiccia, A. Augello, G. Pilato, and G. Vassallo, “Semantic Word Error Rate for Sentence Similarity,” inProceedings of the 10th International Conference on Semantic Computing (ICSC’16), T. Li, A. Scherp, D. Ostrowski, and W. Wang, Eds. Laguna Hill, CA, USA: IEEE, Feb. 2016, pp. 266–269

  61. [61]

    “This Sen- tence Is Wrong

    S. Raybaud, D. Langlois, and K. Smaïli, ““This Sen- tence Is Wrong.” Detecting Errors in Machine-Translated Sentences,”Machine Translation, vol. 25, pp. 1–34, Jul. 2011

  62. [62]

    A Normalized Levenshtein Distance Metric,

    Y . Li and B. Liu, “A Normalized Levenshtein Distance Metric,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091–1095, Jun. 2007

  63. [63]

    Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,

    W. J. Heeringa, “Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,” PhD Thesis, Uni- versity of Groningen, Groningen, Netherlands, 2004

  64. [64]

    Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,

    G.-X. Shi, W.-Q. Zhang, G.-B. Wang, J. Zhao, S.-Z. Chai, and Z.-Y . Zhao, “Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,” EURASIP Journal on Audio, Speech and Music Process- ing, pp. 21:1–21:14, Jul. 2021. 14

  65. [65]

    Temporal Alignment Networks for Long-Term Video,

    T. Han, W. Xie, and A. Zisserman, “Temporal Alignment Networks for Long-Term Video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22), K. Dana, G. Hua, S. Roth, D. Samaras, and R. Singh, Eds. New Orleans, LA, USA: IEEE, Jun. 2022, pp. 2896–2906

  66. [66]

    Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,

    J. Li, Y . Meng, Z. Wu, H. Meng, Q. Tian, Y . Wang, and Y . Wang, “Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp. 8007–8011

  67. [67]

    TEGRA: Table Extraction by Global Record Align- ment,

    X. Chu, Y . He, K. Chakrabarti, and K. Ganjam, “TEGRA: Table Extraction by Global Record Align- ment,” inProceedings of the 41st International Con- ference on Management of Data (SIGMOD’15), S. B. Davison and Z. Ives, Eds. Melbourne, Australia: ACM, May/Jun. 2015, pp. 1713–1728

  68. [68]

    Efficient whisper on streaming speech

    R. Wang, Z. Xu, and F. X. Lin, “Efficient Whisper on Streaming Speech,”arXiv preprint arXiv:2412.11272, 2024

  69. [69]

    Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,

    M. Bain, J. Huh, T. Han, and A. Zisserman, “Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,” inProceedings of the 24th INTERSPEECH Conference (INTERSPEECH’23), S. King, K. Knill, and P. Wagner, Eds. Dublin, Ireland: ISCA, Aug. 2023, pp. 4489–4493

  70. [70]

    A Guided Tour to Approximate String Matching,

    G. Navarro, “A Guided Tour to Approximate String Matching,”Computing Surveys, vol. 33, no. 1, pp. 31–88, Mar. 2001

  71. [71]

    M. M. Deza and E. Deza,Encyclopedia of Distances. Springer, 2009

  72. [72]

    On the Theory and Computation of Evo- lutionary Distances,

    P. H. Sellers, “On the Theory and Computation of Evo- lutionary Distances,”Journal on Applied Mathematics, vol. 26, no. 4, pp. 787–793, 1974

  73. [73]

    Binary Codes Capable of Correcting Deletions, Insertions and Reversals,

    V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,”Soviet Physics Dok- lady, vol. 10, no. 8, pp. 707–710, Feb. 1966

  74. [74]

    V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,

    G. Dimauro, D. Caivano, V . Bevilacqua, F. Girardi, and V . Napoletano, “V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,” inProceedings of the 11th International Symposium on Medical Measurements and Applications (MeMeA’16). Benevento, Italy: IEEE, May 2016, pp. 1–6

  75. [75]

    Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,

    G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Girardi, “Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,” IEEE Access, vol. 5, pp. 22 199–22 208, Nov. 2017

  76. [76]

    Italian Parkinson’s V oice and Speech,

    G. Dimauro and F. Girardi, “Italian Parkinson’s V oice and Speech,” 2019

  77. [77]

    Wohlin, P

    C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Reg- nell, and A. Wesslén,Experimentation in Software Engi- neering. Springer, 2012. Federico Bruzzoneis currently a Ph.D. student in Computer Science at Università degli Studi di Mi- lano, Italy. He was born in 2000 and since he was a child he has been passionate about computer science and music. He got h...