Sink or SWIM: Tackling Real-Time ASR at Scale

Dario Pellegrino; Federico Bruzzone; Matteo Brancaleoni; Walter Cazzola

arxiv: 2601.17097 · v1 · submitted 2026-01-22 · 💻 cs.SD · cs.SE· eess.AS

Sink or SWIM: Tackling Real-Time ASR at Scale

Federico Bruzzone , Walter Cazzola , Matteo Brancaleoni , Dario Pellegrino This is my paper

Pith reviewed 2026-05-16 12:05 UTC · model grok-4.3

classification 💻 cs.SD cs.SEeess.AS

keywords real-time ASRmulti-client transcriptionWhisperbuffer mergingscalable ASRmultilingual speech recognitionlow-latency ASR

0 comments

The pith

SWIM enables real-time Whisper ASR for up to 20 concurrent clients using buffer merging without model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWIM as a system that extends real-time automatic speech recognition to multiple simultaneous users by adding a buffer merging step on top of the existing Whisper model. This strategy lets the model process several audio streams in parallel while keeping transcription accuracy intact across English, Italian, and Spanish. The work shows that latency stays low, around 2.4 seconds at five clients, and the system continues to work up to twenty clients with no loss in quality and higher overall throughput. A reader would care because current real-time ASR tools struggle when more than one person speaks at once, limiting their use in live meetings or services. SWIM demonstrates that these limits can be pushed outward by reusing the same model on merged buffers rather than retraining or duplicating it.

Core claim

SWIM adds a buffer merging strategy to Whisper that achieves true model-level parallelization, letting one unmodified model handle multiple concurrent audio streams for real-time multilingual transcription. In tests it keeps word error rates comparable to single-client baselines while cutting average delay to roughly 2.4 seconds at five clients and scaling cleanly to twenty clients with rising throughput.

What carries the argument

The buffer merging strategy, which combines input buffers from several clients so that Whisper can run a single forward pass on the combined data while still producing separate, accurate transcriptions for each stream.

If this is right

Maintains accuracy comparable to Whisper-Streaming while lowering delay in multi-client use.
Supports simultaneous transcription in English, Italian, and Spanish.
Scales to twenty concurrent clients without quality loss and with higher total throughput.
Requires no modification to the underlying Whisper model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Buffer merging may transfer to other large transformer ASR models for similar multi-user scaling.
The approach could support lower-cost deployment of live captioning in large virtual events or public services.
Dynamic adjustment of merge windows might further reduce latency under fluctuating client numbers.

Load-bearing premise

Merging client buffers preserves transcription fidelity without adding errors or needing any change to the Whisper model itself.

What would settle it

Run the same set of multi-client audio recordings once with merged buffers and once with separate model instances, then compare word error rate and latency to check whether the merged version produces measurably higher errors.

Figures

Figures reproduced from arXiv: 2601.17097 by Dario Pellegrino, Federico Bruzzone, Matteo Brancaleoni, Walter Cazzola.

**Figure 1.** Figure 1: Architecture of SWIM. Multiple clients ❶ stream audio to dedicated services ❷, which forward data to a shared ASR model ❸ for parallel processing. model. Instead, SWIM is based on a shared ASR model9 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Audio chunk processing time diagram. The outer maximum with zero ensures that we do not consider negative delays: if the shared ASR takes longer than 1 second to process all active services, the next chunk will already have arrived for every connected services, so there’s no extra waiting time. Faster Whisper Model. In the adopted architecture, the entire shared audio buffer is passed to the model as a sin… view at source ↗

**Figure 3.** Figure 3: Evaluation results with different chunk durations using box plots. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The similarity score distribution per client across all datasets using boxen plots. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Delay per language across the Multilingual LibriSpeech dataset and Google Fleurs dataset using violin plots. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: WER per language across the Multilingual LibriSpeech dataset and Google Fleurs dataset using violin plots. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Hypothesis delay vs Confirmation delay with fixed [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWIM's buffer merging lets one Whisper model handle many clients at once with reported latency wins, yet the abstract gives too little data on whether accuracy holds up in the multi-client multilingual setting.

read the letter

The main takeaway is that the authors built SWIM on top of Whisper to handle multiple concurrent audio streams using a buffer merging strategy. This allows model-level parallelization for real-time transcription in English, Italian, and Spanish without altering the model, and they report it scales to 20 clients with lower delay than the Whisper-Streaming baseline. The paper does a good job identifying the challenge of scaling real-time ASR for interactive applications. Running separate instances for each client can increase latency and resource use, so merging buffers to share the model makes sense as an efficiency move. The concrete numbers in the abstract, like reducing average delay from 3.4s to 2.4s at 5 clients while keeping WER around 8%, give a sense of the potential benefit. It also claims no degradation in quality at higher loads and increased throughput, which addresses a genuine pain point in deployment. Where it falls short is in the supporting evidence for those claims. The buffer merging is presented as maintaining transcription fidelity, but there are no detailed results for the multi-client case, such as per-client WER, latency breakdowns by number of clients, or tests for cross-language interference. The abstract compares to a single-client English setup, so it's unclear how the multi-client multilingual performance was measured or if merging introduces any errors. Without ablations or full evaluation details, the central performance improvements are only moderately supported. This work is aimed at practitioners in speech recognition who deal with live multi-user systems. It could provide ideas for implementation, but readers interested in new algorithms or comprehensive benchmarks may find it limited. The thinking is practical and the problem is well-posed. I would send this to peer review. The idea is straightforward to evaluate, and referees can request the missing data and analysis to strengthen it.

Referee Report

3 major / 1 minor

Summary. The paper presents SWIM, a real-time ASR system built on unmodified OpenAI Whisper that uses a buffer merging strategy to support multiple concurrent audio streams. It claims to deliver multilingual transcription (English, Italian, Spanish) for up to 20 clients while maintaining ~8.2% WER comparable to single-client Whisper-Streaming, reducing average delay to ~2.4 s at 5 clients (vs. 3.4 s baseline), and scaling throughput without quality degradation.

Significance. If the multi-client empirical results hold with proper validation, the work would be significant for practical deployment of scalable real-time ASR in multi-user settings such as live captioning or voice interfaces, by demonstrating efficient parallelization on existing models without retraining or architectural changes.

major comments (3)

Abstract: the central claim that SWIM 'maintains comparable accuracy' and 'without degrading transcription quality' up to 20 clients rests on unreported WER values for the multi-client regime; only the single-client Whisper-Streaming WER of ~8.2% is stated, leaving the no-degradation assertion unsupported.
Evaluation section: no quantitative ablation or error analysis is provided for the buffer merging strategy under concurrent multilingual streams, so it is impossible to assess whether merging introduces cross-client interference, context loss, or language-specific errors in Italian/Spanish.
Results: reported delays (~2.4 s at 5 clients) and scaling claims lack error bars, full evaluation protocol details, baseline comparisons beyond the single-client case, or statistical significance tests, which are load-bearing for the performance and robustness assertions.

minor comments (1)

Abstract: define 'delay' and 'throughput' precisely and state the exact test conditions (audio length, hardware, batching) used for the multi-client measurements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating the changes made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that SWIM 'maintains comparable accuracy' and 'without degrading transcription quality' up to 20 clients rests on unreported WER values for the multi-client regime; only the single-client Whisper-Streaming WER of ~8.2% is stated, leaving the no-degradation assertion unsupported.

Authors: We agree that the abstract should explicitly support the accuracy claim with multi-client WER data. Our experiments showed WER remaining at ~8.2% for up to 20 clients across languages, but this was not stated clearly enough. We have revised the abstract to include the multi-client WER figures and added a reference to the corresponding table in the evaluation section. revision: yes
Referee: Evaluation section: no quantitative ablation or error analysis is provided for the buffer merging strategy under concurrent multilingual streams, so it is impossible to assess whether merging introduces cross-client interference, context loss, or language-specific errors in Italian/Spanish.

Authors: We acknowledge this gap. We have added a new subsection with quantitative ablations of the buffer merging strategy, including measurements of cross-client interference, context retention, and per-language WER breakdowns for Italian and Spanish to demonstrate the absence of significant degradation. revision: yes
Referee: Results: reported delays (~2.4 s at 5 clients) and scaling claims lack error bars, full evaluation protocol details, baseline comparisons beyond the single-client case, or statistical significance tests, which are load-bearing for the performance and robustness assertions.

Authors: We agree that greater statistical rigor is needed. We have updated the results to include error bars from repeated trials, expanded the evaluation protocol description, added multi-client baseline comparisons, and included statistical significance tests for the reported latency improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical system implementation

full rationale

The paper describes an engineering implementation of SWIM for scalable real-time ASR using unmodified Whisper with a buffer merging strategy. Claims rest entirely on reported empirical measurements of WER, latency, and throughput across 1-20 clients in multiple languages. No equations, derivations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text. No self-citation chains or self-definitional reductions are present. The work is self-contained against external benchmarks via direct system evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are invoked; the contribution is a software engineering system on top of an existing model.

pith-pipeline@v0.9.0 · 5542 in / 983 out tokens · 26066 ms · 2026-05-16T12:05:58.444111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 2 internal anchors

[1]

ASR—A Real-Time Speech Recognition on Portable Devices,

A. S. Sharma and R. Bhalley, “ASR—A Real-Time Speech Recognition on Portable Devices,” inPro- ceedings of the 2nd International Conference on Ad- vances in Computing, Communication, & Automation 11 (ICACCA’16). Bareilly, India: IEEE, Sep. 2016, pp. 1–4

work page 2016
[2]

Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,

W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,” inProceedings of the 6th International Conference on Learning Representations (ICLR’18), M. Ranzato, Ed., Vancouver, BC, Canada, Apr./May 2018

work page 2018
[3]

Automatic Speech Recognition: Systematic Literature Review,

S. Alharbi, M. Alrazgan, A. Alrashed, T. Alnomasi, R. Almojel, R. Alharbi, S. Alharbi, S. Alturki, F. Al- shehri, and M. Almojil, “Automatic Speech Recognition: Systematic Literature Review,”IEEE Access, vol. 9, pp. 131 858–131 876, 2021

work page 2021
[4]

Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,

B. Lecouteux, M. Vacher, and F. Portet, “Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,” inProceed- ings of the 12th INTERSPEECH Conference (INTER- SPEECH’11), R. Pieraccini, Ed. Firenze, Italy: ISCA, Aug. 2011, pp. 2273–2276

work page 2011
[5]

Conversational AI: The Science Behind the Alexa Prize

A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Na- gar, E. King, K. Bland, A. Wartick, Y . Pan, H. Song, S. Jayadevan, G. Hwang, and A. Pettigrue, “Conversa- tional AI: The Science Behind the Alexa Prize,”arXiv e-prints, vol. arXiv:1801.03604, pp. 1–18, Jan. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

A Review of ASR Technologies for Children’s Speech,

M. Gerosa, D. Giuliani, S. Narayanan, and A. Potami- anos, “A Review of ASR Technologies for Children’s Speech,” inProceedings of the 2nd Workshop on Child, Computer and Interaction (WOCCI’09). Cambridge, MA, USA: ACM, Nov. 2009, pp. 7:1–7:8

work page 2009
[7]

Attention- Based wav2text with Feature Transfer Learning,

A. Tjandra, S. Sakti, and S. Nakamura, “Attention- Based wav2text with Feature Transfer Learning,” in Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU’17). Okinawa, Japan: IEEE, Jan. 2017, pp. 309–315

work page 2017
[8]

Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,

P. Dubey and B. Shah, “Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,”arXiv e-prints, vol. arXiv:2204.00977, pp. 1–6, Apr. 2022

work page arXiv 2022
[9]

Deep Speech: Scaling up end-to-end speech recognition

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di- amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep Speech: Scaling up End-to-End Speech Recognition,”arXiv e-prints, vol. arXiv:1412.5567, pp. 1–12, Dec. 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

End-to-End Speech Recognition: A Sur- vey,

R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-End Speech Recognition: A Sur- vey,”IEEE Transactions on Audio, Speech and Language Processing, vol. 32, pp. 325–351, 2023

work page 2023
[11]

Recent Advances in End-to-End Automatic Speech Recognition,

J. Li, “Recent Advances in End-to-End Automatic Speech Recognition,”APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, pp. 1–64, Apr. 2022

work page 2022
[12]

WaveNet: A Generative Model for Raw Audio,

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” inProceedings of the 9th Workshop on Speech Synthesis (SSW’16). Sunnyvale, USA: ISCA, Sep. 2016, p. 125

work page 2016
[13]

50 Years of Progress in Speech and Speaker Recognition Research,

S. Furui, “50 Years of Progress in Speech and Speaker Recognition Research,”Transactions on Computer and Information Technology, vol. 1, no. 2, pp. 64–74, Nov. 2005

work page 2005
[14]

The Application of Hidden Markov Models in Speech Recognition,

M. Gales and S. Young, “The Application of Hidden Markov Models in Speech Recognition,”Foundations and Trends® in Signal Processing, vol. 1, no. 3, pp. 195– 304, Feb. 2008

work page 2008
[15]

Unsuper- vised Automatic Speech Recognition: A Review,

H. Aldarmaki, A. Ullah, S. Ram, and N. Zaki, “Unsuper- vised Automatic Speech Recognition: A Review,”Speech Communication, vol. 139, pp. 76–91, Apr. 2022

work page 2022
[16]

wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,” inProceedings of the 33rd Conference on Neural Information Processing Sys- tems (NeurIPS’20), M. Ranzato, R. Hadsell, M. F. Bal- can, and H.-T. Lin, Eds., vol. 33. Vancouver, BC, Canada: Curran Associates, Inc., Dec. 2020, pp. 1–12

work page 2020
[17]

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,

G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y . Wang, Z. You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” inProceedings of the 22nd INTERSPEECH Conference (INTERSPEECH’21), L. Bur...

work page 2021
[18]

Robust Speech Recogni- tion via Large-Scale Weak Supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recogni- tion via Large-Scale Weak Supervision,” inProceedings of the 40th International Conference on Machine Learn- ing (ICML’23). Honolulu, Hawaii: PMLR, Jul. 2023, pp. 28 492–28 518

work page 2023
[19]

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGres- ley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y . Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan...

work page 2016
[20]

Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,

A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,”Speech Communication, vol. 34, no. 1-2, pp. 25–40, 2001

work page 2001
[21]

Multi-Stream End-to-End Speech Recognition,

R. Li, X. Wang, S. H. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, “Multi-Stream End-to-End Speech Recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1026–1039, Jun. 2019

work page 2019
[22]

Real-Time ASR from Meetings,

P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V . Wan, and L. Zhang, “Real-Time ASR from Meetings,” inPro- 12 ceedings of the 9th INTERSPEECH Conference (INTER- SPEECH’09). Brighton, United Kingdom: ISCA, Sep. 2023, pp. 2119–2122

work page 2023
[23]

Evaluation of real-time tran- scriptions using end-to-end asr models.arXiv preprint arXiv:2409.05674,

C. Arriaga, A. Pozo, J. Conde, and A. Alonso, “Evalua- tion of Real-Time Transcriptions Using End-to-End ASR Models,”arXiv preprint arXiv:2409.05674, 2024

work page arXiv 2024
[24]

Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,

A. V . Ivanov, P. L. Lange, D. Suendermann-Oeft, V . Ra- manarayanan, Y . Qian, Z. Yu, and J. Tao, “Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,” inProceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS’16), ser. Lecture Notes in Electrical Engineering 427. Saa...

work page 2016
[25]

Preech: A System for Privacy-Preserving Speech Transcription,

S. Ahmed, A. R. Chowdhury, K. Fawaz, and P. Ra- manathan, “Preech: A System for Privacy-Preserving Speech Transcription,” inProceedings of the 29th USENIX Security Symposium (USENIX Security’20), S. Capkun and F. Roesner, Eds. Boston, MA, USA: USENIX, Aug. 2020, pp. 2703–2720

work page 2020
[26]

SoK: The Faults in our ASRs,

H. Abdullah, K. Warren, V . Bindschaedler, N. Papernot, and P. Traynor, “SoK: The Faults in our ASRs,” in Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP’21), A. Oprea and T. Holz, Eds. San Francisco, CA, USA: IEEE, May 2021, pp. 730–747

work page 2021
[27]

Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,

A.-L. Geirgescu, A. Pappalardo, H. Cucu, and M. Blott, “Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,”EURASIP Jour- nal on Audio, Speech, and Music Processing, vol. 2021, no. 1, pp. 28:1–28:30, Jul. 2021

work page 2021
[28]

Anatomy of Industrial Scale Multilingual ASR,

F. M. Ramirez, L. Chkhetiani, A. Ehrenberg, R. McHardy, R. Botros, Y . Khare, A. Vanzo, T. Peyash, G. Oexle, M. Lianget al., “Anatomy of Industrial Scale Multilingual ASR,”arXiv preprint arXiv:2404.09841, 2024

work page arXiv 2024
[29]

Decoupled Federated Learning for ASR with Non- IID Data,

H. Zhu, J. Wang, G. Cheng, P. Zhang, and Y . Yan, “Decoupled Federated Learning for ASR with Non- IID Data,” inProceedings of the 23rd INTERSPEECH Conference (INTERSPEECH’22), K. Lee, L. Lamel, M. Hasegawa-Johnson, K. Livescu, and O. Kang, Eds. Incheon, South Korea: ISCA, Aug. 2023, pp. 2628–2632

work page 2023
[30]

Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,

S. Basak, H. Agrawal, S. Jena, S. Gite, M. Bachute, B. Pradhan, and M. Assiri, “Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,”Computer Modeling in Engineering and Sciences, vol. 135, no. 2, pp. 1053–1089, Oct. 2022

work page 2022
[31]

Whispy: Adapting STT Whisper Models to Real- Time Environments,

A. Bevilacqua, P. Saviano, A. Amirante, and S. P. Ro- mano, “Whispy: Adapting STT Whisper Models to Real- Time Environments,”arXiv preprint arXiv:2405.03484, 2024

work page arXiv 2024
[32]

Turning Whisper into Real-Time Transcription System,

D. Machá ˇcek, R. Dabre, and O. Bojar, “Turning Whisper into Real-Time Transcription System,” inProceedings of the 13th International Joint Conference on Natural Lan- guage Processing (IJCNLP’23), S. Saha and H. Sujaini, Eds. Bali, Indonesia: ACL, Nov. 2023, pp. 17–24

work page 2023
[33]

CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,

P. Polák, N.-Q. Pham, T. N. Nguyen, D. Liu, C. Mullov, J. Niehues, O. Bojar, and A. Waibel, “CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,” inProceedings of the 19th International Con- ference on Spoken Language Translation (IWSLT’22), E. Salesky, M. Federico, and M. Costa-jussà, Eds. Dublin, Ireland: ACL, May 2022, pp. 277–285

work page 2022
[34]

Scaling Up Online Speech Recognition Using ConvNets,

V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhoma- nenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Collobert, “Scaling Up Online Speech Recognition Using ConvNets,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 3376–3380

work page 2020
[35]

Federated Acous- tic Modeling for Automatic Speech Recognition,

X. Cui, S. Lu, and B. Kingsbury, “Federated Acous- tic Modeling for Automatic Speech Recognition,” in Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP’21), T. Davidson and D. Yu, Eds. Toronto, Canada: IEEE, Jun. 2021, pp. 6748–6752

work page 2021
[36]

Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,

Y . Zhang, L. Xu, A. Mendoza, G. Yang, P. Chinprut- thiwong, and G. Gu, “Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS’19). San Diego, CA, USA: The Internet Society, Feb. 2019, pp. 1–15

work page 2019
[37]

Measuring the Accuracy of Automatic Speech Recognition Solutions,

K. Kuhn, V . Kersken, B. Reuter, N. Egger, and G. Zimmermann, “Measuring the Accuracy of Automatic Speech Recognition Solutions,”ACM Transactions on Accessible Computing, vol. 16, no. 4, pp. 25:1–25:23, Jan. 2024

work page 2024
[38]

Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,

N. Jamal, S. Shanta, F. Mahmud, and M. Sha’abani, “Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,” in Proceedings of the International Conference on Electri- cal and Electronic Engineering (IC3E’17). Johor Bahru, Malaysia: AIP, Aug. 2017

work page 2017
[39]

Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,

M. B. Mustafa, F. Rosdi, S. S. Salim, and M. U. Mughal, “Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,”Expert Systems with Applications, vol. 42, no. 8, pp. 3924–3932, May 2015

work page 2015
[40]

Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,

S. Latif, J. Qadir, A. Qayyum, M. Usama, and S. You- nis, “Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,”IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2021

work page 2021
[41]

Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?

D. Loakes, “Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?”Frontiers in Com- munication, vol. 7, Aug. 2022

work page 2022
[42]

Institutional and Academic Transcripts of Police Interrogations,

M. Komter, “Institutional and Academic Transcripts of Police Interrogations,”Frontiers in Communication, vol. 7, Aug. 2022

work page 2022
[43]

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,

L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,”Pro- ceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989

work page 1989
[44]

Conformer: Convolution-Augmented Transformer for Speech Recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. P. Pang, 13 “Conformer: Convolution-Augmented Transformer for Speech Recognition,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. ...

work page 2020
[45]

Streaming End-to-end Speech Recognition for Mobile Devices,

Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Al- varez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-y. Chang, K. Rao, and A. Gru- enstein, “Streaming End-to-end Speech Recognition for Mobile Devices,” inProceedings of the 44th Interna- tional Conference on Acoustics...

work page 2019
[46]

Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,

M. B. Hoy, “Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,”Medical Reference Services Quarterly, vol. 37, no. 1, pp. 81–88, Jan./Mar. 2018

work page 2018
[47]

Transformer in Transformer,

K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y . Wang, “Transformer in Transformer,” inProceedings of the 35th International Conference on Neural Information Process- ing Systems (NIPS’21), M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Lian, and J. Wortman Vaughan, Eds., vol. 34. Virtual: Curran Associates, Inc., Dec. 2021, pp. 15 908–15 919

work page 2021
[48]

Attention Is All You Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. M. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” inProceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17), I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Long Beach, CA, USA: Curran Associat...

work page 2017
[49]

Scaling Speech Technology to 1,000+ Languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1,000+ Languages,”Journal of Machine Learning Research, vol. 25, pp. 1–52, Mar. 2024

work page 2024
[50]

Librispeech: An ASR Corpus Based on Public Domain Audio Books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” inProceedings of the 40th International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15), D. Gray and D. Cochran, Eds. Brisbane, Australia: IEEE, Apr. 2015, pp. 5206–5210

work page 2015
[51]

Long-Form Speech Genera- tion with Spoken Language Models,

S. J. Park, J. Salazar, A. Jansen, K. Kinoshita, Y . M. Ro, and R. Skerry-Ryan, “Long-Form Speech Genera- tion with Spoken Language Models,” inProceedings of the 42nd International Conferene on Machine Learning (ICML’25). Vancouver, BC, Canada: PMLR, Jul. 2025, pp. 48 245–48 261

work page 2025
[52]

MLS: A Large-Scale Multilingual Dataset for Speech Research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Col- lobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 2757–2761

work page 2020
[53]

FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,” inProceedings of IEEE Spoken Language Technology Workshop (SLT’22), S. Watanabe, S. Khudanpur, J. Hirschberg, M. Saraclar, and M. Del- croix, Eds. Doha, Qatar: IEEE, Jan. 2023...

work page 2023
[54]

Word Error Rate Estimation for Speech Recognition: e-WER,

A. Ali and S. Renals, “Word Error Rate Estimation for Speech Recognition: e-WER,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: ACM, Jul. 2018, pp. 29–24

work page 2018
[55]

Testing the Correlation of Word Error Rate and Perplexity,

D. Klakow and J. Peters, “Testing the Correlation of Word Error Rate and Perplexity,”Speech Communica- tion, vol. 38, no. 1–2, pp. 19–28, Sep. 2002

work page 2002
[56]

Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,

Y .-Y . Wang, A. Acero, and C. Chelba, “Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,” inProceedings of the 7th Workshop on Automatic Speech Recognition and Understanding (ASRU’03). St. Thomas, U.S. Virgin Islands: IEEE, Nov./Dec. 2003, pp. 577–582

work page 2003
[57]

What Size Test Set Gives Good Error Rate Estimates?

I. Guyon, J. Makhoul, R. Schwartz, and V . Vapnik, “What Size Test Set Gives Good Error Rate Estimates?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 52–64, Jan. 1998

work page 1998
[58]

Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,

C. Park, H. Kang, and T. Hain, “Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,” inProceedings of the 32nd European Signal Processing Conference (EUSIPCO’24). Lyon, France: IEEE, Aug. 2024, pp. 131–135

work page 2024
[59]

Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,

R. Sawata, Y . Kashiwagi, and S. Takahashi, “Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp...

work page 2022
[60]

Semantic Word Error Rate for Sentence Similarity,

C. Spiccia, A. Augello, G. Pilato, and G. Vassallo, “Semantic Word Error Rate for Sentence Similarity,” inProceedings of the 10th International Conference on Semantic Computing (ICSC’16), T. Li, A. Scherp, D. Ostrowski, and W. Wang, Eds. Laguna Hill, CA, USA: IEEE, Feb. 2016, pp. 266–269

work page 2016
[61]

“This Sen- tence Is Wrong

S. Raybaud, D. Langlois, and K. Smaïli, ““This Sen- tence Is Wrong.” Detecting Errors in Machine-Translated Sentences,”Machine Translation, vol. 25, pp. 1–34, Jul. 2011

work page 2011
[62]

A Normalized Levenshtein Distance Metric,

Y . Li and B. Liu, “A Normalized Levenshtein Distance Metric,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091–1095, Jun. 2007

work page 2007
[63]

Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,

W. J. Heeringa, “Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,” PhD Thesis, Uni- versity of Groningen, Groningen, Netherlands, 2004

work page 2004
[64]

Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,

G.-X. Shi, W.-Q. Zhang, G.-B. Wang, J. Zhao, S.-Z. Chai, and Z.-Y . Zhao, “Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,” EURASIP Journal on Audio, Speech and Music Process- ing, pp. 21:1–21:14, Jul. 2021. 14

work page 2021
[65]

Temporal Alignment Networks for Long-Term Video,

T. Han, W. Xie, and A. Zisserman, “Temporal Alignment Networks for Long-Term Video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22), K. Dana, G. Hua, S. Roth, D. Samaras, and R. Singh, Eds. New Orleans, LA, USA: IEEE, Jun. 2022, pp. 2896–2906

work page 2022
[66]

Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,

J. Li, Y . Meng, Z. Wu, H. Meng, Q. Tian, Y . Wang, and Y . Wang, “Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp. 8007–8011

work page 2022
[67]

TEGRA: Table Extraction by Global Record Align- ment,

X. Chu, Y . He, K. Chakrabarti, and K. Ganjam, “TEGRA: Table Extraction by Global Record Align- ment,” inProceedings of the 41st International Con- ference on Management of Data (SIGMOD’15), S. B. Davison and Z. Ives, Eds. Melbourne, Australia: ACM, May/Jun. 2015, pp. 1713–1728

work page 2015
[68]

Efficient whisper on streaming speech

R. Wang, Z. Xu, and F. X. Lin, “Efficient Whisper on Streaming Speech,”arXiv preprint arXiv:2412.11272, 2024

work page arXiv 2024
[69]

Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,” inProceedings of the 24th INTERSPEECH Conference (INTERSPEECH’23), S. King, K. Knill, and P. Wagner, Eds. Dublin, Ireland: ISCA, Aug. 2023, pp. 4489–4493

work page 2023
[70]

A Guided Tour to Approximate String Matching,

G. Navarro, “A Guided Tour to Approximate String Matching,”Computing Surveys, vol. 33, no. 1, pp. 31–88, Mar. 2001

work page 2001
[71]

M. M. Deza and E. Deza,Encyclopedia of Distances. Springer, 2009

work page 2009
[72]

On the Theory and Computation of Evo- lutionary Distances,

P. H. Sellers, “On the Theory and Computation of Evo- lutionary Distances,”Journal on Applied Mathematics, vol. 26, no. 4, pp. 787–793, 1974

work page 1974
[73]

Binary Codes Capable of Correcting Deletions, Insertions and Reversals,

V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,”Soviet Physics Dok- lady, vol. 10, no. 8, pp. 707–710, Feb. 1966

work page 1966
[74]

V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,

G. Dimauro, D. Caivano, V . Bevilacqua, F. Girardi, and V . Napoletano, “V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,” inProceedings of the 11th International Symposium on Medical Measurements and Applications (MeMeA’16). Benevento, Italy: IEEE, May 2016, pp. 1–6

work page 2016
[75]

Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,

G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Girardi, “Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,” IEEE Access, vol. 5, pp. 22 199–22 208, Nov. 2017

work page 2017
[76]

Italian Parkinson’s V oice and Speech,

G. Dimauro and F. Girardi, “Italian Parkinson’s V oice and Speech,” 2019

work page 2019
[77]

Wohlin, P

C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Reg- nell, and A. Wesslén,Experimentation in Software Engi- neering. Springer, 2012. Federico Bruzzoneis currently a Ph.D. student in Computer Science at Università degli Studi di Mi- lano, Italy. He was born in 2000 and since he was a child he has been passionate about computer science and music. He got h...

work page 2012

[1] [1]

ASR—A Real-Time Speech Recognition on Portable Devices,

A. S. Sharma and R. Bhalley, “ASR—A Real-Time Speech Recognition on Portable Devices,” inPro- ceedings of the 2nd International Conference on Ad- vances in Computing, Communication, & Automation 11 (ICACCA’16). Bareilly, India: IEEE, Sep. 2016, pp. 1–4

work page 2016

[2] [2]

Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,

W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep V oice 3: Scal- ing Text-to-Speech with Convolutional Sequence Learn- ing,” inProceedings of the 6th International Conference on Learning Representations (ICLR’18), M. Ranzato, Ed., Vancouver, BC, Canada, Apr./May 2018

work page 2018

[3] [3]

Automatic Speech Recognition: Systematic Literature Review,

S. Alharbi, M. Alrazgan, A. Alrashed, T. Alnomasi, R. Almojel, R. Alharbi, S. Alharbi, S. Alturki, F. Al- shehri, and M. Almojil, “Automatic Speech Recognition: Systematic Literature Review,”IEEE Access, vol. 9, pp. 131 858–131 876, 2021

work page 2021

[4] [4]

Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,

B. Lecouteux, M. Vacher, and F. Portet, “Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions,” inProceed- ings of the 12th INTERSPEECH Conference (INTER- SPEECH’11), R. Pieraccini, Ed. Firenze, Italy: ISCA, Aug. 2011, pp. 2273–2276

work page 2011

[5] [5]

Conversational AI: The Science Behind the Alexa Prize

A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, A. Na- gar, E. King, K. Bland, A. Wartick, Y . Pan, H. Song, S. Jayadevan, G. Hwang, and A. Pettigrue, “Conversa- tional AI: The Science Behind the Alexa Prize,”arXiv e-prints, vol. arXiv:1801.03604, pp. 1–18, Jan. 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

A Review of ASR Technologies for Children’s Speech,

M. Gerosa, D. Giuliani, S. Narayanan, and A. Potami- anos, “A Review of ASR Technologies for Children’s Speech,” inProceedings of the 2nd Workshop on Child, Computer and Interaction (WOCCI’09). Cambridge, MA, USA: ACM, Nov. 2009, pp. 7:1–7:8

work page 2009

[7] [7]

Attention- Based wav2text with Feature Transfer Learning,

A. Tjandra, S. Sakti, and S. Nakamura, “Attention- Based wav2text with Feature Transfer Learning,” in Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU’17). Okinawa, Japan: IEEE, Jan. 2017, pp. 309–315

work page 2017

[8] [8]

Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,

P. Dubey and B. Shah, “Deep Speech Based End-To- End Automated Speech Recognition (ASR) for Indian- English Accents,”arXiv e-prints, vol. arXiv:2204.00977, pp. 1–6, Apr. 2022

work page arXiv 2022

[9] [9]

Deep Speech: Scaling up end-to-end speech recognition

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di- amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y . Ng, “Deep Speech: Scaling up End-to-End Speech Recognition,”arXiv e-prints, vol. arXiv:1412.5567, pp. 1–12, Dec. 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[10] [10]

End-to-End Speech Recognition: A Sur- vey,

R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-End Speech Recognition: A Sur- vey,”IEEE Transactions on Audio, Speech and Language Processing, vol. 32, pp. 325–351, 2023

work page 2023

[11] [11]

Recent Advances in End-to-End Automatic Speech Recognition,

J. Li, “Recent Advances in End-to-End Automatic Speech Recognition,”APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, pp. 1–64, Apr. 2022

work page 2022

[12] [12]

WaveNet: A Generative Model for Raw Audio,

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” inProceedings of the 9th Workshop on Speech Synthesis (SSW’16). Sunnyvale, USA: ISCA, Sep. 2016, p. 125

work page 2016

[13] [13]

50 Years of Progress in Speech and Speaker Recognition Research,

S. Furui, “50 Years of Progress in Speech and Speaker Recognition Research,”Transactions on Computer and Information Technology, vol. 1, no. 2, pp. 64–74, Nov. 2005

work page 2005

[14] [14]

The Application of Hidden Markov Models in Speech Recognition,

M. Gales and S. Young, “The Application of Hidden Markov Models in Speech Recognition,”Foundations and Trends® in Signal Processing, vol. 1, no. 3, pp. 195– 304, Feb. 2008

work page 2008

[15] [15]

Unsuper- vised Automatic Speech Recognition: A Review,

H. Aldarmaki, A. Ullah, S. Ram, and N. Zaki, “Unsuper- vised Automatic Speech Recognition: A Review,”Speech Communication, vol. 139, pp. 76–91, Apr. 2022

work page 2022

[16] [16]

wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learn- ing of Speech Representations,” inProceedings of the 33rd Conference on Neural Information Processing Sys- tems (NeurIPS’20), M. Ranzato, R. Hadsell, M. F. Bal- can, and H.-T. Lin, Eds., vol. 33. Vancouver, BC, Canada: Curran Associates, Inc., Dec. 2020, pp. 1–12

work page 2020

[17] [17]

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,

G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y . Wang, Z. You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” inProceedings of the 22nd INTERSPEECH Conference (INTERSPEECH’21), L. Bur...

work page 2021

[18] [18]

Robust Speech Recogni- tion via Large-Scale Weak Supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recogni- tion via Large-Scale Weak Supervision,” inProceedings of the 40th International Conference on Machine Learn- ing (ICML’23). Honolulu, Hawaii: PMLR, Jul. 2023, pp. 28 492–28 518

work page 2023

[19] [19]

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGres- ley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y . Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan...

work page 2016

[20] [20]

Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,

A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi- Stream Adaptive Evidence Combination for Noise Ro- bust ASR,”Speech Communication, vol. 34, no. 1-2, pp. 25–40, 2001

work page 2001

[21] [21]

Multi-Stream End-to-End Speech Recognition,

R. Li, X. Wang, S. H. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, “Multi-Stream End-to-End Speech Recognition,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, pp. 1026–1039, Jun. 2019

work page 2019

[22] [22]

Real-Time ASR from Meetings,

P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V . Wan, and L. Zhang, “Real-Time ASR from Meetings,” inPro- 12 ceedings of the 9th INTERSPEECH Conference (INTER- SPEECH’09). Brighton, United Kingdom: ISCA, Sep. 2023, pp. 2119–2122

work page 2023

[23] [23]

Evaluation of real-time tran- scriptions using end-to-end asr models.arXiv preprint arXiv:2409.05674,

C. Arriaga, A. Pozo, J. Conde, and A. Alonso, “Evalua- tion of Real-Time Transcriptions Using End-to-End ASR Models,”arXiv preprint arXiv:2409.05674, 2024

work page arXiv 2024

[24] [24]

Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,

A. V . Ivanov, P. L. Lange, D. Suendermann-Oeft, V . Ra- manarayanan, Y . Qian, Z. Yu, and J. Tao, “Speed vs. Accuracy: Designing an Optimal ASR System for Spon- taneous Non-Native Speech in a Real-Time Application,” inProceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS’16), ser. Lecture Notes in Electrical Engineering 427. Saa...

work page 2016

[25] [25]

Preech: A System for Privacy-Preserving Speech Transcription,

S. Ahmed, A. R. Chowdhury, K. Fawaz, and P. Ra- manathan, “Preech: A System for Privacy-Preserving Speech Transcription,” inProceedings of the 29th USENIX Security Symposium (USENIX Security’20), S. Capkun and F. Roesner, Eds. Boston, MA, USA: USENIX, Aug. 2020, pp. 2703–2720

work page 2020

[26] [26]

SoK: The Faults in our ASRs,

H. Abdullah, K. Warren, V . Bindschaedler, N. Papernot, and P. Traynor, “SoK: The Faults in our ASRs,” in Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP’21), A. Oprea and T. Holz, Eds. San Francisco, CA, USA: IEEE, May 2021, pp. 730–747

work page 2021

[27] [27]

Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,

A.-L. Geirgescu, A. Pappalardo, H. Cucu, and M. Blott, “Performance vs. Hardware Requirements in State-of- the-Art Automatic Speech Recognition,”EURASIP Jour- nal on Audio, Speech, and Music Processing, vol. 2021, no. 1, pp. 28:1–28:30, Jul. 2021

work page 2021

[28] [28]

Anatomy of Industrial Scale Multilingual ASR,

F. M. Ramirez, L. Chkhetiani, A. Ehrenberg, R. McHardy, R. Botros, Y . Khare, A. Vanzo, T. Peyash, G. Oexle, M. Lianget al., “Anatomy of Industrial Scale Multilingual ASR,”arXiv preprint arXiv:2404.09841, 2024

work page arXiv 2024

[29] [29]

Decoupled Federated Learning for ASR with Non- IID Data,

H. Zhu, J. Wang, G. Cheng, P. Zhang, and Y . Yan, “Decoupled Federated Learning for ASR with Non- IID Data,” inProceedings of the 23rd INTERSPEECH Conference (INTERSPEECH’22), K. Lee, L. Lamel, M. Hasegawa-Johnson, K. Livescu, and O. Kang, Eds. Incheon, South Korea: ISCA, Aug. 2023, pp. 2628–2632

work page 2023

[30] [30]

Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,

S. Basak, H. Agrawal, S. Jena, S. Gite, M. Bachute, B. Pradhan, and M. Assiri, “Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Sys- tems,”Computer Modeling in Engineering and Sciences, vol. 135, no. 2, pp. 1053–1089, Oct. 2022

work page 2022

[31] [31]

Whispy: Adapting STT Whisper Models to Real- Time Environments,

A. Bevilacqua, P. Saviano, A. Amirante, and S. P. Ro- mano, “Whispy: Adapting STT Whisper Models to Real- Time Environments,”arXiv preprint arXiv:2405.03484, 2024

work page arXiv 2024

[32] [32]

Turning Whisper into Real-Time Transcription System,

D. Machá ˇcek, R. Dabre, and O. Bojar, “Turning Whisper into Real-Time Transcription System,” inProceedings of the 13th International Joint Conference on Natural Lan- guage Processing (IJCNLP’23), S. Saha and H. Sujaini, Eds. Bali, Indonesia: ACL, Nov. 2023, pp. 17–24

work page 2023

[33] [33]

CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,

P. Polák, N.-Q. Pham, T. N. Nguyen, D. Liu, C. Mullov, J. Niehues, O. Bojar, and A. Waibel, “CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022,” inProceedings of the 19th International Con- ference on Spoken Language Translation (IWSLT’22), E. Salesky, M. Federico, and M. Costa-jussà, Eds. Dublin, Ireland: ACL, May 2022, pp. 277–285

work page 2022

[34] [34]

Scaling Up Online Speech Recognition Using ConvNets,

V . Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhoma- nenko, A. Hannun, V . Liptchinsky, G. Synnaeve, and R. Collobert, “Scaling Up Online Speech Recognition Using ConvNets,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 3376–3380

work page 2020

[35] [35]

Federated Acous- tic Modeling for Automatic Speech Recognition,

X. Cui, S. Lu, and B. Kingsbury, “Federated Acous- tic Modeling for Automatic Speech Recognition,” in Proceedings of the 46th International Conference on Acoustics, Speech and Signal Processing (ICASSP’21), T. Davidson and D. Yu, Eds. Toronto, Canada: IEEE, Jun. 2021, pp. 6748–6752

work page 2021

[36] [36]

Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,

Y . Zhang, L. Xu, A. Mendoza, G. Yang, P. Chinprut- thiwong, and G. Gu, “Life After Speech Recognition: Fuzzing Semantic Misinterpretation for V oice Assis- tant Applications,” inProceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS’19). San Diego, CA, USA: The Internet Society, Feb. 2019, pp. 1–15

work page 2019

[37] [37]

Measuring the Accuracy of Automatic Speech Recognition Solutions,

K. Kuhn, V . Kersken, B. Reuter, N. Egger, and G. Zimmermann, “Measuring the Accuracy of Automatic Speech Recognition Solutions,”ACM Transactions on Accessible Computing, vol. 16, no. 4, pp. 25:1–25:23, Jan. 2024

work page 2024

[38] [38]

Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,

N. Jamal, S. Shanta, F. Mahmud, and M. Sha’abani, “Automatic Speech Recognition (ASR) Based Approach for Speech Therapy of Aphasic Patients: A Review,” in Proceedings of the International Conference on Electri- cal and Electronic Engineering (IC3E’17). Johor Bahru, Malaysia: AIP, Aug. 2017

work page 2017

[39] [39]

Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,

M. B. Mustafa, F. Rosdi, S. S. Salim, and M. U. Mughal, “Exploring the Influence of General and Specific Factors on the Recognition Accuracy of an ASR System for Dysarthric Speaker,”Expert Systems with Applications, vol. 42, no. 8, pp. 3924–3932, May 2015

work page 2015

[40] [40]

Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,

S. Latif, J. Qadir, A. Qayyum, M. Usama, and S. You- nis, “Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art,”IEEE Reviews in Biomedical Engineering, vol. 14, pp. 342–356, 2021

work page 2021

[41] [41]

Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?

D. Loakes, “Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?”Frontiers in Com- munication, vol. 7, Aug. 2022

work page 2022

[42] [42]

Institutional and Academic Transcripts of Police Interrogations,

M. Komter, “Institutional and Academic Transcripts of Police Interrogations,”Frontiers in Communication, vol. 7, Aug. 2022

work page 2022

[43] [43]

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,

L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,”Pro- ceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989

work page 1989

[44] [44]

Conformer: Convolution-Augmented Transformer for Speech Recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. P. Pang, 13 “Conformer: Convolution-Augmented Transformer for Speech Recognition,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. ...

work page 2020

[45] [45]

Streaming End-to-end Speech Recognition for Mobile Devices,

Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Al- varez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-y. Chang, K. Rao, and A. Gru- enstein, “Streaming End-to-end Speech Recognition for Mobile Devices,” inProceedings of the 44th Interna- tional Conference on Acoustics...

work page 2019

[46] [46]

Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,

M. B. Hoy, “Alexa, Siri, Cortana, and More: An Intro- duction to V oice Assistants,”Medical Reference Services Quarterly, vol. 37, no. 1, pp. 81–88, Jan./Mar. 2018

work page 2018

[47] [47]

Transformer in Transformer,

K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y . Wang, “Transformer in Transformer,” inProceedings of the 35th International Conference on Neural Information Process- ing Systems (NIPS’21), M. Ranzato, A. Beygelzimer, Y . Dauphin, P. S. Lian, and J. Wortman Vaughan, Eds., vol. 34. Virtual: Curran Associates, Inc., Dec. 2021, pp. 15 908–15 919

work page 2021

[48] [48]

Attention Is All You Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. M. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” inProceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17), I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Long Beach, CA, USA: Curran Associat...

work page 2017

[49] [49]

Scaling Speech Technology to 1,000+ Languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y . Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling Speech Technology to 1,000+ Languages,”Journal of Machine Learning Research, vol. 25, pp. 1–52, Mar. 2024

work page 2024

[50] [50]

Librispeech: An ASR Corpus Based on Public Domain Audio Books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” inProceedings of the 40th International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15), D. Gray and D. Cochran, Eds. Brisbane, Australia: IEEE, Apr. 2015, pp. 5206–5210

work page 2015

[51] [51]

Long-Form Speech Genera- tion with Spoken Language Models,

S. J. Park, J. Salazar, A. Jansen, K. Kinoshita, Y . M. Ro, and R. Skerry-Ryan, “Long-Form Speech Genera- tion with Spoken Language Models,” inProceedings of the 42nd International Conferene on Machine Learning (ICML’25). Vancouver, BC, Canada: PMLR, Jul. 2025, pp. 48 245–48 261

work page 2025

[52] [52]

MLS: A Large-Scale Multilingual Dataset for Speech Research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Col- lobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inProceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), B. Xu and T. F. Zheng, Eds. Shanghai, China: ISCA, Oct. 2020, pp. 2757–2761

work page 2020

[53] [53]

FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-Shot Learning Evaluation of Universal Represen- tations of Speech,” inProceedings of IEEE Spoken Language Technology Workshop (SLT’22), S. Watanabe, S. Khudanpur, J. Hirschberg, M. Saraclar, and M. Del- croix, Eds. Doha, Qatar: IEEE, Jan. 2023...

work page 2023

[54] [54]

Word Error Rate Estimation for Speech Recognition: e-WER,

A. Ali and S. Renals, “Word Error Rate Estimation for Speech Recognition: e-WER,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), I. Gurevych and Y . Miyao, Eds. Melbourne, Australia: ACM, Jul. 2018, pp. 29–24

work page 2018

[55] [55]

Testing the Correlation of Word Error Rate and Perplexity,

D. Klakow and J. Peters, “Testing the Correlation of Word Error Rate and Perplexity,”Speech Communica- tion, vol. 38, no. 1–2, pp. 19–28, Sep. 2002

work page 2002

[56] [56]

Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,

Y .-Y . Wang, A. Acero, and C. Chelba, “Is Word Error Rate a Good Indicator for Spoken Language Under- standing Accuracy,” inProceedings of the 7th Workshop on Automatic Speech Recognition and Understanding (ASRU’03). St. Thomas, U.S. Virgin Islands: IEEE, Nov./Dec. 2003, pp. 577–582

work page 2003

[57] [57]

What Size Test Set Gives Good Error Rate Estimates?

I. Guyon, J. Makhoul, R. Schwartz, and V . Vapnik, “What Size Test Set Gives Good Error Rate Estimates?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 52–64, Jan. 1998

work page 1998

[58] [58]

Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,

C. Park, H. Kang, and T. Hain, “Character Error Rate Estimation for Automatic Speech Recognition of Short Utterances,” inProceedings of the 32nd European Signal Processing Conference (EUSIPCO’24). Lyon, France: IEEE, Aug. 2024, pp. 131–135

work page 2024

[59] [59]

Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,

R. Sawata, Y . Kashiwagi, and S. Takahashi, “Improving Character Error Rate is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-Box Acoustic Models,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp...

work page 2022

[60] [60]

Semantic Word Error Rate for Sentence Similarity,

C. Spiccia, A. Augello, G. Pilato, and G. Vassallo, “Semantic Word Error Rate for Sentence Similarity,” inProceedings of the 10th International Conference on Semantic Computing (ICSC’16), T. Li, A. Scherp, D. Ostrowski, and W. Wang, Eds. Laguna Hill, CA, USA: IEEE, Feb. 2016, pp. 266–269

work page 2016

[61] [61]

“This Sen- tence Is Wrong

S. Raybaud, D. Langlois, and K. Smaïli, ““This Sen- tence Is Wrong.” Detecting Errors in Machine-Translated Sentences,”Machine Translation, vol. 25, pp. 1–34, Jul. 2011

work page 2011

[62] [62]

A Normalized Levenshtein Distance Metric,

Y . Li and B. Liu, “A Normalized Levenshtein Distance Metric,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1091–1095, Jun. 2007

work page 2007

[63] [63]

Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,

W. J. Heeringa, “Measuring Dialect Pronunciation Dif- ferences Using Levenshtein Distance,” PhD Thesis, Uni- versity of Groningen, Groningen, Netherlands, 2004

work page 2004

[64] [64]

Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,

G.-X. Shi, W.-Q. Zhang, G.-B. Wang, J. Zhao, S.-Z. Chai, and Z.-Y . Zhao, “Timestamp-Aligning and Keyword- Biasing End-to-End ASR Front-End for a KWS System,” EURASIP Journal on Audio, Speech and Music Process- ing, pp. 21:1–21:14, Jul. 2021. 14

work page 2021

[65] [65]

Temporal Alignment Networks for Long-Term Video,

T. Han, W. Xie, and A. Zisserman, “Temporal Alignment Networks for Long-Term Video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22), K. Dana, G. Hua, S. Roth, D. Samaras, and R. Singh, Eds. New Orleans, LA, USA: IEEE, Jun. 2022, pp. 2896–2906

work page 2022

[66] [66]

Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,

J. Li, Y . Meng, Z. Wu, H. Meng, Q. Tian, Y . Wang, and Y . Wang, “Neufa: Neural Network Based End- to-End Forced Alignment with Bidirectional Attention Mechanism,” inProceedings of the 47th International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), W.-S. Gan and K.-K. Ma, Eds. Singapore, Singapore: IEEE, May 2022, pp. 8007–8011

work page 2022

[67] [67]

TEGRA: Table Extraction by Global Record Align- ment,

X. Chu, Y . He, K. Chakrabarti, and K. Ganjam, “TEGRA: Table Extraction by Global Record Align- ment,” inProceedings of the 41st International Con- ference on Management of Data (SIGMOD’15), S. B. Davison and Z. Ives, Eds. Melbourne, Australia: ACM, May/Jun. 2015, pp. 1713–1728

work page 2015

[68] [68]

Efficient whisper on streaming speech

R. Wang, Z. Xu, and F. X. Lin, “Efficient Whisper on Streaming Speech,”arXiv preprint arXiv:2412.11272, 2024

work page arXiv 2024

[69] [69]

Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whis- perX: Time-Accurate Speech Transcription of Long- Form Audio,” inProceedings of the 24th INTERSPEECH Conference (INTERSPEECH’23), S. King, K. Knill, and P. Wagner, Eds. Dublin, Ireland: ISCA, Aug. 2023, pp. 4489–4493

work page 2023

[70] [70]

A Guided Tour to Approximate String Matching,

G. Navarro, “A Guided Tour to Approximate String Matching,”Computing Surveys, vol. 33, no. 1, pp. 31–88, Mar. 2001

work page 2001

[71] [71]

M. M. Deza and E. Deza,Encyclopedia of Distances. Springer, 2009

work page 2009

[72] [72]

On the Theory and Computation of Evo- lutionary Distances,

P. H. Sellers, “On the Theory and Computation of Evo- lutionary Distances,”Journal on Applied Mathematics, vol. 26, no. 4, pp. 787–793, 1974

work page 1974

[73] [73]

Binary Codes Capable of Correcting Deletions, Insertions and Reversals,

V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,”Soviet Physics Dok- lady, vol. 10, no. 8, pp. 707–710, Feb. 1966

work page 1966

[74] [74]

V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,

G. Dimauro, D. Caivano, V . Bevilacqua, F. Girardi, and V . Napoletano, “V oxTester, Software for Digital Evaluation of Speech Changes in Parkinson Disease,” inProceedings of the 11th International Symposium on Medical Measurements and Applications (MeMeA’16). Benevento, Italy: IEEE, May 2016, pp. 1–6

work page 2016

[75] [75]

Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,

G. Dimauro, V . Di Nicola, V . Bevilacqua, D. Caivano, and F. Girardi, “Assessment of Speech Intelligibility in Parkinson’s Disease Using a Speech-To-Text System,” IEEE Access, vol. 5, pp. 22 199–22 208, Nov. 2017

work page 2017

[76] [76]

Italian Parkinson’s V oice and Speech,

G. Dimauro and F. Girardi, “Italian Parkinson’s V oice and Speech,” 2019

work page 2019

[77] [77]

Wohlin, P

C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Reg- nell, and A. Wesslén,Experimentation in Software Engi- neering. Springer, 2012. Federico Bruzzoneis currently a Ph.D. student in Computer Science at Università degli Studi di Mi- lano, Italy. He was born in 2000 and since he was a child he has been passionate about computer science and music. He got h...

work page 2012