WhisperRT -- Turning Whisper into a Causal Streaming Model

Bhiksha Raj; Joseph Keshet; Tomer Krichli

arxiv: 2508.12301 · v2 · submitted 2025-08-17 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

WhisperRT -- Turning Whisper into a Causal Streaming Model

Tomer Krichli , Bhiksha Raj , Joseph Keshet This is my paper

Pith reviewed 2026-05-18 22:13 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SDeess.AS

keywords streaming ASRcausal encoderlow-latency transcriptionWhisper adaptationreal-time speech recognitionencoder-decoder alignmentonline ASRfine-tuned streaming model

0 comments

The pith

Whisper can be turned into a causal streaming ASR model by making its encoder process audio chunks incrementally and fine-tuning decoder alignment for token timing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to adapt the Whisper model, built for complete offline transcription, so it can transcribe speech as it arrives in real time. The encoder is changed to handle audio causally without looking ahead, and the decoder is adjusted to emit tokens only when enough current audio context exists through explicit synchronization of frames and outputs. Fine-tuning corrects the resulting alignment to keep latency low. If this works, it would give accurate live transcription at lower computational cost than many current streaming systems. Readers would care because it opens high-accuracy models to immediate-use cases like live captions without needing entirely new architectures.

Core claim

The central claim is that a transformer encoder-decoder like Whisper can be converted to a low-latency streaming model: the encoder is made causal to process audio incrementally, the decoder conditions on partial encoder states to generate tokens aligned with available context, explicit synchronization between encoded frames and token emissions is enforced, and fine-tuning of the alignment mechanism is performed to offset inherent latency. An updated inference procedure then supports greedy and beam-search decoding shown to be locally optimal. Experiments on chunk sizes under 300 milliseconds indicate the fine-tuned version outperforms existing non-fine-tuned streaming methods in most cases.

What carries the argument

Causal encoder combined with decoder conditioning on partial states and explicit frame-token synchronization, refined by alignment fine-tuning.

If this is right

The fine-tuned model outperforms non-fine-tuned streaming approaches on most low-latency chunks under 300 milliseconds.
The method operates at lower complexity than the compared streaming baselines.
Greedy and beam-search decoding become available and locally optimal under the updated inference.
Released training code, inference code, and fine-tuned models allow direct reuse and extension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same causal-encoder and synchronization steps could be applied to other large offline encoder-decoder ASR models beyond Whisper.
Live applications such as real-time captioning or voice interfaces could adopt the approach to reduce end-to-end delay.
Further tests on multilingual or noisy data would clarify whether the reported gains require additional per-domain fine-tuning.

Load-bearing premise

Fine-tuning the encoder-decoder alignment will create a stable low-latency system whose gains persist across different acoustic conditions and languages without new errors that cancel the benefits.

What would settle it

A direct comparison in which the fine-tuned model shows higher word error rates or greater instability than non-fine-tuned baselines when evaluated on acoustic conditions or languages outside the fine-tuning data.

Figures

Figures reproduced from arXiv: 2508.12301 by Bhiksha Raj, Joseph Keshet, Tomer Krichli.

**Figure 1.** Figure 1: Encoder causal mask example, τ = 15, τ0 = 30 given k = 10 chunks. Such mask applies that the model waits 600 msec for the first buffer before feeding the input to the encoder. Then, input is being fed every 300 msec. Purple regions contain zeros while white regions contain −∞. The index (35,50) is marked in a green point. 30 frames, and that the model is currently processing chunk number k = 10. Let’s exam… view at source ↗

**Figure 2.** Figure 2: The inference process, using a chunk size of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Token distribution for the Whisper model (left) and CarelessWhisper (right) for third token over time, conditioned on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: ARWER vs. Chunk Size per method, on large-v2 models. Left sub figure presents the results on LibriSpeech test-clean. Right sub figure presents the results on LibriSpeech test-other minimizing per-word latency. Since RWER does not capture latency or alignment, we introduce an additional metric. b) Aligned-Relative Word Error Rate (ARWER): The ARWER is defined as: ARWER(y, Yˆ)= P τ I (yτ , yˆτ ) + D (yτ , yˆ… view at source ↗

**Figure 5.** Figure 5: ARWER vs. beam size on our method when using [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 7.** Figure 7: Fine-tuning process illustration. The above example demonstrates an encoder that uses a chunk size of size [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model. The encoder is made causal to process audio incrementally, while the decoder conditions on partial encoder states to generate tokens aligned with the available temporal context. This requires explicit synchronization between encoded input frames and token emissions. Since tokens are produced only after sufficient acoustic evidence is observed, an inherent latency arises, necessitating fine-tuning of the encoder-decoder alignment mechanism. We propose an updated inference mechanism that utilizes the fine-tuned causal encoder and decoder to yield greedy and beam-search decoding, and is shown to be locally optimal. Experiments on low-latency chunk sizes (less than 300 msec) show that our fine-tuned model outperforms existing non-fine-tuned streaming approaches in most cases, while using a lower complexity. We release our training and inference code, along with the fine-tuned models, to support further research and development in streaming ASR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This adapts Whisper for streaming via causal encoder and sync fine-tuning with reported short-chunk gains, but the experiments do not isolate whether the causal changes or the fine-tuning drive the results.

read the letter

The main thing here is a practical recipe for turning Whisper into a low-latency streaming model. They make the encoder causal, have the decoder condition on partial encoder states, add explicit frame-token synchronization, and fine-tune the alignment to handle the resulting latency. The fine-tuned version is said to beat some existing streaming baselines on chunks under 300 ms while using lower complexity, and they release code plus models.

Referee Report

2 major / 2 minor

Summary. The paper proposes a method to convert the non-causal Whisper encoder-decoder ASR model into a low-latency causal streaming system. The encoder is made causal for incremental chunk processing, explicit frame-token synchronization is imposed so the decoder conditions only on available partial encoder states, and the alignment is fine-tuned to mitigate inherent latency. An updated inference procedure supporting greedy and beam-search decoding is presented and claimed to be locally optimal. The central result is that the resulting fine-tuned model outperforms existing non-fine-tuned streaming baselines on chunk sizes below 300 ms while using lower complexity; code and models are released.

Significance. If the reported gains can be attributed to the causal adaptation and synchronization mechanism rather than fine-tuning alone, the approach would provide a practical route for adapting strong offline models such as Whisper to real-time ASR with modest added latency and complexity. The public release of training/inference code and fine-tuned checkpoints is a clear strength that supports reproducibility and follow-on work in streaming ASR.

major comments (2)

[Experiments] Experiments section: The headline claim that the fine-tuned causal model outperforms existing non-fine-tuned streaming approaches at <300 ms chunks is load-bearing, yet the comparison does not report whether the baselines received equivalent fine-tuning on the same alignment data or training distribution. Without such controls or an ablation isolating the synchronization mechanism from the fine-tuning step, it remains unclear whether performance deltas arise from the proposed causal construction or simply from fine-tuning itself.
[Method / Experiments] The abstract and method description state that fine-tuning is required to handle the latency induced by synchronization, but no quantitative analysis (e.g., latency-accuracy trade-off curves or alignment error metrics before/after fine-tuning) is referenced to show that the fine-tuned alignment remains stable across acoustic conditions or languages.

minor comments (2)

[Experiments] The claim of 'lower complexity' is stated without accompanying FLOPs, parameter counts, or runtime tables comparing the proposed model to the baselines.
[Experiments] Dataset splits, exact chunk sizes tested, number of runs, and error bars or statistical tests are not mentioned in the abstract and should be added to the experimental section for verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work converting Whisper to a causal streaming model. We address the major comments below and will incorporate revisions to strengthen the experimental controls and analysis as outlined.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline claim that the fine-tuned causal model outperforms existing non-fine-tuned streaming approaches at <300 ms chunks is load-bearing, yet the comparison does not report whether the baselines received equivalent fine-tuning on the same alignment data or training distribution. Without such controls or an ablation isolating the synchronization mechanism from the fine-tuning step, it remains unclear whether performance deltas arise from the proposed causal construction or simply from fine-tuning itself.

Authors: We agree that the current comparison is between our fine-tuned causal model and published non-fine-tuned streaming baselines, which may not isolate the contributions fully. The baselines are existing methods without our encoder causality and decoder synchronization mechanism, and our headline result is that the full proposed pipeline (causality + synchronization + fine-tuning) outperforms them at low latency. To address the concern directly, we will add an ablation in the revised manuscript applying equivalent fine-tuning to the baseline models on the same alignment data and training distribution where feasible, allowing clearer isolation of the synchronization mechanism's effect. revision: yes
Referee: [Method / Experiments] The abstract and method description state that fine-tuning is required to handle the latency induced by synchronization, but no quantitative analysis (e.g., latency-accuracy trade-off curves or alignment error metrics before/after fine-tuning) is referenced to show that the fine-tuned alignment remains stable across acoustic conditions or languages.

Authors: We acknowledge the absence of explicit quantitative analysis on the fine-tuning step in the current manuscript. In the revision, we will add latency-accuracy trade-off curves for the model before and after fine-tuning across chunk sizes. We will also include alignment error metrics (e.g., average token emission latency and WER deltas) evaluated before/after fine-tuning on multiple languages and acoustic conditions from our test sets to demonstrate stability. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical adaptation and comparison are self-contained

full rationale

The paper describes a practical engineering adaptation: rendering the Whisper encoder causal, enforcing explicit frame-token synchronization, fine-tuning the resulting alignment for latency, and updating inference for greedy/beam search. The headline results consist of direct experimental comparisons on low-latency chunks against existing non-fine-tuned streaming baselines. No derivation chain, equation, or first-principles claim reduces to its own inputs by construction; there are no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness theorems, and no ansatz smuggled through prior work. The method is validated against external benchmarks rather than tautologically defined by its own outputs, so the reported performance deltas stand as independent empirical evidence.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard transformer assumptions that causality can be enforced by masking and that fine-tuning can realign encoder-decoder timing without destroying pretrained knowledge. No new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

free parameters (1)

chunk size
Audio chunk duration (under 300 ms) chosen to define low-latency regime; directly affects latency-accuracy trade-off.

axioms (2)

domain assumption A transformer encoder can be made strictly causal by appropriate attention masking while retaining useful representations.
Invoked when converting the Whisper encoder to process audio incrementally.
domain assumption Fine-tuning on aligned partial encoder states will reduce the inherent token-emission latency without catastrophic accuracy loss.
Central premise that justifies the fine-tuning step described in the abstract.

pith-pipeline@v0.9.0 · 5756 in / 1498 out tokens · 51286 ms · 2026-05-18T22:13:13.504691+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We modify the original non-causal encoder to operate causally and fine-tune both the encoder and decoder using Low-Rank Adaptation (LoRA) on a weakly aligned dataset.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2. ... [˜ZT ]t = [˜Zkτ ]t

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

work page 2020
[2]

Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study, 2024

Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, and Lei Xie. Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study, 2024

work page 2024
[3]

Developing real-time streaming transformer transducer for speech recognition on large-scale dataset

Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5904–5908. IEEE, 2021. 13

work page 2021
[4]

On-device streaming discrete speech units

Kwanghee Choi, Masao Someki, Emma Strubell, and Shinji Watan- abe. On-device streaming discrete speech units. arXiv preprint arXiv:2506.01845, 2025

work page arXiv 2025
[5]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io- awareness, 2022

work page 2022
[6]

Garofolo, Lori F

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, and Nancy L. Dahlgren. TIMIT acoustic-phonetic continuous speech corpus, 1993. LDC93S1

work page 1993
[7]

Sequence transduction with recurrent neural networks, 2012

Alex Graves. Sequence transduction with recurrent neural networks, 2012

work page 2012
[8]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J ¨urgen Schmid- huber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning , pages 369–376, 2006

work page 2006
[9]

Streaming end-to-end speech recognition for mobile devices

Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6381–6385. IEEE, 2019

work page 2019
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[12]

Hubert: Self- supervised speech representation learning by masked prediction of hidden units, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self- supervised speech representation learning by masked prediction of hidden units, 2021

work page 2021
[13]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021
[14]

Word level timestamp generation for automatic speech recognition and translation, 2025

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, and Boris Ginsburg. Word level timestamp generation for automatic speech recognition and translation, 2025

work page 2025
[15]

Efficient streaming llm for speech recognition

Junteng Jia, Gil Keren, Wei Zhou, Egor Lakomkin, Xiaohui Zhang, Chunyang Wu, Frank Seide, Jay Mahadeokar, and Ozlem Kalinli. Efficient streaming llm for speech recognition. arXiv preprint arXiv:2410.03752 , 2024

work page arXiv 2024
[16]

Xlsr-transducer: Streaming asr for self-supervised pretrained models, 2024

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esa ´u Villatoro- Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, and Aravind Ganapathiraju. Xlsr-transducer: Streaming asr for self-supervised pretrained models, 2024

work page 2024
[17]

Knowledge distillation from offline to streaming rnn transducer for end-to-end speech recognition

Gakuto Kurata and George Saon. Knowledge distillation from offline to streaming rnn transducer for end-to-end speech recognition. In Interspeech, pages 2117–2121, 2020

work page 2020
[18]

Learning small- size dnn with output-distribution-based criteria

Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. Learning small- size dnn with output-distribution-based criteria. In interspeech, pages 1910–1914, 2014

work page 1910
[19]

Pytorch distributed: Experiences on accelerating data parallel training, 2020

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020

work page 2020
[20]

Low-latency sequence- to-sequence speech recognition and translation by partial hypothesis selection

Danni Liu, Gerasimos Spanakis, and Jan Niehues. Low-latency sequence- to-sequence speech recognition and translation by partial hypothesis selection. arXiv preprint arXiv:2005.11185 , 2020

work page arXiv 2005
[21]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

work page 2019
[22]

Knowledge distillation for small-footprint highway networks

Liang Lu, Michelle Guo, and Steve Renals. Knowledge distillation for small-footprint highway networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4820–4824. IEEE, 2017

work page 2017
[23]

Turning whisper into real-time transcription system

Dominik Mach ´aˇcek, Raj Dabre, and Ond ˇrej Bojar. Turning whisper into real-time transcription system. arXiv preprint arXiv:2307.14743 , 2023

work page arXiv 2023
[24]

Montreal forced aligner: Trainable text-speech alignment using kaldi

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017

work page 2017
[25]

Streaming automatic speech recognition with the transformer model

Niko Moritz, Takaaki Hori, and Jonathan Le. Streaming automatic speech recognition with the transformer model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6074–6078. IEEE, 2020

work page 2020
[26]

Triggered attention for end-to-end speech recognition

Niko Moritz, Takaaki Hori, and Jonathan Le Roux. Triggered attention for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5666–5670. IEEE, 2019

work page 2019
[27]

Dual causal/non- causal self-attention for streaming end-to-end speech recognition

Niko Moritz, Takaaki Hori, and Jonathan Le Roux. Dual causal/non- causal self-attention for streaming end-to-end speech recognition. arXiv preprint arXiv:2107.01269, 2021

work page arXiv 2021
[28]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

work page 2015
[29]

Robust speech recognition via large-scale weak supervision, 2022

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

work page 2022
[30]

Exploring archi- tectures, data and units for streaming end-to-end speech recognition with rnn-transducer

Kanishka Rao, Ha s ¸im Sak, and Rohit Prabhavalkar. Exploring archi- tectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU) , pages 193–199. IEEE, 2017

work page 2017
[31]

Dynamic-programming approach to continuous speech recognition

Hiroaki Sakoe. Dynamic-programming approach to continuous speech recognition. In 1971 Proc. the International Congress of Acoustics, Budapest, 1971

work page 1971
[32]

Dynamic programming algorithm optimization for spoken word recognition

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing , 26(1):43–49, 2003

work page 2003
[33]

wav2vec: Unsupervised pre-training for speech recognition, 2019

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition, 2019

work page 2019
[34]

Bidirectional recurrent neural networks

Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing , 45(11):2673–2681, 1997

work page 1997
[35]

Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition

Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6783–6787. IEEE, 2021

work page 2021
[36]

Transformer transducer: One model unifying streaming and non- streaming speech recognition

Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, and Hasim Sak. Transformer transducer: One model unifying streaming and non- streaming speech recognition. arXiv preprint arXiv:2010.03192 , 2020

work page arXiv 2010
[37]

Decoder-only architecture for streaming end-to-end speech recognition, 2024

Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, and Shinji Watanabe. Decoder-only architecture for streaming end-to-end speech recognition, 2024

work page 2024
[38]

Streaming transformer asr with blockwise synchronous beam search

Emiru Tsunoo, Yosuke Kashiwagi, and Shinji Watanabe. Streaming transformer asr with blockwise synchronous beam search. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 22–29. IEEE, 2021

work page 2021
[39]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023
[40]

Low latency end-to-end streaming speech recognition with a scout network

Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, and Ming Zhou. Low latency end-to-end streaming speech recognition with a scout network. arXiv preprint arXiv:2003.10369 , 2020

work page arXiv 2003
[41]

Simul-whisper: Attention-guided streaming whisper with truncation detection

Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, and Jian Li. Simul-whisper: Attention-guided streaming whisper with truncation detection. arXiv preprint arXiv:2406.10052 , 2024

work page arXiv 2024
[42]

Efficient whisper on streaming speech

Rongxiang Wang, Zhiming Xu, and Felix Xiaozhu Lin. Efficient whisper on streaming speech. arXiv preprint arXiv:2412.11272 , 2024

work page arXiv 2024
[43]

Streaming transformer-based acoustic models using self- attention with augmented memory

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, and Frank Zhang. Streaming transformer-based acoustic models using self- attention with augmented memory. arXiv preprint arXiv:2005.08042 , 2020

work page arXiv 2005
[44]

Transformer-transducer: End-to-end speech recognition with self-attention

Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, and Michael L Seltzer. Transformer-transducer: End-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977 , 2019

work page arXiv 1910
[45]

Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 7829–7833. IEEE, 2020

work page 2020
[46]

Franceschini

Haoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, and Michele M. Franceschini. Adapting whisper for streaming speech recognition via two-pass decoding, 2025. APPENDIX A THEOREMS PROOFS Theorem 1. Let kτ < T , where k is the frame index and τ is the...

work page 2025
[47]

Iterate through sample points: The quickbrown 0.25 0.51 0.9 The quickbrown 0.25 0.51 0.9 The quickbrown 0.25 0.51 0.9 fox 1.22 jumps 1.5 <EOT> <EOT> <EOT> Whisper + LoRA Layers Encoder + Blockwise Masked Self-Attention Decoder streaming log-mel 2D Conv + GeLU

work page
[48]

Sample random points given the chunk size, and calculate target labels per sample point. Fig. 7: Fine-tuning process illustration. The above example demonstrates an encoder that uses a chunk size of size 300 msec. Using such method makes training more efficient, since there is no need to go through each possible frame in the streaming process. Assuming th...

work page
[49]

(52) or P (yi−m = j | y<i−m, Xkτ ) ≥ P (yi−m = j | y<i−m, X(k−1)τ) (53) holds

If yi−m = j is stable, either: j = arg max u∈V P (yi−m = u | y<i−m, Xkτ ) . (52) or P (yi−m = j | y<i−m, Xkτ ) ≥ P (yi−m = j | y<i−m, X(k−1)τ) (53) holds. Either way, yi−m token is a token with higher probability than the last frame. Thus, ρCW k+1 ≥ ρG k+1 Theorem 4. Let T be the input sequence length to the encoder, d the embedding dimension, and τ the c...

work page

[1] [1]

wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020

work page 2020

[2] [2]

Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study, 2024

Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, and Lei Xie. Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study, 2024

work page 2024

[3] [3]

Developing real-time streaming transformer transducer for speech recognition on large-scale dataset

Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5904–5908. IEEE, 2021. 13

work page 2021

[4] [4]

On-device streaming discrete speech units

Kwanghee Choi, Masao Someki, Emma Strubell, and Shinji Watan- abe. On-device streaming discrete speech units. arXiv preprint arXiv:2506.01845, 2025

work page arXiv 2025

[5] [5]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io- awareness, 2022

work page 2022

[6] [6]

Garofolo, Lori F

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, and Nancy L. Dahlgren. TIMIT acoustic-phonetic continuous speech corpus, 1993. LDC93S1

work page 1993

[7] [7]

Sequence transduction with recurrent neural networks, 2012

Alex Graves. Sequence transduction with recurrent neural networks, 2012

work page 2012

[8] [8]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J ¨urgen Schmid- huber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning , pages 369–376, 2006

work page 2006

[9] [9]

Streaming end-to-end speech recognition for mobile devices

Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6381–6385. IEEE, 2019

work page 2019

[10] [10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[12] [12]

Hubert: Self- supervised speech representation learning by masked prediction of hidden units, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self- supervised speech representation learning by masked prediction of hidden units, 2021

work page 2021

[13] [13]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021

[14] [14]

Word level timestamp generation for automatic speech recognition and translation, 2025

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, and Boris Ginsburg. Word level timestamp generation for automatic speech recognition and translation, 2025

work page 2025

[15] [15]

Efficient streaming llm for speech recognition

Junteng Jia, Gil Keren, Wei Zhou, Egor Lakomkin, Xiaohui Zhang, Chunyang Wu, Frank Seide, Jay Mahadeokar, and Ozlem Kalinli. Efficient streaming llm for speech recognition. arXiv preprint arXiv:2410.03752 , 2024

work page arXiv 2024

[16] [16]

Xlsr-transducer: Streaming asr for self-supervised pretrained models, 2024

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esa ´u Villatoro- Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, and Aravind Ganapathiraju. Xlsr-transducer: Streaming asr for self-supervised pretrained models, 2024

work page 2024

[17] [17]

Knowledge distillation from offline to streaming rnn transducer for end-to-end speech recognition

Gakuto Kurata and George Saon. Knowledge distillation from offline to streaming rnn transducer for end-to-end speech recognition. In Interspeech, pages 2117–2121, 2020

work page 2020

[18] [18]

Learning small- size dnn with output-distribution-based criteria

Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. Learning small- size dnn with output-distribution-based criteria. In interspeech, pages 1910–1914, 2014

work page 1910

[19] [19]

Pytorch distributed: Experiences on accelerating data parallel training, 2020

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020

work page 2020

[20] [20]

Low-latency sequence- to-sequence speech recognition and translation by partial hypothesis selection

Danni Liu, Gerasimos Spanakis, and Jan Niehues. Low-latency sequence- to-sequence speech recognition and translation by partial hypothesis selection. arXiv preprint arXiv:2005.11185 , 2020

work page arXiv 2005

[21] [21]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

work page 2019

[22] [22]

Knowledge distillation for small-footprint highway networks

Liang Lu, Michelle Guo, and Steve Renals. Knowledge distillation for small-footprint highway networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4820–4824. IEEE, 2017

work page 2017

[23] [23]

Turning whisper into real-time transcription system

Dominik Mach ´aˇcek, Raj Dabre, and Ond ˇrej Bojar. Turning whisper into real-time transcription system. arXiv preprint arXiv:2307.14743 , 2023

work page arXiv 2023

[24] [24]

Montreal forced aligner: Trainable text-speech alignment using kaldi

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017

work page 2017

[25] [25]

Streaming automatic speech recognition with the transformer model

Niko Moritz, Takaaki Hori, and Jonathan Le. Streaming automatic speech recognition with the transformer model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6074–6078. IEEE, 2020

work page 2020

[26] [26]

Triggered attention for end-to-end speech recognition

Niko Moritz, Takaaki Hori, and Jonathan Le Roux. Triggered attention for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5666–5670. IEEE, 2019

work page 2019

[27] [27]

Dual causal/non- causal self-attention for streaming end-to-end speech recognition

Niko Moritz, Takaaki Hori, and Jonathan Le Roux. Dual causal/non- causal self-attention for streaming end-to-end speech recognition. arXiv preprint arXiv:2107.01269, 2021

work page arXiv 2021

[28] [28]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015

work page 2015

[29] [29]

Robust speech recognition via large-scale weak supervision, 2022

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

work page 2022

[30] [30]

Exploring archi- tectures, data and units for streaming end-to-end speech recognition with rnn-transducer

Kanishka Rao, Ha s ¸im Sak, and Rohit Prabhavalkar. Exploring archi- tectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU) , pages 193–199. IEEE, 2017

work page 2017

[31] [31]

Dynamic-programming approach to continuous speech recognition

Hiroaki Sakoe. Dynamic-programming approach to continuous speech recognition. In 1971 Proc. the International Congress of Acoustics, Budapest, 1971

work page 1971

[32] [32]

Dynamic programming algorithm optimization for spoken word recognition

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing , 26(1):43–49, 2003

work page 2003

[33] [33]

wav2vec: Unsupervised pre-training for speech recognition, 2019

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition, 2019

work page 2019

[34] [34]

Bidirectional recurrent neural networks

Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing , 45(11):2673–2681, 1997

work page 1997

[35] [35]

Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition

Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6783–6787. IEEE, 2021

work page 2021

[36] [36]

Transformer transducer: One model unifying streaming and non- streaming speech recognition

Anshuman Tripathi, Jaeyoung Kim, Qian Zhang, Han Lu, and Hasim Sak. Transformer transducer: One model unifying streaming and non- streaming speech recognition. arXiv preprint arXiv:2010.03192 , 2020

work page arXiv 2010

[37] [37]

Decoder-only architecture for streaming end-to-end speech recognition, 2024

Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, and Shinji Watanabe. Decoder-only architecture for streaming end-to-end speech recognition, 2024

work page 2024

[38] [38]

Streaming transformer asr with blockwise synchronous beam search

Emiru Tsunoo, Yosuke Kashiwagi, and Shinji Watanabe. Streaming transformer asr with blockwise synchronous beam search. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 22–29. IEEE, 2021

work page 2021

[39] [39]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023

[40] [40]

Low latency end-to-end streaming speech recognition with a scout network

Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, and Ming Zhou. Low latency end-to-end streaming speech recognition with a scout network. arXiv preprint arXiv:2003.10369 , 2020

work page arXiv 2003

[41] [41]

Simul-whisper: Attention-guided streaming whisper with truncation detection

Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, and Jian Li. Simul-whisper: Attention-guided streaming whisper with truncation detection. arXiv preprint arXiv:2406.10052 , 2024

work page arXiv 2024

[42] [42]

Efficient whisper on streaming speech

Rongxiang Wang, Zhiming Xu, and Felix Xiaozhu Lin. Efficient whisper on streaming speech. arXiv preprint arXiv:2412.11272 , 2024

work page arXiv 2024

[43] [43]

Streaming transformer-based acoustic models using self- attention with augmented memory

Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, and Frank Zhang. Streaming transformer-based acoustic models using self- attention with augmented memory. arXiv preprint arXiv:2005.08042 , 2020

work page arXiv 2005

[44] [44]

Transformer-transducer: End-to-end speech recognition with self-attention

Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, and Michael L Seltzer. Transformer-transducer: End-to-end speech recognition with self-attention. arXiv preprint arXiv:1910.12977 , 2019

work page arXiv 1910

[45] [45]

Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 7829–7833. IEEE, 2020

work page 2020

[46] [46]

Franceschini

Haoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, and Michele M. Franceschini. Adapting whisper for streaming speech recognition via two-pass decoding, 2025. APPENDIX A THEOREMS PROOFS Theorem 1. Let kτ < T , where k is the frame index and τ is the...

work page 2025

[47] [47]

Iterate through sample points: The quickbrown 0.25 0.51 0.9 The quickbrown 0.25 0.51 0.9 The quickbrown 0.25 0.51 0.9 fox 1.22 jumps 1.5 <EOT> <EOT> <EOT> Whisper + LoRA Layers Encoder + Blockwise Masked Self-Attention Decoder streaming log-mel 2D Conv + GeLU

work page

[48] [48]

Sample random points given the chunk size, and calculate target labels per sample point. Fig. 7: Fine-tuning process illustration. The above example demonstrates an encoder that uses a chunk size of size 300 msec. Using such method makes training more efficient, since there is no need to go through each possible frame in the streaming process. Assuming th...

work page

[49] [49]

(52) or P (yi−m = j | y<i−m, Xkτ ) ≥ P (yi−m = j | y<i−m, X(k−1)τ) (53) holds

If yi−m = j is stable, either: j = arg max u∈V P (yi−m = u | y<i−m, Xkτ ) . (52) or P (yi−m = j | y<i−m, Xkτ ) ≥ P (yi−m = j | y<i−m, X(k−1)τ) (53) holds. Either way, yi−m token is a token with higher probability than the last frame. Thus, ρCW k+1 ≥ ρG k+1 Theorem 4. Let T be the input sequence length to the encoder, d the embedding dimension, and τ the c...

work page