arxiv: 2604.25611 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.SD

Recognition: unknown

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Erfan Ramezani , Mohammad Mahdi Giahi , Mohammad Erfan Zarabadipour , Amir Reza Yosefian , Hamid Ghadiri

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:28 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords real-time ASRstreaming architectureWhisper modelvoice activity detectionlow-latency transcriptionGPU memory efficiencyautomatic speech recognitionbounded memory streaming

0 comments

The pith

WhisperPipe streams the Whisper ASR model at 89 ms median latency with 48 percent less peak GPU memory and near-offline accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WhisperPipe as a streaming architecture for real-time automatic speech recognition that processes audio in bounded segments rather than accumulating full context. It combines a hybrid voice activity detector, overlapping dynamic buffers, and an adaptive processing rule to cut memory use and latency while holding word error rates close to the offline baseline. A sympathetic reader would care because large transformer models deliver high accuracy but normally demand too much compute and memory for live applications on ordinary hardware. The reported results show the system sustains performance over long sessions without memory growth. If the approach holds, real-time transcription could move from specialized servers to everyday devices without sacrificing quality.

Core claim

WhisperPipe is a streaming architecture for the Whisper model that achieves bounded memory consumption through three components: a hybrid VAD pipeline that merges Silero VAD with energy-based filtering to cut false activations by 34 percent, a dynamic buffering scheme using overlapping context windows to avoid boundary information loss, and an adaptive processing strategy that trades latency against accuracy according to speech characteristics. On 2.5 hours of diverse audio the system records a median end-to-end latency of 89 ms (90th percentile 142 ms), 48 percent lower peak GPU memory, 80.9 percent lower average GPU utilization, word error rate within 2 percent of offline Whisper, and zero

What carries the argument

WhisperPipe's hybrid VAD plus overlapping dynamic buffers with adaptive processing, which together allow segment-by-segment transcription without unbounded context accumulation or boundary errors.

If this is right

Real-time ASR becomes practical on edge devices and resource-constrained hardware.
Transcription accuracy remains within 2 percent of offline batch processing.
Systems can operate continuously for hours without memory growth.
Latency stays below 150 ms for 90 percent of utterances.
Modular design supports deployment from mobile to cloud environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same buffering and VAD pattern could be tested on other large transformer ASR models to check transferability.
Lower GPU utilization would reduce operating costs for cloud transcription services.
Live captioning in mobile or embedded applications becomes more feasible if the latency and memory gains hold across languages.
Further experiments on accented or multi-speaker data would reveal whether the 2 percent WER margin scales.

Load-bearing premise

The hybrid VAD and overlapping buffers must avoid losing critical speech information at segment boundaries and the adaptive rule must not push transcription errors beyond the reported 2 percent WER tolerance on varied speech.

What would settle it

Measure word error rate and memory usage on a held-out set of rapid speaker turns or noisy audio; if WER rises above 4 percent or memory grows after 150 minutes, the central performance claim fails.

Figures

Figures reproduced from arXiv: 2604.25611 by Amir Reza Yosefian, Erfan Ramezani, Hamid Ghadiri, Mohammad Erfan Zarabadipour, Mohammad Mahdi Giahi.

**Figure 1.** Figure 1: Overview of the WhisperPipe streaming pipeline. Audio is buffered, decoded by Whisper, filtered, and finalized using a two-tier consensus mechanism with timestamp-guided buffer management for efficient real-time transcription. The primary contributions of this work address the fundamental challenges of adapting large-scale transformer models to resource-constrained streaming scenarios. First, we introduce … view at source ↗

**Figure 2.** Figure 2: WhisperPipe’s dual-buffer mechanism appends audio, commits stable text via a consensus engine, and trims the active buffer at the last committed timestamp to keep the decoding window bounded. 2.1 Audio Acquisition and Scheduling WhisperPipe ingests audio as mono 16 kHz PCM, matching Whisper’s expected input sampling rate and preprocessing pipeline [1]. Let the sample rate be 𝑅 = 16,000Hz and let 𝑎𝑡 ∈ ℝ𝑅Δ d… view at source ↗

**Figure 3.** Figure 3: State machine of WhisperPipe’s two-tier commit policy, where hypotheses accumulate until agreement satisfies Tier-2 criteria, triggering a 3-way confirmation to commit stable text. A timeout fallback finalizes the best available hypothesis to prevent indefinite waiting. 2.5 Guardrails Incomplete Tokens and Timestamp Disambiguation Even when similarity thresholds are met, premature commitment can occur if t… view at source ↗

**Figure 4.** Figure 4: (a) Tier 1 commits stable text instantly with a 100% prefix match, while (b)Tier 2 requires prefix stability across three frames to handle acoustic fluctuations. After each commit, the processed audio is trimmed and added to the stable buffer, preserving the remaining context for the next decoding cycle view at source ↗

**Figure 10.** Figure 10: Computational intensity evolution over session duration. WhisperPipe (blue) maintains a stable, bounded profile throughout the session, while the baseline (orange) exhibits continuous growth proportional to cumulative audio length. The dashed horizontal line indicates the theoretical upper bound imposed by 𝑇𝑏𝑢𝑓 = 30𝑠 view at source ↗

**Figure 11.** Figure 11: Memory growth rate comparison between WhisperPipe and the baseline over time. WhisperPipe converges to a near-zero growth rate under steady-state operation, while the baseline exhibits a persistent positive slope. Shaded regions indicate one standard deviation across five evaluation runs. To provide a unified measure of resource efficiency, we define the REI as a composite metric incorporating peak GPU me… view at source ↗

**Figure 12.** Figure 12: Resource Efficiency Index REI comparison between WhisperPipe and the baseline. Higher values indicate better overall resource efficiency. WhisperPipe achieves a significantly higher REI, reflecting the combined gains in memory, utilization, and latency. These results collectively establish WhisperPipe as a resource-efficient solution suitable for deployment in constrained environments, including edge devi… view at source ↗

read the original abstract

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture's modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents WhisperPipe, a streaming ASR architecture for real-time Whisper inference. It introduces a hybrid VAD (Silero VAD plus energy-based filter) claimed to reduce false activations by 34%, a dynamic buffering scheme with overlapping context windows to avoid segment-boundary information loss, and an adaptive processing strategy. On 2.5 hours of diverse audio, the system reports 89 ms median end-to-end latency (90th percentile 142 ms), 48% lower peak GPU memory, 80.9% lower average GPU utilization, WER within 2% of offline Whisper, and zero memory growth over 150-minute sessions, while claiming 3-5x lower latency than prior streaming solutions.

Significance. If the empirical claims hold under rigorous evaluation, WhisperPipe would provide a practical, modular approach to deploying large transformer ASR models in resource-constrained real-time settings without unbounded memory growth. The combination of bounded-memory chunking with accuracy-preserving mechanisms addresses a key deployment barrier; the reported latency and utilization numbers, if reproducible, would be competitive with existing streaming baselines.

major comments (2)

[Evaluation] Evaluation section (2.5-hour test set): the central WER claim (within 2% of offline Whisper) rests on aggregate word error rate only. No per-boundary deletion/substitution breakdown, no ablation of overlap length, and no failure-mode analysis on fast speech, low-energy segments, or accented audio are provided. This leaves open whether the hybrid VAD and overlapping buffers actually prevent the information loss the skeptic note identifies, which is load-bearing for the accuracy claim.
[§3] §3 (hybrid VAD and dynamic buffering): the 34% false-activation reduction and the assertion that overlapping windows 'fully prevent information loss' are stated without quantitative sensitivity analysis on VAD thresholds or overlap size. Since these are the two free parameters listed in the axiom ledger, the paper should demonstrate that the reported latency/memory gains remain stable when these parameters vary within reasonable ranges.

minor comments (2)

[Abstract and Evaluation] The abstract and results section should explicitly name the 2.5-hour evaluation corpus (e.g., specific subsets of Common Voice, LibriSpeech, or in-house data) and the exact baseline Whisper implementation (model size, chunking strategy) to allow direct replication.
[Results] Figure captions and latency histograms would benefit from error bars or percentile shading to convey variability across the 2.5-hour set rather than single median/90th-percentile numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the evaluation and analysis.

read point-by-point responses

Referee: [Evaluation] Evaluation section (2.5-hour test set): the central WER claim (within 2% of offline Whisper) rests on aggregate word error rate only. No per-boundary deletion/substitution breakdown, no ablation of overlap length, and no failure-mode analysis on fast speech, low-energy segments, or accented audio are provided. This leaves open whether the hybrid VAD and overlapping buffers actually prevent the information loss the skeptic note identifies, which is load-bearing for the accuracy claim.

Authors: We agree that aggregate WER alone is insufficient to fully validate the boundary-preservation claims. In the revised manuscript we will add a per-boundary deletion/substitution breakdown and an ablation on overlap length. We will also include failure-mode analysis on fast-speech, low-energy, and accented subsets drawn from the existing 2.5-hour diverse test set. These additions will directly test whether the hybrid VAD and overlapping buffers mitigate information loss. revision: partial
Referee: [§3] §3 (hybrid VAD and dynamic buffering): the 34% false-activation reduction and the assertion that overlapping windows 'fully prevent information loss' are stated without quantitative sensitivity analysis on VAD thresholds or overlap size. Since these are the two free parameters listed in the axiom ledger, the paper should demonstrate that the reported latency/memory gains remain stable when these parameters vary within reasonable ranges.

Authors: We concur that sensitivity analysis on the free parameters is required. The revised version will include quantitative sensitivity results for VAD thresholds and overlap sizes, demonstrating the stability of the 34% false-activation reduction, latency, and memory metrics across reasonable ranges. This will confirm that the reported gains remain robust. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims rest on direct empirical measurements

full rationale

The paper describes an engineering architecture (hybrid VAD, dynamic overlapping buffers, adaptive processing) and supports its claims exclusively through runtime measurements on 2.5 h of audio: median latency 89 ms, 48% lower peak GPU memory, WER within 2% of offline baseline, and zero memory growth over 150 min. No equations, parameter fits, or first-principles derivations are presented that could reduce to their own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of the hybrid VAD and buffering scheme; no machine-checked proofs or parameter-free derivations are supplied. Free parameters such as VAD decision thresholds and buffer overlap lengths are implicitly tuned to achieve the stated 34% false-activation reduction and boundary-loss prevention.

free parameters (2)

VAD decision thresholds
Tuned to achieve the reported 34% reduction in false activations; exact values not stated in abstract.
Buffer overlap length
Chosen to prevent information loss at segment boundaries; size not specified.

axioms (2)

domain assumption Hybrid Silero-plus-energy VAD reliably segments speech without missing content that would degrade downstream ASR accuracy
Invoked to justify the 34% false-activation claim and the overall accuracy maintenance.
domain assumption Overlapping dynamic buffers fully compensate for context loss at chunk boundaries
Central to the claim that transcription quality remains within 2% of offline Whisper.

pith-pipeline@v0.9.0 · 5587 in / 1604 out tokens · 66367 ms · 2026-05-07T16:28:22.784341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 47 canonical work pages

[1]

Introduction Automatic speech recognition ASR has undergone transformative advances in recent years, driven primarily by the convergence of large-scale weakly supervised learning and transformer -based architectures [1]. Unlike traditional ASR systems that rely on carefully curated, domain -specific transcriptions, modern approaches leverage vast quantiti...
[2]

Let 𝑥(𝑡)denote the incoming audio stream

Method WhisperPipe is a streaming inference framework that transforms Whisper’s batch -oriented decoding into continuous live transcription with bounded steady-state compute and memory. Let 𝑥(𝑡)denote the incoming audio stream. WhisperPipe maintains two persistent buffers: - Committed Text Buffer S, an immutable sequence of finalized tokens or words that ...
[3]

Results We evaluate WhisperPipe under a continuous transcription setting designed to reflect real -world deployment conditions, including live captioning, conversational agents, and long -running voice interfaces. Unlike offline benchmarks that assess transcription quality in isolation, o ur evaluation protocol targets the operational constraints that ari...
[4]

Discussion The results presented in Section 5 demonstrate that WhisperPipe achieves substantial improvements across latency, stability, and resource efficiency without sacrificing transcription quality. Figure 13 provides a consolidated view of these multi -metric improvements, illustrating the simultaneous gains in response time, memory footprint, and GP...
[5]

Conclusion This paper introduced WhisperPipe, a streaming ASR architecture designed to address the latency, stability, and resource efficiency challenges inherent in real-time transcription of continuous audio streams. By integrating acoustic and semantic filtering, incremental decoding with a two -tier commit policy, and timestamp-guided audio slicing, W...
[8]

Robust Speech Recognition via Large-Scale Weak Supervision

Zhang, Y., Han, W., Qin, J., et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2212.04356

work page Pith review doi:10.48550/arxiv.2212.04356 2023
[9]

Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

Dong, L., Xu, S., & Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. ICASSP (2018). https://doi.org/10.1109/ICASSP.2018.8462506

work page doi:10.1109/icassp.2018.8462506 2018
[10]

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Moritz, N., Hori, T., & Le Roux, J. Streaming Automatic Speech Recognition with Blockwise Synchronous Transformer. ICASSP (2020). https://doi.org/10.1109/ICASSP40776.2020.9053742

work page doi:10.1109/icassp40776.2020.9053742 2020
[11]

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Zhang, Q., et al. Streaming Transformer for End-to-End Speech Recognition. ICASSP (2020). https://doi.org/10.1109/ICASSP40776.2020.9054418

work page doi:10.1109/icassp40776.2020.9054418 2020
[12]

Revach, N

Chen, X., et al. Developing Real-Time Streaming Transformer Transducer for Speech Recognition. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9414200

work page doi:10.1109/icassp39728.2021.9414200 2021
[13]

Transformer Transducer: A Streamable Speech Recognition Model

Zhang, Y., et al. Transformer Transducer: A Streamable Speech Recognition Model. ASRU (2021). https://doi.org/10.1109/ASRU51503.2021.9688007

work page doi:10.1109/asru51503.2021.9688007 2021
[14]

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Kannan, A., et al. Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model. Interspeech (2019). https://doi.org/10.21437/Interspeech.2019-1642

work page doi:10.21437/interspeech.2019-1642 2019
[15]

Jasper: An End-to-End Convolutional Neural Acoustic Model

Li, J., Lavrukhin, V., Ginsburg, B., et al. Jasper: An End-to-End Convolutional Neural Acoustic Model. Interspeech (2019). https://doi.org/10.21437/Interspeech.2019-1819

work page doi:10.21437/interspeech.2019-1819 2019
[16]

Deep Speech: Scaling Up End-to-End Speech Recognition

Hannun, A., et al. Deep Speech: Scaling Up End-to-End Speech Recognition. Communications of the ACM (2019). https://doi.org/10.1145/3323037

work page doi:10.1145/3323037 2019
[17]

A Comparison of Sequence-to-Sequence Models for Speech Recognition

Prabhavalkar, R., et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition. Interspeech (2019). https://doi.org/10.21437/Interspeech.2019-1848

work page doi:10.21437/interspeech.2019-1848 2019
[18]

C., et al

Chiu, C. C., et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. ICASSP (2018). https://doi.org/10.1109/ICASSP.2018.8462105

work page doi:10.1109/icassp.2018.8462105 2018
[19]

Listen, Attend and Spell: A Neural Network for Large Vocabulary Speech Recognition

Chan, W., et al. Listen, Attend and Spell: A Neural Network for Large Vocabulary Speech Recognition. IEEE Signal Processing Magazine (2018). https://doi.org/10.1109/MSP.2018.2889381

work page doi:10.1109/msp.2018.2889381 2018
[20]

A Comprehensive Study of Streaming Models for Speech Recognition

Zeyer, A., et al. A Comprehensive Study of Streaming Models for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021). https://doi.org/10.1109/TASLP.2021.3072324

work page doi:10.1109/taslp.2021.3072324 2021
[21]

Efficient Streaming ASR with Adaptive Chunk Transformer

Wang, Y., et al. Efficient Streaming ASR with Adaptive Chunk Transformer. Interspeech (2023). https://doi.org/10.21437/Interspeech.2023-1234

work page doi:10.21437/interspeech.2023-1234 2023
[22]

Scaling Speech Recognition with Transformer Models

Chen, Z., et al. Scaling Speech Recognition with Transformer Models. IEEE/ACM TASLP (2022). https://doi.org/10.1109/TASLP.2022.3152414

work page doi:10.1109/taslp.2022.3152414 2022
[23]

Streaming End-to-End Speech Recognition with Transformer Transducer

Peng, Z., et al. Streaming End-to-End Speech Recognition with Transformer Transducer. IEEE/ACM TASLP (2023). https://doi.org/10.1109/TASLP.2023.3245672

work page doi:10.1109/taslp.2023.3245672 2023
[24]

WeNet: Production-First Speech Recognition Toolkit

Zhang, B., et al. WeNet: Production-First Speech Recognition Toolkit. Interspeech (2022). https://doi.org/10.21437/Interspeech.2022-10630

work page doi:10.21437/interspeech.2022-10630 2022
[25]

Macháček, R

D. Macháček, R. Dabre, and O. Bojar, Turning Whisper into Real-Time Transcription System. arXiv preprint (2023). https://arxiv.org/abs/2307.14743

work page arXiv 2023
[26]

Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

Bain, M., et al. WhisperX: Speech Recognition with Word-Level Alignment. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2303.00747

work page doi:10.48550/arxiv.2303.00747 2023
[27]

Uncertainty-based streaming ASR with evidential deep learning

Sato, H., Sakuma, A., Sugano, R., et al. Uncertainty-based streaming ASR with evidential deep learning. IEEE Open Journal of Signal Processing (2026). https://doi.org/10.1109/OJSP.2026.3657308

work page doi:10.1109/ojsp.2026.3657308 2026
[28]

Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning

Kim, S., et al. Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning. ICASSP (2018). https://doi.org/10.1109/ICASSP.2018.8461375

work page doi:10.1109/icassp.2018.8461375 2018
[30]

k2SSL: A faster and better framework for self-supervised speech representation learning

Yang, Y., Zhuo, J., Jin, Z., et al. k2SSL: A faster and better framework for self-supervised speech representation learning. arXiv preprint (2024). https://doi.org/10.48550/arXiv.2603.16920

work page doi:10.48550/arxiv.2603.16920 2024
[31]

Chen, Stefano Mangini, and Marcel Worring

Chen, N., et al. Exploring Streaming Speech Recognition with Transformer Architectures. ICASSP (2022). https://doi.org/10.1109/ICASSP43922.2022.9746205

work page doi:10.1109/icassp43922.2022.9746205 2022
[32]

Improving Streaming ASR with Chunk-Based Self-Attention

Wang, Y., et al. Improving Streaming ASR with Chunk-Based Self-Attention. Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-2718

work page doi:10.21437/interspeech.2020-2718 2020
[33]

End-to-End Streaming Speech Recognition with Transformer Models

Kim, J., et al. End-to-End Streaming Speech Recognition with Transformer Models. ASRU (2019). https://doi.org/10.1109/ASRU46091.2019.9003950

work page doi:10.1109/asru46091.2019.9003950 2019
[34]

Revach, N

Liu, Y., et al. Streaming Speech Recognition Using Self-Attention Networks. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9414210

work page doi:10.1109/icassp39728.2021.9414210 2021
[35]

Transformer-Based Streaming End-to-End Speech Recognition

Zhang, X., et al. Transformer-Based Streaming End-to-End Speech Recognition. Interspeech (2022). https://doi.org/10.21437/Interspeech.2022-1120

work page doi:10.21437/interspeech.2022-1120 2022
[36]

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Sirichotedumrong, W., Na-Thalang, A., Manakul, P., et al. Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition. arXiv preprint (2026). https://doi.org/10.48550/arXiv.2601.13044

work page doi:10.48550/arxiv.2601.13044 2026
[37]

Improved RNN-T for Streaming Speech Recognition

Kim, S., et al. Improved RNN-T for Streaming Speech Recognition. Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-1073

work page doi:10.21437/interspeech.2020-1073 2020
[38]

Streaming End-to-End Speech Recognition for Mobile Devices

He, Y., et al. Streaming End-to-End Speech Recognition for Mobile Devices. ICASSP (2019). https://doi.org/10.1109/ICASSP.2019.8682678

work page doi:10.1109/icassp.2019.8682678 2019
[39]

Advances in Joint CTC-Attention Based End-to-End Speech Recognition

Hori, T., et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition. IEEE SLT (2018). https://doi.org/10.1109/SLT.2018.8639585

work page doi:10.1109/slt.2018.8639585 2018
[40]

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Narayanan, A., et al. Toward Streaming Speech Recognition with Transformer Models. ICASSP (2020). https://doi.org/10.1109/ICASSP40776.2020.9053092

work page doi:10.1109/icassp40776.2020.9053092 2020
[41]

ESPnet: End-to-end speech processing toolkit

Watanabe, S., et al. ESPnet: End-to-End Speech Processing Toolkit. Interspeech (2018). https://doi.org/10.21437/Interspeech.2018-1456

work page doi:10.21437/interspeech.2018-1456 2018
[42]

Streaming End-to-End Speech Recognition with Neural Transducers

Kanda, N., et al. Streaming End-to-End Speech Recognition with Neural Transducers. ICASSP (2019). https://doi.org/10.1109/ICASSP.2019.8682694

work page doi:10.1109/icassp.2019.8682694 2019
[43]

End-to-End Speech Recognition with Transformer Transducer

Chen, G., et al. End-to-End Speech Recognition with Transformer Transducer. Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1976

work page doi:10.21437/interspeech.2021-1976 2021
[44]

In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

Wang, Z., et al. Streaming Speech Recognition Using Contextual Transformer Models. ICASSP (2023). https://doi.org/10.1109/ICASSP49357.2023.10094873

work page doi:10.1109/icassp49357.2023.10094873 2023
[45]

Online Speech Recognition with Transformer-Based Architectures

Kim, J., et al. Online Speech Recognition with Transformer-Based Architectures. Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1402

work page doi:10.21437/interspeech.2021-1402 2021
[46]

Revach, N

Sun, Y., et al. Efficient Self-Attention for Streaming Speech Recognition. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9413802

work page doi:10.1109/icassp39728.2021.9413802 2021
[47]

Streaming Conformer for End-to-End Speech Recognition

Zhou, Y., et al. Streaming Conformer for End-to-End Speech Recognition. Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1245

work page doi:10.21437/interspeech.2021-1245 2021
[48]

Chen, Stefano Mangini, and Marcel Worring

Chen, Y., et al. Improving Transformer-Based ASR Systems for Real-Time Applications. ICASSP (2022). https://doi.org/10.1109/ICASSP43922.2022.9747031

work page doi:10.1109/icassp43922.2022.9747031 2022
[49]

Efficient Streaming Transformer Transducer for Speech Recognition

Liu, X., et al. Efficient Streaming Transformer Transducer for Speech Recognition. Interspeech (2023). https://doi.org/10.21437/Interspeech.2023-1489

work page doi:10.21437/interspeech.2023-1489 2023
[50]

Revach, N

Zhang, H., et al. Real-Time End-to-End Speech Recognition with Streaming Transformers. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9414041

work page doi:10.1109/icassp39728.2021.9414041 2021
[51]

Real-Time Speech Recognition Using Adaptive Attention Models

Wang, L., et al. Real-Time Speech Recognition Using Adaptive Attention Models. Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-1887

work page doi:10.21437/interspeech.2020-1887 2020
[52]

Revach, N

Kim, Y., et al. Low-Latency Streaming Speech Recognition with Neural Transducers. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9413731

work page doi:10.1109/icassp39728.2021.9413731 2021
[53]

Transformer-Based Online Speech Recognition for Low-Latency Applications

Liu, Z., et al. Transformer-Based Online Speech Recognition for Low-Latency Applications. Interspeech (2022). https://doi.org/10.21437/Interspeech.2022-2451

work page doi:10.21437/interspeech.2022-2451 2022
[54]

In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

Zhang, T., et al. End-to-End Streaming Speech Recognition with Contextual Attention. ICASSP (2023). https://doi.org/10.1109/ICASSP49357.2023.10094721

work page doi:10.1109/icassp49357.2023.10094721 2023
[56]

Ramezani, E., & Giahi, M. M. WhisperPipe: Source Code and Implementation for Real-Time ASR (0.1.1). Zenodo (2026). https://doi.org/10.1109/TASLP.2023.3254102

work page doi:10.1109/taslp.2023.3254102 2026