Contextual Biasing for Streaming ASR via CTC-based Word Spotting

Berlin Chen; Kai-Chen Tsai; Tien-Hong Lo; Yun-Ting Sun

arxiv: 2605.18222 · v2 · pith:OQWRAWNEnew · submitted 2026-05-18 · 📡 eess.AS

Contextual Biasing for Streaming ASR via CTC-based Word Spotting

Kai-Chen Tsai , Tien-Hong Lo , Yun-Ting Sun , Berlin Chen This is my paper

Pith reviewed 2026-05-20 08:20 UTC · model grok-4.3

classification 📡 eess.AS

keywords contextual biasingstreaming ASRCTC word spottingreal-time speech recognitionkeyword detectionword error rateincremental commitmenttoken passing

0 comments

The pith

A streaming extension of CTC word spotting supports real-time contextual biasing in ASR by tracking keywords across audio chunks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a streaming version of CTC-based word spotting to enable contextual biasing during live automatic speech recognition. It keeps track of possible keyword matches that cross from one audio chunk to the next and only releases output segments once they are certain to stay the same even with more audio arriving later. The method works on top of an existing acoustic model without any retraining or architecture changes. If successful, it allows systems to recognize rare or specialized words more accurately while keeping latency low enough for real-time use such as live captions or voice commands.

Core claim

The paper presents a streaming extension of CTC-WS that maintains active keyword paths across audio chunks with a stateful token passing algorithm and adds an incremental commitment mechanism that emits only segments guaranteed to remain unchanged by future audio, yielding lower overall word error rate and higher keyword F-score in real-time ASR without modifying the acoustic model or requiring extra training.

What carries the argument

Stateful token passing algorithm that keeps keyword paths alive across successive audio chunks, paired with an incremental commitment mechanism that defers uncertain output regions.

If this is right

The approach integrates directly into existing streaming ASR pipelines.
No modifications to the acoustic model or additional training are needed.
Overall word error rate decreases while keyword F-score increases.
Output latency remains low enough for real-time applications.
Rare and domain-specific words receive improved recognition during live transcription.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunk-wise path maintenance could apply to other partial-sequence tasks that need stable early commitments.
Combining this spotting layer with dynamic language model adaptation might further lift accuracy for changing contexts.
Scaling tests on utterances where keywords span many chunks would show whether path maintenance stays reliable at longer horizons.

Load-bearing premise

The assumption that a stateful token passing algorithm can reliably maintain keyword paths across successive audio chunks while the incremental commitment mechanism guarantees segments are unaffected by future audio.

What would settle it

Compare keyword F-score and overall WER on a held-out set of live audio streams containing keywords that cross chunk boundaries against the offline CTC-WS baseline; a large drop in performance would falsify the streaming extension claim.

Figures

Figures reproduced from arXiv: 2605.18222 by Berlin Chen, Kai-Chen Tsai, Tien-Hong Lo, Yun-Ting Sun.

**Figure 1.** Figure 1: The proposed Streaming CTC-WS pipeline. Various contextual biasing methods have been proposed, and they can be categorized according to how contextual information is incorporated into the recognition process. One representative direction is deep-fusion-based contextual biasing, which introduces biasing information directly into the neural ASR model. In these methods, the biasing list is encoded as addition… view at source ↗

**Figure 2.** Figure 2: Commit and Hold for Cross-Chunk Keyword Tracking - Completed [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of chunk size on WER and F-score with and without contextual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the challenges of streaming ASR. For example, CTC-based word spotting (CTC-WS) have demonstrated strong performance by directly detecting keywords from CTC log-probabilities, but they are limited to offline processing and require access to the full utterance. In This work, we present a streaming extension of CTC-WS for real-time contextual biasing. Our method maintains active keyword paths across audio chunks using a stateful token passing algorithm, enabling the detection of keywords that span multiple chunks. To ensure low latency and stable output, we introduce an incremental commitment mechanism that only emits segments guaranteed not to be affected by future audio, while deferring uncertain regions. This method naturally integrates with streaming ASR pipelines and does not require modifications to the underlying acoustic model or additional training, making it practical for real-world deployment. Experimental results show that our method reduces overall WER and effectively improves keyword F-score, demonstrating its effectiveness for real-time ASR applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts offline CTC word spotting to streaming ASR with stateful token passing and incremental commitment, but the experiments stay too thin to judge the real gains.

read the letter

The core move here is taking CTC-based word spotting, which worked offline, and making it work on audio chunks in real time. Stateful token passing keeps partial keyword paths alive across boundaries, and the commitment rule tries to emit only the parts that future audio cannot change. That combination lets the system add contextual biasing without retraining the acoustic model or touching the decoder much, which is the practical hook for live voice systems.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a streaming extension of CTC-based word spotting (CTC-WS) for contextual biasing in automatic speech recognition. It introduces a stateful token passing algorithm to maintain active keyword paths across successive audio chunks and an incremental commitment mechanism that emits only segments guaranteed to be unaffected by future audio. The approach integrates with existing streaming ASR pipelines without acoustic-model changes or retraining. Experimental results are reported to show reduced overall WER and improved keyword F-score for real-time applications.

Significance. If the algorithmic claims and experimental results hold, the work addresses a practical gap in low-latency contextual biasing for streaming ASR. The no-retraining and no-AM-modification properties are clear practical strengths that could facilitate deployment. The focus on cross-chunk keyword detection and stable output is relevant to real-world streaming constraints.

major comments (2)

[Method overview] Method overview (abstract and §3): the stateful token passing algorithm and incremental commitment mechanism are described only at a high level. No equations, pseudocode, or boundary conditions are supplied for carrying partial keyword hypotheses across chunk boundaries using only CTC log-probabilities seen so far, or for provably identifying segments whose posterior is unaffected by future audio. These two mechanisms are load-bearing for the streaming claim; without them the reported WER and F-score gains cannot be realized.
[Experimental results] Experimental results (abstract and §4): the claims of overall WER reduction and improved keyword F-score are stated without dataset names, baseline systems, statistical tests, number of runs, or error analysis. This leaves the central effectiveness claim for real-time ASR without visible supporting evidence.

minor comments (1)

[Abstract] Abstract: 'In This work' contains an erroneous capital 'T'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of our streaming CTC-WS approach for contextual biasing. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: [Method overview] Method overview (abstract and §3): the stateful token passing algorithm and incremental commitment mechanism are described only at a high level. No equations, pseudocode, or boundary conditions are supplied for carrying partial keyword hypotheses across chunk boundaries using only CTC log-probabilities seen so far, or for provably identifying segments whose posterior is unaffected by future audio. These two mechanisms are load-bearing for the streaming claim; without them the reported WER and F-score gains cannot be realized.

Authors: We agree that the current presentation of the stateful token passing and incremental commitment mechanisms is at a high level. In the revised version we will expand §3 with explicit equations for updating active keyword paths across chunk boundaries from CTC log-probabilities observed so far, together with pseudocode and boundary conditions for the incremental commitment step that identifies segments whose posteriors are guaranteed to be unaffected by future audio. These additions will make the streaming properties fully explicit. revision: yes
Referee: [Experimental results] Experimental results (abstract and §4): the claims of overall WER reduction and improved keyword F-score are stated without dataset names, baseline systems, statistical tests, number of runs, or error analysis. This leaves the central effectiveness claim for real-time ASR without visible supporting evidence.

Authors: We acknowledge that the experimental claims require more supporting detail. We will revise §4 to name the evaluation datasets, fully describe the baseline systems, report statistical significance tests, state the number of runs performed, and include an error analysis. These changes will provide the concrete evidence needed to substantiate the reported WER and keyword F-score improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic extension remains independent of fitted inputs

full rationale

The paper presents a streaming extension of CTC-WS via a stateful token passing algorithm and incremental commitment mechanism. No equations, parameter fits, or predictions are described that reduce by construction to the method's own inputs or prior self-citations. The claimed WER reduction and keyword F-score gains are presented as empirical outcomes of the implemented algorithm on real-time pipelines, without any self-referential definitions or load-bearing uniqueness theorems imported from the authors' prior work. The approach is explicitly positioned as a practical integration that requires no acoustic model changes or retraining, confirming the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach builds on standard CTC decoding and token-passing concepts from prior ASR literature; the abstract introduces no explicit new free parameters, axioms, or invented entities beyond the described algorithmic components.

pith-pipeline@v0.9.0 · 5739 in / 1037 out tokens · 53435 ms · 2026-05-20T08:20:17.792743+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

maintains active keyword paths across audio chunks using a stateful token passing algorithm... incremental commitment mechanism that only emits segments guaranteed not to be affected by future audio
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CTC-WS uses CTC log probabilities and trie to detect predefined terms... no modifications to the underlying acoustic model or additional training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

A Survey on Deep Learning for Named Entity Recognition,

J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition,”IEEE Transactions on Knowledge and Data Engineering, 2022

work page 2022
[2]

Deep Context: End-to-End Contextual Speech Recognition,

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep Context: End-to-End Contextual Speech Recognition,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2018

work page 2018
[3]

Contextual RNN-T for Open Domain ASR,

M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze, and Y . Saraf, “Contextual RNN-T for Open Domain ASR,” inProc. Interspeech, 2020

work page 2020
[4]

PromptASR for Contextualized ASR with Controllable Style,

X. Yang, W. Kang, Z. Yao, Y . Yang, L. Guo, F. Kuang, L. Lin, and D. Povey, “PromptASR for Contextualized ASR with Controllable Style,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

work page 2024
[5]

Contextualized Streaming End-to-End Speech Recognition with Trie- Based Deep Biasing and Shallow Fusion,

D. Le, M. Jain, G. Keren, S. Kim, Y . Shi, J. Mahadeokar, J. Chan, Y . Shangguan, C. Fuegen, O. Kalinli, Y . Saraf, and M. L. Seltzer, “Contextualized Streaming End-to-End Speech Recognition with Trie- Based Deep Biasing and Shallow Fusion,” inProc. Interspeech, 2021

work page 2021
[6]

Selective Biasing with Trie-Based Contextual Adapters for Personalised Speech Recognition using Neural Transducers,

P. Harding, S. Tong, and S. Wiesler, “Selective Biasing with Trie-Based Contextual Adapters for Personalised Speech Recognition using Neural Transducers,” inProc. Interspeech, 2023

work page 2023
[7]

Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss,

M. Fang, T. Wei, K. Guo, Z. Zhuang, Y . Shi, N. Cheng, S. Wang, and J. Xiao, “Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss,” inProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

work page 2025
[8]

Shallow-Fusion End-to-End Contextual Biasing,

D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-Fusion End-to-End Contextual Biasing,” inProc. Interspeech, 2019

work page 2019
[9]

Spell My Name: Keyword Boosted Speech Recognition,

N. Jung, G. Kim, and J. S. Chung, “Spell My Name: Keyword Boosted Speech Recognition,” inProc. IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP), 2022

work page 2022
[10]

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding,

V . Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V . Lavrukhin, and B. Ginsburg, “NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding,” inProc. Interspeech, 2025

work page 2025
[11]

TurboBias: Universal ASR Context-Biasing Powered by GPU- Accelerated Phrase-Boosting Tree,

A. Andrusenko, V . Bataev, L. Grigoryan, V . Lavrukhin, and B. Gins- burg, “TurboBias: Universal ASR Context-Biasing Powered by GPU- Accelerated Phrase-Boosting Tree,” inProc. IEEE Workshop on Auto- matic Speech Recognition and Understanding (ASRU), 2025

work page 2025
[12]

Fast Context-Biasing for CTC and Transducer ASR Models with CTC- Based Word Spotter,

A. Andrusenko, A. Laptev, V . Bataev, V . Lavrukhin, and B. Ginsburg, “Fast Context-Biasing for CTC and Transducer ASR Models with CTC- Based Word Spotter,” inProc. Interspeech, 2024

work page 2024
[13]

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” inProc. International Conference on Machine Learning (ICML), 2006

work page 2006
[14]

Sequence Transduction with Recurrent Neural Networks

A. Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[15]

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,

D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balam, and B. Ginsburg, “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,” inProc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2023

work page 2023
[16]

Stateful Conformer with Cache-Based Inference for Streaming Auto- matic Speech Recognition,

V . Noroozi, S. Majumdar, A. Kumar, J. Balam, and B. Ginsburg, “Stateful Conformer with Cache-Based Inference for Streaming Auto- matic Speech Recognition,” inProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

work page 2024
[17]

Neural Machine Translation of Rare Words with Subword Units,

R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” inProc. Annual Meeting of the Association for Computational Linguistics (ACL), 2016

work page 2016
[18]

STOP: A Dataset for Spoken Task Oriented Semantic Parsing,

P. Tomasello, A. Shrivastava, D. Lazar, P.-C. Hsu, D. Le, A. Sagar, A. Elkahky, J. Copet, W.-N. Hsu, Y . Adi, R. Algayres, T. A. Nguyen, E. Dupoux, L. Zettlemoyer, and A. Mohamed, “STOP: A Dataset for Spoken Task Oriented Semantic Parsing,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2022

work page 2022

[1] [1]

A Survey on Deep Learning for Named Entity Recognition,

J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition,”IEEE Transactions on Knowledge and Data Engineering, 2022

work page 2022

[2] [2]

Deep Context: End-to-End Contextual Speech Recognition,

G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep Context: End-to-End Contextual Speech Recognition,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2018

work page 2018

[3] [3]

Contextual RNN-T for Open Domain ASR,

M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze, and Y . Saraf, “Contextual RNN-T for Open Domain ASR,” inProc. Interspeech, 2020

work page 2020

[4] [4]

PromptASR for Contextualized ASR with Controllable Style,

X. Yang, W. Kang, Z. Yao, Y . Yang, L. Guo, F. Kuang, L. Lin, and D. Povey, “PromptASR for Contextualized ASR with Controllable Style,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

work page 2024

[5] [5]

Contextualized Streaming End-to-End Speech Recognition with Trie- Based Deep Biasing and Shallow Fusion,

D. Le, M. Jain, G. Keren, S. Kim, Y . Shi, J. Mahadeokar, J. Chan, Y . Shangguan, C. Fuegen, O. Kalinli, Y . Saraf, and M. L. Seltzer, “Contextualized Streaming End-to-End Speech Recognition with Trie- Based Deep Biasing and Shallow Fusion,” inProc. Interspeech, 2021

work page 2021

[6] [6]

Selective Biasing with Trie-Based Contextual Adapters for Personalised Speech Recognition using Neural Transducers,

P. Harding, S. Tong, and S. Wiesler, “Selective Biasing with Trie-Based Contextual Adapters for Personalised Speech Recognition using Neural Transducers,” inProc. Interspeech, 2023

work page 2023

[7] [7]

Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss,

M. Fang, T. Wei, K. Guo, Z. Zhuang, Y . Shi, N. Cheng, S. Wang, and J. Xiao, “Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss,” inProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

work page 2025

[8] [8]

Shallow-Fusion End-to-End Contextual Biasing,

D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-Fusion End-to-End Contextual Biasing,” inProc. Interspeech, 2019

work page 2019

[9] [9]

Spell My Name: Keyword Boosted Speech Recognition,

N. Jung, G. Kim, and J. S. Chung, “Spell My Name: Keyword Boosted Speech Recognition,” inProc. IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP), 2022

work page 2022

[10] [10]

NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding,

V . Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V . Lavrukhin, and B. Ginsburg, “NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding,” inProc. Interspeech, 2025

work page 2025

[11] [11]

TurboBias: Universal ASR Context-Biasing Powered by GPU- Accelerated Phrase-Boosting Tree,

A. Andrusenko, V . Bataev, L. Grigoryan, V . Lavrukhin, and B. Gins- burg, “TurboBias: Universal ASR Context-Biasing Powered by GPU- Accelerated Phrase-Boosting Tree,” inProc. IEEE Workshop on Auto- matic Speech Recognition and Understanding (ASRU), 2025

work page 2025

[12] [12]

Fast Context-Biasing for CTC and Transducer ASR Models with CTC- Based Word Spotter,

A. Andrusenko, A. Laptev, V . Bataev, V . Lavrukhin, and B. Ginsburg, “Fast Context-Biasing for CTC and Transducer ASR Models with CTC- Based Word Spotter,” inProc. Interspeech, 2024

work page 2024

[13] [13]

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” inProc. International Conference on Machine Learning (ICML), 2006

work page 2006

[14] [14]

Sequence Transduction with Recurrent Neural Networks

A. Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[15] [15]

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,

D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balam, and B. Ginsburg, “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,” inProc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2023

work page 2023

[16] [16]

Stateful Conformer with Cache-Based Inference for Streaming Auto- matic Speech Recognition,

V . Noroozi, S. Majumdar, A. Kumar, J. Balam, and B. Ginsburg, “Stateful Conformer with Cache-Based Inference for Streaming Auto- matic Speech Recognition,” inProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

work page 2024

[17] [17]

Neural Machine Translation of Rare Words with Subword Units,

R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” inProc. Annual Meeting of the Association for Computational Linguistics (ACL), 2016

work page 2016

[18] [18]

STOP: A Dataset for Spoken Task Oriented Semantic Parsing,

P. Tomasello, A. Shrivastava, D. Lazar, P.-C. Hsu, D. Le, A. Sagar, A. Elkahky, J. Copet, W.-N. Hsu, Y . Adi, R. Algayres, T. A. Nguyen, E. Dupoux, L. Zettlemoyer, and A. Mohamed, “STOP: A Dataset for Spoken Task Oriented Semantic Parsing,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2022

work page 2022