pith. sign in

arxiv: 2605.18222 · v2 · pith:OQWRAWNEnew · submitted 2026-05-18 · 📡 eess.AS

Contextual Biasing for Streaming ASR via CTC-based Word Spotting

Pith reviewed 2026-05-20 08:20 UTC · model grok-4.3

classification 📡 eess.AS
keywords contextual biasingstreaming ASRCTC word spottingreal-time speech recognitionkeyword detectionword error rateincremental commitmenttoken passing
0
0 comments X

The pith

A streaming extension of CTC word spotting supports real-time contextual biasing in ASR by tracking keywords across audio chunks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a streaming version of CTC-based word spotting to enable contextual biasing during live automatic speech recognition. It keeps track of possible keyword matches that cross from one audio chunk to the next and only releases output segments once they are certain to stay the same even with more audio arriving later. The method works on top of an existing acoustic model without any retraining or architecture changes. If successful, it allows systems to recognize rare or specialized words more accurately while keeping latency low enough for real-time use such as live captions or voice commands.

Core claim

The paper presents a streaming extension of CTC-WS that maintains active keyword paths across audio chunks with a stateful token passing algorithm and adds an incremental commitment mechanism that emits only segments guaranteed to remain unchanged by future audio, yielding lower overall word error rate and higher keyword F-score in real-time ASR without modifying the acoustic model or requiring extra training.

What carries the argument

Stateful token passing algorithm that keeps keyword paths alive across successive audio chunks, paired with an incremental commitment mechanism that defers uncertain output regions.

If this is right

  • The approach integrates directly into existing streaming ASR pipelines.
  • No modifications to the acoustic model or additional training are needed.
  • Overall word error rate decreases while keyword F-score increases.
  • Output latency remains low enough for real-time applications.
  • Rare and domain-specific words receive improved recognition during live transcription.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-wise path maintenance could apply to other partial-sequence tasks that need stable early commitments.
  • Combining this spotting layer with dynamic language model adaptation might further lift accuracy for changing contexts.
  • Scaling tests on utterances where keywords span many chunks would show whether path maintenance stays reliable at longer horizons.

Load-bearing premise

The assumption that a stateful token passing algorithm can reliably maintain keyword paths across successive audio chunks while the incremental commitment mechanism guarantees segments are unaffected by future audio.

What would settle it

Compare keyword F-score and overall WER on a held-out set of live audio streams containing keywords that cross chunk boundaries against the offline CTC-WS baseline; a large drop in performance would falsify the streaming extension claim.

Figures

Figures reproduced from arXiv: 2605.18222 by Berlin Chen, Kai-Chen Tsai, Tien-Hong Lo, Yun-Ting Sun.

Figure 1
Figure 1. Figure 1: The proposed Streaming CTC-WS pipeline. Various contextual biasing methods have been proposed, and they can be categorized according to how contextual information is incorporated into the recognition process. One representative direction is deep-fusion-based contextual biasing, which introduces biasing information directly into the neural ASR model. In these methods, the biasing list is encoded as addition… view at source ↗
Figure 2
Figure 2. Figure 2: Commit and Hold for Cross-Chunk Keyword Tracking - Completed [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of chunk size on WER and F-score with and without contextual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the challenges of streaming ASR. For example, CTC-based word spotting (CTC-WS) have demonstrated strong performance by directly detecting keywords from CTC log-probabilities, but they are limited to offline processing and require access to the full utterance. In This work, we present a streaming extension of CTC-WS for real-time contextual biasing. Our method maintains active keyword paths across audio chunks using a stateful token passing algorithm, enabling the detection of keywords that span multiple chunks. To ensure low latency and stable output, we introduce an incremental commitment mechanism that only emits segments guaranteed not to be affected by future audio, while deferring uncertain regions. This method naturally integrates with streaming ASR pipelines and does not require modifications to the underlying acoustic model or additional training, making it practical for real-world deployment. Experimental results show that our method reduces overall WER and effectively improves keyword F-score, demonstrating its effectiveness for real-time ASR applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a streaming extension of CTC-based word spotting (CTC-WS) for contextual biasing in automatic speech recognition. It introduces a stateful token passing algorithm to maintain active keyword paths across successive audio chunks and an incremental commitment mechanism that emits only segments guaranteed to be unaffected by future audio. The approach integrates with existing streaming ASR pipelines without acoustic-model changes or retraining. Experimental results are reported to show reduced overall WER and improved keyword F-score for real-time applications.

Significance. If the algorithmic claims and experimental results hold, the work addresses a practical gap in low-latency contextual biasing for streaming ASR. The no-retraining and no-AM-modification properties are clear practical strengths that could facilitate deployment. The focus on cross-chunk keyword detection and stable output is relevant to real-world streaming constraints.

major comments (2)
  1. [Method overview] Method overview (abstract and §3): the stateful token passing algorithm and incremental commitment mechanism are described only at a high level. No equations, pseudocode, or boundary conditions are supplied for carrying partial keyword hypotheses across chunk boundaries using only CTC log-probabilities seen so far, or for provably identifying segments whose posterior is unaffected by future audio. These two mechanisms are load-bearing for the streaming claim; without them the reported WER and F-score gains cannot be realized.
  2. [Experimental results] Experimental results (abstract and §4): the claims of overall WER reduction and improved keyword F-score are stated without dataset names, baseline systems, statistical tests, number of runs, or error analysis. This leaves the central effectiveness claim for real-time ASR without visible supporting evidence.
minor comments (1)
  1. [Abstract] Abstract: 'In This work' contains an erroneous capital 'T'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of our streaming CTC-WS approach for contextual biasing. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses
  1. Referee: [Method overview] Method overview (abstract and §3): the stateful token passing algorithm and incremental commitment mechanism are described only at a high level. No equations, pseudocode, or boundary conditions are supplied for carrying partial keyword hypotheses across chunk boundaries using only CTC log-probabilities seen so far, or for provably identifying segments whose posterior is unaffected by future audio. These two mechanisms are load-bearing for the streaming claim; without them the reported WER and F-score gains cannot be realized.

    Authors: We agree that the current presentation of the stateful token passing and incremental commitment mechanisms is at a high level. In the revised version we will expand §3 with explicit equations for updating active keyword paths across chunk boundaries from CTC log-probabilities observed so far, together with pseudocode and boundary conditions for the incremental commitment step that identifies segments whose posteriors are guaranteed to be unaffected by future audio. These additions will make the streaming properties fully explicit. revision: yes

  2. Referee: [Experimental results] Experimental results (abstract and §4): the claims of overall WER reduction and improved keyword F-score are stated without dataset names, baseline systems, statistical tests, number of runs, or error analysis. This leaves the central effectiveness claim for real-time ASR without visible supporting evidence.

    Authors: We acknowledge that the experimental claims require more supporting detail. We will revise §4 to name the evaluation datasets, fully describe the baseline systems, report statistical significance tests, state the number of runs performed, and include an error analysis. These changes will provide the concrete evidence needed to substantiate the reported WER and keyword F-score improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic extension remains independent of fitted inputs

full rationale

The paper presents a streaming extension of CTC-WS via a stateful token passing algorithm and incremental commitment mechanism. No equations, parameter fits, or predictions are described that reduce by construction to the method's own inputs or prior self-citations. The claimed WER reduction and keyword F-score gains are presented as empirical outcomes of the implemented algorithm on real-time pipelines, without any self-referential definitions or load-bearing uniqueness theorems imported from the authors' prior work. The approach is explicitly positioned as a practical integration that requires no acoustic model changes or retraining, confirming the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach builds on standard CTC decoding and token-passing concepts from prior ASR literature; the abstract introduces no explicit new free parameters, axioms, or invented entities beyond the described algorithmic components.

pith-pipeline@v0.9.0 · 5739 in / 1037 out tokens · 53435 ms · 2026-05-20T08:20:17.792743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    A Survey on Deep Learning for Named Entity Recognition,

    J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition,”IEEE Transactions on Knowledge and Data Engineering, 2022

  2. [2]

    Deep Context: End-to-End Contextual Speech Recognition,

    G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep Context: End-to-End Contextual Speech Recognition,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2018

  3. [3]

    Contextual RNN-T for Open Domain ASR,

    M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze, and Y . Saraf, “Contextual RNN-T for Open Domain ASR,” inProc. Interspeech, 2020

  4. [4]

    PromptASR for Contextualized ASR with Controllable Style,

    X. Yang, W. Kang, Z. Yao, Y . Yang, L. Guo, F. Kuang, L. Lin, and D. Povey, “PromptASR for Contextualized ASR with Controllable Style,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

  5. [5]

    Contextualized Streaming End-to-End Speech Recognition with Trie- Based Deep Biasing and Shallow Fusion,

    D. Le, M. Jain, G. Keren, S. Kim, Y . Shi, J. Mahadeokar, J. Chan, Y . Shangguan, C. Fuegen, O. Kalinli, Y . Saraf, and M. L. Seltzer, “Contextualized Streaming End-to-End Speech Recognition with Trie- Based Deep Biasing and Shallow Fusion,” inProc. Interspeech, 2021

  6. [6]

    Selective Biasing with Trie-Based Contextual Adapters for Personalised Speech Recognition using Neural Transducers,

    P. Harding, S. Tong, and S. Wiesler, “Selective Biasing with Trie-Based Contextual Adapters for Personalised Speech Recognition using Neural Transducers,” inProc. Interspeech, 2023

  7. [7]

    Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss,

    M. Fang, T. Wei, K. Guo, Z. Zhuang, Y . Shi, N. Cheng, S. Wang, and J. Xiao, “Improving Contextual ASR with Enhanced Phrase-Level Representation Based on MCTC Loss,” inProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

  8. [8]

    Shallow-Fusion End-to-End Contextual Biasing,

    D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, “Shallow-Fusion End-to-End Contextual Biasing,” inProc. Interspeech, 2019

  9. [9]

    Spell My Name: Keyword Boosted Speech Recognition,

    N. Jung, G. Kim, and J. S. Chung, “Spell My Name: Keyword Boosted Speech Recognition,” inProc. IEEE International Conference on Acous- tics, Speech, and Signal Processing (ICASSP), 2022

  10. [10]

    NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding,

    V . Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V . Lavrukhin, and B. Ginsburg, “NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding,” inProc. Interspeech, 2025

  11. [11]

    TurboBias: Universal ASR Context-Biasing Powered by GPU- Accelerated Phrase-Boosting Tree,

    A. Andrusenko, V . Bataev, L. Grigoryan, V . Lavrukhin, and B. Gins- burg, “TurboBias: Universal ASR Context-Biasing Powered by GPU- Accelerated Phrase-Boosting Tree,” inProc. IEEE Workshop on Auto- matic Speech Recognition and Understanding (ASRU), 2025

  12. [12]

    Fast Context-Biasing for CTC and Transducer ASR Models with CTC- Based Word Spotter,

    A. Andrusenko, A. Laptev, V . Bataev, V . Lavrukhin, and B. Ginsburg, “Fast Context-Biasing for CTC and Transducer ASR Models with CTC- Based Word Spotter,” inProc. Interspeech, 2024

  13. [13]

    Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” inProc. International Conference on Machine Learning (ICML), 2006

  14. [14]

    Sequence Transduction with Recurrent Neural Networks

    A. Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv preprint arXiv:1211.3711, 2012

  15. [15]

    Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,

    D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balam, and B. Ginsburg, “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,” inProc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2023

  16. [16]

    Stateful Conformer with Cache-Based Inference for Streaming Auto- matic Speech Recognition,

    V . Noroozi, S. Majumdar, A. Kumar, J. Balam, and B. Ginsburg, “Stateful Conformer with Cache-Based Inference for Streaming Auto- matic Speech Recognition,” inProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

  17. [17]

    Neural Machine Translation of Rare Words with Subword Units,

    R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” inProc. Annual Meeting of the Association for Computational Linguistics (ACL), 2016

  18. [18]

    STOP: A Dataset for Spoken Task Oriented Semantic Parsing,

    P. Tomasello, A. Shrivastava, D. Lazar, P.-C. Hsu, D. Le, A. Sagar, A. Elkahky, J. Copet, W.-N. Hsu, Y . Adi, R. Algayres, T. A. Nguyen, E. Dupoux, L. Zettlemoyer, and A. Mohamed, “STOP: A Dataset for Spoken Task Oriented Semantic Parsing,” inProc. IEEE Spoken Language Technology Workshop (SLT), 2022