pith. sign in

arxiv: 2605.16896 · v1 · pith:WVAJLHNRnew · submitted 2026-05-16 · 💻 cs.CL

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

Pith reviewed 2026-05-19 20:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords Chinese contextual ASRdynamic dictionary filteringsemantic-pinyin-glyph retrievalhomophonic errorskeyword recognitionextended Smith-WatermanAishell-1RWCS-NER
0
0 comments X

The pith

Joint semantic-pinyin-glyph retrieval filters large keyword dictionaries more effectively for Chinese contextual ASR by recovering from homophonic errors that defeat standard semantic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Contextual ASR for Chinese must handle very large keyword dictionaries, but feeding too many irrelevant terms adds noise and hurts accuracy. Base ASR models often output homophonic or near-homophonic mistakes that keep the sound of the target keyword while destroying its semantic meaning, so pure semantic retrievers cannot find the right entries. The paper introduces a joint system that adds pinyin-based phonetic matching and glyph-based structural matching to semantic retrieval, then uses an extended alignment procedure to score full hypothesis sequences against candidate keywords. This combination produces a compact, relevant subset that downstream ASR models can use for better context. A reader would care because the method makes large-dictionary contextual recognition feasible in a language where sound-alike errors are common.

Core claim

The JSPG framework jointly integrates semantic, pinyin, and glyph features for dynamic dictionary filtering. Pinyin retrieves targets via phonetic similarity to counter homophonic distortions from the base ASR model, while glyph supplies complementary structural cues to discard numerous irrelevant homophones typical in Chinese. An extended Smith-Waterman algorithm computes similarity scores between N-best hypothesis sequences and keywords, bridging character-level pinyin/glyph metrics to sequence-level filtering decisions.

What carries the argument

The JSPG joint retrieval system, which augments semantic matching with pinyin phonetic similarity and glyph structural similarity, then applies extended Smith-Waterman alignment to score sequences against keywords.

If this is right

  • JSPG outperforms single-feature baselines on the Aishell-1 and RWCS-NER datasets.
  • Downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.
  • The method reduces noise from excessive irrelevant candidates when large-scale keyword dictionaries are used.
  • Pinyin handles phonetic similarity while glyph filters out homophones that semantic retrieval alone cannot distinguish.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-feature idea might help error recovery in non-Chinese ASR systems that suffer from frequent sound-alike substitutions.
  • Replacing the extended Smith-Waterman step with a learned sequence scorer could further tighten the link between character metrics and final filtering decisions.
  • Applying the filter inside a streaming decoder rather than on N-best lists could reduce latency while preserving the accuracy gains.

Load-bearing premise

Phonetic cues from pinyin and structural cues from glyph remain sufficiently discriminative and non-redundant even after the base ASR model has already introduced homophonic distortions.

What would settle it

Apply JSPG and a semantic-only baseline to a held-out Chinese speech set containing many homophone substitutions in the N-best hypotheses; if keyword recall and downstream ASR accuracy show no gain, the joint approach adds nothing beyond semantic retrieval.

Figures

Figures reproduced from arXiv: 2605.16896 by Shilin Zhou, Zhenghua Li.

Figure 1
Figure 1. Figure 1: Illustration of the retrieval process. Upper Path: The semantic retriever is misled by the ASR error “弃权 (Abstention)” and retrieves “放弃 (Give Up)”. Lower Path: Our JSPG retriever utilizes the joint semantic-pinyin-glyph features to correctly retrieve the target “期权 (Options)”. general training corpora, which makes it difficult for models to transcribe them effectively (Peters et al., 2018; Sudo et al., 20… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed JSPG filtering framework. Given an input utterance, a base ASR model [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The performance of retrieval methods on Aishell, DC, and ICI datasets. The x-axis indicates the K value [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes JSPG, a dynamic dictionary filtering framework for Chinese contextual ASR that jointly retrieves keywords using semantic, pinyin, and glyph features. It introduces an extended Smith-Waterman algorithm to compute similarity scores between N-best ASR hypotheses and dictionary entries, addressing cases where homophonic ASR errors distort semantic meaning while preserving phonetic and structural cues. Experiments on the Aishell-1 and RWCS-NER datasets are claimed to show that JSPG outperforms single-feature baselines, with downstream contextual ASR models achieving substantial gains in keyword recognition accuracy.

Significance. If the experimental claims hold after verification, the work addresses a practical bottleneck in scaling contextual ASR for Chinese by reducing noise from large dictionaries while leveraging language-specific phonetic and glyph information. The joint feature approach and sequence-level alignment extension are well-motivated responses to known limitations of semantic-only retrievers under ASR distortions. The paper correctly identifies the homophony problem as central and provides a targeted, multi-cue solution that could be useful for other logographic languages.

major comments (2)
  1. [Method] Method section on extended Smith-Waterman: the central claim that this alignment reliably bridges character-level pinyin/glyph metrics to sequence-level filtering decisions after ASR distortions is load-bearing for the outperformance result, yet the manuscript provides no ablation replacing the extension with standard Levenshtein distance or cosine similarity on embeddings; without this, it remains unclear whether the reported gains over single-feature baselines depend on the specific extension or would arise from simpler alignment.
  2. [Experiments] Experiments section (results on Aishell-1 and RWCS-NER): the claim of significant outperformance and substantial downstream improvements lacks reported numerical values, exact baselines, statistical tests, or error analysis in the provided text; this directly affects verifiability of the strongest claim that JSPG outperforms single-feature baselines.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., relative WER reduction or keyword accuracy gain) to support the outperformance statement.
  2. [Method] Notation for the joint similarity score combining semantic, pinyin, and glyph components should be explicitly defined with an equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical relevance of addressing homophony issues in Chinese contextual ASR. We address each major comment below and will revise the manuscript to improve clarity, verifiability, and empirical support.

read point-by-point responses
  1. Referee: [Method] Method section on extended Smith-Waterman: the central claim that this alignment reliably bridges character-level pinyin/glyph metrics to sequence-level filtering decisions after ASR distortions is load-bearing for the outperformance result, yet the manuscript provides no ablation replacing the extension with standard Levenshtein distance or cosine similarity on embeddings; without this, it remains unclear whether the reported gains over single-feature baselines depend on the specific extension or would arise from simpler alignment.

    Authors: We agree that an ablation would help isolate the contribution of the extended Smith-Waterman algorithm. The extension is specifically motivated by its local alignment properties, which tolerate insertions, deletions, and substitutions typical in ASR N-best hypotheses while jointly scoring pinyin and glyph matches at the character level; simpler global metrics like Levenshtein or embedding cosine may not handle partial matches or multi-cue weighting as effectively. In the revision we will add a dedicated ablation subsection comparing the extended Smith-Waterman against (i) standard Levenshtein distance on the same multi-feature representations and (ii) cosine similarity on averaged embeddings, reporting the resulting keyword retrieval F1 and downstream ASR accuracy to demonstrate that the sequence-level alignment is responsible for the observed gains. revision: yes

  2. Referee: [Experiments] Experiments section (results on Aishell-1 and RWCS-NER): the claim of significant outperformance and substantial downstream improvements lacks reported numerical values, exact baselines, statistical tests, or error analysis in the provided text; this directly affects verifiability of the strongest claim that JSPG outperforms single-feature baselines.

    Authors: We acknowledge that the current text does not present the full numerical results, baseline specifications, or statistical details needed for immediate verification. The experiments section of the manuscript contains tables and figures with concrete metrics, but we will expand it in the revision to include: exact numerical values (e.g., retrieval precision/recall and downstream CER/WER improvements on both datasets), precise descriptions of all single-feature baselines and their implementations, paired statistical significance tests, and a concise error analysis focused on homophonic error cases resolved by the joint features. These additions will make the outperformance claims fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical claims rest on external dataset experiments

full rationale

The paper proposes the JSPG framework as a practical joint retrieval method using semantic, pinyin, and glyph features plus an extended Smith-Waterman alignment to filter keyword dictionaries for Chinese contextual ASR. All load-bearing claims of outperformance and downstream accuracy gains are tied directly to experimental results on the independent Aishell-1 and RWCS-NER datasets rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are introduced that reduce the reported improvements to quantities defined by the method's own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that base ASR errors are predominantly homophonic and that pinyin plus glyph supply independent complementary signal; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Base ASR models produce homophonic or near-homophonic errors that preserve phonetic cues but distort semantic meaning, rendering standard semantic retrievers ineffective.
    Explicitly stated in the abstract as the core motivation for adding pinyin and glyph.

pith-pipeline@v0.9.0 · 5742 in / 1296 out tokens · 30512 ms · 2026-05-19T20:52:03.217064+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    Uri Alon, Golan Pundak, and Tara N Sainath. 2019. Contextual speech recognition with difficult negative training examples. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440--6444. IEEE

  2. [2]

    Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, and 1 others. 2024. Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition. arXiv preprint arXiv:2407.04675

  3. [3]

    Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1--5

  4. [4]

    Boli Chen, Guangwei Xu, Xiaobin Wang, Pengjun Xie, Meishan Zhang, and Fei Huang. 2022. Aishell-ner: Named entity recognition from chinese speech. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8352--8356

  5. [5]

    Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, and Zhou Zhao. 2025. https://doi.org/10.18653/v1/2025.acl-long.613 W av RAG : Audio-integrated retrieval augmented generation for spoken dialogue models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  6. [6]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171--4186, Minneapolis, Minnesota. Association for Computational Linguistics

  7. [7]

    Siskos Dimitrios, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, and Anastasios Drosou. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.768 Retrieval augmented generation based context discovery for ASR . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14247--14254, Suzhou, China. Associat...

  8. [8]

    Xun Gong, Anqi Lv, Zhiming Wang, Huijia Zhu, and Yanmin Qian. 2025. Br-asr: Efficient and scalable bias retrieval framework for contextual biasing asr in speech llm. arXiv preprint arXiv:2505.19179

  9. [9]

    Zhihong Lei, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, and Zhen Huang. 2025. Contextualization of asr with llm using phonetic retrieval-based augmentation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

  10. [11]

    Shaojun Li, Hengchao Shang, Daimeng Wei, Jiaxin Guo, Zongyao Li, Xianghui He, Min Zhang, and Hao Yang. 2024 b . La-rag: Enhancing llm-based asr accuracy with retrieval-augmented generation. arXiv preprint arXiv:2409.08597

  11. [12]

    Puneet Mathur, Zhe Liu, Ke Li, Yingyi Ma, Gil Karen, Zeeshan Ahmed, Dinesh Manocha, and Xuedong Zhang. 2024. https://aclanthology.org/2024.lrec-main.457/ DOC - RAG : ASR language model personalization with domain-distributed co-occurrence retrieval augmentation . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Langu...

  12. [13]

    ASR Omnilingual, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, and 1 others. 2025. Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages. arXiv preprint arXiv:2511.09690

  13. [14]

    OpenAI. 2025. https://arxiv.org/abs/2508.10925 gpt-oss-120b and gpt-oss-20b model card . Preprint, arXiv:2508.10925

  14. [15]

    Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2227--2237, New Orleans, Louisiana. Association for Computational Linguistics

  15. [16]

    Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao. 2018. Deep context: end-to-end contextual speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pages 418--425. IEEE

  16. [17]

    Ziheng Qiao, Houquan Zhou, Yumeng Liu, Zhenghua Li, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, and Fei Huang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1373 DISC : Plug-and-play decoding intervention with similarity of characters for C hinese spelling check . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...

  17. [18]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356

  18. [19]

    Temple F Smith, Michael S Waterman, and 1 others. 1981. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195--197

  19. [20]

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33:16857--16867

  20. [21]

    Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, and Shinji Watanabe. 2024 a . Contextualized automatic speech recognition with dynamic vocabulary. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 78--85. IEEE

  21. [22]

    Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, and Shinji Watanabe. 2024 b . Contextualized automatic speech recognition with attention-based bias phrase boosted beam search. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10896--10900. IEEE

  22. [23]

    Li Hai Tan, Angela R Laird, Karl Li, and Peter T Fox. 2005. Neuroanatomical correlates of phonological processing of chinese characters and alphabetic words: A meta-analysis. Human brain mapping, 25(1):83--91

  23. [24]

    Cihan Xiao, Zejiang Hou, Daniel Garcia-Romero, and Kyu J Han. 2025. Contextual asr with retrieval augmented large language model. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

  24. [25]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, and 1 others. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176

  25. [26]

    Shilin Zhou and Zhenghua Li. 2025. Improving contextual asr via multi-grained fusion with large language models. arXiv preprint arXiv:2507.12252

  26. [27]

    Shilin Zhou, Zhenghua Li, Chen Gong, Lei Zhang, Yu Hong, and Min Zhang. 2024 a . https://doi.org/10.18653/v1/2024.findings-acl.111 C hinese spoken named entity recognition in real-world scenarios: Dataset and approaches . In Findings of the Association for Computational Linguistics: ACL 2024, pages 1872--1884, Bangkok, Thailand. Association for Computatio...

  27. [28]

    Shilin Zhou, Zhenghua Li, Yu Hong, Min Zhang, Zhefeng Wang, and Baoxing Huai. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.147 C opy NE : Better contextual ASR by copying named entities . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2675--2686, Bangkok, Thailand. Associatio...