JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR
Pith reviewed 2026-05-19 20:52 UTC · model grok-4.3
The pith
Joint semantic-pinyin-glyph retrieval filters large keyword dictionaries more effectively for Chinese contextual ASR by recovering from homophonic errors that defeat standard semantic methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The JSPG framework jointly integrates semantic, pinyin, and glyph features for dynamic dictionary filtering. Pinyin retrieves targets via phonetic similarity to counter homophonic distortions from the base ASR model, while glyph supplies complementary structural cues to discard numerous irrelevant homophones typical in Chinese. An extended Smith-Waterman algorithm computes similarity scores between N-best hypothesis sequences and keywords, bridging character-level pinyin/glyph metrics to sequence-level filtering decisions.
What carries the argument
The JSPG joint retrieval system, which augments semantic matching with pinyin phonetic similarity and glyph structural similarity, then applies extended Smith-Waterman alignment to score sequences against keywords.
If this is right
- JSPG outperforms single-feature baselines on the Aishell-1 and RWCS-NER datasets.
- Downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.
- The method reduces noise from excessive irrelevant candidates when large-scale keyword dictionaries are used.
- Pinyin handles phonetic similarity while glyph filters out homophones that semantic retrieval alone cannot distinguish.
Where Pith is reading between the lines
- The same multi-feature idea might help error recovery in non-Chinese ASR systems that suffer from frequent sound-alike substitutions.
- Replacing the extended Smith-Waterman step with a learned sequence scorer could further tighten the link between character metrics and final filtering decisions.
- Applying the filter inside a streaming decoder rather than on N-best lists could reduce latency while preserving the accuracy gains.
Load-bearing premise
Phonetic cues from pinyin and structural cues from glyph remain sufficiently discriminative and non-redundant even after the base ASR model has already introduced homophonic distortions.
What would settle it
Apply JSPG and a semantic-only baseline to a held-out Chinese speech set containing many homophone substitutions in the N-best hypotheses; if keyword recall and downstream ASR accuracy show no gain, the joint approach adds nothing beyond semantic retrieval.
Figures
read the original abstract
Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes JSPG, a dynamic dictionary filtering framework for Chinese contextual ASR that jointly retrieves keywords using semantic, pinyin, and glyph features. It introduces an extended Smith-Waterman algorithm to compute similarity scores between N-best ASR hypotheses and dictionary entries, addressing cases where homophonic ASR errors distort semantic meaning while preserving phonetic and structural cues. Experiments on the Aishell-1 and RWCS-NER datasets are claimed to show that JSPG outperforms single-feature baselines, with downstream contextual ASR models achieving substantial gains in keyword recognition accuracy.
Significance. If the experimental claims hold after verification, the work addresses a practical bottleneck in scaling contextual ASR for Chinese by reducing noise from large dictionaries while leveraging language-specific phonetic and glyph information. The joint feature approach and sequence-level alignment extension are well-motivated responses to known limitations of semantic-only retrievers under ASR distortions. The paper correctly identifies the homophony problem as central and provides a targeted, multi-cue solution that could be useful for other logographic languages.
major comments (2)
- [Method] Method section on extended Smith-Waterman: the central claim that this alignment reliably bridges character-level pinyin/glyph metrics to sequence-level filtering decisions after ASR distortions is load-bearing for the outperformance result, yet the manuscript provides no ablation replacing the extension with standard Levenshtein distance or cosine similarity on embeddings; without this, it remains unclear whether the reported gains over single-feature baselines depend on the specific extension or would arise from simpler alignment.
- [Experiments] Experiments section (results on Aishell-1 and RWCS-NER): the claim of significant outperformance and substantial downstream improvements lacks reported numerical values, exact baselines, statistical tests, or error analysis in the provided text; this directly affects verifiability of the strongest claim that JSPG outperforms single-feature baselines.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., relative WER reduction or keyword accuracy gain) to support the outperformance statement.
- [Method] Notation for the joint similarity score combining semantic, pinyin, and glyph components should be explicitly defined with an equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical relevance of addressing homophony issues in Chinese contextual ASR. We address each major comment below and will revise the manuscript to improve clarity, verifiability, and empirical support.
read point-by-point responses
-
Referee: [Method] Method section on extended Smith-Waterman: the central claim that this alignment reliably bridges character-level pinyin/glyph metrics to sequence-level filtering decisions after ASR distortions is load-bearing for the outperformance result, yet the manuscript provides no ablation replacing the extension with standard Levenshtein distance or cosine similarity on embeddings; without this, it remains unclear whether the reported gains over single-feature baselines depend on the specific extension or would arise from simpler alignment.
Authors: We agree that an ablation would help isolate the contribution of the extended Smith-Waterman algorithm. The extension is specifically motivated by its local alignment properties, which tolerate insertions, deletions, and substitutions typical in ASR N-best hypotheses while jointly scoring pinyin and glyph matches at the character level; simpler global metrics like Levenshtein or embedding cosine may not handle partial matches or multi-cue weighting as effectively. In the revision we will add a dedicated ablation subsection comparing the extended Smith-Waterman against (i) standard Levenshtein distance on the same multi-feature representations and (ii) cosine similarity on averaged embeddings, reporting the resulting keyword retrieval F1 and downstream ASR accuracy to demonstrate that the sequence-level alignment is responsible for the observed gains. revision: yes
-
Referee: [Experiments] Experiments section (results on Aishell-1 and RWCS-NER): the claim of significant outperformance and substantial downstream improvements lacks reported numerical values, exact baselines, statistical tests, or error analysis in the provided text; this directly affects verifiability of the strongest claim that JSPG outperforms single-feature baselines.
Authors: We acknowledge that the current text does not present the full numerical results, baseline specifications, or statistical details needed for immediate verification. The experiments section of the manuscript contains tables and figures with concrete metrics, but we will expand it in the revision to include: exact numerical values (e.g., retrieval precision/recall and downstream CER/WER improvements on both datasets), precise descriptions of all single-feature baselines and their implementations, paired statistical significance tests, and a concise error analysis focused on homophonic error cases resolved by the joint features. These additions will make the outperformance claims fully transparent and reproducible. revision: yes
Circularity Check
No circularity detected; empirical claims rest on external dataset experiments
full rationale
The paper proposes the JSPG framework as a practical joint retrieval method using semantic, pinyin, and glyph features plus an extended Smith-Waterman alignment to filter keyword dictionaries for Chinese contextual ASR. All load-bearing claims of outperformance and downstream accuracy gains are tied directly to experimental results on the independent Aishell-1 and RWCS-NER datasets rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are introduced that reduce the reported improvements to quantities defined by the method's own inputs or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Base ASR models produce homophonic or near-homophonic errors that preserve phonetic cues but distort semantic meaning, rendering standard semantic retrievers ineffective.
Reference graph
Works this paper leans on
-
[1]
Uri Alon, Golan Pundak, and Tara N Sainath. 2019. Contextual speech recognition with difficult negative training examples. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440--6444. IEEE
work page 2019
- [2]
-
[3]
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1--5
work page 2017
-
[4]
Boli Chen, Guangwei Xu, Xiaobin Wang, Pengjun Xie, Meishan Zhang, and Fei Huang. 2022. Aishell-ner: Named entity recognition from chinese speech. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8352--8356
work page 2022
-
[5]
Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, and Zhou Zhao. 2025. https://doi.org/10.18653/v1/2025.acl-long.613 W av RAG : Audio-integrated retrieval augmented generation for spoken dialogue models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...
-
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171--4186, Minneapolis, Minnesota. Association for Computational Linguistics
work page 2019
-
[7]
Siskos Dimitrios, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, and Anastasios Drosou. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.768 Retrieval augmented generation based context discovery for ASR . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14247--14254, Suzhou, China. Associat...
- [8]
-
[9]
Zhihong Lei, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, and Zhen Huang. 2025. Contextualization of asr with llm using phonetic retrieval-based augmentation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE
work page 2025
- [11]
-
[12]
Puneet Mathur, Zhe Liu, Ke Li, Yingyi Ma, Gil Karen, Zeeshan Ahmed, Dinesh Manocha, and Xuedong Zhang. 2024. https://aclanthology.org/2024.lrec-main.457/ DOC - RAG : ASR language model personalization with domain-distributed co-occurrence retrieval augmentation . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Langu...
work page 2024
-
[13]
ASR Omnilingual, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, and 1 others. 2025. Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages. arXiv preprint arXiv:2511.09690
-
[14]
OpenAI. 2025. https://arxiv.org/abs/2508.10925 gpt-oss-120b and gpt-oss-20b model card . Preprint, arXiv:2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2227--2237, New Orleans, Louisiana. Association for Computational Linguistics
work page 2018
-
[16]
Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao. 2018. Deep context: end-to-end contextual speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pages 418--425. IEEE
work page 2018
-
[17]
Ziheng Qiao, Houquan Zhou, Yumeng Liu, Zhenghua Li, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, and Fei Huang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1373 DISC : Plug-and-play decoding intervention with similarity of characters for C hinese spelling check . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...
-
[18]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Temple F Smith, Michael S Waterman, and 1 others. 1981. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195--197
work page 1981
-
[20]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33:16857--16867
work page 2020
-
[21]
Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, and Shinji Watanabe. 2024 a . Contextualized automatic speech recognition with dynamic vocabulary. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 78--85. IEEE
work page 2024
-
[22]
Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, and Shinji Watanabe. 2024 b . Contextualized automatic speech recognition with attention-based bias phrase boosted beam search. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10896--10900. IEEE
work page 2024
-
[23]
Li Hai Tan, Angela R Laird, Karl Li, and Peter T Fox. 2005. Neuroanatomical correlates of phonological processing of chinese characters and alphabetic words: A meta-analysis. Human brain mapping, 25(1):83--91
work page 2005
-
[24]
Cihan Xiao, Zejiang Hou, Daniel Garcia-Romero, and Kyu J Han. 2025. Contextual asr with retrieval augmented large language model. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE
work page 2025
-
[25]
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, and 1 others. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [26]
-
[27]
Shilin Zhou, Zhenghua Li, Chen Gong, Lei Zhang, Yu Hong, and Min Zhang. 2024 a . https://doi.org/10.18653/v1/2024.findings-acl.111 C hinese spoken named entity recognition in real-world scenarios: Dataset and approaches . In Findings of the Association for Computational Linguistics: ACL 2024, pages 1872--1884, Bangkok, Thailand. Association for Computatio...
-
[28]
Shilin Zhou, Zhenghua Li, Yu Hong, Min Zhang, Zhefeng Wang, and Baoxing Huai. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.147 C opy NE : Better contextual ASR by copying named entities . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2675--2686, Bangkok, Thailand. Associatio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.