Improving Performance of End-to-End ASR on Numeric Sequences
Pith reviewed 2026-05-25 11:27 UTC · model grok-4.3
The pith
End-to-end ASR models reduce word error rates on long numeric sequences by up to a factor of eight by augmenting training with TTS data and replacing large FST denormalizers with a small neural network.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Recognizing written-domain numeric utterances remains difficult for end-to-end models when numeric sequences are absent from training. Conventional pipelines address the issue by training on spoken-domain data and applying an FST verbalizer for denormalization, yet the verbalizer's memory footprint precludes its use in the on-device setting. Generating additional numeric training data with a text-to-speech system and substituting a small-footprint neural network for the FST verbalizer produces measurable gains across several numeric classes. The largest improvement occurs on the longest numeric sequences, where word error rate falls by up to a factor of eight.
What carries the argument
A small-footprint neural network trained to map spoken-domain numeric output to written-domain form, used together with TTS-generated numeric utterances to augment the training set.
If this is right
- Recognition accuracy improves on several distinct numeric classes such as prices, phone numbers, and dates.
- Word error rate on the longest numeric sequences drops by as much as a factor of eight.
- The entire pipeline remains compatible with the strict memory limits of on-device ASR because the neural denormalizer replaces the large FST component.
- End-to-end models can now be trained to handle out-of-vocabulary numeric material without relying on external spoken-domain training pipelines.
Where Pith is reading between the lines
- The same TTS-plus-neural-denormalizer pattern could be applied to other categories of rare tokens, such as proper names or technical terms, that also suffer from domain mismatch.
- Placing the neural denormalizer inside the end-to-end model itself rather than as a post-processing step might further reduce latency on resource-constrained devices.
- Measuring how performance changes when the TTS voices are drawn from a wider range of accents would test whether the current gains hold under more varied real-world conditions.
Load-bearing premise
The distribution of numeric utterances produced by the text-to-speech system is close enough to real user speech that models trained on the synthetic data will generalize to actual spoken input.
What would settle it
Evaluating the augmented model on a large set of real-user numeric utterances recorded in the target acoustic conditions and finding no reduction, or an increase, in word error rate relative to the unaugmented baseline would falsify the central claim.
Figures
read the original abstract
Recognizing written domain numeric utterances (e.g. I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g. I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low memory setting of on-device speech recognition. E2E models such as RNN-T are attractive for on-device ASR, as they fold the AM, PM and LM of a conventional model into one neural network. However, in the on-device setting the large memory footprint of an FST denormer makes spoken domain training more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that using a text-to-speech system to generate additional numeric training data, as well as using a small-footprint neural network to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest numeric sequences, we see reduction of WER by up to a factor of 8.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for on-device E2E ASR (e.g., RNN-T), augmenting training with TTS-generated numeric utterances and replacing a large FST verbalizer with a small-footprint neural denormalizer improves recognition of written-domain numeric sequences, yielding WER reductions of up to a factor of 8 on the longest numeric classes.
Significance. If the reported gains prove robust, the approach would be significant for memory-constrained on-device ASR by addressing numeric OOV without relying on large FST components; the combination of data augmentation and compact denorming directly targets a practical deployment constraint.
major comments (2)
- [Abstract] Abstract: the claim of up to 8x WER reduction on longest sequences is presented without any baseline WER values, dataset sizes, model sizes, error bars, or ablation results, so it is impossible to determine whether the data support the stated improvement.
- [Abstract] Abstract: no details are supplied on numeric-sequence sampling for TTS, acoustic conditions modeled by the TTS system, speaker variability, or any side-by-side comparison of TTS-generated versus real numeric test utterances; this leaves the central assumption—that TTS data sufficiently approximates the target user-speech distribution—unverified and load-bearing for the reported gains.
minor comments (1)
- [Abstract] The abstract refers to 'several numeric classes' without defining the classes or providing a table that would allow readers to assess per-class results.
Simulated Author's Rebuttal
We thank the referee for the comments on the abstract. The full manuscript contains the supporting details and experiments; we address each point below and indicate where revisions to the abstract are feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of up to 8x WER reduction on longest sequences is presented without any baseline WER values, dataset sizes, model sizes, error bars, or ablation results, so it is impossible to determine whether the data support the stated improvement.
Authors: The abstract is a concise summary and therefore omits these specifics, but the full paper reports baseline WER values and the factor-of-8 improvement on the longest numeric class in Table 2, training-set sizes and TTS augmentation volumes in Section 3, model sizes in Section 2.1, and ablation results comparing TTS data and the neural denormalizer in Section 4. Error bars are not included because all runs used fixed random seeds; we can add a brief statement of the key baseline and improved WER numbers to the abstract if space permits. revision: partial
-
Referee: [Abstract] Abstract: no details are supplied on numeric-sequence sampling for TTS, acoustic conditions modeled by the TTS system, speaker variability, or any side-by-side comparison of TTS-generated versus real numeric test utterances; this leaves the central assumption—that TTS data sufficiently approximates the target user-speech distribution—unverified and load-bearing for the reported gains.
Authors: Section 3.1 describes the numeric-sequence sampling procedure used to generate the TTS utterances. Section 3.2 specifies the acoustic conditions and speaker variability (multiple TTS voices) modeled by the TTS system. All reported WER numbers are measured on real user utterances; the performance gains on those real test sets (Section 4) therefore serve as the empirical verification that the TTS-augmented training distribution is sufficiently close to the target domain. revision: no
Circularity Check
No circularity: empirical gains from external TTS data and independent NN denormer
full rationale
The paper reports WER improvements from augmenting training with TTS-generated numeric utterances and replacing an FST verbalizer with a small-footprint neural denormalizer. No equations, fitted parameters, or predictions are defined in terms of the reported results. The approach relies on external TTS synthesis and a separate neural component whose accuracy is evaluated independently on real data. No self-citation chains, ansatzes, or renamings are load-bearing. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Improving Performance of End-to-End ASR on Numeric Sequences
Introduction An ongoing challenge of ASR systems is to model transcrip- tions that do not exactly reflect the words spoken in an utter- ance. For example, the spoken utterance “set an alarm for four fifteen” is typically decoded in the written form as “set an alarm for 4:15”. Numeric utterances, such as addresses, phone num- bers, and postal codes are parti...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
Methods In this section, we present different ideas explored to address the long-tail numeric issue of our RNN-T system. We give each approach a label, which we will reference in Table 2 below. 2.1. TTS Training Data ( W1) To address the numeric data-sparsity issue, we generate addi- tional training data that represent challenging and realistic nu- meric ...
-
[3]
22110” might be verbalized as “double two double one oh
Experiments 3.1. Data Sets Our experiments are conducted on a ∼30,000 hour training set consisting of 43 million English utterances. The training utter- ances are anonymized and hand-transcribed, and are representa- tive of Googles voice search traffic in the United States. Multi- style training (MTR) data are created by artificially corrupting the clean ut...
-
[4]
Results Table 2 gives WER results for each of our experiments on the SAMPLED and TAIL test sets, as well as the real-audio VS and NUMERICS test sets. We use the labels given in Section 2 (W1, W2, S1, S2) for convenience. We useW0 to refer to the baseline RNN-T model. The results for the written domain models are characterized by a steep decline in quality...
-
[5]
Conclusions In this paper, we experimented with four approaches for im- proving end-to-end ASR performance on numeric utterances. We found that all approaches yield improvements, with the largest improvements occurring when TTS training data, spo- ken domain training, and neural denorming are all used to- gether. The fact that we see the largest improveme...
-
[6]
Acknowledgements We thank Gabriel Mechali, Mark Epstein, Michael Riley, and Richard Sproat for help and comments on this work
-
[7]
Formatting time-aligned ASR transcripts for readability,
M. Shugrina, “Formatting time-aligned ASR transcripts for readability,” inHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of he Association for Computational Linguistics , ser. HLT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 198–206. [Online]. Available: http://dl.acm.org/citation.c...
-
[8]
Language model verbalization for automatic speech recognition,
H. Sak, C. Allauzen, K. Nakajima, and F. Beaufay, “Language model verbalization for automatic speech recognition,” in Proc. ICASSP, 2013
work page 2013
-
[9]
Query Language Modeling for V oice Search,
C. Chelba, “Query Language Modeling for V oice Search,” in Proc. IEEE Workshop on Spoken Language Technology, 2010
work page 2010
-
[10]
Sequence-based class tag- ging for robust transcription in asr,
K. H. Lucy Vasserman, Vlad Schogol, “Sequence-based class tag- ging for robust transcription in asr,” in INTERSPEECH, 2015
work page 2015
-
[11]
Neural models of text normalization for speech applications,
H. Zhang, R. Sproat, A. Ng, F. Stahlberg, X. Peng, K. Gorman, and B. Roark, “Neural models of text normalization for speech applications,” Computational Linguistics, vol. 45, no. 2, 2019
work page 2019
-
[12]
Streaming End-to-end Speech Recognition For Mobile Devices,
Y . He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. Sim, T. Bagby, S. Chang, K. Rao, and A. Gruenstein, “Streaming End-to-end Speech Recognition For Mobile Devices,” 2019
work page 2019
-
[13]
State-of-the-art speech recognition with sequence- to-sequence models,
C. Chiu, T. N. Sainath, Y . Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, N. Jaitly, B. Li, and J. Chorowski, “State-of-the-art speech recognition with sequence- to-sequence models,” in Proc. ICASSP, 2018
work page 2018
-
[14]
Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” Conference on Neural Information Processing Systems, vol. abs/1406.2227, 2014. [Online]. Available: http: //arxiv.org/abs/1406.2227
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[15]
Synthetic Data for Text Localisation in Natural Images
A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” Computer Vision and Pattern Recognition, vol. abs/1604.06646, 2016. [Online]. Available: http://arxiv.org/abs/1604.06646
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization
J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V . Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” Conference on Computer Vision and Pattern Recognition , vol. abs/1804.06516, 2018. [Online]. Available: http://arxiv.org/abs/1804.06516
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Streaming End-to-end Speech Recognition For Mobile Devices
Y . He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y . Wu, R. Pang, Q. Liang, D. Bhatia, Y . Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. yiin Chang, K. Rao, and A. Gruenstein, “Streaming end- to-end speech recognition for mobile devices,” 2019. [Online]. Available: https://arxiv.org/abs/1811.06621
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
A neural probabilistic language model,
Y . Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” in Proceedings of the 13th International Conference on Neural Information Processing Systems , ser. Con- ference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2000, pp. 893–899. [Online]. Available: http://dl.acm.org/citation.cfm?id=3008751.3008881
-
[19]
Lstm neural networks for language modeling,
M. Sundermeyer, R. Schlter, and H. Ney, “Lstm neural networks for language modeling,” 09 2012
work page 2012
-
[20]
Recurrent neural network based language modeling in meeting recognition,
S. Kombrink, T. Mikolov, M. Karafi´at, and L. Burget, “Recurrent neural network based language modeling in meeting recognition,” in INTERSPEECH, 2011
work page 2011
-
[21]
Multi-domain recurrent neural network language model for medical speech recognition,
O. Tilk and T. Alume, “Multi-domain recurrent neural network language model for medical speech recognition,” 09 2014
work page 2014
-
[22]
A Spelling Correction Model for End-to-End Speech Recognition,
J. Guo, T. N. Sainath, and R. J. Weiss, “A Spelling Correction Model for End-to-End Speech Recognition,” 2019
work page 2019
-
[23]
Neural error corrective language models for automatic speech recognition,
T. Tanaka, R. Masumura, H. Masataki, and Y . Aono, “Neural error corrective language models for automatic speech recognition,” in Proc. Interspeech 2018 , 2018, pp. 401–405. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1430
-
[24]
RNN Approaches to Text Normalization: A Challenge
R. Sproat and N. Jaitly, “RNN approaches to text normalization: A challenge,” arXiv preprint , vol. abs/1611.00068, 2016. [Online]. Available: http://arxiv.org/abs/1611.00068
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[25]
C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani, “Generated of large-scale simulated utterances in virtual rooms to train deep-neural networks for far- field speech recognition in Google Home,” in Proc. Interspeech, 2017
work page 2017
-
[26]
Hierarchical generative modeling for controllable speech synthesis,
W.-N. Hsu, Y . Zhang, R. Weiss, H. Zen, Y . Wu, Y . Wang, Y . Cao, Y . Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” in Proc. ICLR, 2019, to appear , 2019
work page 2019
-
[27]
Tacotron: Towards End-to-End Speech Synthesis
Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis model,” Proc. Interspeech , 2017. [Online]. Available: http: //arxiv.org/abs/1703.10135
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Efficient Neural Audio Synthesis
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” ICML, vol. abs/1802.08435, 2018. [Online]. Available: http://arxiv.org/abs/1802.08435
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[Online]. Available: http://arxiv.org/abs/1711.10433
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Recent advances in google real-time hmm-driven unit selection synthesizer,
X. Gonzalvo, S. Tazari, C. an Chan, M. Becker, A. Gutkin, and H. Silen, “Recent advances in google real-time hmm-driven unit selection synthesizer,” in Proc. Interspeech, 2016
work page 2016
-
[32]
Tensorflow: Large-scale machine learn- ing on heterogeneous distributed systems,
M. Abadi et al., “Tensorflow: Large-scale machine learn- ing on heterogeneous distributed systems,” Available online: http://download.tensorflow.org/paper/whitepaper2015.pdf, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.