FormalASR: End-to-End Spoken Chinese to Formal Text
Pith reviewed 2026-05-20 06:28 UTC · model grok-4.3
The pith
Compact end-to-end models can turn spoken Chinese directly into formal written text without any separate LLM post-editing step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FormalASR consists of two compact models at 0.6B and 1.7B parameters obtained by supervised fine-tuning of Qwen3-ASR on the WenetSpeech-Formal and Speechio-Formal datasets. These datasets were built by applying LLM rewriting and quality filtering to turn verbatim transcripts into formal written targets. When tested on the same formal datasets the models produce lower character error rates than standard verbatim ASR baselines and also register gains on ROUGE-L and BERTScore.
What carries the argument
Supervised fine-tuning of compact Qwen3-ASR models on LLM-rewritten spoken-to-formal datasets that directly map audio input to formal text output.
If this is right
- Deployment becomes possible on resource-limited devices because no second LLM stage is required at inference time.
- The same training approach could be applied to produce other specialized output styles beyond formal writing.
- Latency for producing ready-to-use text from speech drops because the entire conversion happens inside one model forward pass.
- Memory footprint shrinks compared with running both an ASR model and a separate post-editing model in sequence.
Where Pith is reading between the lines
- The same dataset-construction method might be reused to train models that output other cleaned-up styles such as summaries or bullet points directly from speech.
- If the approach generalizes, voice interfaces could start producing professional documents without users having to edit raw transcripts afterward.
- Testing the models on spontaneous conversations outside the filtered training domains would reveal how much the gains depend on the LLM rewriting step.
Load-bearing premise
That the LLM rewriting process used to build the training targets produces formal text that truly matches what users would want from spoken input.
What would settle it
A side-by-side human evaluation on fresh spoken recordings where the end-to-end model outputs receive lower suitability ratings for formal writing than the outputs of a standard ASR plus separate LLM pipeline.
read the original abstract
Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FormalASR, two compact end-to-end models (0.6B and 1.7B parameters) fine-tuned from Qwen3-ASR to directly transcribe spoken Chinese into formal written text. It constructs WenetSpeech-Formal and Speechio-Formal datasets via LLM-based rewriting and quality filtering of existing speech corpora. Supervised fine-tuning on these datasets yields up to 37.4% relative CER reduction over verbatim baselines, plus gains in ROUGE-L and BERTScore, positioning the system as a lightweight on-device alternative to two-stage ASR+LLM pipelines.
Significance. If the performance gains prove robust and independent of the LLM reference construction process, the work would meaningfully advance practical spoken-to-formal transcription for Chinese, particularly in latency-sensitive or on-device scenarios such as meeting summarization and subtitles. The compact model sizes and end-to-end design address real deployment constraints, and the large-scale dataset construction offers a reusable methodology for similar style-transfer tasks in speech.
major comments (2)
- [Experiments] The reported 37.4% relative CER reduction (and ROUGE-L/BERTScore gains) is measured on test sets whose references were generated by the identical LLM rewriting + filtering pipeline used to create the training data. Because the models are fine-tuned to predict exactly those targets, the metric improvements may reflect stylistic imitation of the LLM rather than independent production of human-preferred formal text. No human-annotated formal references or inter-annotator agreement statistics are reported to break this dependency (see Abstract and Experiments sections).
- [Experiments] The manuscript provides no details on experimental controls, statistical tests, exact baseline implementations, or safeguards against data leakage from the LLM rewriting step into the test sets. These omissions make it impossible to verify that the claimed reductions are attributable to the proposed approach rather than artifacts of the data construction process (see Abstract and Experiments sections).
minor comments (2)
- [Dataset Construction] Add concrete examples of spoken input, LLM-rewritten formal output, and model prediction to illustrate the target transformation in the dataset construction section.
- [Model Training] Clarify the precise fine-tuning hyperparameters, learning rate schedules, and any differences in training procedure between the 0.6B and 1.7B models.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the evaluation methodology and experimental details. We address each major comment below and indicate revisions to the manuscript.
read point-by-point responses
-
Referee: [Experiments] The reported 37.4% relative CER reduction (and ROUGE-L/BERTScore gains) is measured on test sets whose references were generated by the identical LLM rewriting + filtering pipeline used to create the training data. Because the models are fine-tuned to predict exactly those targets, the metric improvements may reflect stylistic imitation of the LLM rather than independent production of human-preferred formal text. No human-annotated formal references or inter-annotator agreement statistics are reported to break this dependency (see Abstract and Experiments sections).
Authors: We acknowledge that the evaluation uses references from the same LLM rewriting pipeline as the training data, which defines a consistent notion of formal text. The relative CER reductions and other metrics demonstrate that the end-to-end models learn to map spoken input to this target style more effectively than verbatim baselines. This setup is intentional to isolate the style-transfer capability without confounding factors from mismatched reference distributions. However, we agree this does not directly validate against independent human preferences. In the revised manuscript, we have added a dedicated Limitations subsection discussing the reliance on LLM-generated targets, the risk of stylistic imitation, and our plans to collect human annotations in follow-up work. We have also included qualitative examples comparing model outputs to both LLM references and human-edited versions where available. revision: partial
-
Referee: [Experiments] The manuscript provides no details on experimental controls, statistical tests, exact baseline implementations, or safeguards against data leakage from the LLM rewriting step into the test sets. These omissions make it impossible to verify that the claimed reductions are attributable to the proposed approach rather than artifacts of the data construction process (see Abstract and Experiments sections).
Authors: We apologize for these omissions in the initial submission. The revised manuscript now includes an expanded Experiments section with: (i) precise descriptions of baseline implementations, including the verbatim Qwen3-ASR fine-tuning procedure and any post-processing; (ii) statistical significance testing via bootstrap resampling with reported confidence intervals and p-values for the CER reductions; (iii) details on experimental controls such as hyperparameter search ranges, early stopping criteria, and multiple random seeds; and (iv) explicit safeguards against leakage, including n-gram overlap analysis between train and test sets after rewriting, separate LLM calls for test data, and verification that no test utterances were used in training data construction. These additions enable independent verification of the results. revision: yes
- We do not have human-annotated formal references for the test sets and therefore cannot report inter-annotator agreement statistics or direct human preference comparisons in the current work.
Circularity Check
No significant circularity; empirical results on constructed data with no self-referential reduction
full rationale
The paper describes dataset construction via LLM rewriting and quality filtering, followed by standard supervised fine-tuning of ASR models and evaluation with CER, ROUGE-L, and BERTScore on the resulting test sets. No equations, derivations, or self-citations appear in the provided text that would reduce any claimed result (such as the 37.4% relative CER reduction) to an input by construction. The performance numbers are measured empirical outcomes comparing fine-tuned models against verbatim baselines under identical reference conditions, which does not match any of the enumerated circularity patterns. The pipeline is self-contained as a conventional end-to-end fine-tuning experiment on synthetically labeled data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based rewriting and quality filtering produce accurate formal written equivalents for spoken Chinese utterances
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct WenetSpeech-Formal and Speechio-Formal... by rewriting verbatim transcriptions with DeepSeek-V3.2 and applying quality filtering... fine-tune Qwen3-ASR... achieving up to 37.4% relative CER reduction
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Automatic speech recognition (ASR) has become a foun- dational component of modern human-computer interac- tion, powering applications ranging from voice assistants and meeting transcription to real-time captioning and docu- ment dictation. State-of-the-art systems such as Whisper [1], Qwen3-ASR [2], and SenseV oice [3] have achieved remark- ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELATED WORKS 2.1. Automatic Speech Recognition Modern ASR systems have evolved from traditional hybrid HMM-DNN architectures [12] toward end-to-end models based on CTC [13] and attention-based encoder-decoder frameworks [14]. Large-scale pre-trained models have further advanced the field: Whisper [1] demonstrates that training on hundreds of thousands of...
-
[3]
DATASETS: WENETSPEECH-FORMAL AND SPEECHIO-FORMAL 3.1. Construction Pipeline We construct WenetSpeech-Formal and Speechio-Formal from the WenetSpeech corpus [9] and Speechio benchmark data [10], following a three-stage pipeline: Verbatim transcription collection.We use the origi- nal audio files and their verbatim transcriptions from Wenet- Speech and Spee...
-
[4]
METHOD Given an input audio utterancex, our objective is to directly predict a formal written transcriptionˆyin a single pass: ˆy= arg max y Pθ(y|x),(1) whereydenotes a well-formed written sentence rather than a verbatim spoken transcript. Different from the conven- tional ASR→LLM pipeline, this formulation couples acoustic recognition and linguistic form...
-
[5]
EXPERIMENTS 5.1. Experimental Setup We fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) on WenetSpeech-Formal using full-parameter supervised fine- tuning (SFT). Both models are initialized from the official Qwen3-ASR [2] checkpoints and trained for 2 epochs on the 969K-sample training split. All experiments are conducted on 2 NVIDIA A800-SXM4-80GB GPUs....
work page 1969
-
[6]
CONCLUSION We presented two contributions toward end-to-end spoken- to-formal Chinese ASR. First, we constructed and open- sourced WenetSpeech-Formal with 969K training samples and Speechio-Formal with 43K cross-domain test samples, two large-scale spoken-to-formal datasets built by rewriting verbatim transcriptions with DeepSeek-V3.2 and applying quality...
-
[7]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Qwen Team, “Qwen3-asr technical report,”https: //github.com/QwenLM/Qwen3-ASR, 2025, Ac- cessed: 2026-05-07
work page 2025
-
[9]
Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, et al., “FunAudioLLM: V oice understanding and generation foundation models for natural interaction between humans and LLMs,”https://arxiv.org/abs/ 2407.04051, 2024
-
[10]
Disfluency detection using a bidirectional LSTM,
Victoria Zayats, Mari Ostendorf, and Hannaneh Ha- jishirzi, “Disfluency detection using a bidirectional LSTM,” inProc. Interspeech, 2016, pp. 2523–2527
work page 2016
-
[11]
Improv- ing disfluency detection by self-training a self-attentive model,
Paria Jamshid Lou and Mark Johnson, “Improv- ing disfluency detection by self-training a self-attentive model,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3754–3763
work page 2020
-
[12]
Spoken language understanding with spoken-to-written conver- sion,
Bing Wang, Wanxiang Che, and Ting Liu, “Spoken language understanding with spoken-to-written conver- sion,” inProc. Interspeech, 2020, pp. 4661–4665
work page 2020
-
[13]
HyPoradise: An open baseline for generative speech recognition with large language models,
Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, and Eng Siong Chng, “HyPoradise: An open baseline for generative speech recognition with large language models,” inAd- vances in Neural Information Processing Systems, 2023, vol. 36
work page 2023
-
[14]
Gpt-4o system card and model re- lease,
OpenAI, “Gpt-4o system card and model re- lease,”https://openai.com/index/ hello-gpt-4o/, 2024, Accessed: 2026-05-07
work page 2024
-
[15]
WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al., “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6363–6367
work page 2022
-
[16]
SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,
SpeechColab, “SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,”https://github.com/SpeechColab/ Leaderboard, 2021, Accessed: 2026-05-18
work page 2021
-
[17]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al., “DeepSeek-V3 technical report,” https://arxiv.org/abs/2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,”IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. Table 6. Bitsand...
work page 2012
-
[19]
Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,” inProceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376
work page 2006
-
[20]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964
work page 2016
-
[21]
Normalization of non-standard words,
Richard Sproat, Alan W Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards, “Normalization of non-standard words,”Computer Speech & Language, vol. 15, no. 3, pp. 287–333, 2001
work page 2001
-
[22]
RNN Approaches to Text Normalization: A Challenge
Richard Sproat and Navdeep Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Decoupled weight decay regularization,
Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019
work page 2019
-
[24]
llama.cpp: Efficient LLM inference in C/C++,
Georgi Gerganov et al., “llama.cpp: Efficient LLM inference in C/C++,”https://github.com/ ggerganov/llama.cpp, 2023, Introduces the GGUF model format for portable, quantized on-device inference. Accessed: 2026-05-11
work page 2023
-
[25]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,”https://arxiv.org/ abs/2208.07339, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024. A. APPENDIX A.1. Bitsandbytes Quantization Results Table 6 reports bitsandbytes [19] INT8/INT4 quantization results as a complement to the GGUF results in Section 5. INT8 is near-lossless...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.