pith. sign in

arxiv: 2605.19266 · v1 · pith:KGD5V4ONnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

FormalASR: End-to-End Spoken Chinese to Formal Text

Pith reviewed 2026-05-20 06:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords end-to-end ASRspoken to formal textChinese speech recognitionLLM data rewritingon-device transcriptionQwen3-ASR fine-tuningverbatim vs formal output
0
0 comments X

The pith

Compact end-to-end models can turn spoken Chinese directly into formal written text without any separate LLM post-editing step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to replace the usual two-stage process of first doing verbatim speech recognition and then cleaning it up with a large language model. Instead it trains small ASR models to output formal text right from the audio. This matters for applications that need clean, writing-ready output from speech, such as note-taking or report generation, because it cuts latency and memory use while allowing on-device running. The work rests on creating new paired datasets where spoken audio is matched to LLM-rewritten formal versions rather than raw transcripts.

Core claim

FormalASR consists of two compact models at 0.6B and 1.7B parameters obtained by supervised fine-tuning of Qwen3-ASR on the WenetSpeech-Formal and Speechio-Formal datasets. These datasets were built by applying LLM rewriting and quality filtering to turn verbatim transcripts into formal written targets. When tested on the same formal datasets the models produce lower character error rates than standard verbatim ASR baselines and also register gains on ROUGE-L and BERTScore.

What carries the argument

Supervised fine-tuning of compact Qwen3-ASR models on LLM-rewritten spoken-to-formal datasets that directly map audio input to formal text output.

If this is right

  • Deployment becomes possible on resource-limited devices because no second LLM stage is required at inference time.
  • The same training approach could be applied to produce other specialized output styles beyond formal writing.
  • Latency for producing ready-to-use text from speech drops because the entire conversion happens inside one model forward pass.
  • Memory footprint shrinks compared with running both an ASR model and a separate post-editing model in sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dataset-construction method might be reused to train models that output other cleaned-up styles such as summaries or bullet points directly from speech.
  • If the approach generalizes, voice interfaces could start producing professional documents without users having to edit raw transcripts afterward.
  • Testing the models on spontaneous conversations outside the filtered training domains would reveal how much the gains depend on the LLM rewriting step.

Load-bearing premise

That the LLM rewriting process used to build the training targets produces formal text that truly matches what users would want from spoken input.

What would settle it

A side-by-side human evaluation on fresh spoken recordings where the end-to-end model outputs receive lower suitability ratings for formal writing than the outputs of a standard ASR plus separate LLM pipeline.

read the original abstract

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FormalASR, two compact end-to-end models (0.6B and 1.7B parameters) fine-tuned from Qwen3-ASR to directly transcribe spoken Chinese into formal written text. It constructs WenetSpeech-Formal and Speechio-Formal datasets via LLM-based rewriting and quality filtering of existing speech corpora. Supervised fine-tuning on these datasets yields up to 37.4% relative CER reduction over verbatim baselines, plus gains in ROUGE-L and BERTScore, positioning the system as a lightweight on-device alternative to two-stage ASR+LLM pipelines.

Significance. If the performance gains prove robust and independent of the LLM reference construction process, the work would meaningfully advance practical spoken-to-formal transcription for Chinese, particularly in latency-sensitive or on-device scenarios such as meeting summarization and subtitles. The compact model sizes and end-to-end design address real deployment constraints, and the large-scale dataset construction offers a reusable methodology for similar style-transfer tasks in speech.

major comments (2)
  1. [Experiments] The reported 37.4% relative CER reduction (and ROUGE-L/BERTScore gains) is measured on test sets whose references were generated by the identical LLM rewriting + filtering pipeline used to create the training data. Because the models are fine-tuned to predict exactly those targets, the metric improvements may reflect stylistic imitation of the LLM rather than independent production of human-preferred formal text. No human-annotated formal references or inter-annotator agreement statistics are reported to break this dependency (see Abstract and Experiments sections).
  2. [Experiments] The manuscript provides no details on experimental controls, statistical tests, exact baseline implementations, or safeguards against data leakage from the LLM rewriting step into the test sets. These omissions make it impossible to verify that the claimed reductions are attributable to the proposed approach rather than artifacts of the data construction process (see Abstract and Experiments sections).
minor comments (2)
  1. [Dataset Construction] Add concrete examples of spoken input, LLM-rewritten formal output, and model prediction to illustrate the target transformation in the dataset construction section.
  2. [Model Training] Clarify the precise fine-tuning hyperparameters, learning rate schedules, and any differences in training procedure between the 0.6B and 1.7B models.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify the evaluation methodology and experimental details. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses
  1. Referee: [Experiments] The reported 37.4% relative CER reduction (and ROUGE-L/BERTScore gains) is measured on test sets whose references were generated by the identical LLM rewriting + filtering pipeline used to create the training data. Because the models are fine-tuned to predict exactly those targets, the metric improvements may reflect stylistic imitation of the LLM rather than independent production of human-preferred formal text. No human-annotated formal references or inter-annotator agreement statistics are reported to break this dependency (see Abstract and Experiments sections).

    Authors: We acknowledge that the evaluation uses references from the same LLM rewriting pipeline as the training data, which defines a consistent notion of formal text. The relative CER reductions and other metrics demonstrate that the end-to-end models learn to map spoken input to this target style more effectively than verbatim baselines. This setup is intentional to isolate the style-transfer capability without confounding factors from mismatched reference distributions. However, we agree this does not directly validate against independent human preferences. In the revised manuscript, we have added a dedicated Limitations subsection discussing the reliance on LLM-generated targets, the risk of stylistic imitation, and our plans to collect human annotations in follow-up work. We have also included qualitative examples comparing model outputs to both LLM references and human-edited versions where available. revision: partial

  2. Referee: [Experiments] The manuscript provides no details on experimental controls, statistical tests, exact baseline implementations, or safeguards against data leakage from the LLM rewriting step into the test sets. These omissions make it impossible to verify that the claimed reductions are attributable to the proposed approach rather than artifacts of the data construction process (see Abstract and Experiments sections).

    Authors: We apologize for these omissions in the initial submission. The revised manuscript now includes an expanded Experiments section with: (i) precise descriptions of baseline implementations, including the verbatim Qwen3-ASR fine-tuning procedure and any post-processing; (ii) statistical significance testing via bootstrap resampling with reported confidence intervals and p-values for the CER reductions; (iii) details on experimental controls such as hyperparameter search ranges, early stopping criteria, and multiple random seeds; and (iv) explicit safeguards against leakage, including n-gram overlap analysis between train and test sets after rewriting, separate LLM calls for test data, and verification that no test utterances were used in training data construction. These additions enable independent verification of the results. revision: yes

standing simulated objections not resolved
  • We do not have human-annotated formal references for the test sets and therefore cannot report inter-annotator agreement statistics or direct human preference comparisons in the current work.

Circularity Check

0 steps flagged

No significant circularity; empirical results on constructed data with no self-referential reduction

full rationale

The paper describes dataset construction via LLM rewriting and quality filtering, followed by standard supervised fine-tuning of ASR models and evaluation with CER, ROUGE-L, and BERTScore on the resulting test sets. No equations, derivations, or self-citations appear in the provided text that would reduce any claimed result (such as the 37.4% relative CER reduction) to an input by construction. The performance numbers are measured empirical outcomes comparing fine-tuned models against verbatim baselines under identical reference conditions, which does not match any of the enumerated circularity patterns. The pipeline is self-contained as a conventional end-to-end fine-tuning experiment on synthetically labeled data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the assumption that LLM rewriting yields high-quality formal targets and that fine-tuning on these targets generalizes to real spoken input without introducing artifacts.

axioms (1)
  • domain assumption LLM-based rewriting and quality filtering produce accurate formal written equivalents for spoken Chinese utterances
    This premise is required to create the training targets in WenetSpeech-Formal and Speechio-Formal.

pith-pipeline@v0.9.0 · 5750 in / 1212 out tokens · 45250 ms · 2026-05-20T06:28:10.030213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    INTRODUCTION Automatic speech recognition (ASR) has become a foun- dational component of modern human-computer interac- tion, powering applications ranging from voice assistants and meeting transcription to real-time captioning and docu- ment dictation. State-of-the-art systems such as Whisper [1], Qwen3-ASR [2], and SenseV oice [3] have achieved remark- ...

  2. [2]

    um”, “uh

    RELATED WORKS 2.1. Automatic Speech Recognition Modern ASR systems have evolved from traditional hybrid HMM-DNN architectures [12] toward end-to-end models based on CTC [13] and attention-based encoder-decoder frameworks [14]. Large-scale pre-trained models have further advanced the field: Whisper [1] demonstrates that training on hundreds of thousands of...

  3. [3]

    DATASETS: WENETSPEECH-FORMAL AND SPEECHIO-FORMAL 3.1. Construction Pipeline We construct WenetSpeech-Formal and Speechio-Formal from the WenetSpeech corpus [9] and Speechio benchmark data [10], following a three-stage pipeline: Verbatim transcription collection.We use the origi- nal audio files and their verbatim transcriptions from Wenet- Speech and Spee...

  4. [4]

    METHOD Given an input audio utterancex, our objective is to directly predict a formal written transcriptionˆyin a single pass: ˆy= arg max y Pθ(y|x),(1) whereydenotes a well-formed written sentence rather than a verbatim spoken transcript. Different from the conven- tional ASR→LLM pipeline, this formulation couples acoustic recognition and linguistic form...

  5. [5]

    Sample Output

    EXPERIMENTS 5.1. Experimental Setup We fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) on WenetSpeech-Formal using full-parameter supervised fine- tuning (SFT). Both models are initialized from the official Qwen3-ASR [2] checkpoints and trained for 2 epochs on the 969K-sample training split. All experiments are conducted on 2 NVIDIA A800-SXM4-80GB GPUs....

  6. [6]

    CONCLUSION We presented two contributions toward end-to-end spoken- to-formal Chinese ASR. First, we constructed and open- sourced WenetSpeech-Formal with 969K training samples and Speechio-Formal with 43K cross-domain test samples, two large-scale spoken-to-formal datasets built by rewriting verbatim transcriptions with DeepSeek-V3.2 and applying quality...

  7. [7]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2023

  8. [8]

    Qwen3-asr technical report,

    Qwen Team, “Qwen3-asr technical report,”https: //github.com/QwenLM/Qwen3-ASR, 2025, Ac- cessed: 2026-05-07

  9. [9]

    Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051,

    Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, et al., “FunAudioLLM: V oice understanding and generation foundation models for natural interaction between humans and LLMs,”https://arxiv.org/abs/ 2407.04051, 2024

  10. [10]

    Disfluency detection using a bidirectional LSTM,

    Victoria Zayats, Mari Ostendorf, and Hannaneh Ha- jishirzi, “Disfluency detection using a bidirectional LSTM,” inProc. Interspeech, 2016, pp. 2523–2527

  11. [11]

    Improv- ing disfluency detection by self-training a self-attentive model,

    Paria Jamshid Lou and Mark Johnson, “Improv- ing disfluency detection by self-training a self-attentive model,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3754–3763

  12. [12]

    Spoken language understanding with spoken-to-written conver- sion,

    Bing Wang, Wanxiang Che, and Ting Liu, “Spoken language understanding with spoken-to-written conver- sion,” inProc. Interspeech, 2020, pp. 4661–4665

  13. [13]

    HyPoradise: An open baseline for generative speech recognition with large language models,

    Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, and Eng Siong Chng, “HyPoradise: An open baseline for generative speech recognition with large language models,” inAd- vances in Neural Information Processing Systems, 2023, vol. 36

  14. [14]

    Gpt-4o system card and model re- lease,

    OpenAI, “Gpt-4o system card and model re- lease,”https://openai.com/index/ hello-gpt-4o/, 2024, Accessed: 2026-05-07

  15. [15]

    WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,

    Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al., “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6363–6367

  16. [16]

    SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,

    SpeechColab, “SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,”https://github.com/SpeechColab/ Leaderboard, 2021, Accessed: 2026-05-18

  17. [17]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al., “DeepSeek-V3 technical report,” https://arxiv.org/abs/2412.19437, 2024

  18. [18]

    Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,

    Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,”IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. Table 6. Bitsand...

  19. [19]

    Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,

    Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,” inProceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376

  20. [20]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

    William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964

  21. [21]

    Normalization of non-standard words,

    Richard Sproat, Alan W Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards, “Normalization of non-standard words,”Computer Speech & Language, vol. 15, no. 3, pp. 287–333, 2001

  22. [22]

    RNN Approaches to Text Normalization: A Challenge

    Richard Sproat and Navdeep Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2017

  23. [23]

    Decoupled weight decay regularization,

    Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

  24. [24]

    llama.cpp: Efficient LLM inference in C/C++,

    Georgi Gerganov et al., “llama.cpp: Efficient LLM inference in C/C++,”https://github.com/ ggerganov/llama.cpp, 2023, Introduces the GGUF model format for portable, quantized on-device inference. Accessed: 2026-05-11

  25. [25]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,”https://arxiv.org/ abs/2208.07339, 2022

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024. A. APPENDIX A.1. Bitsandbytes Quantization Results Table 6 reports bitsandbytes [19] INT8/INT4 quantization results as a complement to the GGUF results in Section 5. INT8 is near-lossless...