FormalASR: End-to-End Spoken Chinese to Formal Text

Haitao Qian; Jiyuan Cheng; Wanyi Ning; Weiyuan Feng; Yinshang Guo; Yufei Zhang

arxiv: 2605.19266 · v1 · pith:KGD5V4ONnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

FormalASR: End-to-End Spoken Chinese to Formal Text

Wanyi Ning , Yinshang Guo , Haitao Qian , Jiyuan Cheng , Weiyuan Feng , Yufei Zhang This is my paper

Pith reviewed 2026-05-20 06:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords end-to-end ASRspoken to formal textChinese speech recognitionLLM data rewritingon-device transcriptionQwen3-ASR fine-tuningverbatim vs formal output

0 comments

The pith

Compact end-to-end models can turn spoken Chinese directly into formal written text without any separate LLM post-editing step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to replace the usual two-stage process of first doing verbatim speech recognition and then cleaning it up with a large language model. Instead it trains small ASR models to output formal text right from the audio. This matters for applications that need clean, writing-ready output from speech, such as note-taking or report generation, because it cuts latency and memory use while allowing on-device running. The work rests on creating new paired datasets where spoken audio is matched to LLM-rewritten formal versions rather than raw transcripts.

Core claim

FormalASR consists of two compact models at 0.6B and 1.7B parameters obtained by supervised fine-tuning of Qwen3-ASR on the WenetSpeech-Formal and Speechio-Formal datasets. These datasets were built by applying LLM rewriting and quality filtering to turn verbatim transcripts into formal written targets. When tested on the same formal datasets the models produce lower character error rates than standard verbatim ASR baselines and also register gains on ROUGE-L and BERTScore.

What carries the argument

Supervised fine-tuning of compact Qwen3-ASR models on LLM-rewritten spoken-to-formal datasets that directly map audio input to formal text output.

If this is right

Deployment becomes possible on resource-limited devices because no second LLM stage is required at inference time.
The same training approach could be applied to produce other specialized output styles beyond formal writing.
Latency for producing ready-to-use text from speech drops because the entire conversion happens inside one model forward pass.
Memory footprint shrinks compared with running both an ASR model and a separate post-editing model in sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset-construction method might be reused to train models that output other cleaned-up styles such as summaries or bullet points directly from speech.
If the approach generalizes, voice interfaces could start producing professional documents without users having to edit raw transcripts afterward.
Testing the models on spontaneous conversations outside the filtered training domains would reveal how much the gains depend on the LLM rewriting step.

Load-bearing premise

That the LLM rewriting process used to build the training targets produces formal text that truly matches what users would want from spoken input.

What would settle it

A side-by-side human evaluation on fresh spoken recordings where the end-to-end model outputs receive lower suitability ratings for formal writing than the outputs of a standard ASR plus separate LLM pipeline.

read the original abstract

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FormalASR gives compact on-device models for spoken-to-formal Chinese with reported CER gains, but the evaluation sits on LLM-rewritten references that match the training targets.

read the letter

FormalASR fine-tunes two small Qwen3-ASR models (0.6B and 1.7B) directly on spoken Chinese to produce formal written text. They built WenetSpeech-Formal and Speechio-Formal by running an LLM over existing transcripts, applying quality filters, and using the outputs as targets. On those test sets the models cut character error rate by up to 37.4% relative to verbatim baselines and also improved ROUGE-L and BERTScore. At inference there is no separate LLM step, which keeps the footprint light for on-device use. That combination of new data and end-to-end training for this narrow task is what the paper actually adds. Earlier work mostly handled the formality step with a second LLM after standard ASR. The practical payoff is clear for anyone who wants formal output without extra latency or memory. The main limitation is the evaluation loop. Both training and test references come from the same LLM rewriting pipeline, so the model is optimized to reproduce the LLM's stylistic choices. Lower CER and higher overlap scores may simply reflect better imitation rather than independent production of human-preferred formal text. The abstract gives no human-annotated formal references, no inter-annotator agreement numbers, and little detail on baseline implementations or statistical controls. If the full paper adds those checks it would strengthen the claims; without them the numbers are harder to interpret as general progress. This paper is aimed at engineers and researchers building Chinese ASR systems for writing-oriented applications. Someone working on on-device transcription or formal note-taking would find the datasets and the compact-model results useful to look at. I would send it to peer review. The datasets are a concrete addition and the deployment angle is worth referee scrutiny even with the evaluation caveats.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FormalASR, two compact end-to-end models (0.6B and 1.7B parameters) fine-tuned from Qwen3-ASR to directly transcribe spoken Chinese into formal written text. It constructs WenetSpeech-Formal and Speechio-Formal datasets via LLM-based rewriting and quality filtering of existing speech corpora. Supervised fine-tuning on these datasets yields up to 37.4% relative CER reduction over verbatim baselines, plus gains in ROUGE-L and BERTScore, positioning the system as a lightweight on-device alternative to two-stage ASR+LLM pipelines.

Significance. If the performance gains prove robust and independent of the LLM reference construction process, the work would meaningfully advance practical spoken-to-formal transcription for Chinese, particularly in latency-sensitive or on-device scenarios such as meeting summarization and subtitles. The compact model sizes and end-to-end design address real deployment constraints, and the large-scale dataset construction offers a reusable methodology for similar style-transfer tasks in speech.

major comments (2)

[Experiments] The reported 37.4% relative CER reduction (and ROUGE-L/BERTScore gains) is measured on test sets whose references were generated by the identical LLM rewriting + filtering pipeline used to create the training data. Because the models are fine-tuned to predict exactly those targets, the metric improvements may reflect stylistic imitation of the LLM rather than independent production of human-preferred formal text. No human-annotated formal references or inter-annotator agreement statistics are reported to break this dependency (see Abstract and Experiments sections).
[Experiments] The manuscript provides no details on experimental controls, statistical tests, exact baseline implementations, or safeguards against data leakage from the LLM rewriting step into the test sets. These omissions make it impossible to verify that the claimed reductions are attributable to the proposed approach rather than artifacts of the data construction process (see Abstract and Experiments sections).

minor comments (2)

[Dataset Construction] Add concrete examples of spoken input, LLM-rewritten formal output, and model prediction to illustrate the target transformation in the dataset construction section.
[Model Training] Clarify the precise fine-tuning hyperparameters, learning rate schedules, and any differences in training procedure between the 0.6B and 1.7B models.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify the evaluation methodology and experimental details. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Experiments] The reported 37.4% relative CER reduction (and ROUGE-L/BERTScore gains) is measured on test sets whose references were generated by the identical LLM rewriting + filtering pipeline used to create the training data. Because the models are fine-tuned to predict exactly those targets, the metric improvements may reflect stylistic imitation of the LLM rather than independent production of human-preferred formal text. No human-annotated formal references or inter-annotator agreement statistics are reported to break this dependency (see Abstract and Experiments sections).

Authors: We acknowledge that the evaluation uses references from the same LLM rewriting pipeline as the training data, which defines a consistent notion of formal text. The relative CER reductions and other metrics demonstrate that the end-to-end models learn to map spoken input to this target style more effectively than verbatim baselines. This setup is intentional to isolate the style-transfer capability without confounding factors from mismatched reference distributions. However, we agree this does not directly validate against independent human preferences. In the revised manuscript, we have added a dedicated Limitations subsection discussing the reliance on LLM-generated targets, the risk of stylistic imitation, and our plans to collect human annotations in follow-up work. We have also included qualitative examples comparing model outputs to both LLM references and human-edited versions where available. revision: partial
Referee: [Experiments] The manuscript provides no details on experimental controls, statistical tests, exact baseline implementations, or safeguards against data leakage from the LLM rewriting step into the test sets. These omissions make it impossible to verify that the claimed reductions are attributable to the proposed approach rather than artifacts of the data construction process (see Abstract and Experiments sections).

Authors: We apologize for these omissions in the initial submission. The revised manuscript now includes an expanded Experiments section with: (i) precise descriptions of baseline implementations, including the verbatim Qwen3-ASR fine-tuning procedure and any post-processing; (ii) statistical significance testing via bootstrap resampling with reported confidence intervals and p-values for the CER reductions; (iii) details on experimental controls such as hyperparameter search ranges, early stopping criteria, and multiple random seeds; and (iv) explicit safeguards against leakage, including n-gram overlap analysis between train and test sets after rewriting, separate LLM calls for test data, and verification that no test utterances were used in training data construction. These additions enable independent verification of the results. revision: yes

standing simulated objections not resolved

We do not have human-annotated formal references for the test sets and therefore cannot report inter-annotator agreement statistics or direct human preference comparisons in the current work.

Circularity Check

0 steps flagged

No significant circularity; empirical results on constructed data with no self-referential reduction

full rationale

The paper describes dataset construction via LLM rewriting and quality filtering, followed by standard supervised fine-tuning of ASR models and evaluation with CER, ROUGE-L, and BERTScore on the resulting test sets. No equations, derivations, or self-citations appear in the provided text that would reduce any claimed result (such as the 37.4% relative CER reduction) to an input by construction. The performance numbers are measured empirical outcomes comparing fine-tuned models against verbatim baselines under identical reference conditions, which does not match any of the enumerated circularity patterns. The pipeline is self-contained as a conventional end-to-end fine-tuning experiment on synthetically labeled data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the assumption that LLM rewriting yields high-quality formal targets and that fine-tuning on these targets generalizes to real spoken input without introducing artifacts.

axioms (1)

domain assumption LLM-based rewriting and quality filtering produce accurate formal written equivalents for spoken Chinese utterances
This premise is required to create the training targets in WenetSpeech-Formal and Speechio-Formal.

pith-pipeline@v0.9.0 · 5750 in / 1212 out tokens · 45250 ms · 2026-05-20T06:28:10.030213+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct WenetSpeech-Formal and Speechio-Formal... by rewriting verbatim transcriptions with DeepSeek-V3.2 and applying quality filtering... fine-tune Qwen3-ASR... achieving up to 37.4% relative CER reduction
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

[1]

INTRODUCTION Automatic speech recognition (ASR) has become a foun- dational component of modern human-computer interac- tion, powering applications ranging from voice assistants and meeting transcription to real-time captioning and docu- ment dictation. State-of-the-art systems such as Whisper [1], Qwen3-ASR [2], and SenseV oice [3] have achieved remark- ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

um”, “uh

RELATED WORKS 2.1. Automatic Speech Recognition Modern ASR systems have evolved from traditional hybrid HMM-DNN architectures [12] toward end-to-end models based on CTC [13] and attention-based encoder-decoder frameworks [14]. Large-scale pre-trained models have further advanced the field: Whisper [1] demonstrates that training on hundreds of thousands of...

work page
[3]

DATASETS: WENETSPEECH-FORMAL AND SPEECHIO-FORMAL 3.1. Construction Pipeline We construct WenetSpeech-Formal and Speechio-Formal from the WenetSpeech corpus [9] and Speechio benchmark data [10], following a three-stage pipeline: Verbatim transcription collection.We use the origi- nal audio files and their verbatim transcriptions from Wenet- Speech and Spee...

work page
[4]

METHOD Given an input audio utterancex, our objective is to directly predict a formal written transcriptionˆyin a single pass: ˆy= arg max y Pθ(y|x),(1) whereydenotes a well-formed written sentence rather than a verbatim spoken transcript. Different from the conven- tional ASR→LLM pipeline, this formulation couples acoustic recognition and linguistic form...

work page
[5]

Sample Output

EXPERIMENTS 5.1. Experimental Setup We fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) on WenetSpeech-Formal using full-parameter supervised fine- tuning (SFT). Both models are initialized from the official Qwen3-ASR [2] checkpoints and trained for 2 epochs on the 969K-sample training split. All experiments are conducted on 2 NVIDIA A800-SXM4-80GB GPUs....

work page 1969
[6]

CONCLUSION We presented two contributions toward end-to-end spoken- to-formal Chinese ASR. First, we constructed and open- sourced WenetSpeech-Formal with 969K training samples and Speechio-Formal with 43K cross-domain test samples, two large-scale spoken-to-formal datasets built by rewriting verbatim transcriptions with DeepSeek-V3.2 and applying quality...

work page
[7]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Qwen3-asr technical report,

Qwen Team, “Qwen3-asr technical report,”https: //github.com/QwenLM/Qwen3-ASR, 2025, Ac- cessed: 2026-05-07

work page 2025
[9]

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051,

Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, et al., “FunAudioLLM: V oice understanding and generation foundation models for natural interaction between humans and LLMs,”https://arxiv.org/abs/ 2407.04051, 2024

work page arXiv 2024
[10]

Disfluency detection using a bidirectional LSTM,

Victoria Zayats, Mari Ostendorf, and Hannaneh Ha- jishirzi, “Disfluency detection using a bidirectional LSTM,” inProc. Interspeech, 2016, pp. 2523–2527

work page 2016
[11]

Improv- ing disfluency detection by self-training a self-attentive model,

Paria Jamshid Lou and Mark Johnson, “Improv- ing disfluency detection by self-training a self-attentive model,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3754–3763

work page 2020
[12]

Spoken language understanding with spoken-to-written conver- sion,

Bing Wang, Wanxiang Che, and Ting Liu, “Spoken language understanding with spoken-to-written conver- sion,” inProc. Interspeech, 2020, pp. 4661–4665

work page 2020
[13]

HyPoradise: An open baseline for generative speech recognition with large language models,

Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, and Eng Siong Chng, “HyPoradise: An open baseline for generative speech recognition with large language models,” inAd- vances in Neural Information Processing Systems, 2023, vol. 36

work page 2023
[14]

Gpt-4o system card and model re- lease,

OpenAI, “Gpt-4o system card and model re- lease,”https://openai.com/index/ hello-gpt-4o/, 2024, Accessed: 2026-05-07

work page 2024
[15]

WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al., “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6363–6367

work page 2022
[16]

SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,

SpeechColab, “SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,”https://github.com/SpeechColab/ Leaderboard, 2021, Accessed: 2026-05-18

work page 2021
[17]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al., “DeepSeek-V3 technical report,” https://arxiv.org/abs/2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,”IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. Table 6. Bitsand...

work page 2012
[19]

Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,” inProceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376

work page 2006
[20]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964

work page 2016
[21]

Normalization of non-standard words,

Richard Sproat, Alan W Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards, “Normalization of non-standard words,”Computer Speech & Language, vol. 15, no. 3, pp. 287–333, 2001

work page 2001
[22]

RNN Approaches to Text Normalization: A Challenge

Richard Sproat and Navdeep Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

work page 2019
[24]

llama.cpp: Efficient LLM inference in C/C++,

Georgi Gerganov et al., “llama.cpp: Efficient LLM inference in C/C++,”https://github.com/ ggerganov/llama.cpp, 2023, Introduces the GGUF model format for portable, quantized on-device inference. Accessed: 2026-05-11

work page 2023
[25]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,”https://arxiv.org/ abs/2208.07339, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024. A. APPENDIX A.1. Bitsandbytes Quantization Results Table 6 reports bitsandbytes [19] INT8/INT4 quantization results as a complement to the GGUF results in Section 5. INT8 is near-lossless...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

INTRODUCTION Automatic speech recognition (ASR) has become a foun- dational component of modern human-computer interac- tion, powering applications ranging from voice assistants and meeting transcription to real-time captioning and docu- ment dictation. State-of-the-art systems such as Whisper [1], Qwen3-ASR [2], and SenseV oice [3] have achieved remark- ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

um”, “uh

RELATED WORKS 2.1. Automatic Speech Recognition Modern ASR systems have evolved from traditional hybrid HMM-DNN architectures [12] toward end-to-end models based on CTC [13] and attention-based encoder-decoder frameworks [14]. Large-scale pre-trained models have further advanced the field: Whisper [1] demonstrates that training on hundreds of thousands of...

work page

[3] [3]

DATASETS: WENETSPEECH-FORMAL AND SPEECHIO-FORMAL 3.1. Construction Pipeline We construct WenetSpeech-Formal and Speechio-Formal from the WenetSpeech corpus [9] and Speechio benchmark data [10], following a three-stage pipeline: Verbatim transcription collection.We use the origi- nal audio files and their verbatim transcriptions from Wenet- Speech and Spee...

work page

[4] [4]

METHOD Given an input audio utterancex, our objective is to directly predict a formal written transcriptionˆyin a single pass: ˆy= arg max y Pθ(y|x),(1) whereydenotes a well-formed written sentence rather than a verbatim spoken transcript. Different from the conven- tional ASR→LLM pipeline, this formulation couples acoustic recognition and linguistic form...

work page

[5] [5]

Sample Output

EXPERIMENTS 5.1. Experimental Setup We fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) on WenetSpeech-Formal using full-parameter supervised fine- tuning (SFT). Both models are initialized from the official Qwen3-ASR [2] checkpoints and trained for 2 epochs on the 969K-sample training split. All experiments are conducted on 2 NVIDIA A800-SXM4-80GB GPUs....

work page 1969

[6] [6]

CONCLUSION We presented two contributions toward end-to-end spoken- to-formal Chinese ASR. First, we constructed and open- sourced WenetSpeech-Formal with 969K training samples and Speechio-Formal with 43K cross-domain test samples, two large-scale spoken-to-formal datasets built by rewriting verbatim transcriptions with DeepSeek-V3.2 and applying quality...

work page

[7] [7]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Qwen3-asr technical report,

Qwen Team, “Qwen3-asr technical report,”https: //github.com/QwenLM/Qwen3-ASR, 2025, Ac- cessed: 2026-05-07

work page 2025

[9] [9]

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051,

Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, et al., “FunAudioLLM: V oice understanding and generation foundation models for natural interaction between humans and LLMs,”https://arxiv.org/abs/ 2407.04051, 2024

work page arXiv 2024

[10] [10]

Disfluency detection using a bidirectional LSTM,

Victoria Zayats, Mari Ostendorf, and Hannaneh Ha- jishirzi, “Disfluency detection using a bidirectional LSTM,” inProc. Interspeech, 2016, pp. 2523–2527

work page 2016

[11] [11]

Improv- ing disfluency detection by self-training a self-attentive model,

Paria Jamshid Lou and Mark Johnson, “Improv- ing disfluency detection by self-training a self-attentive model,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3754–3763

work page 2020

[12] [12]

Spoken language understanding with spoken-to-written conver- sion,

Bing Wang, Wanxiang Che, and Ting Liu, “Spoken language understanding with spoken-to-written conver- sion,” inProc. Interspeech, 2020, pp. 4661–4665

work page 2020

[13] [13]

HyPoradise: An open baseline for generative speech recognition with large language models,

Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, and Eng Siong Chng, “HyPoradise: An open baseline for generative speech recognition with large language models,” inAd- vances in Neural Information Processing Systems, 2023, vol. 36

work page 2023

[14] [14]

Gpt-4o system card and model re- lease,

OpenAI, “Gpt-4o system card and model re- lease,”https://openai.com/index/ hello-gpt-4o/, 2024, Accessed: 2026-05-07

work page 2024

[15] [15]

WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al., “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6363–6367

work page 2022

[16] [16]

SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,

SpeechColab, “SpeechIO TIOBE: A large-scale bench- marking platform for Chinese automatic speech recog- nition,”https://github.com/SpeechColab/ Leaderboard, 2021, Accessed: 2026-05-18

work page 2021

[17] [17]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al., “DeepSeek-V3 technical report,” https://arxiv.org/abs/2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- nior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups,”IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. Table 6. Bitsand...

work page 2012

[19] [19]

Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classifi- cation: Labelling unsegmented sequence data with re- current neural networks,” inProceedings of the 23rd International Conference on Machine Learning, 2006, pp. 369–376

work page 2006

[20] [20]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964

work page 2016

[21] [21]

Normalization of non-standard words,

Richard Sproat, Alan W Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards, “Normalization of non-standard words,”Computer Speech & Language, vol. 15, no. 3, pp. 287–333, 2001

work page 2001

[22] [22]

RNN Approaches to Text Normalization: A Challenge

Richard Sproat and Navdeep Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

work page 2019

[24] [24]

llama.cpp: Efficient LLM inference in C/C++,

Georgi Gerganov et al., “llama.cpp: Efficient LLM inference in C/C++,”https://github.com/ ggerganov/llama.cpp, 2023, Introduces the GGUF model format for portable, quantized on-device inference. Accessed: 2026-05-11

work page 2023

[25] [25]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,”https://arxiv.org/ abs/2208.07339, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024. A. APPENDIX A.1. Bitsandbytes Quantization Results Table 6 reports bitsandbytes [19] INT8/INT4 quantization results as a complement to the GGUF results in Section 5. INT8 is near-lossless...

work page internal anchor Pith review Pith/arXiv arXiv 2024