FormalASR: End-to-End Spoken Chinese to Formal Text

· 2026 · cs.CL · arXiv 2605.19266

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

representative citing papers

FormalASR: End-to-End Spoken Chinese to Formal Text

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

FormalASR fine-tunes small Qwen3-ASR models on new spoken-to-formal Chinese datasets to achieve direct transcription with up to 37.4% relative CER reduction over verbatim baselines.

citing papers explorer

Showing 1 of 1 citing paper.

FormalASR: End-to-End Spoken Chinese to Formal Text cs.CL · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
FormalASR fine-tunes small Qwen3-ASR models on new spoken-to-formal Chinese datasets to achieve direct transcription with up to 37.4% relative CER reduction over verbatim baselines.

FormalASR: End-to-End Spoken Chinese to Formal Text

fields

years

verdicts

representative citing papers

citing papers explorer