FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Haoneng Luo; Jiaming Wang; Lingyun Zuo; Mengzhe Chen; Shiliang Zhang; Xian Shi; Yabin Li; Zerui Li; Zhangyu Xiao; Zhifu Gao

arxiv: 2305.11013 · v1 · pith:DFITJTZInew · submitted 2023-05-18 · 💻 cs.SD · cs.CL· eess.AS

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Zhifu Gao , Zerui Li , Jiaming Wang , Haoneng Luo , Xian Shi , Mengzhe Chen , Yabin Li , Lingyun Zuo

show 3 more authors

Zhihao Du Zhangyu Xiao Shiliang Zhang

This is my paper

classification 💻 cs.SD cs.CLeess.AS

keywords speechmodelrecognitionparaformertrainedfunasrindustrialtoolkit

0 comments

read the original abstract

This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
cs.SD 2026-06 unverdicted novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
eess.AS 2026-06 unverdicted novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show ...
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
cs.SD 2025-05 unverdicted novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
eess.AS 2024-06 unverdicted novelty 6.0

Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
eess.AS 2023-11 unverdicted novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
cs.AI 2026-05 unverdicted novelty 5.0

Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.
AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
cs.CV 2026-05 unverdicted novelty 5.0

AudioFace improves speech-driven facial animation by guiding blendshape prediction with linguistic and articulatory information extracted via multimodal language models.
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
cs.SD 2025-11 unverdicted novelty 5.0

CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
eess.AS 2024-10 unverdicted novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
Qwen2-Audio Technical Report
eess.AS 2024-07 unverdicted novelty 4.0

Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
Audio Editing in the Era of Foundation Models: A Survey
eess.AS 2026-06 unverdicted novelty 3.0

A survey that presents a unified taxonomy of audio editing tasks, summarizes training-based and training-free foundation model approaches, reviews datasets and evaluation protocols, and identifies future challenges.