pith. sign in

arxiv: 2605.23463 · v1 · pith:52L7GHSYnew · submitted 2026-05-22 · 📡 eess.AS

StepAudio 2.5 Technical Report

Pith reviewed 2026-05-25 02:50 UTC · model grok-4.3

classification 📡 eess.AS
keywords unified audio-language modelautomatic speech recognitiontext-to-speech synthesisrealtime spoken interactionreinforcement learning from human feedbackmultimodal foundation model
0
0 comments X

The pith

A single audio-language model matches specialized systems at speech recognition, synthesis, and realtime dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StepAudio 2.5 as a unified foundation that reaches state-of-the-art results in automatic speech recognition, text-to-speech synthesis, and realtime spoken interaction. It claims that once text and audio occupy a shared multimodal space, the three tasks no longer require separate architectures but can instead be handled through different choices of training data, optimization targets, and decoding rules. The work shifts from ordinary supervised training to task-specific reinforcement learning from human feedback as the main way to set those targets. A sympathetic reader would care because the result points toward fewer models needed to cover the full range of audio capabilities in applications such as voice assistants and live conversation systems.

Core claim

StepAudio 2.5 shows that a shared audio-language backbone can internalize the distinct deployment objectives of speech understanding, generation, and live interaction by advancing post-training to task-tailored RLHF together with specialized decoding, thereby matching or exceeding the performance of systems built separately for ASR, TTS, and realtime dialogue.

What carries the argument

Task-tailored Reinforcement Learning from Human Feedback applied after text and audio share a multimodal representational space, used to set distinct optimization targets and decoding constraints for each operational mode.

If this is right

  • ASR mode improves transcription efficiency through verifiable multi-token decoding.
  • TTS mode produces controllable and expressive output via preference-based RLHF and context-rich supervision.
  • Realtime mode delivers low-latency, persona-consistent dialogue through generative reward modeling inside the RLHF framework.
  • The single backbone achieves state-of-the-art numbers across all three tasks on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the premise holds, developers could maintain one model instead of three separate pipelines for audio tasks.
  • The same operational-regime approach might allow additional audio capabilities to be added without redesigning the core architecture.
  • Consistent persona across understanding and generation modes could simplify building reliable conversational agents.

Load-bearing premise

Once text and audio share a multimodal representational space, task specialization reduces to choices in data construction, optimization targets, and decoding constraints.

What would settle it

Head-to-head evaluation on a standard benchmark in which StepAudio 2.5 fails to match or exceed the best specialized system in at least one of ASR, TTS, or realtime interaction.

Figures

Figures reproduced from arXiv: 2605.23463 by Bin Lin, Boyong Wu, Bo Zhao, Brian Li, Changlin Zhang, Chang Zeng, Chao Yan, Chen Geng, Chenghao Dong, Chengli Feng, Cheng Yi, Chengyuan Yao, Chen Wu, Daijiao Liu, Danni Wan, Dan Zhou, Daxin Jiang, Di Chen, Die Zhang, Dongqing Pang, Fei Tian, Feng Tian, Future Li, Gang Yu, Guanglong Yang, Guoqiang Hu, Haiyang Sun, Haoyang Zhang, Huangxi Zhu, Jiangjie Zhen, Jianzheng Gao, Jinghua Liang, Jinglan Gong, Jinmei Wan, Jun Chen, Junjie Yuan, Kang An, Lei Lei, Limin Zhong, Li Xie, Lun Cai, Mengqiang Ren, Mingliang Li, Mingxiao Li, Min Xu, Na Wang, Peilin Li, Pengfei Tan, Peng Yang, Qiang Tong, Qiaoling Huang, Qingfu Du, Qingjian Lin, Rui Wang, Runze Li, Shengchen Zhou, Shenghua Hu, Shihao Peng, Shiliang Yang, Shi Qiu, Siqi Tu, Siyi Zhou, Tianjiao Deng, Ting Xu, Tong Wang, WeiMing Niu, Wenwen Qu, Wuxun Xie, Xiangyu Li, Xiangyu Tony Zhang, Xiangyu Zhang, Xianwei Zhang, Xianyu Feng, Xiaojia Liu, Xing Chen, Xiongbin Wu, Xuerui Yang, Yang Li, Yang Yang, Yan Wu, Yechang Huang, Yibo Zhu, Yifan Zhang, Yile Liu, Yi Liu, Yongshen Long, Yuanhao Ding, Yuchu Luo, Yu Fu, Yuhao Wang, Yuhe Yin, Yu Luo, Yunfang Xu, Yuxiang Yang, Yuxin Li, Yuxin Zhang, Zhengyan Sheng, Zhiguo Huang, Zhiyue Wu, Zichao Li, Zichao Zhou.

Figure 1
Figure 1. Figure 1: A unified view of the StepAudio 2.5 model family. The shared audio-language stack provides the common architectural basis used to organize ASR, TTS, and Realtime, while the three systems serve different deployment goals. prior plus a mechanism to route supervision through different output spaces and deployment regimes. Recognition, synthesis, and realtime dialogue then become three ways of querying the sam… view at source ↗
Figure 2
Figure 2. Figure 2: ASR architecture in StepAudio 2.5. The shared encoder-adaptor-decoder backbone is augmented with parallel future-token branches, making decoding substantially more efficient while preserving autoregressive verification. processed by a decoder-style Transformer block. All branches share the same embedding layer and vocabulary output head as the main decoder. 4.1 Training Pipeline ASR SFT Supervised fine-tun… view at source ↗
Figure 3
Figure 3. Figure 3: Long-form ASR data construction pipeline. The process transitions from individual clip transcription to global session-level refinement to ensure both accuracy and consistency. Both stages inherit the 32K sequence budget, 32 global batch size, and 10K-step training horizon. During training, the main branch predicts the next token xt+1 at position t, while the h-th MTP branch targets the future token xt+1+h… view at source ↗
Figure 4
Figure 4. Figure 4: Arena Win Rates of StepAudio-2.5-TTS. Finally, we select three leading models with controllable generation capabilities—MiniMax-2.8-HD, Elevenlabs-v3, and Gemini-3.1-Flash-TTS. For each model, we adopt its officially recommended optimal voice preset and conduct arena-based evaluation using 774 prompts. The results in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Realtime interaction evaluation. Higher is better. Best results are in bold. Results Analysis: As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across ASR, TTS, and realtime spoken interaction. It operates on the premise that a shared multimodal representational space allows task specialization via operational regimes (data construction, optimization targets, and decoding constraints), with RLHF as the primary post-training mechanism: verifiable multi-token decoding for ASR, preference-based RLHF for TTS, and generative reward modeling for realtime.

Significance. If the SOTA claims are substantiated with detailed, reproducible benchmarks including error bars, dataset specifications, and direct comparisons to specialized baselines, the work would be significant for showing that a single backbone can internalize distinct deployment objectives through RLHF-centric alignment rather than separate architectures.

major comments (1)
  1. [Abstract] Abstract: the central claim that StepAudio 2.5 'achieves state-of-the-art results across ASR, TTS, and Realtime' is presented without any quantitative metrics (e.g., WER, MOS, latency figures), error bars, dataset details, or comparison tables. This directly undermines verification of the performance claim that is load-bearing for the entire contribution.
minor comments (1)
  1. [Abstract] Abstract, paragraph 3: the phrase 'standard benchmarks' is used without naming the specific datasets or metrics, reducing clarity on how the SOTA comparisons were performed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise accordingly to strengthen verifiability of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that StepAudio 2.5 'achieves state-of-the-art results across ASR, TTS, and Realtime' is presented without any quantitative metrics (e.g., WER, MOS, latency figures), error bars, dataset details, or comparison tables. This directly undermines verification of the performance claim that is load-bearing for the entire contribution.

    Authors: We agree that the abstract would benefit from explicit quantitative support to allow immediate assessment of the SOTA claims. The full manuscript contains detailed benchmark tables, dataset specifications, and direct comparisons in the experimental sections, but the abstract relies on a summary statement. In the revised version we will update the abstract to include representative metrics (e.g., WER on LibriSpeech, MOS on standard TTS test sets, and end-to-end latency for realtime), along with brief references to baselines and error bars where reported. This change improves transparency without altering the technical narrative or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper states its central premise explicitly as an operating assumption ('we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes') and then describes the application of RLHF and specialized decoding to produce three modes. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the claimed results to the inputs by construction. The SOTA claims rest on benchmark outcomes rather than any definitional equivalence or load-bearing self-reference. This is the normal case of a self-contained empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report relies on the untested premise that a shared multimodal space plus task-specific RLHF is sufficient to match specialized systems; no independent evidence for this premise is supplied in the abstract.

axioms (1)
  • domain assumption Once text and audio share a multimodal representational space, task specialization reduces to data construction, optimization targets, and decoding constraints.
    Stated explicitly in the abstract as the guiding insight.

pith-pipeline@v0.9.0 · 6189 in / 1212 out tokens · 13214 ms · 2026-05-25T02:50:13.420355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 11 internal anchors

  1. [1]

    Connectionist temporal classification

    Alex Graves. Connectionist temporal classification. InSupervised sequence labelling with recurrent neural networks, pages 61–93. Springer, 2012

  2. [2]

    Sequence Transduction with Recurrent Neural Networks

    Alex Graves. Sequence transduction with recurrent neural networks.arXiv preprint arXiv:1211.3711, 2012

  3. [3]

    Listen, Attend and Spell

    William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell.arXiv preprint arXiv:1508.01211, 2015. 16 StepFun-Audio Team

  4. [4]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. pages 28492–28518, 2023

  5. [5]

    VIBEVOICE-ASR technical report.arXiv preprint arXiv:2601.18184, 2026

    Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, et al. VIBEVOICE-ASR technical report.arXiv preprint arXiv:2601.18184, 2026

  6. [6]

    Fun-ASR technical report.arXiv preprint arXiv:2509.12508, 2025

    Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, et al. Fun-ASR technical report.arXiv preprint arXiv:2509.12508, 2025

  7. [7]

    Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675, 2024

    Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, et al. Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675, 2024

  8. [8]

    Qwen3-ASR Technical Report

    Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, et al. Qwen3- ASR technical report.arXiv preprint arXiv:2601.21337, 2026

  9. [9]

    Step-Audio 2 Technical Report

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. StepAudio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

  10. [10]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

  11. [11]

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, and Fei Tian. Boosting omni-modal language models: Staged post-training with visually debiased evaluation, 2026. URLhttps://arxiv.org/abs/2605.12034

  12. [12]

    Salmonn: Towards generic hearing abilities for large language models

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. In International Conference on Learning Representations, volume 2024, pages 16607–16629, 2024

  13. [13]

    Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

  14. [14]

    Recent advances in speech language models: A survey

    Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13943–13970, 2025

  15. [15]

    Paralinguistics-aware speech-empowered large language models for natural conversation.Advances in Neural Information Processing Systems, 37:131072–131103, 2024

    Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, et al. Paralinguistics-aware speech-empowered large language models for natural conversation.Advances in Neural Information Processing Systems, 37:131072–131103, 2024

  16. [16]

    Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM

    Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long MA. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. InF orty-second International Conference on Machine Learning, 2025. URL 17 StepFun-Audio Team https://openreview.net/forum?id=s1EImzs5Id

  17. [17]

    Depflow: Disentangled speech generation to mitigate semantic bias in depression detection.arXiv preprint arXiv:2601.00303, 2026

    Yuxin Li, Xiangyu Zhang, Yifei Li, Zhiwei Guo, Haoyang Zhang, Eng Siong Chng, and Cuntai Guan. Depflow: Disentangled speech generation to mitigate semantic bias in depression detection.arXiv preprint arXiv:2601.00303, 2026

  18. [18]

    A new approach to extract fetal electrocardiogram using affine combination of adaptive filters

    Yu Xuan, Xiangyu Zhang, Shuyue Stella Li, Zihan Shen, Xin Xie, Leibny Paola Garcia, and Roberto Togneri. A new approach to extract fetal electrocardiogram using affine combination of adaptive filters. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  19. [19]

    Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models.arXiv preprint arXiv:2511.00850, 2025

    Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, and Eng Siong Chng. Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models.arXiv preprint arXiv:2511.00850, 2025

  20. [20]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  21. [22]

    Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

    Donghang Wu, Haoyang Zhang, Jun Chen, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu, et al. Mind-paced speaking: A dual-brain approach to real-time reasoning in spoken language models.arXiv preprint arXiv:2510.09592, 2025

  22. [23]

    Chronological thinking in full-duplex spoken dialogue language models.arXiv preprint arXiv:2510.05150, 2025

    Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, et al. Chronological thinking in full-duplex spoken dialogue language models.arXiv preprint arXiv:2510.05150, 2025

  23. [24]

    Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action, 2026

    Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian. Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action, 2026. URL https://arxiv.org/abs/2605. 20755

  24. [25]

    Mamba in speech: Towards an alternative to self-attention.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, and Julien Epps. Mamba in speech: Towards an alternative to self-attention.IEEE Transactions on Audio, Speech and Language Processing, 2025

  25. [26]

    Code-switching speech recognition under the lens: Model-and data-centric perspectives.IEEE Transactions on Audio, Speech and Language Processing, 2026

    Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xiangyu Zhang, Dongyuan Shi, Eng Siong Chng, and Haizhou Li. Code-switching speech recognition under the lens: Model-and data-centric perspectives.IEEE Transactions on Audio, Speech and Language Processing, 2026

  26. [27]

    Step-audio-r1 technical report.arXiv preprint arXiv:2511.15848, 2025

    Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, et al. Step-audio-r1 technical report.arXiv preprint arXiv:2511.15848, 2025. 18 StepFun-Audio Team

  27. [28]

    Step-Audio-R1.5 Technical Report

    Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, Fei Tian, Yayue Deng, Jun Chen, Qingjian Lin, Haoyang Zhang, Yuxin Li, Jinglan Gong, et al. Step-audio-r1.5 technical report.arXiv preprint arXiv:2604.25719, 2026

  28. [29]

    Park, William Chan, Yu Zhang, et al

    Daniel S. Park, William Chan, Yu Zhang, et al. SpecAugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, pages 2613–2617, 2019

  29. [30]

    J. G. Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pages 347–354, 1997

  30. [31]

    AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline

    Hui Bu, Jiatong Du, Xingyu Na, Bengu Wu, and Hao Zheng. AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline. In20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pages 1–5, 2017

  31. [32]

    AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

    Jiatong Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: Transforming mandarin ASR research into industrial scale. InarXiv preprint arXiv:1808.10583, 2018

  32. [33]

    WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

    Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6182–6186, 2022

  33. [34]

    FLEURS: Few-shot learning evaluation of universal representations of speech.arXiv preprint arXiv:2205.12446, 2022

    Alexis Conneau, Min Ma, Simran Khanuja, et al. FLEURS: Few-shot learning evaluation of universal representations of speech.arXiv preprint arXiv:2205.12446, 2022

  34. [35]

    LibriSpeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210, 2015

  35. [36]

    Common voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, et al. Common voice: A massively-multilingual speech corpus. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, 2020

  36. [37]

    V oxpopuli-cleaned-aa: Cleaned ground truth transcripts for voxpopuli english test set, 2026

    Artificial Analysis. V oxpopuli-cleaned-aa: Cleaned ground truth transcripts for voxpopuli english test set, 2026. URLhttps://artificialanalysis.ai/articles/aa-wer-v2

  37. [38]

    Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set, 2026

    Artificial Analysis. Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set, 2026. URLhttps://artificialanalysis.ai/articles/aa-wer-v2

  38. [39]

    Step-audio-editx technical report, 2025

    Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, and Gang Yu. Step-audio-editx technical report, 2025. URLhttps://arxiv.org/abs/2511.03601

  39. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 19