pith. the verified trust layer for science. sign in

arxiv: 2502.11946 · v2 · pith:5TPEGTL3new · submitted 2025-02-17 · 💻 cs.CL · cs.AI· cs.HC· cs.SD· eess.AS

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang , Boyong Wu , Bruce Wang , Chao Yan , Chen Hu , Chengli Feng , Fei Tian , Feiyu Shen
show 137 more authors
Jingbei Li Mingrui Chen Peng Liu Ruihang Miao Wang You Xi Chen Xuerui Yang Yechang Huang Yuxiang Zhang Zheng Gong Zixin Zhang Hongyu Zhou Jianjian Sun Brian Li Chengting Feng Changyi Wan Hanpeng Hu Jianchang Wu Jiangjie Zhen Ranchen Ming Song Yuan Xuelin Zhang Yu Zhou Bingxin Li Buyun Ma Hongyuan Wang Kang An Wei Ji Wen Li Xuan Wen Xiangwen Kong Yuankai Ma Yuanwei Liang Yun Mou Bahtiyar Ahmidi Bin Wang Bo Li Changxin Miao Chen Xu Chenrun Wang Dapeng Shi Deshan Sun Dingyuan Hu Dula Sai Enle Liu Guanzhe Huang Gulin Yan Heng Wang Haonan Jia Haoyang Zhang Jiahao Gong Junjing Guo Jiashuai Liu Jiahong Liu Jie Feng Jie Wu Jiaoren Wu Jie Yang Jinguo Wang Jingyang Zhang Junzhe Lin Kaixiang Li Lei Xia Li Zhou Liang Zhao Longlong Gu Mei Chen Menglin Wu Ming Li Mingxiao Li Mingliang Li Mingyao Liang Na Wang Nie Hao Qiling Wu Qinyuan Tan Ran Sun Shuai Shuai Shaoliang Pang Shiliang Yang Shuli Gao Shanshan Yuan Siqi Liu Shihong Deng Shilei Jiang Sitong Liu Tiancheng Cao Tianyu Wang Wenjin Deng Wuxun Xie Weipeng Ming Wenqing He Wen Sun Xin Han Xin Huang Xiaomin Deng Xiaojia Liu Xin Wu Xu Zhao Yanan Wei Yanbo Yu Yang Cao Yangguang Li Yangzhen Ma Yanming Xu Yaoyu Wang Yaqiang Shi Yilei Wang Yizhuang Zhou Yinmin Zhong Yang Zhang Yaoben Wei Yu Luo Yuanwei Lu Yuhe Yin Yuchu Luo Yuanhao Ding Yuting Yan Yaqi Dai Yuxiang Yang Zhe Xie Zheng Ge Zheng Sun Zhewei Huang Zhichao Chang Zhisheng Guan Zidong Yang Zili Zhang Binxing Jiao Daxin Jiang Heung-Yeung Shum Jiansheng Chen Jing Li Shuchang Zhou Xiangyu Zhang Xinhao Zhang Yibo Zhu
This is my paper

Pith reviewed 2026-05-18 13:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.SDeess.AS
keywords unified speech-text modelreal-time speech interactionvoice cloninginstruction followingmulti-modal modeldynamic speech controlopen-source AI
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{5TPEGTL3}

Prints a linked pith:5TPEGTL3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A 130B-parameter unified speech-text model enables real-time interactive conversations with dynamic control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified multi-modal system that combines speech understanding and generation to support real-time human-machine interactions. It addresses high data costs, limited control, and lower intelligence in prior open-source work by adding a generative data engine, instruction-based adjustments to speech traits, and cognitive features like tool use. A sympathetic reader would expect this to make advanced voice interfaces more widely available without proprietary barriers. The system includes a new benchmark for testing these capabilities and reports stronger results in instruction following along with gains on other evaluations. If correct, the work shows how scaling and integration can produce capable open systems for speech tasks.

Core claim

The paper claims that the 130B-parameter unified speech-text multi-modal model, paired with a generative speech data engine for affordable cloning, an instruction-driven fine control system for adjustments across dialects emotions singing and RAP, and an enhanced cognitive architecture with tool calling and role-playing, delivers the first production-ready open-source solution for real-time speech interaction, reaching state-of-the-art human evaluation results especially in instruction following and a 9.3 percent average improvement on open-source benchmarks like LLaMA Question.

What carries the argument

The 130B-parameter unified speech-text multi-modal model that performs both understanding and generation, supported by the generative speech data engine and the instruction-driven fine control system.

If this is right

  • Real-time speech interaction becomes feasible in open-source settings with unified understanding and generation.
  • Speech output can be adjusted dynamically for dialects emotions singing and RAP using instructions.
  • Complex tasks are handled through added tool calling and role-playing abilities.
  • A lightweight 3B-parameter model is obtained via distillation for efficient voice synthesis.
  • The open-sourced chat version supports broader community use and further development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could build voice interfaces without depending on closed systems.
  • The unification pattern may extend to other input and output modalities.
  • Applications in accessibility or tutoring could benefit from the dynamic control features.
  • Integration with additional external tools could further expand task handling.

Load-bearing premise

The human evaluations on the new benchmark and the reported gains on existing benchmarks reflect genuine model capability rather than effects from test conditions or selection.

What would settle it

Evaluate the model on a fresh collection of real-time speech interaction scenarios created independently of the introduced benchmark and check whether the state-of-the-art human scores and 9.3 percent average benchmark improvement remain.

read the original abstract

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Step-Audio, a 130B-parameter unified speech-text multimodal model for real-time intelligent speech interaction. It claims to be the first production-ready open-source solution, featuring a generative speech data engine for voice cloning and distillation to a 3B TTS model, an instruction-driven fine control system for dialects/emotions/singing/RAP, and an enhanced cognitive architecture with tool calling and role-playing. On the newly introduced StepEval-Audio-360 benchmark, it reports state-of-the-art human evaluation results especially in instruction following; it also claims a 9.3% average improvement on open-source benchmarks such as LLaMA Question. The Step-Audio-Chat variant is open-sourced.

Significance. If the human-evaluation and benchmark claims hold after full disclosure of protocols and controls, the work would represent a notable contribution to open-source multimodal speech systems by combining large-scale unified modeling with practical control and cognitive extensions. The open release of models, code, and a new evaluation benchmark could accelerate research in real-time speech interfaces, provided the reported gains reflect genuine generalization rather than evaluation artifacts.

major comments (3)
  1. StepEval-Audio-360 benchmark and human evaluation protocol (Evaluation section): the manuscript provides no details on annotator blinding, inter-rater reliability metrics (e.g., Cohen’s kappa or Fleiss’ kappa), prompt sampling strategy, or statistical significance testing. Without these, the SOTA claim for instruction following and the 9.3% benchmark gains cannot be assessed for robustness against selection effects or evaluator bias.
  2. Training procedure and data composition (Training and Data sections): the abstract and available text report neither the composition of the training corpus, voice data sources, nor any statistical tests or error bars on the reported improvements. This absence directly affects the load-bearing claim that the 130B unified model achieves genuine generalization.
  3. Real-time and production-readiness claims (Introduction and System Overview): the assertion of being the “first production-ready open-source solution” rests on unshown comparisons to prior open-source systems regarding latency, stability, and deployment metrics; no quantitative real-time performance tables or ablation studies are referenced to support this framing.
minor comments (2)
  1. Clarify the exact relationship between the 130B unified model and the distilled 3B TTS model; a diagram or parameter-flow figure would improve readability.
  2. Add missing references to prior open-source speech interaction systems (e.g., recent works on unified audio-language models) to better contextualize the novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: StepEval-Audio-360 benchmark and human evaluation protocol (Evaluation section): the manuscript provides no details on annotator blinding, inter-rater reliability metrics (e.g., Cohen’s kappa or Fleiss’ kappa), prompt sampling strategy, or statistical significance testing. Without these, the SOTA claim for instruction following and the 9.3% benchmark gains cannot be assessed for robustness against selection effects or evaluator bias.

    Authors: We agree with the referee that additional details on the evaluation protocol are essential to substantiate our claims. In the revised manuscript, we will expand the Evaluation section to include information on annotator blinding, inter-rater reliability metrics including Fleiss’ kappa, the prompt sampling strategy used, and the results of statistical significance testing. These additions will help demonstrate the robustness of the reported SOTA performance in instruction following and the benchmark improvements. revision: yes

  2. Referee: Training procedure and data composition (Training and Data sections): the abstract and available text report neither the composition of the training corpus, voice data sources, nor any statistical tests or error bars on the reported improvements. This absence directly affects the load-bearing claim that the 130B unified model achieves genuine generalization.

    Authors: We acknowledge the importance of transparency regarding the training data and procedures. We will revise the Training and Data sections to provide a detailed description of the training corpus composition, the sources of the voice data, and incorporate statistical tests along with error bars for the reported performance improvements. This will better support the claims of generalization in the 130B model. revision: yes

  3. Referee: Real-time and production-readiness claims (Introduction and System Overview): the assertion of being the “first production-ready open-source solution” rests on unshown comparisons to prior open-source systems regarding latency, stability, and deployment metrics; no quantitative real-time performance tables or ablation studies are referenced to support this framing.

    Authors: We appreciate this feedback on strengthening the production-readiness claims. In the revision, we will include quantitative comparisons with prior open-source systems on metrics such as latency and stability, along with a table presenting real-time performance data and relevant ablation studies. While we maintain that the combination of features and open-sourcing makes it production-ready, these additions will provide more concrete evidence. revision: partial

Circularity Check

0 steps flagged

No load-bearing circularity; new benchmark and empirical claims are independent of fitted inputs

full rationale

The paper introduces a new 130B unified model, a generative data engine, an instruction-driven control system, and a new StepEval-Audio-360 benchmark, then reports human-evaluation SOTA and 9.3% gains on LLaMA Question. These are empirical system-building and benchmarking results rather than a closed derivation chain. No equations, self-definitional reductions, or fitted parameters renamed as predictions appear in the abstract or described contributions. The 'first production-ready' framing rests on external comparisons and the new benchmark rather than reducing to self-citation or ansatz smuggling. A minor self-citation risk exists around benchmark construction details, but it is not load-bearing for the core claims and does not trigger higher circularity under the rules.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claims rest on the effectiveness of the unified 130B architecture and the quality of the generative speech data engine; these are treated as engineering achievements whose internal validity cannot be audited from the abstract alone.

free parameters (1)
  • 130B parameter count
    Scale chosen to achieve claimed performance; exact training hyperparameters and data mixture ratios are unspecified in abstract.

pith-pipeline@v0.9.0 · 6343 in / 1243 out tokens · 49382 ms · 2026-05-18T13:34:10.275230+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

    cs.AI 2026-05 unverdicted novelty 8.0

    ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

  2. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  3. Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

    cs.CR 2026-05 conditional novelty 7.0

    Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

  4. ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

    cs.AI 2026-05 unverdicted novelty 7.0

    ReasonAudio benchmark shows current text-audio retrieval models fail at reasoning tasks like negation and duration discrimination beyond simple semantic matching.

  5. Same Words, Different Judgments: How Preferences Vary Across Modalities

    cs.SD 2026-02 unverdicted novelty 7.0

    Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.

  6. Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

    cs.CL 2025-12 accept novelty 7.0

    Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.

  7. ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

    cs.CV 2025-12 unverdicted novelty 7.0

    ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...

  8. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    cs.SD 2025-07 unverdicted novelty 7.0

    Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

  9. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  10. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

  11. MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    cs.SD 2026-05 accept novelty 6.0

    MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

  12. Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

    eess.AS 2026-04 unverdicted novelty 6.0

    A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...

  13. Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...

  14. StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

    cs.CL 2025-09 unverdicted novelty 6.0

    StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

  15. Step-Audio 2 Technical Report

    cs.CL 2025-07 unverdicted novelty 6.0

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...

  16. Sema: Semantic Transport for Real-Time Multimodal Agents

    cs.MM 2026-04 unverdicted novelty 5.0

    Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.

  17. On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

    cs.SD 2026-04 unverdicted novelty 5.0

    Joint-marginal alignment plus adaptive weighting in speech VAE distillation yields the best combined performance on reconstruction, understanding, and generation tasks.

  18. Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

    cs.CL 2025-10 unverdicted novelty 5.0

    An adaptive CFG method that tunes guidance based on LLM-detected mismatch between emotion prompts and text semantics improves emotional expressiveness in AR TTS while preserving audio quality and intelligibility.

  19. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 17 Pith papers · 9 internal anchors

  1. [5]

    The Method of Paired Comparisons , author=

    Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , year=

  2. [13]

    International conference on machine learning , pages=

    Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

  3. [14]

    ICASSP 23 , year=

    Hybrid Transformers for Music Source Separation , author=. ICASSP 23 , year=

  4. [15]

    2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) , pages=

    Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline , author=. 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) , pages=. 2017 , organization=

  5. [16]

    ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

  6. [17]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Audiogpt: Understanding and generating speech, music, sound, and talking head , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  7. [18]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2020 , publisher=

  8. [29]

    2024 , howpublished =

    Step-1: A 130B Large Language Model , author =. 2024 , howpublished =

  9. [30]

    2024 , howpublished =

    Step-2 , author =. 2024 , howpublished =

  10. [31]

    2024 , howpublished =

    doubaovoice , author =. 2024 , howpublished =

  11. [32]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    anastassiou2024seedttsafamilyofhighquality APACrefauthors Anastassiou, P. , Chen, J. , Chen, J. , Chen, Y. , Chen, Z. , Chen, Z. others APACrefauthors \ 2024 . Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Seed-tts: A family of high-quality versatile speech generation models . arXiv preprint arXiv:2406.02430

  12. [33]

    \ Terry, M E

    Bradley1952RankAO APACrefauthors Bradley, R A. \ Terry, M E. APACrefauthors \ 1952 . Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons Rank analysis of incomplete block designs: I. the method of paired comparisons . Biometrika 39 324 . APACrefURL https://api.semanticscholar.org/CorpusID:125209808 APACrefURL

  13. [34]

    APACrefauthors \ 2024

    doubaovoice APACrefauthors bytedance. APACrefauthors \ 2024 . doubaovoice. doubaovoice. https://team.doubao.com/zh/special/realtime_voice . Accessed: 2024

  14. [35]

    , Chen, Y

    chen2025minmomultimodallargelanguage APACrefauthors Chen, Q. , Chen, Y. , Chen, Y. , Chen, M. , Chen, Y. , Deng, C. others APACrefauthors \ 2025 . Minmo: A multimodal large language model for seamless voice interaction Minmo: A multimodal large language model for seamless voice interaction . arXiv preprint arXiv:2501.06282

  15. [36]

    , Zheng, S

    anenhancedres2netwithlocal APACrefauthors Chen, Y. , Zheng, S. , Wang, H. , Cheng, L. , Chen, Q. \ Qi, J. APACrefauthors \ 2023 . An enhanced res2net with local and global feature fusion for speaker verification An enhanced res2net with local and global feature fusion for speaker verification . arXiv preprint arXiv:2305.12838

  16. [38]

    chu2024qwen2 APACrefauthors Chu, Y. , Xu, J. , Yang, Q. , Wei, H. , Wei, X. , Guo, Z. others APACrefauthors \ 2024 2 . Qwen2-audio technical report Qwen2-audio technical report . arXiv preprint arXiv:2407.10759

  17. [39]

    Speechverse: A large-scale generalizable audio language model.arXiv preprint arXiv:2405.08295, 2024

    das2024speechverselargescalegeneralizableaudio APACrefauthors Das, N. , Dingliwal, S. , Ronanki, S. , Paturi, R. , Huang, Z. , Mathur, P. others APACrefauthors \ 2024 . Speechverse: A large-scale generalizable audio language model Speechverse: A large-scale generalizable audio language model . arXiv preprint arXiv:2405.08295

  18. [40]

    Moshi: a speech-text foundation model for real-time dialogue

    2024moshispeechtextfoundationmodel APACrefauthors D \'e fossez, A. , Mazar \'e , L. , Orsini, M. , Royer, A. , P \'e rez, P. , J \'e gou, H. Zeghidour, N. APACrefauthors \ 2024 . Moshi: a speech-text foundation model for real-time dialogue Moshi: a speech-text foundation model for real-time dialogue . arXiv preprint arXiv:2410.00037

  19. [41]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    du2024cosyvoicescalablemultilingualzeroshot APACrefauthors Du, Z. , Chen, Q. , Zhang, S. , Hu, K. , Lu, H. , Yang, Y. others APACrefauthors \ 2024 . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic...

  20. [42]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    du2024cosyvoice2scalablestreaming APACrefauthors Du, Z. , Wang, Y. , Chen, Q. , Shi, X. , Lv, X. , Zhao, T. others APACrefauthors \ 2024 . Cosyvoice 2: Scalable streaming speech synthesis with large language models Cosyvoice 2: Scalable streaming speech synthesis with large language models . arXiv preprint arXiv:2412.10117

  21. [43]

    The Llama 3 Herd of Models

    dubey2024llama APACrefauthors Dubey, A. , Jauhri, A. , Pandey, A. , Kadian, A. , Al-Dahle, A. , Letman, A. others APACrefauthors \ 2024 . The llama 3 herd of models The llama 3 herd of models . arXiv preprint arXiv:2407.21783

  22. [44]

    Llama-omni: Seamless speech interaction with large language models

    fang2024llamaomniseamlessspeechinteraction APACrefauthors Fang, Q. , Guo, S. , Zhou, Y. , Ma, Z. , Zhang, S. \ Feng, Y. APACrefauthors \ 2024 . Llama-omni: Seamless speech interaction with large language models Llama-omni: Seamless speech interaction with large language models . arXiv preprint arXiv:2409.06666

  23. [45]

    , Shao, H

    gao2025lucylinguisticunderstandingcontrol APACrefauthors Gao, H. , Shao, H. , Wang, X. , Qiu, C. , Shen, Y. , Cai, S. others APACrefauthors \ 2025 . LUCY: Linguistic Understanding and Control Yielding Early Stage of Her Lucy: Linguistic understanding and control yielding early stage of her . arXiv preprint arXiv:2501.16327

  24. [46]

    , Zhang, S

    gao2023paraformerfastaccurateparallel APACrefauthors Gao, Z. , Zhang, S. , McLoughlin, I. \ Yan, Z. APACrefauthors \ 2022 . Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition . arXiv preprint arXiv:2206.08317

  25. [47]

    , Zhou, L

    hu2024wavllm APACrefauthors Hu, S. , Zhou, L. , Liu, S. , Chen, S. , Meng, L. , Hao, H. others APACrefauthors \ 2024 . Wavllm: Towards robust and adaptive speech large language model Wavllm: Towards robust and adaptive speech large language model . arXiv preprint arXiv:2404.00656

  26. [48]

    huang2023audiogptunderstandinggeneratingspeech APACrefauthors Huang, R. , Li, M. , Yang, D. , Shi, J. , Chang, X. , Ye, Z. others APACrefauthors \ 2024 . Audiogpt: Understanding and generating speech, music, sound, and talking head Audiogpt: Understanding and generating speech, music, sound, and talking head . Proceedings of the AAAI Conference on Artific...

  27. [49]

    GPT-4o System Card

    hurst2024gpt APACrefauthors Hurst, A. , Lerer, A. , Goucher, A P. , Perelman, A. , Ramesh, A. , Clark, A. others APACrefauthors \ 2024 . Gpt-4o system card Gpt-4o system card . arXiv preprint arXiv:2410.21276

  28. [50]

    , Cao, Y

    kong2020pannslargescalepretrainedaudio APACrefauthors Kong, Q. , Cao, Y. , Iqbal, T. , Wang, Y. , Wang, W. \ Plumbley, M D. APACrefauthors \ 2020 . Panns: Large-scale pretrained audio neural networks for audio pattern recognition Panns: Large-scale pretrained audio neural networks for audio pattern recognition . IEEE/ACM Transactions on Audio, Speech, and...

  29. [51]

    ming2024advancingautoregressivecontinuationvideo APACrefauthors Ming, R. , Wu, J. , Huang, Z. , Ju, Z. , Hu, J. , Peng, L. \ Zhou, S. APACrefauthors \ 2024 . Advancing Auto-Regressive Continuation for Video Frames Advancing auto-regressive continuation for video frames . arXiv preprint arXiv:2412.03758

  30. [52]

    , Muller, B

    nguyen2024spiritlminterleavedspoken APACrefauthors Nguyen, T A. , Muller, B. , Yu, B. , Costa-Jussa, M R. , Elbayad, M. , Popuri, S. others APACrefauthors \ 2024 . Spirit-lm: Interleaved spoken and written language model Spirit-lm: Interleaved spoken and written language model . arXiv preprint arXiv:2402.05755

  31. [53]

    , Kim, J W

    radford2023robust APACrefauthors Radford, A. , Kim, J W. , Xu, T. , Brockman, G. , McLeavey, C. \ Sutskever, I. APACrefauthors \ 2023 . Robust speech recognition via large-scale weak supervision Robust speech recognition via large-scale weak supervision . International conference on machine learning International conference on machine learning \ ( \ 28492--28518)

  32. [54]

    , Massa, F

    rouard2022hybrid APACrefauthors Rouard, S. , Massa, F. \ D \'e fossez, A. APACrefauthors \ 2023 . Hybrid Transformers for Music Source Separation Hybrid transformers for music source separation . ICASSP 23. Icassp 23

  33. [55]

    Proximal Policy Optimization Algorithms

    schulman2017proximalpolicyoptimizationalgorithms APACrefauthors Schulman, J. , Wolski, F. , Dhariwal, P. , Radford, A. \ Klimov, O. APACrefauthors \ 2017 . Proximal policy optimization algorithms Proximal policy optimization algorithms . arXiv preprint arXiv:1707.06347

  34. [56]

    APACrefauthors \ 2024 1

    step1 APACrefauthors StepFun. APACrefauthors \ 2024 1 . Step-1: A 130B Large Language Model. Step-1: A 130b large language model. https://platform.stepfun.com/docs/llm/text . Accessed: February 2024

  35. [57]

    APACrefauthors \ 2024 2

    step2 APACrefauthors StepFun. APACrefauthors \ 2024 2 . Step-2. Step-2. https://platform.stepfun.com/docs/llm/text . Accessed: February 2024

  36. [58]

    wang2024freezeomnismartlowlatency APACrefauthors Wang, X. , Li, Y. , Fu, C. , Shen, Y. , Xie, L. , Li, K. Ma, L. APACrefauthors \ 2024 . Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm . arXiv preprint arXiv:2411.00774

  37. [59]

    xie2024miniomnilanguagemodelshear APACrefauthors Xie, Z. \ Wu, C. APACrefauthors \ 2024 . Mini-omni: Language models can hear, talk while thinking in streaming Mini-omni: Language models can hear, talk while thinking in streaming . arXiv preprint arXiv:2408.16725

  38. [60]

    zeng2024glm4voiceintelligenthumanlikeendtoend APACrefauthors Zeng, A. , Du, Z. , Liu, M. , Wang, K. , Jiang, S. , Zhao, L. Tang, J. APACrefauthors \ 2024 . Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot . arXiv preprint arXiv:2412.02612

  39. [61]

    Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

    zhang2024disttrain APACrefauthors Zhang, Z. , Zhong, Y. , Ming, R. , Hu, H. , Sun, J. , Ge, Z. Jin, X. APACrefauthors \ 2024 . DistTrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language model...