Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Pith reviewed 2026-05-18 13:34 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{5TPEGTL3}
Prints a linked pith:5TPEGTL3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A 130B-parameter unified speech-text model enables real-time interactive conversations with dynamic control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the 130B-parameter unified speech-text multi-modal model, paired with a generative speech data engine for affordable cloning, an instruction-driven fine control system for adjustments across dialects emotions singing and RAP, and an enhanced cognitive architecture with tool calling and role-playing, delivers the first production-ready open-source solution for real-time speech interaction, reaching state-of-the-art human evaluation results especially in instruction following and a 9.3 percent average improvement on open-source benchmarks like LLaMA Question.
What carries the argument
The 130B-parameter unified speech-text multi-modal model that performs both understanding and generation, supported by the generative speech data engine and the instruction-driven fine control system.
If this is right
- Real-time speech interaction becomes feasible in open-source settings with unified understanding and generation.
- Speech output can be adjusted dynamically for dialects emotions singing and RAP using instructions.
- Complex tasks are handled through added tool calling and role-playing abilities.
- A lightweight 3B-parameter model is obtained via distillation for efficient voice synthesis.
- The open-sourced chat version supports broader community use and further development.
Where Pith is reading between the lines
- Developers could build voice interfaces without depending on closed systems.
- The unification pattern may extend to other input and output modalities.
- Applications in accessibility or tutoring could benefit from the dynamic control features.
- Integration with additional external tools could further expand task handling.
Load-bearing premise
The human evaluations on the new benchmark and the reported gains on existing benchmarks reflect genuine model capability rather than effects from test conditions or selection.
What would settle it
Evaluate the model on a fresh collection of real-time speech interaction scenarios created independently of the introduced benchmark and check whether the state-of-the-art human scores and 9.3 percent average benchmark improvement remain.
read the original abstract
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Step-Audio, a 130B-parameter unified speech-text multimodal model for real-time intelligent speech interaction. It claims to be the first production-ready open-source solution, featuring a generative speech data engine for voice cloning and distillation to a 3B TTS model, an instruction-driven fine control system for dialects/emotions/singing/RAP, and an enhanced cognitive architecture with tool calling and role-playing. On the newly introduced StepEval-Audio-360 benchmark, it reports state-of-the-art human evaluation results especially in instruction following; it also claims a 9.3% average improvement on open-source benchmarks such as LLaMA Question. The Step-Audio-Chat variant is open-sourced.
Significance. If the human-evaluation and benchmark claims hold after full disclosure of protocols and controls, the work would represent a notable contribution to open-source multimodal speech systems by combining large-scale unified modeling with practical control and cognitive extensions. The open release of models, code, and a new evaluation benchmark could accelerate research in real-time speech interfaces, provided the reported gains reflect genuine generalization rather than evaluation artifacts.
major comments (3)
- StepEval-Audio-360 benchmark and human evaluation protocol (Evaluation section): the manuscript provides no details on annotator blinding, inter-rater reliability metrics (e.g., Cohen’s kappa or Fleiss’ kappa), prompt sampling strategy, or statistical significance testing. Without these, the SOTA claim for instruction following and the 9.3% benchmark gains cannot be assessed for robustness against selection effects or evaluator bias.
- Training procedure and data composition (Training and Data sections): the abstract and available text report neither the composition of the training corpus, voice data sources, nor any statistical tests or error bars on the reported improvements. This absence directly affects the load-bearing claim that the 130B unified model achieves genuine generalization.
- Real-time and production-readiness claims (Introduction and System Overview): the assertion of being the “first production-ready open-source solution” rests on unshown comparisons to prior open-source systems regarding latency, stability, and deployment metrics; no quantitative real-time performance tables or ablation studies are referenced to support this framing.
minor comments (2)
- Clarify the exact relationship between the 130B unified model and the distilled 3B TTS model; a diagram or parameter-flow figure would improve readability.
- Add missing references to prior open-source speech interaction systems (e.g., recent works on unified audio-language models) to better contextualize the novelty.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: StepEval-Audio-360 benchmark and human evaluation protocol (Evaluation section): the manuscript provides no details on annotator blinding, inter-rater reliability metrics (e.g., Cohen’s kappa or Fleiss’ kappa), prompt sampling strategy, or statistical significance testing. Without these, the SOTA claim for instruction following and the 9.3% benchmark gains cannot be assessed for robustness against selection effects or evaluator bias.
Authors: We agree with the referee that additional details on the evaluation protocol are essential to substantiate our claims. In the revised manuscript, we will expand the Evaluation section to include information on annotator blinding, inter-rater reliability metrics including Fleiss’ kappa, the prompt sampling strategy used, and the results of statistical significance testing. These additions will help demonstrate the robustness of the reported SOTA performance in instruction following and the benchmark improvements. revision: yes
-
Referee: Training procedure and data composition (Training and Data sections): the abstract and available text report neither the composition of the training corpus, voice data sources, nor any statistical tests or error bars on the reported improvements. This absence directly affects the load-bearing claim that the 130B unified model achieves genuine generalization.
Authors: We acknowledge the importance of transparency regarding the training data and procedures. We will revise the Training and Data sections to provide a detailed description of the training corpus composition, the sources of the voice data, and incorporate statistical tests along with error bars for the reported performance improvements. This will better support the claims of generalization in the 130B model. revision: yes
-
Referee: Real-time and production-readiness claims (Introduction and System Overview): the assertion of being the “first production-ready open-source solution” rests on unshown comparisons to prior open-source systems regarding latency, stability, and deployment metrics; no quantitative real-time performance tables or ablation studies are referenced to support this framing.
Authors: We appreciate this feedback on strengthening the production-readiness claims. In the revision, we will include quantitative comparisons with prior open-source systems on metrics such as latency and stability, along with a table presenting real-time performance data and relevant ablation studies. While we maintain that the combination of features and open-sourcing makes it production-ready, these additions will provide more concrete evidence. revision: partial
Circularity Check
No load-bearing circularity; new benchmark and empirical claims are independent of fitted inputs
full rationale
The paper introduces a new 130B unified model, a generative data engine, an instruction-driven control system, and a new StepEval-Audio-360 benchmark, then reports human-evaluation SOTA and 9.3% gains on LLaMA Question. These are empirical system-building and benchmarking results rather than a closed derivation chain. No equations, self-definitional reductions, or fitted parameters renamed as predictions appear in the abstract or described contributions. The 'first production-ready' framing rests on external comparisons and the new benchmark rather than reducing to self-citation or ansatz smuggling. A minor self-citation risk exists around benchmark construction details, but it is not load-bearing for the core claims and does not trigger higher circularity under the rules.
Axiom & Free-Parameter Ledger
free parameters (1)
- 130B parameter count
Forward citations
Cited by 19 Pith papers
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark shows current text-audio retrieval models fail at reasoning tasks like negation and duration discrimination beyond simple semantic matching.
-
Same Words, Different Judgments: How Preferences Vary Across Modalities
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.
-
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
-
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on ...
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
-
Sema: Semantic Transport for Real-Time Multimodal Agents
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
-
On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation
Joint-marginal alignment plus adaptive weighting in speech VAE distillation yields the best combined performance on reconstruction, understanding, and generation tasks.
-
Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models
An adaptive CFG method that tunes guidance based on LLM-detected mismatch between emotion prompts and text semantics improves emotional expressiveness in AR TTS while preserving audio quality and intelligibility.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Reference graph
Works this paper leans on
-
[5]
The Method of Paired Comparisons , author=
Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , author=. Biometrika , year=
-
[13]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[14]
Hybrid Transformers for Music Source Separation , author=. ICASSP 23 , year=
-
[15]
Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline , author=. 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA) , pages=. 2017 , organization=
work page 2017
-
[16]
Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=
work page 2022
-
[17]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Audiogpt: Understanding and generating speech, music, sound, and talking head , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[18]
IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=
Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2020 , publisher=
work page 2020
-
[29]
Step-1: A 130B Large Language Model , author =. 2024 , howpublished =
work page 2024
- [30]
- [31]
-
[32]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
anastassiou2024seedttsafamilyofhighquality APACrefauthors Anastassiou, P. , Chen, J. , Chen, J. , Chen, Y. , Chen, Z. , Chen, Z. others APACrefauthors \ 2024 . Seed-TTS: A Family of High-Quality Versatile Speech Generation Models Seed-tts: A family of high-quality versatile speech generation models . arXiv preprint arXiv:2406.02430
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Bradley1952RankAO APACrefauthors Bradley, R A. \ Terry, M E. APACrefauthors \ 1952 . Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons Rank analysis of incomplete block designs: I. the method of paired comparisons . Biometrika 39 324 . APACrefURL https://api.semanticscholar.org/CorpusID:125209808 APACrefURL
work page 1952
-
[34]
doubaovoice APACrefauthors bytedance. APACrefauthors \ 2024 . doubaovoice. doubaovoice. https://team.doubao.com/zh/special/realtime_voice . Accessed: 2024
work page 2024
-
[35]
chen2025minmomultimodallargelanguage APACrefauthors Chen, Q. , Chen, Y. , Chen, Y. , Chen, M. , Chen, Y. , Deng, C. others APACrefauthors \ 2025 . Minmo: A multimodal large language model for seamless voice interaction Minmo: A multimodal large language model for seamless voice interaction . arXiv preprint arXiv:2501.06282
-
[36]
anenhancedres2netwithlocal APACrefauthors Chen, Y. , Zheng, S. , Wang, H. , Cheng, L. , Chen, Q. \ Qi, J. APACrefauthors \ 2023 . An enhanced res2net with local and global feature fusion for speaker verification An enhanced res2net with local and global feature fusion for speaker verification . arXiv preprint arXiv:2305.12838
-
[38]
chu2024qwen2 APACrefauthors Chu, Y. , Xu, J. , Yang, Q. , Wei, H. , Wei, X. , Guo, Z. others APACrefauthors \ 2024 2 . Qwen2-audio technical report Qwen2-audio technical report . arXiv preprint arXiv:2407.10759
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Speechverse: A large-scale generalizable audio language model.arXiv preprint arXiv:2405.08295, 2024
das2024speechverselargescalegeneralizableaudio APACrefauthors Das, N. , Dingliwal, S. , Ronanki, S. , Paturi, R. , Huang, Z. , Mathur, P. others APACrefauthors \ 2024 . Speechverse: A large-scale generalizable audio language model Speechverse: A large-scale generalizable audio language model . arXiv preprint arXiv:2405.08295
-
[40]
Moshi: a speech-text foundation model for real-time dialogue
2024moshispeechtextfoundationmodel APACrefauthors D \'e fossez, A. , Mazar \'e , L. , Orsini, M. , Royer, A. , P \'e rez, P. , J \'e gou, H. Zeghidour, N. APACrefauthors \ 2024 . Moshi: a speech-text foundation model for real-time dialogue Moshi: a speech-text foundation model for real-time dialogue . arXiv preprint arXiv:2410.00037
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
du2024cosyvoicescalablemultilingualzeroshot APACrefauthors Du, Z. , Chen, Q. , Zhang, S. , Hu, K. , Lu, H. , Yang, Y. others APACrefauthors \ 2024 . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
du2024cosyvoice2scalablestreaming APACrefauthors Du, Z. , Wang, Y. , Chen, Q. , Shi, X. , Lv, X. , Zhao, T. others APACrefauthors \ 2024 . Cosyvoice 2: Scalable streaming speech synthesis with large language models Cosyvoice 2: Scalable streaming speech synthesis with large language models . arXiv preprint arXiv:2412.10117
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
dubey2024llama APACrefauthors Dubey, A. , Jauhri, A. , Pandey, A. , Kadian, A. , Al-Dahle, A. , Letman, A. others APACrefauthors \ 2024 . The llama 3 herd of models The llama 3 herd of models . arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Llama-omni: Seamless speech interaction with large language models
fang2024llamaomniseamlessspeechinteraction APACrefauthors Fang, Q. , Guo, S. , Zhou, Y. , Ma, Z. , Zhang, S. \ Feng, Y. APACrefauthors \ 2024 . Llama-omni: Seamless speech interaction with large language models Llama-omni: Seamless speech interaction with large language models . arXiv preprint arXiv:2409.06666
-
[45]
gao2025lucylinguisticunderstandingcontrol APACrefauthors Gao, H. , Shao, H. , Wang, X. , Qiu, C. , Shen, Y. , Cai, S. others APACrefauthors \ 2025 . LUCY: Linguistic Understanding and Control Yielding Early Stage of Her Lucy: Linguistic understanding and control yielding early stage of her . arXiv preprint arXiv:2501.16327
-
[46]
gao2023paraformerfastaccurateparallel APACrefauthors Gao, Z. , Zhang, S. , McLoughlin, I. \ Yan, Z. APACrefauthors \ 2022 . Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition . arXiv preprint arXiv:2206.08317
-
[47]
hu2024wavllm APACrefauthors Hu, S. , Zhou, L. , Liu, S. , Chen, S. , Meng, L. , Hao, H. others APACrefauthors \ 2024 . Wavllm: Towards robust and adaptive speech large language model Wavllm: Towards robust and adaptive speech large language model . arXiv preprint arXiv:2404.00656
-
[48]
huang2023audiogptunderstandinggeneratingspeech APACrefauthors Huang, R. , Li, M. , Yang, D. , Shi, J. , Chang, X. , Ye, Z. others APACrefauthors \ 2024 . Audiogpt: Understanding and generating speech, music, sound, and talking head Audiogpt: Understanding and generating speech, music, sound, and talking head . Proceedings of the AAAI Conference on Artific...
work page 2024
-
[49]
hurst2024gpt APACrefauthors Hurst, A. , Lerer, A. , Goucher, A P. , Perelman, A. , Ramesh, A. , Clark, A. others APACrefauthors \ 2024 . Gpt-4o system card Gpt-4o system card . arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
kong2020pannslargescalepretrainedaudio APACrefauthors Kong, Q. , Cao, Y. , Iqbal, T. , Wang, Y. , Wang, W. \ Plumbley, M D. APACrefauthors \ 2020 . Panns: Large-scale pretrained audio neural networks for audio pattern recognition Panns: Large-scale pretrained audio neural networks for audio pattern recognition . IEEE/ACM Transactions on Audio, Speech, and...
work page 2020
-
[51]
ming2024advancingautoregressivecontinuationvideo APACrefauthors Ming, R. , Wu, J. , Huang, Z. , Ju, Z. , Hu, J. , Peng, L. \ Zhou, S. APACrefauthors \ 2024 . Advancing Auto-Regressive Continuation for Video Frames Advancing auto-regressive continuation for video frames . arXiv preprint arXiv:2412.03758
-
[52]
nguyen2024spiritlminterleavedspoken APACrefauthors Nguyen, T A. , Muller, B. , Yu, B. , Costa-Jussa, M R. , Elbayad, M. , Popuri, S. others APACrefauthors \ 2024 . Spirit-lm: Interleaved spoken and written language model Spirit-lm: Interleaved spoken and written language model . arXiv preprint arXiv:2402.05755
-
[53]
radford2023robust APACrefauthors Radford, A. , Kim, J W. , Xu, T. , Brockman, G. , McLeavey, C. \ Sutskever, I. APACrefauthors \ 2023 . Robust speech recognition via large-scale weak supervision Robust speech recognition via large-scale weak supervision . International conference on machine learning International conference on machine learning \ ( \ 28492--28518)
work page 2023
-
[54]
rouard2022hybrid APACrefauthors Rouard, S. , Massa, F. \ D \'e fossez, A. APACrefauthors \ 2023 . Hybrid Transformers for Music Source Separation Hybrid transformers for music source separation . ICASSP 23. Icassp 23
work page 2023
-
[55]
Proximal Policy Optimization Algorithms
schulman2017proximalpolicyoptimizationalgorithms APACrefauthors Schulman, J. , Wolski, F. , Dhariwal, P. , Radford, A. \ Klimov, O. APACrefauthors \ 2017 . Proximal policy optimization algorithms Proximal policy optimization algorithms . arXiv preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[56]
step1 APACrefauthors StepFun. APACrefauthors \ 2024 1 . Step-1: A 130B Large Language Model. Step-1: A 130b large language model. https://platform.stepfun.com/docs/llm/text . Accessed: February 2024
work page 2024
-
[57]
step2 APACrefauthors StepFun. APACrefauthors \ 2024 2 . Step-2. Step-2. https://platform.stepfun.com/docs/llm/text . Accessed: February 2024
work page 2024
-
[58]
wang2024freezeomnismartlowlatency APACrefauthors Wang, X. , Li, Y. , Fu, C. , Shen, Y. , Xie, L. , Li, K. Ma, L. APACrefauthors \ 2024 . Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm . arXiv preprint arXiv:2411.00774
- [59]
-
[60]
zeng2024glm4voiceintelligenthumanlikeendtoend APACrefauthors Zeng, A. , Du, Z. , Liu, M. , Wang, K. , Jiang, S. , Zhao, L. Tang, J. APACrefauthors \ 2024 . Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot . arXiv preprint arXiv:2412.02612
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
zhang2024disttrain APACrefauthors Zhang, Z. , Zhong, Y. , Ming, R. , Hu, H. , Sun, J. , Ge, Z. Jin, X. APACrefauthors \ 2024 . DistTrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language model...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.