{"paper":{"title":"Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":["cs.AI","cs.HC","cs.SD","eess.AS"],"primary_cat":"cs.CL","authors_text":"Ailin Huang, Bahtiyar Ahmidi, Bingxin Li, Bin Wang, Binxing Jiao, Bo Li, Boyong Wu, Brian Li, Bruce Wang, Buyun Ma, Changxin Miao, Changyi Wan, Chao Yan, Chengli Feng, Chengting Feng, Chen Hu, Chenrun Wang, Chen Xu, Dapeng Shi, Daxin Jiang, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Fei Tian, Feiyu Shen, Guanzhe Huang, Gulin Yan, Hanpeng Hu, Haonan Jia, Haoyang Zhang, Heng Wang, Heung-Yeung Shum, Hongyuan Wang, Hongyu Zhou, Jiahao Gong, Jiahong Liu, Jianchang Wu, Jiangjie Zhen, Jianjian Sun, Jiansheng Chen, Jiaoren Wu, Jiashuai Liu, Jie Feng, Jie Wu, Jie Yang, Jingbei Li, Jing Li, Jinguo Wang, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Kang An, Lei Xia, Liang Zhao, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingliang Li, Mingrui Chen, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Peng Liu, Qiling Wu, Qinyuan Tan, Ranchen Ming, Ran Sun, Ruihang Miao, Shanshan Yuan, Shaoliang Pang, Shihong Deng, Shilei Jiang, Shiliang Yang, Shuai Shuai, Shuchang Zhou, Shuli Gao, Siqi Liu, Sitong Liu, Song Yuan, Tiancheng Cao, Tianyu Wang, Wang You, Wei Ji, Weipeng Ming, Wenjin Deng, Wen Li, Wenqing He, Wen Sun, Wuxun Xie, Xiangwen Kong, Xiangyu Zhang, Xiaojia Liu, Xiaomin Deng, Xi Chen, Xin Han, Xinhao Zhang, Xin Huang, Xin Wu, Xuan Wen, Xuelin Zhang, Xuerui Yang, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yang Zhang, Yangzhen Ma, Yanming Xu, Yaoben Wei, Yaoyu Wang, Yaqiang Shi, Yaqi Dai, Yechang Huang, Yibo Zhu, Yilei Wang, Yinmin Zhong, Yizhuang Zhou, Yuanhao Ding, Yuankai Ma, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yu Luo, Yun Mou, Yuting Yan, Yuxiang Yang, Yuxiang Zhang, Yu Zhou, Zheng Ge, Zheng Gong, Zheng Sun, Zhewei Huang, Zhe Xie, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Zixin Zhang","submitted_at":"2025-02-17T15:58:56Z","abstract_excerpt":"Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data "},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2502.11946","kind":"arxiv","version":2},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}