Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Ang Wang; Bang Zhang; Baole Ai; Cheng Yu; Chen Liang; Chen-Wei Xie; Chongyang Zhong; Jingren Zhou; Jinwei Qi; Junjie He

arxiv: 2606.25041 · v2 · pith:XELZR3XNnew · submitted 2026-06-23 · 💻 cs.CV · cs.AI· cs.GR· cs.SD

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Lianghua Huang , Zhi-Fan Wu , Wei Wang , Yupeng Shi , Mengyang Feng , Junjie He , Chen-Wei Xie , Yu Liu

show 16 more authors

Jingren Zhou Ang Wang Bang Zhang Baole Ai Chen Liang Cheng Yu Chongyang Zhong Jinwei Qi Kai Zhu Pandeng Li Peng Zhang Wenyuan Zhang Xinhua Cheng Yitong Huang Yun Zheng Zoubin Bi

This is my paper

Pith reviewed 2026-06-26 05:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.SD

keywords real-time multimodal interactionstreaming transformerend-to-end foundation modelfull-duplex audio-visualblock-causal attentionlow-latency streamingunified perception and generation

0 comments

The pith

A single Transformer unifies audio, video and text to deliver sub-second full-duplex interaction without external modules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Wan-Streamer as an end-to-end foundation model that represents all modalities as interleaved input and output tokens inside one Transformer. It claims that block-causal attention together with causal encoders and decoders lets the model jointly learn perception, reasoning, generation, timing and synchronization. This design removes the separate VAD, ASR, TTS, animation and video modules that cascaded systems require. A reader would care because the resulting latencies reach roughly 200 ms on the model side and 550 ms end-to-end, which supports natural, real-time audio-visual conversations.

Core claim

Wan-Streamer is a native-streaming interactive foundation model that treats language, audio and video as both inputs and outputs inside a single Transformer. The sequence consists of interleaved visual, audio and text tokens coordinated by block-causal attention, which supports incremental streaming units as short as 160 ms at 25 fps. All components of interaction—perception, reasoning, generation, response timing, turn management and cross-modal synchronization—are learned jointly, eliminating reliance on external specialized modules and the associated pipeline latency and error accumulation. The model reports approximately 200 ms model-side response latency and 550 ms total interaction lat

What carries the argument

Block-causal attention over interleaved visual, audio and text input and output tokens inside a single Transformer, enabling incremental streaming.

If this is right

Pipeline latency drops because separate VAD, ASR, language, TTS and generation stages are removed.
Error accumulation from module hand-offs is eliminated.
Streaming units of 160 ms at 25 fps become feasible through redesigned causal encoders, decoders and token scheduling.
Natural responsiveness emerges from joint learning of timing and turn management.
Full-duplex audio-visual communication reaches sub-second total latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of interactive agents could simplify to a single model rather than maintaining multiple specialized services.
The same streaming token design might extend to additional modalities while preserving low latency.
Real-world tests on variable networks would reveal whether the reported 550 ms total latency holds outside controlled conditions.
Edge-device implementations could become practical if the unified model reduces memory and compute overhead compared with cascaded stacks.

Load-bearing premise

Perception, reasoning, generation, response timing, turn management and cross-modal synchronization can be learned jointly inside one model without external modules or significant performance loss.

What would settle it

A controlled side-by-side measurement of end-to-end latency and interaction quality between Wan-Streamer and an equivalent cascaded pipeline under identical network and hardware conditions.

read the original abstract

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wan-Streamer claims a single block-causal Transformer can do joint multimodal streaming at 200 ms model latency, but the abstract supplies no measurements, model details, or comparisons to support it.

read the letter

The main takeaway is that this paper describes a unified Transformer that interleaves visual, audio, and text tokens for both input and output, using block-causal attention to support streaming at 160 ms units. It argues this avoids the latency and error buildup of separate VAD, ASR, TTS, and generation modules.

What is actually new is the explicit redesign of the full stack—causal encoders, causal decoders, and low-latency token scheduling—to make everything native to streaming rather than bolted on. The framing around full-duplex turn management and cross-modal synchronization inside one model is a clean way to state the goal.

The soft spots are straightforward. The abstract states 200 ms model-side and 550 ms total latency numbers with zero accompanying information on how they were measured, what the model size or hardware is, what token rates were used, or how end-to-end latency was separated from network effects. There are no baselines, no ablations on joint versus cascaded performance, and no error analysis. The central premise that perception, reasoning, generation, and timing can all be learned jointly without significant loss is asserted but not shown.

This paper is aimed at groups working on real-time multimodal interfaces. A reader who wants concrete evidence or reproducible numbers will not find it here. The stress-test concern about unsupported latency claims holds up on the given text.

I would not bring it to a reading group yet and would not cite it. It does not look ready for peer review until the experiments and measurement details are added.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Wan-Streamer v0.1, a single block-causal Transformer foundation model that jointly performs perception, reasoning, generation, turn management, and cross-modal synchronization over interleaved visual/audio/text input and output tokens for native-streaming, full-duplex audio-visual interaction. It claims redesign of the full stack (causal encoders/decoders, block-causal attention, 160 ms streaming units at 25 fps) yields approximately 200 ms model-side response latency and 550 ms total interaction latency (including 350 ms network), eliminating cascaded modules such as VAD, ASR, TTS, and separate video generators.

Significance. If the latency and joint-modeling claims are substantiated with reproducible measurements, the work would be significant for demonstrating that a unified streaming Transformer can replace multi-module pipelines while preserving sub-second responsiveness; this would directly address error accumulation and latency in interactive multimodal systems.

major comments (2)

[Abstract] Abstract: the central latency claims (200 ms model-side, 550 ms total) are stated without any measurement protocol, model scale (parameter count), hardware, token rates, input/output streaming configuration, or comparison to cascaded baselines; these numbers are load-bearing for the claim that joint modeling inside one block-causal Transformer achieves the reported performance.
[Abstract] Abstract: no ablation, error analysis, or benchmark results are supplied to support the assertion that perception/reasoning/generation/turn-taking can be learned jointly without external specialized modules or significant performance loss; the absence of any experimental section or table makes the joint-modeling premise impossible to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the points on the abstract below and commit to revisions that add the requested details and validation.

read point-by-point responses

Referee: [Abstract] Abstract: the central latency claims (200 ms model-side, 550 ms total) are stated without any measurement protocol, model scale (parameter count), hardware, token rates, input/output streaming configuration, or comparison to cascaded baselines; these numbers are load-bearing for the claim that joint modeling inside one block-causal Transformer achieves the reported performance.

Authors: We agree the abstract is too terse on these load-bearing details. In the revised manuscript we will expand the abstract (or add an immediately following paragraph) to specify the measurement protocol, model scale, hardware, token rates, streaming configuration, and cascaded baseline comparisons so the latency numbers can be properly evaluated. revision: yes
Referee: [Abstract] Abstract: no ablation, error analysis, or benchmark results are supplied to support the assertion that perception/reasoning/generation/turn-taking can be learned jointly without external specialized modules or significant performance loss; the absence of any experimental section or table makes the joint-modeling premise impossible to evaluate.

Authors: The current version is a system-description paper focused on the unified architecture. We acknowledge that empirical support is required to substantiate the joint-modeling claims. We will add a dedicated experimental section containing ablations, error analysis, and benchmark results versus cascaded pipelines in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; latency claims are direct assertions

full rationale

The paper text supplies only architectural descriptions and numerical latency assertions with no equations, derivations, fitted parameters, or self-citations that could be inspected for reduction to inputs. The central claims about joint modeling and 200 ms / 550 ms latencies are stated without any mathematical steps, making circularity analysis inapplicable; this is the common honest finding of a self-contained descriptive paper with no load-bearing derivation to evaluate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical formulation, training details, or explicit assumptions; ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5880 in / 985 out tokens · 24637 ms · 2026-06-26T05:25:26.346940+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 20 linked inside Pith

[1]

Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

arXiv 2024
[2]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025
[3]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[4]

Doubao realtime voice model.https://seed.bytedance.com/en/realtime_voice, 2025

ByteDance Seed Team. Doubao realtime voice model.https://seed.bytedance.com/en/realtime_voice, 2025. Model page, January 20, 2025

2025
[5]

Introducing seed full-duplex speech llm: Attentive listening, robust interference suppression, enabling more natural interaction

ByteDance Seed Team. Introducing seed full-duplex speech llm: Attentive listening, robust interference suppression, enabling more natural interaction. ByteDance Seed Blog, 2026. Blog post, April 9, 2026

2026
[6]

Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, et al. Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

arXiv 2025
[7]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InAdvances in Neural Information Processing Systems, 2024

2024
[8]

Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

arXiv 2025
[9]

From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515, 2025

Yuxuan Chen and Haoyuan Yu. From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515, 2025

arXiv 2025
[10]

Livetalk: Real-time multimodal interactive video diffusion via improved on-policy distillation.arXiv preprint arXiv:2512.23576, 2025

Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, and Pengfei Liu. Livetalk: Real-time multimodal interactive video diffusion via improved on-policy distillation.arXiv preprint arXiv:2512.23576, 2025

arXiv 2025
[11]

Avatarforcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, and Xiaoqiang Liu. Avatarforcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

arXiv 2026
[12]

Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

Pith/arXiv arXiv 2024
[13]

U-mind: A unified framework for real-time multimodal interaction with audiovisual generation

Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, and Yebin Liu. U-mind: A unified framework for real-time multimodal interaction with audiovisual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10874–10886, 2026

2026
[14]

Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025

Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-shen Liu, and Pengfei Wan. Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025

arXiv 2025
[15]

Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Pith/arXiv arXiv 2025
[16]

Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670, 2025

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Želasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670, 2025

arXiv 2025
[17]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Pith/arXiv arXiv 2025
[18]

Introducing evi 3: The world’s most realistic and instructible speech-language model.https://www

Hume AI. Introducing evi 3: The world’s most realistic and instructible speech-language model.https://www. hume.ai/blog/introducing-evi-3, 2025. Blog post, 2025

2025
[19]

Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026. 9

Pith/arXiv arXiv 2026
[20]

Openai realtime api: The missing manual

Latent.Space. Openai realtime api: The missing manual. https://www.latent.space/p/realtime-api, 2024. Technical blog, December 2024

2024
[21]

Hallo-live: Real-time streaming joint audio-video avatar generation with asynchronous dual-stream and human-centric preference distillation.arXiv preprint arXiv:2604.23632, 2026

Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu, Jingdong Wang, and Siyu Zhu. Hallo-live: Real-time streaming joint audio-video avatar generation with asynchronous dual-stream and human-centric preference distillation.arXiv preprint arXiv:2604.23632, 2026

Pith/arXiv arXiv 2026
[22]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[23]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

arXiv 2025
[24]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Pith/arXiv arXiv 2025
[25]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

arXiv 2025
[26]

Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024. Blog post, May 13, 2024

2024
[27]

Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

OpenBMB Team. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

Pith/arXiv arXiv 2026
[28]

Mavid: A multimodal framework for audio-visual dialogue understanding and generation.arXiv preprint arXiv:2512.03034, 2025

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, and Yebin Liu. Mavid: A multimodal framework for audio-visual dialogue understanding and generation.arXiv preprint arXiv:2512.03034, 2025

arXiv 2025
[29]

Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Qwen Team. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025
[30]

Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026
[31]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026
[32]

Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, et al. Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026

arXiv 2026
[33]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[34]

Streamavatar: Streaming diffusion models for real-time interactive human avatars.arXiv preprint arXiv:2512.22065, 2025

Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, and Yong-Jin Liu. Streamavatar: Streaming diffusion models for real-time interactive human avatars.arXiv preprint arXiv:2512.22065, 2025

arXiv 2025
[35]

Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Pith/arXiv arXiv 2025
[36]

Doubao end-to-end realtime voice model

Volcengine. Doubao end-to-end realtime voice model. Volcengine product page, 2025. Product page, 2025

2025
[37]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[38]

Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, et al. Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

arXiv 2026
[39]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Pith/arXiv arXiv 2026
[40]

X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 10

arXiv 2025
[41]

Vasa-1: Lifelike audio-driven talking faces generated in real time.arXiv preprint arXiv:2404.10667, 2024

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.arXiv preprint arXiv:2404.10667, 2024

arXiv 2024
[42]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[43]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

Pith/arXiv arXiv 2025
[44]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

2025
[45]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis.arXiv preprint arXiv:2405.14867, 2024

arXiv 2024
[46]

Freeman, and Taesung Park

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2024

arXiv 2024
[47]

Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823, 2026

Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, et al. Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823, 2026

Pith/arXiv arXiv 2026
[48]

Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action.arXiv preprint arXiv:2605.20755, 2026

Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian. Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action.arXiv preprint arXiv:2605.20755, 2026

Pith/arXiv arXiv 2026
[49]

Omniflatten: An end-to-end gpt model for seamless voice conversation

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, and Shiliang Zhang. Omniflatten: An end-to-end gpt model for seamless voice conversation. arXiv preprint arXiv:2410.17799, 2024. 11 Appendix A Contributions and Acknowledgements A.1 Core Contributors Lianghua Huang, Zhi-Fan Wu, Wei Wang, ...

arXiv 2024

[1] [1]

Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024

arXiv 2024

[2] [2]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025

[3] [3]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[4] [4]

Doubao realtime voice model.https://seed.bytedance.com/en/realtime_voice, 2025

ByteDance Seed Team. Doubao realtime voice model.https://seed.bytedance.com/en/realtime_voice, 2025. Model page, January 20, 2025

2025

[5] [5]

Introducing seed full-duplex speech llm: Attentive listening, robust interference suppression, enabling more natural interaction

ByteDance Seed Team. Introducing seed full-duplex speech llm: Attentive listening, robust interference suppression, enabling more natural interaction. ByteDance Seed Blog, 2026. Blog post, April 9, 2026

2026

[6] [6]

Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, et al. Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025

arXiv 2025

[7] [7]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InAdvances in Neural Information Processing Systems, 2024

2024

[8] [8]

Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025

arXiv 2025

[9] [9]

From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515, 2025

Yuxuan Chen and Haoyuan Yu. From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515, 2025

arXiv 2025

[10] [10]

Livetalk: Real-time multimodal interactive video diffusion via improved on-policy distillation.arXiv preprint arXiv:2512.23576, 2025

Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, and Pengfei Liu. Livetalk: Real-time multimodal interactive video diffusion via improved on-policy distillation.arXiv preprint arXiv:2512.23576, 2025

arXiv 2025

[11] [11]

Avatarforcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, and Xiaoqiang Liu. Avatarforcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026

arXiv 2026

[12] [12]

Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

Pith/arXiv arXiv 2024

[13] [13]

U-mind: A unified framework for real-time multimodal interaction with audiovisual generation

Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, and Yebin Liu. U-mind: A unified framework for real-time multimodal interaction with audiovisual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10874–10886, 2026

2026

[14] [14]

Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025

Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-shen Liu, and Pengfei Wan. Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025

arXiv 2025

[15] [15]

Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

Pith/arXiv arXiv 2025

[16] [16]

Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670, 2025

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Želasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670, 2025

arXiv 2025

[17] [17]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Pith/arXiv arXiv 2025

[18] [18]

Introducing evi 3: The world’s most realistic and instructible speech-language model.https://www

Hume AI. Introducing evi 3: The world’s most realistic and instructible speech-language model.https://www. hume.ai/blog/introducing-evi-3, 2025. Blog post, 2025

2025

[19] [19]

Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026. 9

Pith/arXiv arXiv 2026

[20] [20]

Openai realtime api: The missing manual

Latent.Space. Openai realtime api: The missing manual. https://www.latent.space/p/realtime-api, 2024. Technical blog, December 2024

2024

[21] [21]

Hallo-live: Real-time streaming joint audio-video avatar generation with asynchronous dual-stream and human-centric preference distillation.arXiv preprint arXiv:2604.23632, 2026

Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu, Jingdong Wang, and Siyu Zhu. Hallo-live: Real-time streaming joint audio-video avatar generation with asynchronous dual-stream and human-centric preference distillation.arXiv preprint arXiv:2604.23632, 2026

Pith/arXiv arXiv 2026

[22] [22]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[23] [23]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025

arXiv 2025

[24] [24]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Pith/arXiv arXiv 2025

[25] [25]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

arXiv 2025

[26] [26]

Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024. Blog post, May 13, 2024

2024

[27] [27]

Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

OpenBMB Team. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

Pith/arXiv arXiv 2026

[28] [28]

Mavid: A multimodal framework for audio-visual dialogue understanding and generation.arXiv preprint arXiv:2512.03034, 2025

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, and Yebin Liu. Mavid: A multimodal framework for audio-visual dialogue understanding and generation.arXiv preprint arXiv:2512.03034, 2025

arXiv 2025

[29] [29]

Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Qwen Team. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025

[30] [30]

Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026

[31] [31]

Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

Pith/arXiv arXiv 2026

[32] [32]

Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, et al. Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026

arXiv 2026

[33] [33]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[34] [34]

Streamavatar: Streaming diffusion models for real-time interactive human avatars.arXiv preprint arXiv:2512.22065, 2025

Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, and Yong-Jin Liu. Streamavatar: Streaming diffusion models for real-time interactive human avatars.arXiv preprint arXiv:2512.22065, 2025

arXiv 2025

[35] [35]

Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Pith/arXiv arXiv 2025

[36] [36]

Doubao end-to-end realtime voice model

Volcengine. Doubao end-to-end realtime voice model. Volcengine product page, 2025. Product page, 2025

2025

[37] [37]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[38] [38]

Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, et al. Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026

arXiv 2026

[39] [39]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

Pith/arXiv arXiv 2026

[40] [40]

X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025

You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 10

arXiv 2025

[41] [41]

Vasa-1: Lifelike audio-driven talking faces generated in real time.arXiv preprint arXiv:2404.10667, 2024

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.arXiv preprint arXiv:2404.10667, 2024

arXiv 2024

[42] [42]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[43] [43]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

Pith/arXiv arXiv 2025

[44] [44]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025

2025

[45] [45]

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis.arXiv preprint arXiv:2405.14867, 2024

arXiv 2024

[46] [46]

Freeman, and Taesung Park

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2024

arXiv 2024

[47] [47]

Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823, 2026

Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, et al. Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823, 2026

Pith/arXiv arXiv 2026

[48] [48]

Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action.arXiv preprint arXiv:2605.20755, 2026

Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian. Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action.arXiv preprint arXiv:2605.20755, 2026

Pith/arXiv arXiv 2026

[49] [49]

Omniflatten: An end-to-end gpt model for seamless voice conversation

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, and Shiliang Zhang. Omniflatten: An end-to-end gpt model for seamless voice conversation. arXiv preprint arXiv:2410.17799, 2024. 11 Appendix A Contributions and Acknowledgements A.1 Core Contributors Lianghua Huang, Zhi-Fan Wu, Wei Wang, ...

arXiv 2024