Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Pith reviewed 2026-06-26 05:25 UTC · model grok-4.3
The pith
A single Transformer unifies audio, video and text to deliver sub-second full-duplex interaction without external modules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Wan-Streamer is a native-streaming interactive foundation model that treats language, audio and video as both inputs and outputs inside a single Transformer. The sequence consists of interleaved visual, audio and text tokens coordinated by block-causal attention, which supports incremental streaming units as short as 160 ms at 25 fps. All components of interaction—perception, reasoning, generation, response timing, turn management and cross-modal synchronization—are learned jointly, eliminating reliance on external specialized modules and the associated pipeline latency and error accumulation. The model reports approximately 200 ms model-side response latency and 550 ms total interaction lat
What carries the argument
Block-causal attention over interleaved visual, audio and text input and output tokens inside a single Transformer, enabling incremental streaming.
If this is right
- Pipeline latency drops because separate VAD, ASR, language, TTS and generation stages are removed.
- Error accumulation from module hand-offs is eliminated.
- Streaming units of 160 ms at 25 fps become feasible through redesigned causal encoders, decoders and token scheduling.
- Natural responsiveness emerges from joint learning of timing and turn management.
- Full-duplex audio-visual communication reaches sub-second total latency.
Where Pith is reading between the lines
- Deployment of interactive agents could simplify to a single model rather than maintaining multiple specialized services.
- The same streaming token design might extend to additional modalities while preserving low latency.
- Real-world tests on variable networks would reveal whether the reported 550 ms total latency holds outside controlled conditions.
- Edge-device implementations could become practical if the unified model reduces memory and compute overhead compared with cascaded stacks.
Load-bearing premise
Perception, reasoning, generation, response timing, turn management and cross-modal synchronization can be learned jointly inside one model without external modules or significant performance loss.
What would settle it
A controlled side-by-side measurement of end-to-end latency and interaction quality between Wan-Streamer and an equivalent cascaded pipeline under identical network and hardware conditions.
read the original abstract
We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Wan-Streamer v0.1, a single block-causal Transformer foundation model that jointly performs perception, reasoning, generation, turn management, and cross-modal synchronization over interleaved visual/audio/text input and output tokens for native-streaming, full-duplex audio-visual interaction. It claims redesign of the full stack (causal encoders/decoders, block-causal attention, 160 ms streaming units at 25 fps) yields approximately 200 ms model-side response latency and 550 ms total interaction latency (including 350 ms network), eliminating cascaded modules such as VAD, ASR, TTS, and separate video generators.
Significance. If the latency and joint-modeling claims are substantiated with reproducible measurements, the work would be significant for demonstrating that a unified streaming Transformer can replace multi-module pipelines while preserving sub-second responsiveness; this would directly address error accumulation and latency in interactive multimodal systems.
major comments (2)
- [Abstract] Abstract: the central latency claims (200 ms model-side, 550 ms total) are stated without any measurement protocol, model scale (parameter count), hardware, token rates, input/output streaming configuration, or comparison to cascaded baselines; these numbers are load-bearing for the claim that joint modeling inside one block-causal Transformer achieves the reported performance.
- [Abstract] Abstract: no ablation, error analysis, or benchmark results are supplied to support the assertion that perception/reasoning/generation/turn-taking can be learned jointly without external specialized modules or significant performance loss; the absence of any experimental section or table makes the joint-modeling premise impossible to evaluate.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the points on the abstract below and commit to revisions that add the requested details and validation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central latency claims (200 ms model-side, 550 ms total) are stated without any measurement protocol, model scale (parameter count), hardware, token rates, input/output streaming configuration, or comparison to cascaded baselines; these numbers are load-bearing for the claim that joint modeling inside one block-causal Transformer achieves the reported performance.
Authors: We agree the abstract is too terse on these load-bearing details. In the revised manuscript we will expand the abstract (or add an immediately following paragraph) to specify the measurement protocol, model scale, hardware, token rates, streaming configuration, and cascaded baseline comparisons so the latency numbers can be properly evaluated. revision: yes
-
Referee: [Abstract] Abstract: no ablation, error analysis, or benchmark results are supplied to support the assertion that perception/reasoning/generation/turn-taking can be learned jointly without external specialized modules or significant performance loss; the absence of any experimental section or table makes the joint-modeling premise impossible to evaluate.
Authors: The current version is a system-description paper focused on the unified architecture. We acknowledge that empirical support is required to substantiate the joint-modeling claims. We will add a dedicated experimental section containing ablations, error analysis, and benchmark results versus cascaded pipelines in the revised manuscript. revision: yes
Circularity Check
No derivation chain or equations present; latency claims are direct assertions
full rationale
The paper text supplies only architectural descriptions and numerical latency assertions with no equations, derivations, fitted parameters, or self-citations that could be inspected for reduction to inputs. The central claims about joint modeling and 200 ms / 550 ms latencies are stated without any mathematical steps, making circularity analysis inapplicable; this is the common honest finding of a self-contained descriptive paper with no load-bearing derivation to evaluate.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024
Tenglong Ao. Body of her: A preliminary study on end-to-end humanoid agent.arXiv preprint arXiv:2408.02879, 2024
arXiv 2024
-
[2]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025
2025
-
[3]
Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
Pith/arXiv arXiv 2025
-
[4]
Doubao realtime voice model.https://seed.bytedance.com/en/realtime_voice, 2025
ByteDance Seed Team. Doubao realtime voice model.https://seed.bytedance.com/en/realtime_voice, 2025. Model page, January 20, 2025
2025
-
[5]
Introducing seed full-duplex speech llm: Attentive listening, robust interference suppression, enabling more natural interaction
ByteDance Seed Team. Introducing seed full-duplex speech llm: Attentive listening, robust interference suppression, enabling more natural interaction. ByteDance Seed Blog, 2026. Blog post, April 9, 2026
2026
-
[6]
Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025
Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, et al. Towards interactive intelligence for digital humans.arXiv preprint arXiv:2512.13674, 2025
arXiv 2025
-
[7]
Diffusion forcing: Next-token prediction meets full-sequence diffusion
Boyuan Chen, Diego Marti Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InAdvances in Neural Information Processing Systems, 2024
2024
-
[8]
Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, et al. Midas: Multimodal interactive digital-human synthesis via real-time autoregressive video generation.arXiv preprint arXiv:2508.19320, 2025
arXiv 2025
-
[9]
Yuxuan Chen and Haoyuan Yu. From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515, 2025
arXiv 2025
-
[10]
Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, and Pengfei Liu. Livetalk: Real-time multimodal interactive video diffusion via improved on-policy distillation.arXiv preprint arXiv:2512.23576, 2025
arXiv 2025
-
[11]
Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, and Xiaoqiang Liu. Avatarforcing: One-step streaming talking avatars via local-future sliding-window denoising.arXiv preprint arXiv:2603.14331, 2026
arXiv 2026
-
[12]
Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024
Pith/arXiv arXiv 2024
-
[13]
U-mind: A unified framework for real-time multimodal interaction with audiovisual generation
Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming, Zhuoliang Kang, Xiaoming Wei, and Yebin Liu. U-mind: A unified framework for real-time multimodal interaction with audiovisual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10874–10886, 2026
2026
-
[14]
Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-shen Liu, and Pengfei Wan. Kling-avatar: Grounding multimodal instructions for cascaded long-duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025
arXiv 2025
-
[15]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
Pith/arXiv arXiv 2025
-
[16]
Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Želasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670, 2025
arXiv 2025
-
[17]
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
Pith/arXiv arXiv 2025
-
[18]
Introducing evi 3: The world’s most realistic and instructible speech-language model.https://www
Hume AI. Introducing evi 3: The world’s most realistic and instructible speech-language model.https://www. hume.ai/blog/introducing-evi-3, 2025. Blog post, 2025
2025
-
[19]
Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, and Sung Ju Hwang. Avatar forcing: Real-time interactive head avatar generation for natural conversation.arXiv preprint arXiv:2601.00664, 2026. 9
Pith/arXiv arXiv 2026
-
[20]
Openai realtime api: The missing manual
Latent.Space. Openai realtime api: The missing manual. https://www.latent.space/p/realtime-api, 2024. Technical blog, December 2024
2024
-
[21]
Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu, Jingdong Wang, and Siyu Zhu. Hallo-live: Real-time streaming joint audio-video avatar generation with asynchronous dual-stream and human-centric preference distillation.arXiv preprint arXiv:2604.23632, 2026
Pith/arXiv arXiv 2026
-
[22]
Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Pith/arXiv arXiv 2026
-
[23]
Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models.arXiv preprint arXiv:2502.01061, 2025
arXiv 2025
-
[24]
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025
Pith/arXiv arXiv 2025
-
[25]
Chetwin Low and Weimin Wang. Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025
arXiv 2025
-
[26]
Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024
OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024. Blog post, May 13, 2024
2024
-
[27]
OpenBMB Team. Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026
Pith/arXiv arXiv 2026
-
[28]
Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, and Yebin Liu. Mavid: A multimodal framework for audio-visual dialogue understanding and generation.arXiv preprint arXiv:2512.03034, 2025
arXiv 2025
-
[29]
Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Qwen Team. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[30]
Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
Pith/arXiv arXiv 2026
-
[31]
Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
Pith/arXiv arXiv 2026
-
[32]
Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, et al. Omniforcing: Unleashing real-time joint audio-visual generation.arXiv preprint arXiv:2603.11647, 2026
arXiv 2026
-
[33]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
Pith/arXiv arXiv 2025
-
[34]
Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, and Yong-Jin Liu. Streamavatar: Streaming diffusion models for real-time interactive human avatars.arXiv preprint arXiv:2512.22065, 2025
arXiv 2025
-
[35]
Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
Pith/arXiv arXiv 2025
-
[36]
Doubao end-to-end realtime voice model
Volcengine. Doubao end-to-end realtime voice model. Volcengine product page, 2025. Product page, 2025
2025
-
[37]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[38]
Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026
Lizhen Wang, Yongming Zhu, Zhipeng Ge, Youwei Zheng, Longhao Zhang, Tianshu Hu, Shiyang Qin, et al. Flowact-r1: Towards interactive humanoid video generation.arXiv preprint arXiv:2601.10103, 2026
arXiv 2026
-
[39]
Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026
Pith/arXiv arXiv 2026
-
[40]
You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Guoxian Song, Xiaochen Zhao, Chao Liang, Jianwen Jiang, Hongyi Xu, and Linjie Luo. X-streamer: Unified human world modeling with audiovisual interaction.arXiv preprint arXiv:2509.21574, 2025. 10
arXiv 2025
-
[41]
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.arXiv preprint arXiv:2404.10667, 2024
arXiv 2024
-
[42]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[43]
Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025
Pith/arXiv arXiv 2025
-
[44]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025
2025
-
[45]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis.arXiv preprint arXiv:2405.14867, 2024
arXiv 2024
-
[46]
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828, 2024
arXiv 2024
-
[47]
Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823, 2026
Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, et al. Lpm 1.0: Video-based character performance model.arXiv preprint arXiv:2604.07823, 2026
Pith/arXiv arXiv 2026
-
[48]
Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian. Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action.arXiv preprint arXiv:2605.20755, 2026
Pith/arXiv arXiv 2026
-
[49]
Omniflatten: An end-to-end gpt model for seamless voice conversation
Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chaohong Tan, Zhihao Du, and Shiliang Zhang. Omniflatten: An end-to-end gpt model for seamless voice conversation. arXiv preprint arXiv:2410.17799, 2024. 11 Appendix A Contributions and Acknowledgements A.1 Core Contributors Lianghua Huang, Zhi-Fan Wu, Wei Wang, ...
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.