Echo-Memory: A Controlled Study of Memory in Action World Models
Pith reviewed 2026-06-27 16:56 UTC · model grok-4.3
The pith
Block-wise state-space recurrence stores scene history best for open-domain return in action video models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, block-wise state-space recurrence yields the highest open-domain return scores among raw context, compression-based memory, and spatial summaries with varying read-out paths, while replay quality, in-domain loop revisit, and open-domain return probes routinely disagree.
What carries the argument
Echo-Memory matched matrix that separates capacity, compression, read-out, and recurrence by varying only how history is stored and read while fixing the action-to-video generator.
If this is right
- Raw context improves open-domain return far more than it improves replay metrics.
- Aggressive spatial and hybrid-compression memories lose the salient evidence needed for return.
- Block-wise state-space recurrence is the strongest open-domain return mechanism in the tested matrix.
- Replay fidelity is not a sufficient proxy for remembering a world.
- The three evaluation branches disagree, requiring multiple protocols to assess memory.
Where Pith is reading between the lines
- Designers of future action world models may need to embed recurrence structures rather than simply lengthening context windows.
- Benchmarks should adopt the three-branch protocol instead of replay-only tests to measure actual scene memory.
- Targeted ablations on read-out paths within state-space models could reveal further gains without increasing capacity.
Load-bearing premise
Fixing the action-to-video interface, backbone, optimizer, camera-action representation, sampler, and evaluation pipeline isolates memory design effects without hidden interactions among components.
What would settle it
A replication run under the same fixed pipeline in which block-wise state-space recurrence no longer scores highest on open-domain return after camera leave-and-return sequences.
read the original abstract
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. Echo-Memory presents a controlled empirical comparison of memory mechanisms in action-conditioned video world models. The authors fix the action-to-video interface, shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline while varying only memory storage and readout across raw context, compression-based designs, spatial summaries with different read-out paths, and state-space recurrence. They evaluate via a three-branch protocol (replay quality, in-domain loop revisit, open-domain return) and report three findings: raw context improves open-domain return more than replay metrics; aggressive compression loses salient evidence needed for return; and block-wise state-space recurrence is the strongest open-domain return mechanism, indicating that the structure of implicit memory matters as much as the decision to use memory.
Significance. If the controlled comparisons are robust, the work supplies a reusable matched-matrix protocol for studying memory in action world models and demonstrates that replay fidelity is not a sufficient proxy for world remembering. The explicit separation of capacity, compression, read-out, and recurrence axes, together with the multi-branch evaluation, strengthens comparability across future studies. The result that block-wise state-space recurrence outperforms other designs on open-domain return, if confirmed, would be a concrete, actionable finding for architecture design.
major comments (2)
- [description of the matched matrix and experimental protocol] The central claim that the matched matrix cleanly separates the four axes (capacity, compression, read-out, recurrence) and that block-wise state-space recurrence is therefore the strongest open-domain mechanism rests on the assumption that the fixed optimizer and sampler induce no differential interactions with memory type. The manuscript provides no reported checks (e.g., per-memory training curves, convergence statistics, or sensitivity to optimizer hyperparameters) that would rule out such cross-terms; without them the ranking cannot be confidently attributed to memory structure alone.
- [results and findings paragraphs] The three findings are stated without accompanying quantitative values, effect sizes, or statistical tests in the abstract; the full results section should make explicit the magnitude of improvement of block-wise recurrence over the next-best design on the open-domain return probe and whether the difference survives multiple-comparison correction.
minor comments (1)
- The abstract would be strengthened by a single sentence reporting the key numerical outcome (e.g., open-domain return score for the top memory versus baseline) so readers can immediately gauge effect size.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on Echo-Memory. We address each major comment below and agree to strengthen the manuscript with additional verification and quantitative details.
read point-by-point responses
-
Referee: The central claim that the matched matrix cleanly separates the four axes (capacity, compression, read-out, recurrence) and that block-wise state-space recurrence is therefore the strongest open-domain mechanism rests on the assumption that the fixed optimizer and sampler induce no differential interactions with memory type. The manuscript provides no reported checks (e.g., per-memory training curves, convergence statistics, or sensitivity to optimizer hyperparameters) that would rule out such cross-terms; without them the ranking cannot be confidently attributed to memory structure alone.
Authors: We agree that unexamined optimizer-memory interactions could in principle affect rankings. Our protocol holds the optimizer, learning-rate schedule, batch size, and sampler fixed across all variants precisely to minimize such confounds. In revision we will add per-memory training curves and final convergence losses (in supplementary material) to demonstrate that all designs reached comparable optimization states, thereby supporting attribution of performance differences to memory structure rather than training dynamics. revision: yes
-
Referee: The three findings are stated without accompanying quantitative values, effect sizes, or statistical tests in the abstract; the full results section should make explicit the magnitude of improvement of block-wise recurrence over the next-best design on the open-domain return probe and whether the difference survives multiple-comparison correction.
Authors: The abstract is intentionally concise. We will revise the results section to report the precise improvement (e.g., absolute and relative gain in open-domain return success rate) of block-wise state-space recurrence over the next-best design, together with effect sizes and p-values after multiple-comparison correction across the three evaluation branches. These numbers and the corrected statistical tests will be added to the main text and tables. revision: yes
Circularity Check
Empirical controlled comparison; no derivation chain or equations present
full rationale
The paper is an empirical study that fixes interfaces and varies only memory designs across a matrix of implementations, then reports experimental outcomes on replay, loop, and return metrics. No equations, derivations, fitted parameters, or self-citation chains are described in the provided text that could reduce any claim to its inputs by construction. The central findings (raw context strength, limits of compression, block-wise recurrence ranking) are direct experimental observations under the stated protocol, not algebraic identities or renamed fits. This matches the default expectation of no significant circularity for non-derivational work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The controlled setup with fixed backbone and interface isolates memory effects without significant interactions.
- domain assumption The three-branch protocol measures distinct aspects of memory performance.
Reference graph
Works this paper leans on
-
[1]
Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[2]
World simulation with video foundation models for physical ai
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025
Pith/arXiv arXiv 2025
-
[3]
Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, and Tian Xie. Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025
arXiv 2025
-
[4]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025
2025
-
[5]
Mixture of contexts for long video generation
Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, and Gordon Wetzstein. Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058, 2025
arXiv 2025
-
[6]
Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities
BoyuanChen, ZhuoXu, SeanKirmani, BrainIchter, DorsaSadigh, LeonidasGuibas, andFeiXia. Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024
2024
-
[7]
First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025
Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025
arXiv 2025
-
[8]
Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026
arXiv 2026
-
[9]
Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, and Chi Jin. Recurrent autore- gressive diffusion: Global memory meets local attention.arXiv preprint arXiv:2511.12940, 2025
arXiv 2025
-
[10]
Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025
Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025
Pith/arXiv arXiv 2025
-
[11]
Teleworld: Towards dynamic multimodal synthesis with a 4d world model
Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...
arXiv 2026
-
[12]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[13]
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025
Pith/arXiv arXiv 2025
-
[14]
Veo 3, 2025
Google DeepMind. Veo 3, 2025. URLhttps://deepmind.google/technologies/veo
2025
-
[15]
Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025
2025
-
[16]
Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026
arXiv 2026
-
[17]
A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025
Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025. 14
arXiv 2025
-
[18]
Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, and Chen-Hsuan Lin. Plenoptic video generation. arXiv preprint arXiv:2601.05239, 2026
arXiv 2026
-
[19]
Seedance 1.0: Exploring the boundaries of video generation models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025
Pith/arXiv arXiv 2025
-
[20]
Beyond pixel histories: World models with persistent 3d state.arXiv preprint arXiv:2603.03482, 2026
Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, and Jiang Bian. Beyond pixel histories: World models with persistent 3d state.arXiv preprint arXiv:2603.03482, 2026
Pith/arXiv arXiv 2026
-
[21]
World models.arXiv preprint arXiv:1803.10122, 2(3), 2018
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018
Pith/arXiv arXiv 2018
-
[22]
Mastering diverse domains through world models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023
Pith/arXiv arXiv 2023
-
[23]
Imagen video: High definition video generation with diffusion models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022
Pith/arXiv arXiv 2022
-
[24]
Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025
Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025
arXiv 2025
-
[25]
JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, and Yanye Lu. Geometry-as-context: Modulating explicit 3d in scene-consistent video generation to geometry context.arXiv preprint arXiv:2602.21929, 2026
arXiv 2026
-
[26]
Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, and Hui Xiong. Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025
arXiv 2025
-
[27]
Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025
arXiv 2025
-
[28]
Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025
arXiv 2025
-
[29]
A path towards autonomous machine intelligence version 0.9
Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022
2022
-
[30]
DohunLee, Chun-HaoPaulHuang, XuelinChen, JongChulYe, DuyguCeylan, andHyeonhoJeong. Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026
arXiv 2026
-
[31]
3d scene prompting for scene-consistent camera-controllable video generation
JoungBin Lee, JaewooJung, Jisang Han, TakuyaNarihira, Kazumi Fukuda, Junyoung Seo, SunghwanHong, Yuki Mitsufuji, and Seungryong Kim. 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945, 2025
arXiv 2025
-
[32]
Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, and Ying Shan. Cubecomposer: Spatio-temporal autoregressive 4k 360°video generation from perspective video.arXiv preprint arXiv:2603.04291, 2026
arXiv 2026
-
[33]
A comprehensive survey on world models for embodied ai
Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732, 2025
arXiv 2025
-
[34]
Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, and Thomas Anderson Keller. Flow equivariant world models: Memory for partially observed dynamic environments.arXiv preprint arXiv:2601.01075, 2026
Pith/arXiv arXiv 2026
-
[35]
Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377, 2025. 15
arXiv 2025
-
[36]
A survey: Learning embodied intelligence from physical simulators and world models
Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models. arXiv preprint arXiv:2507.00917, 2025
arXiv 2025
-
[37]
Camclonemaster: Enabling reference-based camera control for video generation
Yawen Luo, Xiaoyu Shi, Jianhong Bai, Menghan Xia, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Camclonemaster: Enabling reference-based camera control for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–10, 2025
2025
-
[38]
Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025
arXiv 2025
-
[39]
Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025
Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025
arXiv 2025
-
[40]
Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025
arXiv 2025
-
[41]
Sora, 2024
OpenAI. Sora, 2024. URLhttps://openai.com/sora
2024
-
[42]
Gpt-5, 2025
OpenAI. Gpt-5, 2025. URLhttps://openai.com/gpt-5
2025
-
[43]
Sora 2: Video generation model, 2025
OpenAI. Sora 2: Video generation model, 2025. URLhttps://openai.com/sora
2025
-
[44]
Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025
Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025
arXiv 2025
-
[45]
Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, and Nataniel Ruiz. Multigen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026
arXiv 2026
-
[46]
Solaris: Building a multiplayer video world model in minecraft
Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208, 2026
arXiv 2026
-
[47]
Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, and Biwei Huang. Learning plug- and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025
arXiv 2025
-
[48]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
Pith/arXiv arXiv 2025
-
[49]
FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, and Yuxin Zhang. Fsvideo: Fast speed video diffusion model in a highly-compressed latent space.arXiv preprint arXiv:26...
arXiv 2026
-
[50]
InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xiaojun Xiang, Xiaoyu Zhang, Xianbin Liu, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, and Ziqiang Zhao. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026
Pith/arXiv arXiv 2026
-
[51]
Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Pith/arXiv arXiv 2026
-
[52]
Wan 2.5: Unified multi-modal video generation framework, 2025
Alibaba Tongyi. Wan 2.5: Unified multi-modal video generation framework, 2025. URLhttps://tongyi.aliyun. com/wan
2025
-
[53]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[54]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 16
Pith/arXiv arXiv 2024
-
[55]
Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, and Mohit Bansal. Anchorweave: World-consistent video generation with retrieved local spatial memories.arXiv preprint arXiv:2602.14941, 2026
arXiv 2026
-
[56]
Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026
arXiv 2026
-
[57]
Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
arXiv 2025
-
[58]
Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025
arXiv 2025
-
[59]
Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[60]
Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models
Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, and Song- Hai Zhang. Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models. arXiv preprint arXiv:2602.22960, 2026
arXiv 2026
-
[61]
Longlive: Real-time interactive long video generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025
Pith/arXiv arXiv 2025
-
[62]
Con- text as memory: Scene-consistent interactive long video generation with memory retrieval
Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Con- text as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141, 2025
arXiv 2025
-
[63]
A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025
Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025
arXiv 2025
-
[64]
Videossm: Autoregressive long video generation with hybrid state-space memory
Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025
arXiv 2025
-
[65]
Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025
Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025
arXiv 2025
-
[66]
Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
Pith/arXiv arXiv 2025
-
[67]
Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025
arXiv 2025
-
[68]
Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025
Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025
arXiv 2025
-
[69]
Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, and Shunshun Yin. Soulx- liveact: Towards hour-scale real-time human animation with neighbor forcing and convkv memory.arXiv preprint arXiv:2603.11746, 2026
arXiv 2026
-
[70]
Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, and Ying-cong Chen. Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026
arXiv 2026
-
[71]
Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long- term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025
arXiv 2025
-
[72]
Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024. 17 A Action World Models: Preliminaries and Related Work The key components of an action world model are thev...
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.