Echo-Memory: A Controlled Study of Memory in Action World Models

Haoran Li; Haoyang Huang; Haoyu Wang; Jie Huang; Junhao Zhuang; Nan Duan; Shiyi Zhang; Sihan Xu; Songchun Zhang; Wayne King

arxiv: 2606.09803 · v1 · pith:GY4MX5U7new · submitted 2026-06-08 · 💻 cs.CV · cs.GR· cs.LG

Echo-Memory: A Controlled Study of Memory in Action World Models

Wayne King , Zeyue Xue , Yuxuan Bian , Jie Huang , Haoran Li , Yaowei Li , Yaofeng Su , Yuming Li

show 8 more authors

Haoyu Wang Shiyi Zhang Songchun Zhang Yuwei Niu Sihan Xu Junhao Zhuang Haoyang Huang Nan Duan

This is my paper

Pith reviewed 2026-06-27 16:56 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG

keywords memory mechanismsaction world modelsvideo generationstate-space recurrencecontrolled studyopen-domain returnreplay qualityscene consistency

0 comments

The pith

Block-wise state-space recurrence stores scene history best for open-domain return in action video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Echo-Memory runs a controlled comparison of memory designs inside video generators that create multi-segment clips from an initial frame, text prompt, and camera-action sequence. The study keeps the diffusion backbone, optimizer, inputs, sampler, and evaluation pipeline identical so that only the way past frames are stored and retrieved varies. Raw context windows improve return after the camera leaves and comes back more than they improve simple replay. Compression methods often drop the exact details needed to recognize a revisited scene. Block-wise state-space recurrence outperforms the other options on open-domain return while the three test branches disagree, showing that replay scores alone do not confirm a model has remembered its world.

Core claim

Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, block-wise state-space recurrence yields the highest open-domain return scores among raw context, compression-based memory, and spatial summaries with varying read-out paths, while replay quality, in-domain loop revisit, and open-domain return probes routinely disagree.

What carries the argument

Echo-Memory matched matrix that separates capacity, compression, read-out, and recurrence by varying only how history is stored and read while fixing the action-to-video generator.

If this is right

Raw context improves open-domain return far more than it improves replay metrics.
Aggressive spatial and hybrid-compression memories lose the salient evidence needed for return.
Block-wise state-space recurrence is the strongest open-domain return mechanism in the tested matrix.
Replay fidelity is not a sufficient proxy for remembering a world.
The three evaluation branches disagree, requiring multiple protocols to assess memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future action world models may need to embed recurrence structures rather than simply lengthening context windows.
Benchmarks should adopt the three-branch protocol instead of replay-only tests to measure actual scene memory.
Targeted ablations on read-out paths within state-space models could reveal further gains without increasing capacity.

Load-bearing premise

Fixing the action-to-video interface, backbone, optimizer, camera-action representation, sampler, and evaluation pipeline isolates memory design effects without hidden interactions among components.

What would settle it

A replication run under the same fixed pipeline in which block-wise state-space recurrence no longer scores highest on open-domain return after camera leave-and-return sequences.

read the original abstract

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Echo-Memory gives a matched comparison matrix for memory designs in action world models plus a three-branch eval protocol that shows the branches disagree, but the abstract supplies no numbers so the specific rankings are hard to assess.

read the letter

The main thing to know is that this paper fixes the backbone, optimizer, camera-action interface, sampler, and pipeline, then varies only the memory component across raw context, compression, spatial summaries, and state-space recurrence. It also runs three separate evaluation branches—replay quality, in-domain loop revisit, and open-domain return—and reports that they routinely disagree. That separation of capacity, compression, read-out, and recurrence is the clearest new piece.

The design is better than most prior comparisons because it removes the usual entanglements. Holding everything else constant makes it easier to attribute differences to the memory choice itself, and the disagreement between branches is a useful warning against relying on replay metrics alone.

The soft spot is the complete absence of quantitative results, effect sizes, or statistical details in the abstract. The claims that raw context helps open-domain return more than replay, that aggressive compression loses salient evidence, and that block-wise state-space recurrence is strongest on return cannot be checked without the actual data. The stress-test point about possible cross-term interactions is also reasonable: even with a fixed optimizer and sampler, different memory structures could produce different effective training dynamics, and nothing in the abstract shows this was measured or ruled out.

This is for researchers working on memory in video world models and generative simulation. The protocol itself is worth seeing in full, so the paper deserves peer review even though the current write-up leaves the strength of the findings open.

Referee Report

2 major / 1 minor

Summary. Echo-Memory presents a controlled empirical comparison of memory mechanisms in action-conditioned video world models. The authors fix the action-to-video interface, shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline while varying only memory storage and readout across raw context, compression-based designs, spatial summaries with different read-out paths, and state-space recurrence. They evaluate via a three-branch protocol (replay quality, in-domain loop revisit, open-domain return) and report three findings: raw context improves open-domain return more than replay metrics; aggressive compression loses salient evidence needed for return; and block-wise state-space recurrence is the strongest open-domain return mechanism, indicating that the structure of implicit memory matters as much as the decision to use memory.

Significance. If the controlled comparisons are robust, the work supplies a reusable matched-matrix protocol for studying memory in action world models and demonstrates that replay fidelity is not a sufficient proxy for world remembering. The explicit separation of capacity, compression, read-out, and recurrence axes, together with the multi-branch evaluation, strengthens comparability across future studies. The result that block-wise state-space recurrence outperforms other designs on open-domain return, if confirmed, would be a concrete, actionable finding for architecture design.

major comments (2)

[description of the matched matrix and experimental protocol] The central claim that the matched matrix cleanly separates the four axes (capacity, compression, read-out, recurrence) and that block-wise state-space recurrence is therefore the strongest open-domain mechanism rests on the assumption that the fixed optimizer and sampler induce no differential interactions with memory type. The manuscript provides no reported checks (e.g., per-memory training curves, convergence statistics, or sensitivity to optimizer hyperparameters) that would rule out such cross-terms; without them the ranking cannot be confidently attributed to memory structure alone.
[results and findings paragraphs] The three findings are stated without accompanying quantitative values, effect sizes, or statistical tests in the abstract; the full results section should make explicit the magnitude of improvement of block-wise recurrence over the next-best design on the open-domain return probe and whether the difference survives multiple-comparison correction.

minor comments (1)

The abstract would be strengthened by a single sentence reporting the key numerical outcome (e.g., open-domain return score for the top memory versus baseline) so readers can immediately gauge effect size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on Echo-Memory. We address each major comment below and agree to strengthen the manuscript with additional verification and quantitative details.

read point-by-point responses

Referee: The central claim that the matched matrix cleanly separates the four axes (capacity, compression, read-out, recurrence) and that block-wise state-space recurrence is therefore the strongest open-domain mechanism rests on the assumption that the fixed optimizer and sampler induce no differential interactions with memory type. The manuscript provides no reported checks (e.g., per-memory training curves, convergence statistics, or sensitivity to optimizer hyperparameters) that would rule out such cross-terms; without them the ranking cannot be confidently attributed to memory structure alone.

Authors: We agree that unexamined optimizer-memory interactions could in principle affect rankings. Our protocol holds the optimizer, learning-rate schedule, batch size, and sampler fixed across all variants precisely to minimize such confounds. In revision we will add per-memory training curves and final convergence losses (in supplementary material) to demonstrate that all designs reached comparable optimization states, thereby supporting attribution of performance differences to memory structure rather than training dynamics. revision: yes
Referee: The three findings are stated without accompanying quantitative values, effect sizes, or statistical tests in the abstract; the full results section should make explicit the magnitude of improvement of block-wise recurrence over the next-best design on the open-domain return probe and whether the difference survives multiple-comparison correction.

Authors: The abstract is intentionally concise. We will revise the results section to report the precise improvement (e.g., absolute and relative gain in open-domain return success rate) of block-wise state-space recurrence over the next-best design, together with effect sizes and p-values after multiple-comparison correction across the three evaluation branches. These numbers and the corrected statistical tests will be added to the main text and tables. revision: yes

Circularity Check

0 steps flagged

Empirical controlled comparison; no derivation chain or equations present

full rationale

The paper is an empirical study that fixes interfaces and varies only memory designs across a matrix of implementations, then reports experimental outcomes on replay, loop, and return metrics. No equations, derivations, fitted parameters, or self-citation chains are described in the provided text that could reduce any claim to its inputs by construction. The central findings (raw context strength, limits of compression, block-wise recurrence ranking) are direct experimental observations under the stated protocol, not algebraic identities or renamed fits. This matches the default expectation of no significant circularity for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities are introduced in the abstract. The primary domain assumption is the validity of the controlled experimental design for isolating memory effects and the utility of the three-branch protocol for evaluating memory.

axioms (2)

domain assumption The controlled setup with fixed backbone and interface isolates memory effects without significant interactions.
This is central to the claim that differences are due to memory designs.
domain assumption The three-branch protocol measures distinct aspects of memory performance.
Invoked to conclude that replay is not a sufficient proxy.

pith-pipeline@v0.9.1-grok · 5893 in / 1407 out tokens · 32426 ms · 2026-06-27T16:56:32.137986+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 19 linked inside Pith

[1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

World simulation with video foundation models for physical ai

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[3]

Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, and Tian Xie. Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

arXiv 2025
[4]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025
[5]

Mixture of contexts for long video generation

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, and Gordon Wetzstein. Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058, 2025

arXiv 2025
[6]

Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities

BoyuanChen, ZhuoXu, SeanKirmani, BrainIchter, DorsaSadigh, LeonidasGuibas, andFeiXia. Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024
[7]

First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

arXiv 2025
[8]

Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

arXiv 2026
[9]

Recurrent autore- gressive diffusion: Global memory meets local attention.arXiv preprint arXiv:2511.12940, 2025

Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, and Chi Jin. Recurrent autore- gressive diffusion: Global memory meets local attention.arXiv preprint arXiv:2511.12940, 2025

arXiv 2025
[10]

Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

Pith/arXiv arXiv 2025
[11]

Teleworld: Towards dynamic multimodal synthesis with a 4d world model

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...

arXiv 2026
[12]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[13]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

Pith/arXiv arXiv 2025
[14]

Veo 3, 2025

Google DeepMind. Veo 3, 2025. URLhttps://deepmind.google/technologies/veo

2025
[15]

Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

2025
[16]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

arXiv 2026
[17]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025. 14

arXiv 2025
[18]

Plenoptic video generation

Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, and Chen-Hsuan Lin. Plenoptic video generation. arXiv preprint arXiv:2601.05239, 2026

arXiv 2026
[19]

Seedance 1.0: Exploring the boundaries of video generation models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

Pith/arXiv arXiv 2025
[20]

Beyond pixel histories: World models with persistent 3d state.arXiv preprint arXiv:2603.03482, 2026

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, and Jiang Bian. Beyond pixel histories: World models with persistent 3d state.arXiv preprint arXiv:2603.03482, 2026

Pith/arXiv arXiv 2026
[21]

World models.arXiv preprint arXiv:1803.10122, 2(3), 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018

Pith/arXiv arXiv 2018
[22]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023
[23]

Imagen video: High definition video generation with diffusion models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

Pith/arXiv arXiv 2022
[24]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025
[25]

Geometry-as-context: Modulating explicit 3d in scene-consistent video generation to geometry context.arXiv preprint arXiv:2602.21929, 2026

JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, and Yanye Lu. Geometry-as-context: Modulating explicit 3d in scene-consistent video generation to geometry context.arXiv preprint arXiv:2602.21929, 2026

arXiv 2026
[26]

Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, and Hui Xiong. Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

arXiv 2025
[27]

Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

arXiv 2025
[28]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

arXiv 2025
[29]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

2022
[30]

Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026

DohunLee, Chun-HaoPaulHuang, XuelinChen, JongChulYe, DuyguCeylan, andHyeonhoJeong. Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026

arXiv 2026
[31]

3d scene prompting for scene-consistent camera-controllable video generation

JoungBin Lee, JaewooJung, Jisang Han, TakuyaNarihira, Kazumi Fukuda, Junyoung Seo, SunghwanHong, Yuki Mitsufuji, and Seungryong Kim. 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945, 2025

arXiv 2025
[32]

Cubecomposer: Spatio-temporal autoregressive 4k 360°video generation from perspective video.arXiv preprint arXiv:2603.04291, 2026

Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, and Ying Shan. Cubecomposer: Spatio-temporal autoregressive 4k 360°video generation from perspective video.arXiv preprint arXiv:2603.04291, 2026

arXiv 2026
[33]

A comprehensive survey on world models for embodied ai

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732, 2025

arXiv 2025
[34]

Flow equivariant world models: Memory for partially observed dynamic environments.arXiv preprint arXiv:2601.01075, 2026

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, and Thomas Anderson Keller. Flow equivariant world models: Memory for partially observed dynamic environments.arXiv preprint arXiv:2601.01075, 2026

Pith/arXiv arXiv 2026
[35]

Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377, 2025. 15

arXiv 2025
[36]

A survey: Learning embodied intelligence from physical simulators and world models

Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models. arXiv preprint arXiv:2507.00917, 2025

arXiv 2025
[37]

Camclonemaster: Enabling reference-based camera control for video generation

Yawen Luo, Xiaoyu Shi, Jianhong Bai, Menghan Xia, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Camclonemaster: Enabling reference-based camera control for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–10, 2025

2025
[38]

Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

arXiv 2025
[39]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025
[40]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

arXiv 2025
[41]

Sora, 2024

OpenAI. Sora, 2024. URLhttps://openai.com/sora

2024
[42]

Gpt-5, 2025

OpenAI. Gpt-5, 2025. URLhttps://openai.com/gpt-5

2025
[43]

Sora 2: Video generation model, 2025

OpenAI. Sora 2: Video generation model, 2025. URLhttps://openai.com/sora

2025
[44]

Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

arXiv 2025
[45]

Multigen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026

Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, and Nataniel Ruiz. Multigen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026

arXiv 2026
[46]

Solaris: Building a multiplayer video world model in minecraft

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208, 2026

arXiv 2026
[47]

Learning plug- and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, and Biwei Huang. Learning plug- and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

arXiv 2025
[48]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[49]

Fsvideo: Fast speed video diffusion model in a highly-compressed latent space.arXiv preprint arXiv:2602.02092, 2026

FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, and Yuxin Zhang. Fsvideo: Fast speed video diffusion model in a highly-compressed latent space.arXiv preprint arXiv:26...

arXiv 2026
[50]

Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xiaojun Xiang, Xiaoyu Zhang, Xianbin Liu, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, and Ziqiang Zhao. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

Pith/arXiv arXiv 2026
[51]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026
[52]

Wan 2.5: Unified multi-modal video generation framework, 2025

Alibaba Tongyi. Wan 2.5: Unified multi-modal video generation framework, 2025. URLhttps://tongyi.aliyun. com/wan

2025
[53]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[54]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 16

Pith/arXiv arXiv 2024
[55]

Anchorweave: World-consistent video generation with retrieved local spatial memories.arXiv preprint arXiv:2602.14941, 2026

Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, and Mohit Bansal. Anchorweave: World-consistent video generation with retrieved local spatial memories.arXiv preprint arXiv:2602.14941, 2026

arXiv 2026
[56]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026
[57]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025
[58]

Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

arXiv 2025
[59]

Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025
[60]

Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, and Song- Hai Zhang. Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models. arXiv preprint arXiv:2602.22960, 2026

arXiv 2026
[61]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025

Pith/arXiv arXiv 2025
[62]

Con- text as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Con- text as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141, 2025

arXiv 2025
[63]

A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

arXiv 2025
[64]

Videossm: Autoregressive long video generation with hybrid state-space memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025

arXiv 2025
[65]

Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

arXiv 2025
[66]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Pith/arXiv arXiv 2025
[67]

Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

arXiv 2025
[68]

Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

arXiv 2025
[69]

Soulx- liveact: Towards hour-scale real-time human animation with neighbor forcing and convkv memory.arXiv preprint arXiv:2603.11746, 2026

Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, and Shunshun Yin. Soulx- liveact: Towards hour-scale real-time human animation with neighbor forcing and convkv memory.arXiv preprint arXiv:2603.11746, 2026

arXiv 2026
[70]

Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026

Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, and Ying-cong Chen. Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026

arXiv 2026
[71]

Memorize-and-generate: Towards long- term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long- term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

arXiv 2025
[72]

compression

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024. 17 A Action World Models: Preliminaries and Related Work The key components of an action world model are thev...

arXiv 2024

[1] [1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

World simulation with video foundation models for physical ai

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[3] [3]

Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, and Tian Xie. Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025

arXiv 2025

[4] [4]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025

[5] [5]

Mixture of contexts for long video generation

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, and Gordon Wetzstein. Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058, 2025

arXiv 2025

[6] [6]

Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities

BoyuanChen, ZhuoXu, SeanKirmani, BrainIchter, DorsaSadigh, LeonidasGuibas, andFeiXia. Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

2024

[7] [7]

First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y Feng, and Yiannis Aloimonos. First frame is the place to go for video content customization.arXiv preprint arXiv:2511.15700, 2025

arXiv 2025

[8] [8]

Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

arXiv 2026

[9] [9]

Recurrent autore- gressive diffusion: Global memory meets local attention.arXiv preprint arXiv:2511.12940, 2025

Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, and Chi Jin. Recurrent autore- gressive diffusion: Global memory meets local attention.arXiv preprint arXiv:2511.12940, 2025

arXiv 2025

[10] [10]

Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

Pith/arXiv arXiv 2025

[11] [11]

Teleworld: Towards dynamic multimodal synthesis with a 4d world model

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...

arXiv 2026

[12] [12]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[13] [13]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

Pith/arXiv arXiv 2025

[14] [14]

Veo 3, 2025

Google DeepMind. Veo 3, 2025. URLhttps://deepmind.google/technologies/veo

2025

[15] [15]

Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state- of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

2025

[16] [16]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

arXiv 2026

[17] [17]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025. 14

arXiv 2025

[18] [18]

Plenoptic video generation

Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, and Chen-Hsuan Lin. Plenoptic video generation. arXiv preprint arXiv:2601.05239, 2026

arXiv 2026

[19] [19]

Seedance 1.0: Exploring the boundaries of video generation models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

Pith/arXiv arXiv 2025

[20] [20]

Beyond pixel histories: World models with persistent 3d state.arXiv preprint arXiv:2603.03482, 2026

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, and Jiang Bian. Beyond pixel histories: World models with persistent 3d state.arXiv preprint arXiv:2603.03482, 2026

Pith/arXiv arXiv 2026

[21] [21]

World models.arXiv preprint arXiv:1803.10122, 2(3), 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018

Pith/arXiv arXiv 2018

[22] [22]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023

[23] [23]

Imagen video: High definition video generation with diffusion models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

Pith/arXiv arXiv 2022

[24] [24]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

arXiv 2025

[25] [25]

Geometry-as-context: Modulating explicit 3d in scene-consistent video generation to geometry context.arXiv preprint arXiv:2602.21929, 2026

JiaKui Hu, Jialun Liu, Liying Yang, Xinliang Zhang, Kaiwen Li, Shuang Zeng, Yuanwei Li, Haibin Huang, Chi Zhang, and Yanye Lu. Geometry-as-context: Modulating explicit 3d in scene-consistent video generation to geometry context.arXiv preprint arXiv:2602.21929, 2026

arXiv 2026

[26] [26]

Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, and Hui Xiong. Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

arXiv 2025

[27] [27]

Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

arXiv 2025

[28] [28]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

arXiv 2025

[29] [29]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

2022

[30] [30]

Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026

DohunLee, Chun-HaoPaulHuang, XuelinChen, JongChulYe, DuyguCeylan, andHyeonhoJeong. Memory-v2v: Memory-augmented video-to-video diffusion for consistent multi-turn editing.arXiv preprint arXiv:2601.16296, 2026

arXiv 2026

[31] [31]

3d scene prompting for scene-consistent camera-controllable video generation

JoungBin Lee, JaewooJung, Jisang Han, TakuyaNarihira, Kazumi Fukuda, Junyoung Seo, SunghwanHong, Yuki Mitsufuji, and Seungryong Kim. 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945, 2025

arXiv 2025

[32] [32]

Cubecomposer: Spatio-temporal autoregressive 4k 360°video generation from perspective video.arXiv preprint arXiv:2603.04291, 2026

Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, and Ying Shan. Cubecomposer: Spatio-temporal autoregressive 4k 360°video generation from perspective video.arXiv preprint arXiv:2603.04291, 2026

arXiv 2026

[33] [33]

A comprehensive survey on world models for embodied ai

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732, 2025

arXiv 2025

[34] [34]

Flow equivariant world models: Memory for partially observed dynamic environments.arXiv preprint arXiv:2601.01075, 2026

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, and Thomas Anderson Keller. Flow equivariant world models: Memory for partially observed dynamic environments.arXiv preprint arXiv:2601.01075, 2026

Pith/arXiv arXiv 2026

[35] [35]

Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377, 2025. 15

arXiv 2025

[36] [36]

A survey: Learning embodied intelligence from physical simulators and world models

Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models. arXiv preprint arXiv:2507.00917, 2025

arXiv 2025

[37] [37]

Camclonemaster: Enabling reference-based camera control for video generation

Yawen Luo, Xiaoyu Shi, Jianhong Bai, Menghan Xia, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Camclonemaster: Enabling reference-based camera control for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–10, 2025

2025

[38] [38]

Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialrea- soner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

arXiv 2025

[39] [39]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025

[40] [40]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

arXiv 2025

[41] [41]

Sora, 2024

OpenAI. Sora, 2024. URLhttps://openai.com/sora

2024

[42] [42]

Gpt-5, 2025

OpenAI. Gpt-5, 2025. URLhttps://openai.com/gpt-5

2025

[43] [43]

Sora 2: Video generation model, 2025

OpenAI. Sora 2: Video generation model, 2025. URLhttps://openai.com/sora

2025

[44] [44]

Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models.arXiv preprint arXiv:2505.20171, 2025

arXiv 2025

[45] [45]

Multigen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026

Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, and Nataniel Ruiz. Multigen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026

arXiv 2026

[46] [46]

Solaris: Building a multiplayer video world model in minecraft

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208, 2026

arXiv 2026

[47] [47]

Learning plug- and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo, Lianhui Qin, and Biwei Huang. Learning plug- and-play memory for guiding video diffusion models.arXiv preprint arXiv:2511.19229, 2025

arXiv 2025

[48] [48]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[49] [49]

Fsvideo: Fast speed video diffusion model in a highly-compressed latent space.arXiv preprint arXiv:2602.02092, 2026

FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, and Yuxin Zhang. Fsvideo: Fast speed video diffusion model in a highly-compressed latent space.arXiv preprint arXiv:26...

arXiv 2026

[50] [50]

Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xiaojun Xiang, Xiaoyu Zhang, Xianbin Liu, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, and Ziqiang Zhao. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

Pith/arXiv arXiv 2026

[51] [51]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026

[52] [52]

Wan 2.5: Unified multi-modal video generation framework, 2025

Alibaba Tongyi. Wan 2.5: Unified multi-modal video generation framework, 2025. URLhttps://tongyi.aliyun. com/wan

2025

[53] [53]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[54] [54]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 16

Pith/arXiv arXiv 2024

[55] [55]

Anchorweave: World-consistent video generation with retrieved local spatial memories.arXiv preprint arXiv:2602.14941, 2026

Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, and Mohit Bansal. Anchorweave: World-consistent video generation with retrieved local spatial memories.arXiv preprint arXiv:2602.14941, 2026

arXiv 2026

[56] [56]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026

[57] [57]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025

[58] [58]

Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

arXiv 2025

[59] [59]

Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025

[60] [60]

Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, and Song- Hai Zhang. Ucm: Unifying camera control and memory with time-aware positional encoding warping for world models. arXiv preprint arXiv:2602.22960, 2026

arXiv 2026

[61] [61]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025

Pith/arXiv arXiv 2025

[62] [62]

Con- text as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Con- text as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141, 2025

arXiv 2025

[63] [63]

A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

arXiv 2025

[64] [64]

Videossm: Autoregressive long video generation with hybrid state-space memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, and Xiaojuan Qi. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025

arXiv 2025

[65] [65]

Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. Storymem: Multi-shot long video storytelling with memory.arXiv preprint arXiv:2512.19539, 2025

arXiv 2025

[66] [66]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

Pith/arXiv arXiv 2025

[67] [67]

Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

arXiv 2025

[68] [68]

Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

arXiv 2025

[69] [69]

Soulx- liveact: Towards hour-scale real-time human animation with neighbor forcing and convkv memory.arXiv preprint arXiv:2603.11746, 2026

Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, and Shunshun Yin. Soulx- liveact: Towards hour-scale real-time human animation with neighbor forcing and convkv memory.arXiv preprint arXiv:2603.11746, 2026

arXiv 2026

[70] [70]

Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026

Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, and Ying-cong Chen. Videomemory: Toward consistent video generation via memory integration.arXiv preprint arXiv:2601.03655, 2026

arXiv 2026

[71] [71]

Memorize-and-generate: Towards long- term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long- term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

arXiv 2025

[72] [72]

compression

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024. 17 A Action World Models: Preliminaries and Related Work The key components of an action world model are thev...

arXiv 2024