pith. sign in

arxiv: 2605.18601 · v1 · pith:JT2CNRU6new · submitted 2026-05-18 · 💻 cs.CV

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Pith reviewed 2026-05-20 10:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords video world modelsnatural language conditioningmulti-entity controlcross-entity transferinteractive video generationaction interfacelong-horizon streamingElden Ring dataset
0
0 comments X

The pith

Natural language conditioning enables simultaneous multi-entity control and cross-entity concept transfer in video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the main barrier in current interactive video world models is their rigid action interfaces, which bind controls to fixed entities or engines at design time. The authors argue that switching to natural language as the per-latent-frame conditioning signal removes this restriction and allows expressive, simultaneous control over multiple entities along with concept-level transfer between them. Incantation demonstrates this by adapting a pretrained bidirectional video backbone with frame-local text cross-attention and supporting long rollouts through specialized distillation and caching techniques. A sympathetic reader would care because such an interface could make complex scene control in games or simulations as intuitive as writing descriptions rather than selecting predefined actions.

Core claim

Incantation is the first interactive video world model that treats natural language as the per-latent-frame (0.25 s) action interface. It pairs a pretrained bidirectional video backbone with frame-local text cross-attention to support simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. Real-time long-horizon streaming is enabled by ODE-initialized Self-Forcing distillation together with a RoPE-decoupled sliding KV-cache. The system outperforms the Action-Index baseline on cross-entity transfer (89 percent versus 43 percent) and out-of-vocabulary prompts (90 percent versus 0 percent) while sustaining 19.7 FPS at 480p with stable FVD.

What carries the argument

Frame-local text cross-attention applied to each latent frame of a pretrained bidirectional video backbone, which injects natural language instructions independently per frame to drive multi-entity actions.

If this is right

  • The model surpasses the Action-Index baseline by achieving 89 percent success on cross-entity transfer and 90 percent on out-of-vocabulary prompts.
  • It sustains real-time generation at 19.7 FPS at 480p with stable FVD across 2-hour rollouts.
  • The same architecture applies to other environments such as The King of Fighters by changing only the per-entity action vocabulary slots.
  • A preview dataset of Elden Ring player-boss combat scenes with structured action metadata has been released to support further training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The language interface could extend beyond games to domains like robotic simulation or animated storytelling where users describe behaviors for multiple agents.
  • It opens the possibility of creating novel scenarios by describing entity interactions in everyday words rather than predefined controls.
  • Strong performance on out-of-vocabulary prompts suggests the approach may handle instructions that go beyond the training distribution.
  • Stable long-horizon coherence indicates the method could support extended interactive sessions without frequent resets.

Load-bearing premise

That adding frame-local text cross-attention to a pretrained video backbone is enough to achieve multi-entity control and cross-entity concept transfer while preserving visual fidelity and temporal coherence over long sequences.

What would settle it

Generate a video sequence in which two distinct entities receive contradictory natural language instructions within the same 0.25-second frame and check whether the output shows coherent, separate actions for each entity instead of merged or incoherent motion.

Figures

Figures reproduced from arXiv: 2605.18601 by Fan Cheng, Huangji Wang, Jian Zhao, Qianyu Peng, Ruili Feng, Shangwen Zhu, Xiangrui Ke, Xinyu Cui, Yeying Jin, Zeqing Wang, Zhaohu Xing, Zhao Pu, Zhilei Shu, Zizhao Tong.

Figure 1
Figure 1. Figure 1: Demonstrations of Incantation’s cross-entity action transfer and multi-entity control in the game Elden Ring. (i) Two bosses, Margit and Crucible Knight, each possessing character￾exclusive moves, are conditioned via natural language to perform each other’s actions, each executed by both its native character and the other: Light Blade Attack (Margit-exclusive, green rows) and Tail of the Crucible (Crucible… view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of Incantation. Left: Incantation translates combatant keyboard inputs into natural language prompts and autoregressively generates video frames in a causal streaming manner. Right: Training proceeds in two stages: (1) Language-Conditioned Pretraining adapts the base model for per-frame text-driven generation; (2) ODE-Initialized Self-Forcing Distillation enables real-time streaming via ODE-based … view at source ↗
Figure 3
Figure 3. Figure 3: Demonstrations of fine-grained multi-entity action control of Incantation in KOF. Incantation precisely responds to rapid action inputs and successfully captures actions as brief as 0.25 s (e.g., Punching), demonstrating its fine-grained and responsive control capability. 2 Related Work Interactive Video World Models. Most interactive video world models still simulate only a single controllable entity. Fol… view at source ↗
Figure 4
Figure 4. Figure 4: Attention design. Bidirectional self-attention is retained over history frames to preserve the spatio-temporal priors of the pretrained base model. Action cross-attention is restricted exclusively to the noisy target frame, preventing temporal cross-contamination. Together, these two constraints improve per-frame controllability without degrading generation quality. solution for each, structuring our pipel… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of Incantation against leading video generation models on Elden Ring. Seedance 2.0 [33] and Kling 3.0 [24] achieve high visual fidelity yet fail on fine-grained player–boss interactions; LongLive [45] partially captures multi-entity dynamics but loses action fidelity and visual coherence. Only Incantation delivers precise per-frame multi-entity action control with genuine interactive… view at source ↗
Figure 6
Figure 6. Figure 6: Elden Ring rollout from a continuous Margit session. We show 40 frames sampled from the generated stream starting at the 1-minute mark. The sequence illustrates long-horizon visual stability and fine-grained player–boss interaction in a complex 3D adversarial scene. Environment: Stormveil Castle bridge, overcast sky, cinematic combat. Agents: Player (Greatsword user) vs. Boss (Margit, the Fell Omen). Playe… view at source ↗
Figure 7
Figure 7. Figure 7: KOF rollout under the same architecture and training recipe. We show 30 frames from a KOF rollout. The sequence illustrates that the same per-entity language-conditioning recipe also supports visually distinct 2D fighting gameplay. • 4.00 s – 5.00 s: Jump back to disengage • 5.00 s – 7.50 s: Jump and mid-air slam • 7.50 s – 9.00 s: Tail swipe • 9.00 s – 10.00 s: Horizontal slash Since the three commercial … view at source ↗
Figure 8
Figure 8. Figure 8: Annotation interface for the human evaluation of Action Control Accuracy (ACA). Each trial presents the annotators with a generated video clip alongside the per-entity target action label (here: KYO—Light kick; YURI—Blocking). We strip conditioning-variant identities (NL vs. Action-Index) and prompt-source labels before rating, which ensures a fully blinded evaluation. Each annotator rates each entity’s ac… view at source ↗
Figure 9
Figure 9. Figure 9: Stage 1 ablation: FVD vs. training steps. Companion to [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Incantation, an interactive video world model that uses natural language prompts for per-latent-frame (0.25 s) conditioning on a pretrained bidirectional video backbone via frame-local text cross-attention. It claims this interface enables simultaneous multi-entity control and concept-level cross-entity transfer beyond fixed action indices or rendering pipelines, demonstrated through superior performance on Elden Ring and King of Fighters scenes (89% cross-entity transfer vs. 43% baseline; 90% OOV vs. 0%). Additional contributions include ODE-initialized Self-Forcing distillation for real-time long-horizon streaming at 19.7 FPS with stable FVD over 2-hour rollouts, and release of a preview dataset subset.

Significance. If the empirical claims hold under rigorous controls, the work would represent a meaningful advance in video world models by replacing rigid action interfaces with expressive natural language, potentially enabling broader generalization and multi-entity interactions. Strengths include the dataset release for reproducibility, the engineering for real-time performance, and direct comparisons showing gains on transfer and OOV tasks. The significance is limited by the current lack of verification details that would confirm the gains arise from the NL interface rather than backbone statistics.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the central claim of simultaneous independent multi-entity control and concept-level cross-entity transfer rests on aggregate metrics (89% transfer, 90% OOV) without reported dataset size, number of evaluation episodes, statistical significance tests, per-entity fidelity breakdowns, or controls for prompt difficulty. This leaves open whether the observed gains isolate the NL interface or reflect global scene statistics from the pretrained backbone.
  2. [§3.1] §3.1 (Architecture, frame-local text cross-attention): the conditioning mechanism applies text cross-attention to global frame features without entity-specific tokens, phrase-to-entity binding, attention masks, or visual grounding (e.g., segmentation). This design risks non-independent control where conditioning for one entity bleeds into others, undermining the claim that the NL interface itself enables independent multi-entity control beyond the backbone's implicit statistics.
minor comments (2)
  1. [Abstract and §2] The abstract and §2 mention 'per-latent-frame (0.25 s)' conditioning but do not specify the exact latent frame rate or how it aligns with the video backbone's temporal resolution; a brief equation or diagram would clarify.
  2. [Figures] Figure captions for rollout examples could include quantitative FVD values per sequence length to better support the 'stable over 2-hour rollouts' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate the revisions we will make to strengthen the empirical and architectural descriptions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the central claim of simultaneous independent multi-entity control and concept-level cross-entity transfer rests on aggregate metrics (89% transfer, 90% OOV) without reported dataset size, number of evaluation episodes, statistical significance tests, per-entity fidelity breakdowns, or controls for prompt difficulty. This leaves open whether the observed gains isolate the NL interface or reflect global scene statistics from the pretrained backbone.

    Authors: We agree that the current presentation of results would benefit from greater transparency on the evaluation protocol. In the revised manuscript we will report the exact size of the held-out evaluation set, the number of episodes per metric, and the outcomes of statistical significance tests (e.g., bootstrap confidence intervals or paired tests) for the 89 % vs. 43 % and 90 % vs. 0 % differences. We will also add per-entity fidelity tables and a breakdown of prompt difficulty (simple vs. compound instructions) to control for that variable. Because the Action-Index baseline uses the identical pretrained backbone and training distribution, the comparison already isolates the effect of the conditioning interface to a meaningful degree; we will make this point explicit in the revision. revision: yes

  2. Referee: [§3.1] §3.1 (Architecture, frame-local text cross-attention): the conditioning mechanism applies text cross-attention to global frame features without entity-specific tokens, phrase-to-entity binding, attention masks, or visual grounding (e.g., segmentation). This design risks non-independent control where conditioning for one entity bleeds into others, undermining the claim that the NL interface itself enables independent multi-entity control beyond the backbone's implicit statistics.

    Authors: The architecture deliberately applies frame-local cross-attention to the full set of visual tokens so that a single natural-language prompt can describe multiple entities without requiring explicit segmentation or per-entity tokens at inference time. The pretrained bidirectional backbone already encodes rich multi-entity scene structure; the text cross-attention therefore lets the language prompt modulate those existing representations rather than learning bindings from scratch. Our qualitative rollouts and the large gap versus the Action-Index baseline indicate that control remains largely independent in practice. We will expand §3.1 with a short discussion of implicit phrase-to-entity binding and will include attention-map visualizations in the supplement to illustrate separation of control signals. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons

full rationale

The paper describes an empirical architecture (pretrained bidirectional video backbone + frame-local text cross-attention + ODE-initialized Self-Forcing distillation) and reports direct performance numbers against external baselines (89% vs 43% cross-entity transfer, 90% vs 0% OOV prompts, 19.7 FPS). No equations, fitted parameters, or uniqueness theorems are presented that reduce by construction to the inputs or to self-citations. The central claims of multi-entity control and cross-entity transfer are supported by aggregate metrics on held-out rollouts rather than quantities defined in terms of the model's own conditioning variables. This is the normal case of a system paper whose results are falsifiable against stated baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the transferability of a pretrained video backbone via added cross-attention and on the effectiveness of the distillation procedure for long-horizon stability; no explicit free parameters, axioms, or invented entities are detailed in the provided abstract.

pith-pipeline@v0.9.0 · 5870 in / 1162 out tokens · 52403 ms · 2026-05-20T10:53:33.815251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

  1. [1]

    COMBAT: Conditional world models for behavioral agent training.arXiv preprint arXiv:2603.00825, 2026

    Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp, Shahbuland Matiana, Louis Castricato, and Spencer Frazier. COMBAT: Conditional world models for behavioral agent training.arXiv preprint arXiv:2603.00825, 2026

  2. [2]

    Diffusion for world modeling: Visual details matter in Atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in Atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  3. [3]

    Logic-guided vector fields for constrained generative modeling.arXiv preprint arXiv:2602.02009, 2026

    Ali Baheri. Logic-guided vector fields for constrained generative modeling.arXiv preprint arXiv:2602.02009, 2026

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  7. [7]

    Taehv: Tiny autoencoder for hunyuan video

    Ollin Boer Bohan. Taehv: Tiny autoencoder for hunyuan video. https://github.com/ madebyollin/taehv, 2025

  8. [8]

    Genie: Generative interactive environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

  9. [9]

    GameGen-X: Interactive open-world game video generation

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. GameGen-X: Interactive open-world game video generation. InInternational Conference on Learning Representations (ICLR), 2025

  10. [10]

    Christopher, Michael Cardei, Jinhao Liang, and Ferdinando Fioretto

    Jacob K. Christopher, Michael Cardei, Jinhao Liang, and Ferdinando Fioretto. Neuro-symbolic generative diffusion models for physically grounded, robust, and safe generation. InProceedings of the International Conference on Neuro-Symbolic Systems, volume 288 ofProceedings of Machine Learning Research, pages 188–213. PMLR, 2025

  11. [11]

    Oasis: A universe in a transformer

    Decart AI and Etched AI. Oasis: A universe in a transformer. https://oasis-model. github.io/, 2024. 10

  12. [12]

    LiveWorld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

    Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. LiveWorld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

  13. [13]

    The matrix: Infinite-horizon world generation with real-time moving control

    Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

  14. [14]

    Artur d’Avila Garcez and Luis C. Lamb. Neural-symbolic learning and reasoning: A survey and interpretation. InNeuro-Symbolic Artificial Intelligence: The State of the Art, volume 342 ofFrontiers in Artificial Intelligence and Applications, pages 1–51. IOS Press, 2022

  15. [15]

    Genie 3: A new frontier for world models

    Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/ blog/genie-3-a-new-frontier-for-world-models/, 2025. Google DeepMind Blog

  16. [16]

    Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. MineWorld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

  17. [17]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), 2018

  18. [18]

    Mastering diverse control tasks through world models.Nature, 640:647–653, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640:647–653, 2025

  19. [19]

    LM- Infinite: Zero-shot extreme length generalization for large language models

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM- Infinite: Zero-shot extreme length generalization for large language models. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 3991–4008, 2024

  20. [20]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  21. [21]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  22. [22]

    Vid2World: Crafting video diffusion models to interactive world models

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  23. [23]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  24. [24]

    Kling AI launches 3.0 model, ushering in an era where everyone can be a director

    Kuaishou Technology. Kling AI launches 3.0 model, ushering in an era where everyone can be a director. https://ir.kuaishou.com/news-releases/news-release-details/ kling-ai-launches-30-model-ushering-era-where-everyone-can-be , February

  25. [25]

    Accessed: 2026-05-01

  26. [26]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  27. [27]

    Reasoning physical video generation with diffusion timestep tokens via reinforcement learning.arXiv preprint arXiv:2504.15932, 2025

    Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, and Hanwang Zhang. Reasoning physical video generation with diffusion timestep tokens via reinforcement learning.arXiv preprint arXiv:2504.15932, 2025

  28. [28]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 11

  29. [29]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model. https://deepmind.google/blog/ genie-2-a-large-scale-foundation-world-model/, 2024. Google DeepMind Blog

  30. [30]

    MultiGen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026

    Ryan Po, David Junhao Zhang, Amir Hertz, Gordon Wetzstein, Neal Wadhwa, and Nataniel Ruiz. MultiGen: Level-design for editable multiplayer worlds in diffusion game engines.arXiv preprint arXiv:2603.06679, 2026

  31. [31]

    A VID: Adapting video diffusion models to world models

    Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. A VID: Adapting video diffusion models to world models. InInternational Conference on Learning Representations (ICLR), 2025

  32. [32]

    Solaris: Building a multiplayer video world model in minecraft

    Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie. Solaris: Building a multiplayer video world model in Minecraft.arXiv preprint arXiv:2602.22208, 2026

  33. [33]

    Zero-shot conditioning of score-based diffusion models by neuro-symbolic constraints

    Davide Scassola, Sebastiano Saccani, Ginevra Carbone, and Luca Bortolussi. Zero-shot conditioning of score-based diffusion models by neuro-symbolic constraints. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20302–20309, 2025

  34. [34]

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hon...

  35. [35]

    BlendRL: A framework for merging symbolic and neural policy learning

    Hikaru Shindo, Quentin Delfosse, Devendra Singh Dhami, and Kristian Kersting. BlendRL: A framework for merging symbolic and neural policy learning. InInternational Conference on Learning Representations (ICLR), 2025

  36. [36]

    Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al

    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search.Nature, 529:484–489, 2016

  37. [37]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025. 12

  38. [38]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

    Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction- following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

  39. [39]

    Advancing Open-source World Models

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models. arXiv preprint arXiv:2601.20540, 2026

  40. [40]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  41. [41]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InInternational Conference on Learning Representations (ICLR), 2025

  42. [42]

    Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H

    Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grand- master level in StarCraft II using multi-agent reinforcement learning.Nature, 575:350–354, 2019

  43. [43]

    Memory Networks

    Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916, 2014

  44. [44]

    Infinite-World: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

    Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, and Ming-Ming Cheng. Infinite-World: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

  45. [45]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations (ICLR), 2024

  46. [46]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  47. [47]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6613–6623, 2024

  48. [48]

    GameFactory: Creating new games with generative interactive videos

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. GameFactory: Creating new games with generative interactive videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11590–11599, 2025

  49. [49]

    Matrix-game: Interactive world foundation model, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  50. [50]

    Neuro-symbolic synergy for interactive world modeling.arXiv preprint arXiv:2602.10480, 2026

    Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, and Tianyi Zhou. Neuro-symbolic synergy for interactive world modeling.arXiv preprint arXiv:2602.10480, 2026

  51. [51]

    double light blade throw

    Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, and Xiaoyun Yuan. ShareVerse: Multi-agent consistent video generation for shared world modeling.arXiv preprint arXiv:2603.02697, 2026. 13 Table 3: Systematic comparison of interactive video world models. ✓ = supported, ✗ = not supported, ∼ = partial.Multi-entityrequires independent and simultaneous control...