SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Haoran Xu; Hao Tang; Hongfeng Lai; Jian Zhao; Kexu Cheng; Ling Shao; Ruili Feng; Shangwen Zhu; Yan Zhang; Yeying Jin

REVIEW 2 major objections 1 minor 5 cited by

A per-pixel conditioning module added to video diffusion models separates localized weapon actions from global camera motion in FPS environments, allowing cross-game generalization without segmentation labels.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 04:57 UTC pith:POI6NBKH

load-bearing objection SCOPE's core assumption that discrete FPS actions only affect local weapon pixels does not hold for real mechanics like distant projectile impacts. the 2 major comments →

arxiv 2605.23345 v1 pith:POI6NBKH submitted 2026-05-22 cs.CV

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

Zizhao Tong , Hongfeng Lai , Zeqing Wang , Zhaohu Xing , Kexu Cheng , Haoran Xu , Zhao Pu , Shangwen Zhu

show 6 more authors

Ruili Feng Jian Zhao Yan Zhang Hao Tang Yeying Jin Ling Shao

This is my paper

classification cs.CV

keywords FPS world modelsvideo diffusionaction conditioningcross-game transferspatial selectivityplayable environmentszero-shot generalizationscope separation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build interactive world models for first-person shooters that respond correctly to dense, overlapping control inputs at every frame. It starts from the observation that discrete actions like firing affect only the small region around the weapon while continuous movement governs the rest of the scene. SCOPE inserts a lightweight conditioning block into each transformer layer of a pretrained video diffusion model; the block flattens features into per-pixel time sequences so that each location decides its own action response from local visual evidence. A new dataset called CrossFPS supplies 69K aligned clips from seven different titles to train the model on general rather than title-specific patterns. The result is claimed to be zero-shot transfer to new scenes together with clean separation of in-scope and out-of-scope effects.

Core claim

SCOPE inserts a conditioning module into each transformer block of a pretrained video diffusion model. The module reshapes the feature map into per-pixel temporal sequences so every spatial position can compute its response to the incoming 10-DoF action vector from its own local visual content. This produces spatially selective generation in which discrete events remain confined to the weapon scope while continuous camera and movement signals update the stable surroundings. Trained on the CrossFPS multi-game dataset, the resulting model learns visual-to-action mappings that transfer to unseen titles and scenes.

What carries the argument

SCOPE conditioning module that reshapes video features into per-pixel temporal sequences inside each transformer block to compute local action responses.

Load-bearing premise

Discrete FPS actions affect only a localized region around the weapon while continuous movement signals affect the stable surroundings, so local visual content alone suffices to separate the two without any segmentation labels.

What would settle it

Apply a firing or reload action to a generated frame that contains no visible weapon; if the model still modifies only a small localized patch instead of the entire frame, the spatial-selectivity claim holds.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Zero-shot transfer of action responsiveness to completely unseen FPS scenes and titles.
Precise in-scope versus out-of-scope separation emerges without any segmentation supervision.
General visual-to-action mappings replace game-specific patterns across seven different titles.
Stable background generation remains intact while discrete events stay confined to the weapon region.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-pixel conditioning pattern could be tested on non-FPS interactive simulators such as driving or robotics environments that also mix localized and global controls.
Training cost might drop if the module allows reuse of a single video diffusion backbone across many different game genres.
Extending the approach to continuous rather than discrete actions would test whether the local-response assumption scales beyond weapon events.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SCOPE's core assumption that discrete FPS actions only affect local weapon pixels does not hold for real mechanics like distant projectile impacts.

read the letter

The paper's concrete additions are the SCOPE per-pixel temporal conditioning module inserted into a pretrained video diffusion model's transformer blocks and the CrossFPS dataset of 69K clips from seven games with aligned 10-DoF telemetry. The dataset is curated to reduce gameplay bias and is presented as the first multi-game FPS resource of its kind. The conditioning reshapes features so each pixel computes its response from its own local visual content and temporal sequence, which is a direct attempt to handle overlapping high-frequency controls without global action injection or segmentation labels. That technical choice is distinct from prior single-game or global-conditioning baselines. The work is aimed at people building interactive world models or simulators that need cross-game generalization. The dataset could be picked up by others working on similar problems even if the method is revised. The central claim that the model learns general visual-to-action mappings for zero-shot transfer rests on the premise that discrete events like firing affect only the localized scope region while continuous signals handle the rest. This premise does not match typical FPS behavior, where firing also produces non-local changes at impact locations. The per-pixel separation therefore lacks a clear mechanism to attribute those distant effects correctly. The abstract states that experiments confirm responsiveness, scope separation, and generalization, but the absence of detailed ablations, error breakdowns, or quantitative support for handling non-local effects leaves the main result under-supported. This is enough to send for peer review so referees can examine the full results and data, though the locality assumption needs direct testing.

Referee Report

2 major / 1 minor

Summary. The paper proposes SCOPE, which inserts a per-pixel temporal-sequence conditioning module into each transformer block of a pretrained video diffusion model to handle spatially selective FPS actions. Discrete actions (e.g., firing) are assumed to affect only a localized weapon-scope region while continuous signals govern the surroundings, enabling label-free separation of in-scope effects. The authors introduce the CrossFPS dataset (69K clips from 7 titles with 10-DoF telemetry) and claim the model learns general visual-to-action mappings that support strong action responsiveness, precise scope separation, and zero-shot transfer to unseen scenes.

Significance. If the central claims hold, the work would advance interactive world models by providing a mechanism for dense overlapping control signals without segmentation labels or game-specific training, with the new multi-game dataset as a concrete contribution for studying cross-title generalization.

major comments (2)

[Abstract, §3] Abstract and §3 (method): The load-bearing premise that discrete actions affect only localized weapon-scope pixels is stated without addressing counterexamples such as muzzle flash or distant projectile impacts, which would produce non-local visual changes and break the per-pixel attribution in the transformer blocks.
[Experiments] Experiments section: The abstract states that experiments confirm responsiveness, separation, and cross-game generalization, yet no quantitative metrics, baselines, ablation results, or error analysis are referenced; this prevents verification that the per-pixel conditioning actually isolates effects as claimed.

minor comments (1)

[§4] The dataset curation process to remove gameplay bias is mentioned but lacks detail on the exact filtering criteria or statistics per game.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): The load-bearing premise that discrete actions affect only localized weapon-scope pixels is stated without addressing counterexamples such as muzzle flash or distant projectile impacts, which would produce non-local visual changes and break the per-pixel attribution in the transformer blocks.

Authors: We agree this assumption merits explicit discussion. While muzzle flash remains localized to the weapon region, distant impacts are a valid counterexample that could violate per-pixel attribution. In the revised manuscript we will expand §3 to qualify the assumption, discuss these cases, and list them as a limitation of the current formulation. revision: yes
Referee: [Experiments] Experiments section: The abstract states that experiments confirm responsiveness, separation, and cross-game generalization, yet no quantitative metrics, baselines, ablation results, or error analysis are referenced; this prevents verification that the per-pixel conditioning actually isolates effects as claimed.

Authors: Section 4 already reports quantitative metrics for responsiveness (action-conditioned FID and prediction accuracy), scope separation (region-specific reconstruction error), cross-game zero-shot transfer, plus ablations and baselines. We will revise the abstract and §3 to cite these results explicitly so readers can locate the supporting evidence without ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained

full rationale

The paper presents an architectural modification (per-pixel temporal conditioning in transformer blocks) motivated by an explicit observation about spatial selectivity of FPS actions, plus a new multi-game dataset (CrossFPS). No equations, fitted parameters, or predictions are shown to reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. Central claims rest on empirical results from the introduced dataset rather than tautological redefinitions or renamed known results. This matches the default case of an honest, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text required for audit.

pith-pipeline@v0.9.0 · 5766 in / 1142 out tokens · 27741 ms · 2026-05-25T04:57:49.419424+00:00 · methodology

0 comments

read the original abstract

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

Figures

Figures reproduced from arXiv: 2605.23345 by Haoran Xu, Hao Tang, Hongfeng Lai, Jian Zhao, Kexu Cheng, Ling Shao, Ruili Feng, Shangwen Zhu, Yan Zhang, Yeying Jin, Zeqing Wang, Zhaohu Xing, Zhao Pu, Zizhao Tong.

**Figure 2.** Figure 2: SCOPE architecture. A SCOPE module is inserted into each DiT block. Discrete inputs use cross-attention with visual queries to confine effects to in-scope regions. Continuous inputs use MLP fusion and temporal self-attention for out-of-scope generation. Pathways combine via residual connections. frame Vt must respond to the concurrent action at rather than merely extrapolating visual momentum. As establish… view at source ↗

**Figure 3.** Figure 3: CrossFPS overview. Clip distribution across 7 FPS titles (69K total) with frame-aligned [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under high-frequency actions. Our method maintains out-of-scope [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative ablation. Left: without spatial selectivity, actions perturb the entire frame ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Action controllability on unseen scenes. Left: single and multi-action execution with [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: CrossFPS statistics. (a) Linear velocity distribution. (b) Angular velocity for yaw and pitch. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Call of Duty: Warzone example. The first frame (highlighted with a green box) is used for caption generation and as the image-to-video condition. The action input sequence shows a leftward camera rotation transitioning to forward movement with simultaneous fire and reload events. Caption: “A dark narrow stairwell inside a building in Caldera Capital City, with a wooden ladder leading upward through a dimly… view at source ↗

**Figure 10.** Figure 10: Xonotic example. The first frame (highlighted with a green box) is used for caption generation and as the image-to-video condition. The action input sequence shows leftward movement combined with forward camera motion, rightward sweep, and a diagonal turn. Caption: “A dark military-industrial interior room labeled ‘Computer Room’ with large metal panel walls featuring riveted circular patterns, grid-patte… view at source ↗

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VideoCoCo: Code-as-CoT for Physically-Consistent Video Generation via an Agentic Dual-Engine System
cs.CV 2026-07 conditional novelty 6.0

Using executable Blender code as an intermediate simulation draft improves physical consistency in text-to-video generation, lifting OmniWeaving from 0.475 to 0.558 on PhyGenBench and from 52.18% to 77.88% on VBench-2.0.
StatePlay: State-Aware Game World Models for Mechanics-Consistent Generation
cs.CV 2026-07 conditional novelty 6.0

Coupling explicit state prediction with MoT-style video generation raises mechanics fidelity of Street Fighter 3 rollouts by about 18.6% over stateless game world models.
From Pixels to States: Rethinking Interactive World Models as Game Engines
cs.CV 2026-07 conditional novelty 5.0

Interactive world models are reorganized around the game-engine action-state-observation loop, and a 90-hour Black Myth: Wukong dataset with frame-aligned actions, ground-truth states, and observations is introduced.
PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation
cs.CV 2026-06 unverdicted novelty 5.0

PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.
HyBDM: Multi-Scale Hybrid Experts for Time Series Forecasting with Bidirectional Dependency Modeling
cs.LG 2026-07 conditional novelty 3.0

HyBDM combines a Mamba-style global-pattern expert with a local window transformer and a learned router to forecast multivariate time series, reporting state-of-the-art results on six benchmarks.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 5 Pith papers · 20 internal anchors

[1]

Halo: The master chief collection

343 Industries. Halo: The master chief collection. https://www.xbox.com/en-US/games/halo, 2014

work page 2014
[2]

Halo infinite.https://www.xbox.com/en-US/games/halo-infinite, 2021

343 Industries. Halo infinite.https://www.xbox.com/en-US/games/halo-infinite, 2021

work page 2021
[3]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024
[5]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

work page 2025
[6]

V-jepa: latent video prediction for visual representation learning (2024)

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: latent video prediction for visual representation learning (2024). InURL https://openreview. net/forum, 2024

work page 2024
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024
[9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[10]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

work page arXiv 2024
[11]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

work page 2024
[12]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, et al. Agentic world modeling: Foundations, capabilities, laws, and beyond.arXiv preprint arXiv:2604.22748, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

CUP Archive, 1967

Kenneth James Williams Craik.The nature of explanation, volume 445. CUP Archive, 1967

work page 1967
[15]

Oasis: A universe in a transformer.URL: https://oasis-model

Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.URL: https://oasis-model. github. io, 2(3):6, 2024

work page 2024
[16]

Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

work page 2025
[17]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

work page 2025
[18]

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

work page Pith review arXiv 2025
[19]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Dream to control: Learning behaviors by latent imagination, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2019

work page 2019
[21]

Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

work page 2025
[22]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[24]

Call of duty.https://www.callofduty.com, 2003

Infinity Ward. Call of duty.https://www.callofduty.com, 2003. 11

work page 2003
[25]

Call of duty: Modern warfare.https://www.callofduty.com/modernwarfare, 2019

Infinity Ward. Call of duty: Modern warfare.https://www.callofduty.com/modernwarfare, 2019

work page 2019
[26]

Call of duty: Warzone

Infinity Ward and Raven Software. Call of duty: Warzone. https://www.callofduty.com/warzone, 2020

work page 2020
[27]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5820–5829, 2021

work page 2021
[28]

Learning to simulate dynamic environments with gamegan

Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to simulate dynamic environments with gamegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1231–1240, 2020

work page 2020
[29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

work page 2024
[33]

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo, Gian Mario Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, and Christopher Pal. Beyond fvd: Enhanced evaluation metrics for video generation quality.arXiv preprint arXiv:2410.05203, 2024

work page Pith review arXiv 2024
[34]

Magne, A

Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, et al. Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

work page arXiv 2026
[35]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

work page 2024
[36]

Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871,

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, et al. Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026

work page arXiv 2026
[37]

Introducing ChatGPT images 2.0

OpenAI. Introducing ChatGPT images 2.0. https://openai.com/index/ introducing-chatgpt-images-2-0/, 2026

work page 2026
[38]

Genie 2: A large-scale foundation world model.URL: https://deepmind

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model.URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model, 2, 2024

work page 2024
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

work page 2023
[41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[42]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

work page 2020
[43]

arXiv preprint arXiv:2602.08971 , year=

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026. 12

work page arXiv 2026
[44]

Call of duty: Modern warfare iii

Sledgehammer Games. Call of duty: Modern warfare iii. https://www.callofduty.com/store/ games/modernwarfare3, 2023

work page 2023
[45]

Generative modeling by estimating gradients of the data distribution, 2019

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution, 2019

work page 2019
[46]

Score-based generative modeling through stochastic differential equations, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2020

work page 2020
[47]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

work page 1991
[49]

Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025
[50]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Xonotic.https://xonotic.org/, 2011

Team Xonotic. Xonotic.https://xonotic.org/, 2011

work page 2011
[53]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

arXiv preprint arXiv:2503.08153 (2025)

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153, 2025

work page arXiv 2025
[57]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

work page 2023
[59]

Worldmem: Long-term consistent world simulation with memory, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory, 2025

work page 2025
[60]

Learning Interactive Real-World Simulators

Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[63]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025. 13

work page 2025
[64]

A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

work page arXiv 2025
[65]

Gamefactory: Creating new games with generative interactive videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11590–11599, 2025

work page 2025
[66]

Vfimamba: Video frame interpolation with state space models.Advances in Neural Information Processing Systems, 37:107225–107248, 2024

Guozhen Zhang, Chunxu Liu, Yutao Cui, Xiaotong Zhao, Kai Ma, and Limin Wang. Vfimamba: Video frame interpolation with state space models.Advances in Neural Information Processing Systems, 37:107225–107248, 2024

work page 2024
[67]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[68]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, and Enze Xie. Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer.arXiv preprint arXiv:2605.15178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. A CrossFPS Dataset Details This appendix provides complete details on the CrossFPS dataset, organized as follows: Section A....

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Halo: The master chief collection

343 Industries. Halo: The master chief collection. https://www.xbox.com/en-US/games/halo, 2014

work page 2014

[2] [2]

Halo infinite.https://www.xbox.com/en-US/games/halo-infinite, 2021

343 Industries. Halo infinite.https://www.xbox.com/en-US/games/halo-infinite, 2021

work page 2021

[3] [3]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024

[5] [5]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

work page 2025

[6] [6]

V-jepa: latent video prediction for visual representation learning (2024)

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: latent video prediction for visual representation learning (2024). InURL https://openreview. net/forum, 2024

work page 2024

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024

[9] [9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024

[10] [10]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

work page arXiv 2024

[11] [11]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

work page 2024

[12] [12]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, et al. Agentic world modeling: Foundations, capabilities, laws, and beyond.arXiv preprint arXiv:2604.22748, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

CUP Archive, 1967

Kenneth James Williams Craik.The nature of explanation, volume 445. CUP Archive, 1967

work page 1967

[15] [15]

Oasis: A universe in a transformer.URL: https://oasis-model

Etched Decart, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer.URL: https://oasis-model. github. io, 2(3):6, 2024

work page 2024

[16] [16]

Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

work page 2025

[17] [17]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

work page 2025

[18] [18]

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

work page Pith review arXiv 2025

[19] [19]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Dream to control: Learning behaviors by latent imagination, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2019

work page 2019

[21] [21]

Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

work page 2025

[22] [22]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[23] [23]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[24] [24]

Call of duty.https://www.callofduty.com, 2003

Infinity Ward. Call of duty.https://www.callofduty.com, 2003. 11

work page 2003

[25] [25]

Call of duty: Modern warfare.https://www.callofduty.com/modernwarfare, 2019

Infinity Ward. Call of duty: Modern warfare.https://www.callofduty.com/modernwarfare, 2019

work page 2019

[26] [26]

Call of duty: Warzone

Infinity Ward and Raven Software. Call of duty: Warzone. https://www.callofduty.com/warzone, 2020

work page 2020

[27] [27]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5820–5829, 2021

work page 2021

[28] [28]

Learning to simulate dynamic environments with gamegan

Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to simulate dynamic environments with gamegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1231–1240, 2020

work page 2020

[29] [29]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Open-Sora Plan: Open-Source Large Video Generation Model

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

work page 2024

[33] [33]

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo, Gian Mario Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, and Christopher Pal. Beyond fvd: Enhanced evaluation metrics for video generation quality.arXiv preprint arXiv:2410.05203, 2024

work page Pith review arXiv 2024

[34] [34]

Magne, A

Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, et al. Nitrogen: An open foundation model for generalist gaming agents.arXiv preprint arXiv:2601.02427, 2026

work page arXiv 2026

[35] [35]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

work page 2024

[36] [36]

Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871,

Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, et al. Worldcam: Interactive autoregressive 3d gaming worlds with camera pose as a unifying geometric representation.arXiv preprint arXiv:2603.16871, 2026

work page arXiv 2026

[37] [37]

Introducing ChatGPT images 2.0

OpenAI. Introducing ChatGPT images 2.0. https://openai.com/index/ introducing-chatgpt-images-2-0/, 2026

work page 2026

[38] [38]

Genie 2: A large-scale foundation world model.URL: https://deepmind

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al. Genie 2: A large-scale foundation world model.URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model, 2, 2024

work page 2024

[39] [39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[40] [40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

work page 2023

[41] [41]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[42] [42]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

work page 2020

[43] [43]

arXiv preprint arXiv:2602.08971 , year=

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026. 12

work page arXiv 2026

[44] [44]

Call of duty: Modern warfare iii

Sledgehammer Games. Call of duty: Modern warfare iii. https://www.callofduty.com/store/ games/modernwarfare3, 2023

work page 2023

[45] [45]

Generative modeling by estimating gradients of the data distribution, 2019

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution, 2019

work page 2019

[46] [46]

Score-based generative modeling through stochastic differential equations, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2020

work page 2020

[47] [47]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

work page 1991

[49] [49]

Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025

[50] [50]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

Xonotic.https://xonotic.org/, 2011

Team Xonotic. Xonotic.https://xonotic.org/, 2011

work page 2011

[53] [53]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[54] [54]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

arXiv preprint arXiv:2503.08153 (2025)

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153, 2025

work page arXiv 2025

[57] [57]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al. Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

Daydreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

work page 2023

[59] [59]

Worldmem: Long-term consistent world simulation with memory, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory, 2025

work page 2025

[60] [60]

Learning Interactive Real-World Simulators

Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[63] [63]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025. 13

work page 2025

[64] [64]

A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, and Xihui Liu. A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

work page arXiv 2025

[65] [65]

Gamefactory: Creating new games with generative interactive videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11590–11599, 2025

work page 2025

[66] [66]

Vfimamba: Video frame interpolation with state space models.Advances in Neural Information Processing Systems, 37:107225–107248, 2024

Guozhen Zhang, Chunxu Liu, Yutao Cui, Xiaotong Zhao, Kai Ma, and Limin Wang. Vfimamba: Video frame interpolation with state space models.Advances in Neural Information Processing Systems, 37:107225–107248, 2024

work page 2024

[67] [67]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

[68] [68]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, and Enze Xie. Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer.arXiv preprint arXiv:2605.15178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. A CrossFPS Dataset Details This appendix provides complete details on the CrossFPS dataset, organized as follows: Section A....

work page internal anchor Pith review Pith/arXiv arXiv 2026