NeuWorld uses a transformer VAE to learn compact Neural Implicit Scenes from sparse posed frames and a diffusion transformer to evolve them conditioned on camera trajectories for consistent interactive exploration.
hub Canonical reference
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Canonical reference. 91% of citing Pith papers cite this work as background.
abstract
Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SPAWN enables training-free insertion of custom visual concepts into autoregressive world models by swapping the pinned context-memory anchor over a short injection window.
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
Incantation is the first video world model to use per-frame natural language conditioning for simultaneous multi-entity control and concept-level cross-entity transfer in interactive video generation.
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
RoboWorld introduces an automated pipeline using autoregressive video world models and task-progress VLM scoring, plus Step Forcing for long-horizon stability, to achieve high correlation with real robot policy evaluation.
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
WEAVER is a multi-view world model using flow-matching that jointly satisfies fidelity, consistency, and efficiency for robotic manipulation, yielding 0.87 correlation with real success and policy gains on hardware.
MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causal autoregressive renderer.
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.
DisCo uses discrete action primitives for camera control in video world models to achieve more reliable action following than continuous trajectories.
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
WorldFly integrates a world model into a VLA framework via dual-branch coupled flow matching to jointly generate future videos and actions, outperforming baselines on an urban canyon traversal benchmark especially in unseen environments.
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
DexSIM is a bi-directional video diffusion model with hand trajectory embedding and spatial memory cache for real-time dexterous hand-object simulation at 15 FPS.
citing papers explorer
No citing papers match the current filters.