pith. machine review for the scientific record. sign in

hub Canonical reference

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Canonical reference. 83% of citing Pith papers cite this work as background.

19 Pith papers citing it
Background 83% of classified citations
abstract

Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.

hub tools

citation-role summary

background 5 baseline 1

citation-polarity summary

years

2026 16 2025 3

representative citing papers

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

Lyra 2.0: Explorable Generative 3D Worlds

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

UNICA: A Unified Neural Framework for Controllable 3D Avatars

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.

Neural Computers

cs.LG · 2026-04-07 · unverdicted · novelty 5.0

Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.

Advancing Open-source World Models

cs.CV · 2026-01-28 · unverdicted · novelty 4.0

LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.

citing papers explorer

Showing 19 of 19 citing papers.

  • WorldMark: A Unified Benchmark Suite for Interactive Video World Models cs.CV · 2026-04-23 · unverdicted · none · ref 15 · internal anchor

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  • Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 290 · internal anchor

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  • MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 32 · internal anchor

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

  • Training Agents Inside of Scalable World Models cs.AI · 2025-09-29 · conditional · none · ref 9 · internal anchor

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  • Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation cs.CV · 2026-05-13 · unverdicted · none · ref 7 · internal anchor

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  • Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 12 · internal anchor

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.

  • Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 17 · internal anchor

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  • Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 28 · internal anchor

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  • INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling cs.CV · 2026-04-08 · unverdicted · none · ref 32 · internal anchor

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching distillation.

  • UNICA: A Unified Neural Framework for Controllable 3D Avatars cs.CV · 2026-04-03 · unverdicted · none · ref 21 · internal anchor

    UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.

  • Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms eess.IV · 2026-03-30 · unverdicted · none · ref 42 · internal anchor

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  • Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization cs.LG · 2026-02-03 · unverdicted · none · ref 3 · internal anchor

    Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.

  • WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling cs.CV · 2025-12-16 · unverdicted · none · ref 17 · internal anchor

    WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.

  • Self-Forcing++: Towards Minute-Scale High-Quality Video Generation cs.CV · 2025-10-02 · conditional · none · ref 15 · internal anchor

    Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

  • Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 67 · 2 links · internal anchor

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  • Neural Computers cs.LG · 2026-04-07 · unverdicted · none · ref 15 · internal anchor

    Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.

  • Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory cs.CV · 2026-04-10 · unverdicted · none · ref 15 · internal anchor

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.

  • OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unverdicted · none · ref 43 · internal anchor

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  • Advancing Open-source World Models cs.CV · 2026-01-28 · unverdicted · none · ref 27 · internal anchor

    LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.