A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Yan: Foundational interactive video generation
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
DisCo uses discrete action primitives for camera control in video world models to achieve more reliable action following than continuous trajectories.
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.
DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.
citing papers explorer
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
-
DisCo: World Models with Discrete Camera Motion Control
DisCo uses discrete action primitives for camera control in video world models to achieve more reliable action following than continuous trajectories.
-
Streaming Video Generation with Streaming Force Control
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
-
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
-
WorldKV: Efficient World Memory with World Retrieval and Compression
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
-
AstraNav-World: World Model for Foresight Control and Consistency
AstraNav-World unifies diffusion video generation and vision-language action planning in a single bidirectional model that improves trajectory accuracy, success rates, and zero-shot real-world adaptation in embodied navigation.
-
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.
-
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory
DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.
-
Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.