hub Canonical reference

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, Hao Tan · 2025 · arXiv 2512.04040

Canonical reference. 91% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 91% of classified citations

read on arXiv browse 36 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11

citation-polarity summary

background 10 unclear 1

representative citing papers

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

cs.CV · 2026-06-16 · unverdicted · novelty 7.0

EgoCS-400K is a new 400K-video egocentric CS dataset with action-state-event alignment from public match demos for world model training.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

Causal-rCM unifies teacher-forcing and self-forcing distillation for autoregressive video diffusion, delivering a 2-step model with VBench-T2V score 84.63 and enabling interactive world models on Cosmos 3 using only synthetic data.

Compression and Retrieval: Implicit Memory Retrieval for Video World Models

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causal autoregressive renderer.

Echo-Memory: A Controlled Study of Memory in Action World Models

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.

Prisma-World: Camera-Controllable Multi-Agent Video World Model

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.

Geometry-Aware Implicit Memory for Video World Models

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.

Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.

WorldKV: Efficient World Memory with World Retrieval and Compression

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.

Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

Lyra 2.0: Explorable Generative 3D Worlds

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

cs.CV · 2026-04-11 · conditional · novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

citing papers explorer

Showing 36 of 36 citing papers.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · accept · none · ref 34 · 2 links
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
MemLearner: Learning to Query Context memory for Video World Models cs.CV · 2026-06-30 · unverdicted · none · ref 25
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
EgoCS-400K: An Egocentric Gameplay Dataset for World Models cs.CV · 2026-06-16 · unverdicted · none · ref 4
EgoCS-400K is a new 400K-video egocentric CS dataset with action-state-event alignment from public match demos for world model training.
World Model Self-Distillation: Training World Models to Solve General Tasks cs.CV · 2026-06-10 · unverdicted · none · ref 29
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
Benchmarking Visual State Tracking in Multimodal Video Understanding cs.CV · 2026-06-02 · unverdicted · none · ref 28
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation cs.CV · 2026-06-01 · unverdicted · none · ref 19
LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 32
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 35
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 34
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models cs.CV · 2026-06-24 · unverdicted · none · ref 21
Causal-rCM unifies teacher-forcing and self-forcing distillation for autoregressive video diffusion, delivering a 2-step model with VBench-T2V score 84.63 and enabling interactive world models on Cosmos 3 using only synthetic data.
Compression and Retrieval: Implicit Memory Retrieval for Video World Models cs.CV · 2026-06-22 · unverdicted · none · ref 6
CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory cs.CV · 2026-06-15 · unverdicted · none · ref 2
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold cs.CV · 2026-06-11 · unverdicted · none · ref 16
MoVerse generates real-time interactive video world models from single narrow-FOV images via panoramic diffusion expansion, Gaussian scaffold lifting, and distillation of a bidirectional diffusion teacher into a causal autoregressive renderer.
Echo-Memory: A Controlled Study of Memory in Action World Models cs.CV · 2026-06-08 · unverdicted · none · ref 24
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
Prisma-World: Camera-Controllable Multi-Agent Video World Model cs.CV · 2026-06-08 · unverdicted · none · ref 43
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation cs.CV · 2026-06-02 · unverdicted · none · ref 5
AAD-1 uses a causal generator with a bidirectional holistic discriminator plus phased distribution matching before adversarial training to reach state-of-the-art one-step autoregressive video generation on VBench.
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data cs.CV · 2026-06-01 · unverdicted · none · ref 8
MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
Geometry-Aware Implicit Memory for Video World Models cs.CV · 2026-06-01 · unverdicted · none · ref 22
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation cs.CV · 2026-05-29 · unverdicted · none · ref 22
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models cs.CV · 2026-05-28 · unverdicted · none · ref 15
minWM supplies an end-to-end pipeline that fine-tunes bidirectional T2V/TI2V models with camera control then distills them via Causal Forcing into few-step autoregressive generators for low-latency rollout.
WorldKV: Efficient World Memory with World Retrieval and Compression cs.CV · 2026-05-21 · unverdicted · none · ref 8
WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 20
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 33
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation cs.CV · 2026-04-11 · conditional · none · ref 14
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 37
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization cs.LG · 2026-02-03 · unverdicted · none · ref 5
Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.
WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory cs.CV · 2026-07-02 · unverdicted · none · ref 22
A video world model framework that uses LLM-orchestrated 3D trajectories as control signals for generation to achieve persistent dynamic object memory and viewpoint freedom.
Directing the World: Fast Autoregressive Video Generation with Compositional Human-Camera Control cs.CV · 2026-06-26 · unverdicted · none · ref 49
A decoupled-control autoregressive video model using Fast-Slow Memory training, dynamic projection, and staged camera control to produce stable long-horizon outputs with human and viewpoint guidance.
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization cs.CV · 2026-06-05 · unverdicted · none · ref 17
AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.
DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory cs.CV · 2026-05-29 · unverdicted · none · ref 14
DecMem proposes a decoupled memory system using sparse global and anchored local components to enable consistent minute-long controllable video generation in world models.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer cs.CV · 2026-05-14 · unverdicted · none · ref 40
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory cs.CV · 2026-04-10 · unverdicted · none · ref 16
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive distillation on a 5B model.
Advancing Open-source World Models cs.CV · 2026-01-28 · unverdicted · none · ref 28
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends cs.CV · 2026-05-31 · unverdicted · none · ref 84
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 293
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation cs.CV · 2026-02-02 · unreviewed · ref 15 · 2 links

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer