Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026
7 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 7years
2026 7representative citing papers
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.
citing papers explorer
-
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
-
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
-
World Models as Group Actions
Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
Geometry-Aware Implicit Memory for Video World Models
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implicit baselines.
-
WorldOlympiad: Can Your World Model Survive a Triathlon?
WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.