Genie: Generative Interactive Environments

Aditi Mavalankar; Ashley Edwards; Chris Apps; Edward Hughes; Feryal Behbahani; Jack Parker-Holder; Jake Bruce; Jeff Clune; Jingwei Zhang; Konrad Zolna

arxiv: 2402.15391 · v1 · pith:LKVB4U57new · submitted 2024-02-23 · 💻 cs.LG · cs.AI· cs.CV

Genie: Generative Interactive Environments

Jake Bruce , Michael Dennis , Ashley Edwards , Jack Parker-Holder , Yuge Shi , Edward Hughes , Matthew Lai , Aditi Mavalankar

show 17 more authors

Richie Steigerwald Chris Apps Yusuf Aytar Sarah Bechtle Feryal Behbahani Stephanie Chan Nicolas Heess Lucy Gonzalez Simon Osindero Sherjil Ozair Scott Reed Jingwei Zhang Konrad Zolna Jeff Clune Nando de Freitas Satinder Singh Tim Rockt\"aschel

This is my paper

classification 💻 cs.LG cs.AIcs.CV

keywords modelgenieactiontrainingagentsenvironmentsgenerativeinteractive

0 comments

read the original abstract

We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation
cs.RO 2026-06 unverdicted novelty 7.0

A single Encoder-Router network uses semigroup superposition of frame, modulation, and coefficient parameters to produce a scene-specific Riemannian metric field that supports zero-shot geodesic planning after trainin...
Stealthy World Model Manipulation via Data Poisoning
cs.LG 2026-06 unverdicted novelty 7.0

SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning...
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
cs.AI 2026-05 unverdicted novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus,...
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
Diffusion Models Are Real-Time Game Engines
cs.LG 2024-08 conditional novelty 7.0

A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
cs.CV 2026-07 unverdicted novelty 6.0

HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.
Path Planning in Physically Viable World Models
cs.RO 2026-07 unverdicted novelty 6.0

A physically viable world model augments 3D Gaussian splats with physics simulation to assess robot route feasibility under simulated terrain changes like flooding, revealing failures not visible in static maps.
PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning
cs.RO 2026-06 unverdicted novelty 6.0

PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
cs.CV 2026-06 unverdicted novelty 6.0

Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot succes...
Mechanisms of Misgeneralization in Physical Sequence Modeling
cs.LG 2026-05 unverdicted novelty 6.0

Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance o...
Why Latent Actions Fail, and How to Prevent It
cs.CV 2026-05 unverdicted novelty 6.0

Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce co...
ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control
cs.RO 2026-04 unverdicted novelty 6.0

ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training
cs.NE 2026-04 unverdicted novelty 6.0

PBT-NCA evolves PD-NCAs via a composite novelty-diversity objective to generate sustained emergent lifelike behaviors including waves, spore scattering, and migrating macro-structures at the edge of chaos.
Safety, Security, and Cognitive Risks in World Models
cs.CR 2026-04 unverdicted novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
cs.LG 2026-02 unverdicted novelty 6.0

Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4...
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
cs.CV 2026-01 unverdicted novelty 6.0

VideoGPA distills geometry priors via self-supervised DPO to enhance 3D consistency, temporal stability, and motion coherence in video diffusion models.
FLARE: Robot Learning with Implicit World Modeling
cs.RO 2025-05 unverdicted novelty 6.0

FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
cs.RO 2024-11 unverdicted novelty 6.0

DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
VideoPhy: Evaluating Physical Commonsense for Video Generation
cs.CV 2024-06 conditional novelty 6.0

VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
World Pilot: Steering Vision-Language-Action Models with World-Action Priors
cs.RO 2026-06 unverdicted novelty 5.0

World Pilot augments VLA policies with world-action priors through latent and action steering pathways, reporting 84.7% success on LIBERO-Plus zero-shot OOD and top real-robot results across four tasks.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
LaGO: Latent Action Guidance for Online Reinforcement Learning
cs.AI 2026-06 unverdicted novelty 4.0

LaGO improves online RL success rates over vanilla PPO by using pretrained LLMs as latent action priors, raising rates from 15.1% to 27.2% on CLEVR-Robot and 2.7% to 15.2% on Meta-World.
EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation
cs.RO 2026-06 unverdicted novelty 4.0

EA-WM adds task-specification-grounded event prediction and verification to frozen visual-feature world models for improved long-horizon robot manipulation planning.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model
cs.LG 2026-06 unverdicted novelty 3.0

A preview system demonstrates real-time controllable world modeling at 14-15 FPS on RTX 4090 by adapting open video backbones with action pathways for keyboard/mouse control and multimodal features.
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications
cs.LG 2026-05 unverdicted novelty 3.0

The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sor...
Fewer, Better Frames: A Compute-Normalized Proof of Concept for Coherence-First World-Model Rendering with Model-Guided FSR4 Frame Generation
cs.GR 2026-05 unverdicted novelty 3.0

Coherence-first rendering with 15 FPS anchors plus FSR4 upsampling to 30 FPS preserves scene geometry and identity longer than native 30 FPS generation across tested forest, sword, desert, and snow scenes, with LPIPS ...
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
cs.RO 2026-04 unverdicted novelty 3.0

A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
A Tutorial on World Models and Physical AI
cs.AI 2026-06 unverdicted novelty 2.0

A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.