OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
hub Canonical reference
3d and 4d world modeling: A survey
Canonical reference. 90% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 10representative citing papers
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
Driver-WM is a driver-centric latent world model for causal rollout of in-cabin dynamics conditioned on out-cabin traffic, unifying kinematics forecasting with behavioral and emotional recognition via dual-stream architecture and gated injection.
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.
U4D introduces an uncertainty-guided two-stage diffusion framework for 4D LiDAR scene synthesis that prioritizes high-entropy regions for geometry and uses a spatio-temporal block for consistency, reporting SOTA results on nuScenes and SemanticKITTI.
A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.
DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.
A sparse transformer predicts multi-frame 3D occupancy from images without BEV or VAE tokenization and reports SOTA results on nuScenes for 1-3s forecasting under arbitrary trajectories.
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.
citing papers explorer
-
One Video, One World: Turning Monocular Video into Physical 4D Scenes
OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
-
Benchmarking Visual State Tracking in Multimodal Video Understanding
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
Is Your Driving World Model an All-Around Player?
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
-
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
-
Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout
Driver-WM is a driver-centric latent world model for causal rollout of in-cabin dynamics conditioned on out-cabin traffic, unifying kinematics forecasting with behavioral and emotional recognition via dual-stream architecture and gated injection.
-
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
-
Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis
A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.
-
Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis
U4D introduces an uncertainty-guided two-stage diffusion framework for 4D LiDAR scene synthesis that prioritizes high-entropy regions for geometry and uses a spatio-temporal block for consistency, reporting SOTA results on nuScenes and SemanticKITTI.
-
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.
-
GEM: Generating LiDAR World Model via Deformable Mamba
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
-
Embody4D: A Generalist Data Engine for Embodied 4D World Modeling
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.
-
DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.
-
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
A sparse transformer predicts multi-frame 3D occupancy from images without BEV or VAE tokenization and reports SOTA results on nuScenes for 1-3s forecasting under arbitrary trajectories.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
-
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
- Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models