hub Canonical reference

3d and 4d world modeling: A survey

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al · 2025 · arXiv 2509.07996

Canonical reference. 90% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 90% of classified citations

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 9 unclear 1

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

Is Your Driving World Model an All-Around Player?

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.

Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout

cs.RO · 2026-05-06 · unverdicted · novelty 7.0 · 2 refs

Driver-WM is a driver-centric latent world model for causal rollout of in-cabin dynamics conditioned on out-cabin traffic, unifying kinematics forecasting with behavioral and emotional recognition via dual-stream architecture and gated injection.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

U4D introduces an uncertainty-guided two-stage diffusion framework for 4D LiDAR scene synthesis that prioritizes high-entropy regions for geometry and uses a spatio-temporal block for consistency, reporting SOTA results on nuScenes and SemanticKITTI.

OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

cs.CV · 2026-05-13 · conditional · novelty 6.0

A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.

GEM: Generating LiDAR World Model via Deformable Mamba

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

cs.CV · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

cs.CV · 2025-12-29 · unverdicted · novelty 6.0

GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.

DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

cs.CV · 2026-03-20 · unverdicted · novelty 5.0

DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

cs.CV · 2025-11-27 · unverdicted · novelty 5.0

A sparse transformer predicts multi-frame 3D occupancy from images without BEV or VAE tokenization and reports SOTA results on nuScenes for 1-3s forecasting under arbitrary trajectories.

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

cs.AI · 2025-10-06 · unverdicted · novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

cs.LG · 2026-05-28 · unverdicted · novelty 3.0

The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

cs.AI · 2026-04-24

Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation

cs.CV · 2026-04-09

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

cs.CV · 2026-04-06

citing papers explorer

Showing 21 of 21 citing papers.

One Video, One World: Turning Monocular Video into Physical 4D Scenes cs.CV · 2026-06-30 · unverdicted · none · ref 39
OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
Benchmarking Visual State Tracking in Multimodal Video Understanding cs.CV · 2026-06-02 · unverdicted · none · ref 31
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 13
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
Is Your Driving World Model an All-Around Player? cs.CV · 2026-05-11 · unverdicted · none · ref 17
WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 35
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout cs.RO · 2026-05-06 · unverdicted · none · ref 14 · 2 links
Driver-WM is a driver-centric latent world model for causal rollout of in-cabin dynamics conditioned on out-cabin traffic, unifying kinematics forecasting with behavioral and emotional recognition via dual-stream architecture and gated injection.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 53
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis cs.CV · 2026-06-27 · unverdicted · none · ref 34
A new framework factorizes weather video synthesis into semantic appearance anchoring, physics-informed Gaussian particle simulation under gravity/wind/turbulence, and geometry-grounded alignment to produce diverse realistic weather effects.
Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis cs.CV · 2026-06-01 · unverdicted · none · ref 7
U4D introduces an uncertainty-guided two-stage diffusion framework for 4D LiDAR scene synthesis that prioritizes high-entropy regions for geometry and uses a spatio-temporal block for consistency, reporting SOTA results on nuScenes and SemanticKITTI.
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation cs.CV · 2026-05-13 · conditional · none · ref 30
A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.
GEM: Generating LiDAR World Model via Deformable Mamba cs.CV · 2026-05-08 · unverdicted · none · ref 16
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
Embody4D: A Generalist Data Engine for Embodied 4D World Modeling cs.CV · 2026-05-03 · unverdicted · none · ref 28 · 2 links
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 56 · 2 links
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation cs.CV · 2025-12-29 · unverdicted · none · ref 27
GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.
DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving cs.CV · 2026-03-20 · unverdicted · none · ref 22
DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model cs.CV · 2025-11-27 · unverdicted · none · ref 16
A sparse transformer predicts multi-frame 3D occupancy from images without BEV or VAE tokenization and reports SOTA results on nuScenes for 1-3s forecasting under arbitrary trajectories.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 55
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications cs.LG · 2026-05-28 · unverdicted · none · ref 68
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unreviewed · ref 189
Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation cs.CV · 2026-04-09 · unreviewed · ref 20
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models cs.CV · 2026-04-06 · unreviewed · ref 58

3d and 4d world modeling: A survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer