{"total":16,"items":[{"citing_arxiv_id":"2605.13815","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:42:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified text-conditioned diffusion model generates high-fidelity LiDAR scans across eight domains spanning weather, sensor, and platform shifts using cross-domain training and feature modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12957","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-13T03:43:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Language guided generation of 3d embod- ied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16227-16237 (2024) [12] Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Embodied- gen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025) [13] Kong, L., Yang, W., Mei, J., Liu, Y., Liang, A., Zhu, D., Lu, D., Yin, W., Hu, X., Jia, M., et al.: 3d and 4d world modeling: A survey. arXiv preprint arXiv:2509.07996 (2025) [14] Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N.,et al.: Understanding world or predicting future? a comprehensive survey"},{"citing_arxiv_id":"2605.10858","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Is Your Driving World Model an All-Around Player?","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:05:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorldLens benchmark reveals no driving world model dominates across visual, geometric, behavioral, and perceptual fidelity, with contributions of a 26K human-annotated dataset and a distilled vision-language evaluator.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"difficult to distinguish from real footage. Yet the prevailing evaluation metrics, such as FID, FVD, and LPIPS, were de- signed forimage quality, notworld fidelity[13, 15]. These metrics quantify perceptual similarity but reveal nothing about whether the underlying geometry is coherent, whether the physics are plausible, or whether the generated world can support downstream autonomy tasks [17, 20]. As such, the field has been optimizing for an incomplete objective: worlds thatappearreal but do notbehaverealistically. The absence of a comprehensive evaluation protocol means that progress on one axis (e.g., texture realism) can mask re- gression on others (e.g., 3D consistency or action control- lability), making it difficult to compare models or identify"},{"citing_arxiv_id":"2605.08712","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-09T05:48:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InMICCAI Work- shop on Data Engineering in Medical Imaging, pages 1-10. Springer, 2025. [34] Ça˘ghan Köksal, Ghazal Ghazaei, Felix Holm, Azade Farshad, and Nassir Navab. Sangria: surgical video scene graph optimization for surgical workflow prediction. InInternational Workshop on Graphs in Biomedical Image Analysis, pages 106-117. Springer, 2024. [35] Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025. [36] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video"},{"citing_arxiv_id":"2605.07326","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GEM: Generating LiDAR World Model via Deformable Mamba","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:32:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"periments show that GEM achieves state-of-the-art perfor- mances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page:https://github.com/wuyang98/GEM. 1. Introduction World models represent a transformative technology, en- abling autonomous vehicles to evolve from passively re- acting to their environment to actively reasoning about the future [16]. While significant progress has been made for camera video-based and occupancy-based methods [1, 54], the potential of LiDAR-based world models remains largely unexplored, despite LiDAR's inherent advantage in provid- ing precise geometric capture of driving environments. *Corresponding author. † Independent researcher. Disentangled Static Dynamic"},{"citing_arxiv_id":"2605.05092","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout","primary_cat":"cs.RO","submitted_at":"2026-05-06T16:30:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Driver-WM rolls out in-cabin driver states in a compact latent space from frozen vision-language features, using traffic-conditioned dual streams and gated causal injection for long-horizon geometric and semantic forecasting.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"failures are associated with inadequate takeover readiness rather than incorrect scene understanding [30], and the risk increases when the system operates beyond its functional domain under evolving traffic interactions [28]. Motivated by the need to reason about long-horizon safety and interaction, world models have recently emerged as a principled framework for autonomous driving [14]. By predicting how the external environment evolves conditioned on current observations and actions, they have been widely adopted for forward simulation, maneuver reasoning, and policy training, ranging from unified full- stack driving systems to generative traffic simulators and driving foundation models [8,10,29]. To support efficient long-horizon prediction and scalable closed-"},{"citing_arxiv_id":"2605.01799","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embody4D: A Generalist Data Engine for Embodied 4D World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-03T09:39:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"World modeling stands as a cornerstone of robotics, endowing systems with the capacity to comprehend, encode, and forecast the evolution of dynamic scenes [11]. As the core of interactive environment construction, 3D and 4D generation techniques are critical for bridging the gap between perception and decision- making, synthesizing high-fidelity dynamic scenes for embodied planning [28]. Most existing embodied world models are confined to single-view video pre- diction [63]. While recent studies acknowledge the necessity of multi-view per- ception and support generating synchronized head and wrist views [26,33,57], they typically rely on fixed multi-view reference frames during inference, thus falling short of flexible, arbitrary viewpoint generation."},{"citing_arxiv_id":"2604.22748","ref_index":189,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond","primary_cat":"cs.AI","submitted_at":"2026-04-24T17:48:47+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18486","ref_index":56,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:37:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RoboDrive challenge: Drive anytime anywhere in any condition.arXiv preprint arXiv:2405.08816, 2024. [55] Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi-modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748-3765, 2025. [56] Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, and Ziwei Liu. 3D and 4D world modeling: A survey.arXiv preprint arXiv:2509."},{"citing_arxiv_id":"2604.07923","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation","primary_cat":"cs.CV","submitted_at":"2026-04-09T07:45:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stitch4D reconstructs coherent 4D urban scenes from sparse non-overlapping camera placements by synthesizing bridge views and enforcing inter-location spatio-temporal consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04707","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenWorldLib: A Unified Codebase and Definition of Advanced World Models","primary_cat":"cs.CV","submitted_at":"2026-04-06T14:19:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Causal reasoning and large language models: Opening a new frontier for causality.Transactionson Machine Learning Research, 2023. [57] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017. [58] Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025. [59] Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1-62, 2022. [60] Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao,"},{"citing_arxiv_id":"2603.19675","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-03-20T06:19:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.06949","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos","primary_cat":"cs.RO","submitted_at":"2026-02-06T18:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"on Computer Vision and Pattern Recognition (CVPR), 2021. 16 [51] Diederik Kingma and Max Welling. Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114, 2013. 7 [52] Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3D and 4D World Modeling: A Survey.arXiv preprint arXiv:2509.07996, 2025. 16 [53] Yann LeCun. A Path Towards Autonomous Machine Intelligence.Open Review, 2022. 2 [54] Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators.arXiv preprint arXiv:2510.00406, 2025."},{"citing_arxiv_id":"2512.23180","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation","primary_cat":"cs.CV","submitted_at":"2025-12-29T03:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22039","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model","primary_cat":"cs.CV","submitted_at":"2025-11-27T02:48:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A sparse transformer predicts multi-frame 3D occupancy from images without BEV or VAE tokenization and reports SOTA results on nuScenes for 1-3s forecasting under arbitrary trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04978","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI","primary_cat":"cs.AI","submitted_at":"2025-10-06T16:16:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"input features toward generating and interacting with realistic physical scenarios. Scope Comparison and Contributions.As summa- rized in Table 1, existing surveys have examined individual dimensions of physical understanding in isolation, addressing perception [40], [41], [42], [43], [44], reasoning [45], [46], [47], [48], [49], mod- eling [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], and interaction [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71] as separate research areas without examining the synergistic connections between them. Our survey uniquely focuses on the evolutionary trajectory that unites these four capa- bilities into a coherent paradigm, analyzing how they interact and inform one another toward unified"}],"limit":50,"offset":0}