pith. sign in

arxiv: 2606.18375 · v3 · pith:LRGJVWMCnew · submitted 2026-06-16 · 💻 cs.RO

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

Pith reviewed 2026-06-27 00:16 UTC · model grok-4.3

classification 💻 cs.RO
keywords world foundation modelsmulti-view 3D consistencyrobotic manipulationdiffusion transformercross-view attentiongeometric position embedding3D feature distillationrobotic simulation
0
0 comments X

The pith

PAIWorld adds inter-view attention, ray embeddings and 3D distillation to diffusion world models to eliminate cross-view drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that multi-view world foundation models fail for robotics because they simply concatenate view tokens without mechanisms for communication between views or explicit 3D geometric knowledge. It identifies these two missing elements as the direct cause of object drift, depth errors, and texture misalignment across cameras. To fix both at once, the authors add three components to a standard diffusion-transformer base: attention blocks that let views exchange information, position embeddings that encode camera rays and poses, and a distillation step that pulls 3D features from frozen models. If the argument holds, robotic simulators can finally produce outputs that stay consistent no matter which camera is used, supporting more accurate planning and policy learning from multiple views.

Core claim

PAIWorld augments diffusion-transformer world models with Geometry-Aware Cross-View Attention blocks that establish explicit pathways across views, Geometric Rotary Position Embedding that injects camera ray directions and extrinsic poses, and Latent 3D-REPA that distills 3D-aware features from frozen foundation models; the authors argue that addressing the absence of inter-view communication and the lack of a 3D geometric prior simultaneously is necessary and sufficient to resolve cross-view object drift, depth inconsistency, and texture misalignment, yielding state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks.

What carries the argument

The three components—Geometry-Aware Cross-View Attention blocks, Geometric Rotary Position Embedding, and Latent 3D-REPA—that together supply inter-view communication and 3D geometric priors inside a DiT-based world model.

If this is right

  • Multi-view world models can now maintain stable object positions, depths, and textures across egocentric, eye-to-hand, and wrist cameras.
  • Model-based planning gains reliability because simulated future states remain geometrically consistent from every viewpoint.
  • World action models can be trained directly on the consistent multi-view outputs without additional correction steps.
  • Multi-view policy post-training becomes feasible because the underlying simulator no longer drifts between training views.
  • The same base model reaches top leaderboard positions on WorldArena and AgiBot-Challenge2026 consistency metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-component pattern could be tested in non-robotic multi-camera settings such as scene reconstruction or video surveillance to check whether the consistency gains transfer.
  • If the geometric prior is the dominant factor, future models might achieve similar results with lighter attention mechanisms by increasing reliance on the distillation step.
  • The approach suggests a modular way to retrofit existing single-view world models rather than training entirely new architectures from scratch.

Load-bearing premise

The two identified deficiencies are the root causes of the observed inconsistencies, and simultaneously adding the three proposed components is necessary and sufficient to resolve them without introducing new problems.

What would settle it

An experiment in which a model using only one or two of the three components reaches comparable consistency scores on the same benchmarks, or in which the full PAIWorld model produces new inconsistencies not seen in the baseline.

read the original abstract

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PAIWorld, a DiT-based world foundation model for robotic manipulation that augments standard multi-view diffusion transformers with three components: Geometry-Aware Cross-View Attention blocks, Geometric Rotary Position Embedding (encoding camera rays and extrinsics), and Latent 3D-REPA (distilling from frozen 3D models). It identifies two root causes of cross-view drift, depth inconsistency, and texture misalignment—lack of explicit inter-view communication and absence of 3D geometric priors—and claims that addressing both simultaneously via these additions is necessary and sufficient. The resulting model reports SOTA multi-view 3D consistency, ranking 1st on WorldArena and 2nd on AgiBot-Challenge2026, with downstream uses in model-based planning, world action models, and multi-view policy post-training.

Significance. If the attribution of gains to the three components is empirically validated, the work would provide a concrete architectural recipe for injecting geometric consistency into world models, which is a load-bearing requirement for reliable multi-camera robotic simulation and policy learning. The leaderboard results and downstream application claims would then represent a meaningful advance over simple token concatenation baselines.

major comments (2)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that the two deficiencies are the root causes and that the three components are simultaneously necessary and sufficient is not supported by ablation studies. No controlled experiments are described that compare the full PAIWorld model against ablated variants (e.g., without Geometry-Aware Cross-View Attention, without Geometric RoPE, or without Latent 3D-REPA) on the WorldArena consistency metrics or the reported leaderboard scores. Without such results, the attribution of the 1st/2nd place rankings and downstream improvements to this specific design cannot be substantiated.
  2. [§3 (Method)] §3 (Method): The necessity argument—that inter-view communication and 3D priors must be resolved together—rests on the design of the three components, yet no analysis is provided showing that partial combinations fail to achieve comparable consistency or that alternative mechanisms (e.g., explicit depth supervision or different attention patterns) would not suffice. This leaves open the possibility that the observed gains arise from increased capacity rather than the claimed geometric mechanisms.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from explicit definitions or citations for the WorldArena and AgiBot-Challenge2026 metrics used to claim SOTA multi-view 3D consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments correctly identify that the manuscript's central claims about root causes and the necessity/sufficiency of the three components rest on design motivation and overall results rather than direct empirical isolation. We address each point below and commit to revisions that add the requested controlled experiments.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that the two deficiencies are the root causes and that the three components are simultaneously necessary and sufficient is not supported by ablation studies. No controlled experiments are described that compare the full PAIWorld model against ablated variants (e.g., without Geometry-Aware Cross-View Attention, without Geometric RoPE, or without Latent 3D-REPA) on the WorldArena consistency metrics or the reported leaderboard scores. Without such results, the attribution of the 1st/2nd place rankings and downstream improvements to this specific design cannot be substantiated.

    Authors: We agree that the original submission lacks explicit ablation tables isolating each of the three components on the WorldArena metrics and leaderboard scores. The manuscript instead presents the joint model and its end-to-end gains. In the revision we will add a dedicated ablation subsection in §4 that reports performance for the base DiT, plus three single-component removals and two partial combinations, all evaluated on the same consistency metrics and downstream tasks. This will allow direct attribution of gains to the geometric mechanisms rather than capacity alone. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): The necessity argument—that inter-view communication and 3D priors must be resolved together—rests on the design of the three components, yet no analysis is provided showing that partial combinations fail to achieve comparable consistency or that alternative mechanisms (e.g., explicit depth supervision or different attention patterns) would not suffice. This leaves open the possibility that the observed gains arise from increased capacity rather than the claimed geometric mechanisms.

    Authors: The necessity claim in the manuscript is presented as a design hypothesis rather than an experimentally proven theorem. While the joint architecture is shown to work, we did not include systematic comparisons of partial combinations or alternative priors such as explicit depth losses. In the revision we will expand §3 with a short discussion of why the chosen mechanisms address both deficiencies simultaneously and will reference the new ablation results (from the §4 revision) to show that removing any one component measurably degrades 3D consistency. We will also note that alternatives like depth supervision remain compatible future directions but are outside the current scope. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks without self-referential reductions

full rationale

The paper identifies deficiencies in prior multi-view world models, proposes three architectural components to address them, and reports empirical results on WorldArena and AgiBot-Challenge2026 leaderboards plus downstream tasks. No equations, parameter-fitting steps presented as predictions, or self-citation chains appear in the provided text that would make any central claim equivalent to its inputs by construction. The necessity-and-sufficiency argument is stated as a design hypothesis tested via benchmarks rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are described. The framework implicitly assumes standard properties of diffusion transformers and frozen 3D foundation models, but these are not enumerated.

pith-pipeline@v0.9.1-grok · 5905 in / 1180 out tokens · 39153 ms · 2026-06-27T00:16:22.305413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation

    cs.CV 2026-06 unverdicted novelty 5.0

    PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.

Reference graph

Works this paper leans on

61 extracted references · 15 linked inside Pith · cited by 1 Pith paper

  1. [1]

    World models.arXiv preprint arXiv:1803.10122, 2018

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  2. [2]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020

  3. [3]

    Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

    NVIDIA,NiketAgarwal, ArslanAli, MaciejBala, YogeshBalaji, ErikBarker, TiffanyCai, PrithvijitChattopadhyay, Yongxin Chen, Yin Cui, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

  4. [4]

    Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026

    NVIDIA. Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026

  5. [5]

    Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

  6. [6]

    CogVideoX: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations (ICLR), 2025

  7. [7]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems, volume 37, 2024

  8. [8]

    HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  9. [9]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  10. [10]

    DayDreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InProceedings of the Conference on Robot Learning (CoRL), volume 205, pages 2226–2240, 2022

  11. [11]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, volume 36, 2023

  12. [12]

    RoboDreamer: Learning compositional world models for robot imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  13. [13]

    LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

    Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

  14. [14]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

  15. [15]

    Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026

    Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, et al. Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026. 17

  16. [16]

    RT-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

  17. [17]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the Conference on Robot Learning (CoRL), volume 229, pages 2165–2183, 2023

  18. [18]

    Open X-embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

  19. [19]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  20. [20]

    Genie: Generative interactive environments

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InProceedings of the 41st International Conference on Machine Learning, 2024

  21. [21]

    iVideoGPT: Interactive VideoGPTs are scalable world models

    Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. iVideoGPT: Interactive VideoGPTs are scalable world models. InAdvances in Neural Information Processing Systems, volume 37, 2024

  22. [22]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  23. [23]

    CameraCtrl: Enabling camera control for video diffusion models

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for video diffusion models. InInternational Conference on Learning Representations (ICLR), 2025

  24. [24]

    REPA: Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. REPA: Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations, 2025

  25. [25]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  26. [26]

    DIAMOND: Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Francois Fleuret. DIAMOND: Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, volume 37, 2024

  27. [27]

    Learning interactive real-world simulators

    Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024

  28. [28]

    Genie 2: A large-scale foundation world model

    Jack Parker-Holder, Stephen Spencer, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Ka- planis, Alexandre Moufarek, Guy Scully, Jeremy Shar, et al. Genie 2: A large-scale foundation world model. Google DeepMind Blog, 2024.https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/

  29. [29]

    GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  30. [30]

    IRASim: Learning interactive real-robot action simulators

    Fangqi Zhu, Hongtao Wu, Song Guo, et al. IRASim: Learning interactive real-robot action simulators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  31. [31]

    EnerVerse: Envisioning embodied future space for robotics manipulation

    Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse: Envisioning embodied future space for robotics manipulation. InAdvances in Neural Information Processing Systems, 2025

  32. [32]

    AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. 18

  33. [33]

    WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  34. [34]

    Zero-1-to-3: Zero-shot one image to 3D object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3D object. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023

  35. [35]

    SyncDreamer: Generating multiview-consistent images from a single-view image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations (ICLR), 2024

  36. [36]

    MVDream: Multi-view diffusion for 3D generation

    Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. InInternational Conference on Learning Representations, 2024

  37. [37]

    MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion

    Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. InAdvances in Neural Information Processing Systems, volume 36, 2023

  38. [38]

    SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion

    Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024

  39. [39]

    SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

    Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

  40. [40]

    Barron, and Ben Poole

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. CAT3D: Create anything in 3D with multi-view diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024

  41. [41]

    ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  42. [42]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. InProceedings of the European Conference on Computer Vision (ECCV), pages 405–421, 2020

  43. [43]

    3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

  44. [44]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024

  45. [45]

    Grounding image matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

  46. [46]

    Depth Anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024

  47. [47]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything V2. InAdvances in Neural Information Processing Systems, volume 37, 2024

  48. [48]

    Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InInternational Conference on Learning Representations (ICLR), 2026

  49. [49]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  50. [50]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 19

  51. [51]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  52. [52]

    EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

    Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

  53. [53]

    World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

    NVIDIA. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

  54. [54]

    Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

  55. [55]

    RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

  56. [56]

    Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

  57. [57]

    RoboTwin: Dual-arm robot benchmark with generative digital twins

    Yao Mu, Tianxing Chen, Zeyu Gao, Zhiqian Lan, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  58. [58]

    RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, et al. RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

  59. [59]

    Motubrain: An advanced world action model for robot control,

    MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, and Jun Zhu. Motubrain: An advanced world action model for robot control,

  60. [60]

    URLhttps://arxiv.org/abs/2604.27792

  61. [61]

    Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

    Yue Liao, Yuxin Jiang, Liliang Chen, Siyuan Huang, Pengfei Zhou, Shengcong Chen, Chiming Liu, Xindong He, Yi Liu, Maoqing Yao, Guanghui Ren, and Hongsheng Li. Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 20