PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

Bin Wu; Chenyang Zhu; Chunhe Song; Douhui Wu; Haibin Yu; Hang Zhao; Hewen Xiao; Jiazhao Zhang; Junyan Xu; Juzhan Xu

arxiv: 2606.18375 · v3 · pith:LRGJVWMCnew · submitted 2026-06-16 · 💻 cs.RO

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

Yuhang Huang , Xuan Lv , Junyan Xu , Zhiyuan Yu , Jiazhao Zhang , Ruizhen Hu , Wancheng Feng , Shilong Zou

show 20 more authors

Hewen Xiao Ziqiao Zhou Kaiyun Huang Zhiyu Peng Juzhan Xu Hang Zhao Chenyang Zhu Renjiao Yi Yifei Huang Douhui Wu Yan Zhang Kexu Cheng Chunhe Song Yunzhi Xue Xiuhong Zhang Leitao Guo Yunji Chen Bin Wu Haibin Yu Kai Xu

This is my paper

Pith reviewed 2026-06-27 00:16 UTC · model grok-4.3

classification 💻 cs.RO

keywords world foundation modelsmulti-view 3D consistencyrobotic manipulationdiffusion transformercross-view attentiongeometric position embedding3D feature distillationrobotic simulation

0 comments

The pith

PAIWorld adds inter-view attention, ray embeddings and 3D distillation to diffusion world models to eliminate cross-view drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that multi-view world foundation models fail for robotics because they simply concatenate view tokens without mechanisms for communication between views or explicit 3D geometric knowledge. It identifies these two missing elements as the direct cause of object drift, depth errors, and texture misalignment across cameras. To fix both at once, the authors add three components to a standard diffusion-transformer base: attention blocks that let views exchange information, position embeddings that encode camera rays and poses, and a distillation step that pulls 3D features from frozen models. If the argument holds, robotic simulators can finally produce outputs that stay consistent no matter which camera is used, supporting more accurate planning and policy learning from multiple views.

Core claim

PAIWorld augments diffusion-transformer world models with Geometry-Aware Cross-View Attention blocks that establish explicit pathways across views, Geometric Rotary Position Embedding that injects camera ray directions and extrinsic poses, and Latent 3D-REPA that distills 3D-aware features from frozen foundation models; the authors argue that addressing the absence of inter-view communication and the lack of a 3D geometric prior simultaneously is necessary and sufficient to resolve cross-view object drift, depth inconsistency, and texture misalignment, yielding state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks.

What carries the argument

The three components—Geometry-Aware Cross-View Attention blocks, Geometric Rotary Position Embedding, and Latent 3D-REPA—that together supply inter-view communication and 3D geometric priors inside a DiT-based world model.

If this is right

Multi-view world models can now maintain stable object positions, depths, and textures across egocentric, eye-to-hand, and wrist cameras.
Model-based planning gains reliability because simulated future states remain geometrically consistent from every viewpoint.
World action models can be trained directly on the consistent multi-view outputs without additional correction steps.
Multi-view policy post-training becomes feasible because the underlying simulator no longer drifts between training views.
The same base model reaches top leaderboard positions on WorldArena and AgiBot-Challenge2026 consistency metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-component pattern could be tested in non-robotic multi-camera settings such as scene reconstruction or video surveillance to check whether the consistency gains transfer.
If the geometric prior is the dominant factor, future models might achieve similar results with lighter attention mechanisms by increasing reliance on the distillation step.
The approach suggests a modular way to retrofit existing single-view world models rather than training entirely new architectures from scratch.

Load-bearing premise

The two identified deficiencies are the root causes of the observed inconsistencies, and simultaneously adding the three proposed components is necessary and sufficient to resolve them without introducing new problems.

What would settle it

An experiment in which a model using only one or two of the three components reaches comparable consistency scores on the same benchmarks, or in which the full PAIWorld model produces new inconsistencies not seen in the baseline.

read the original abstract

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAIWorld adds three geometric components to a DiT world model and reports SOTA multi-view consistency on robot benchmarks, but the necessity and sufficiency of exactly those components lacks direct ablation support.

read the letter

PAIWorld adds Geometry-Aware Cross-View Attention, Geometric Rotary Position Embedding, and Latent 3D-REPA to a DiT-based world foundation model. The authors say these fix the lack of inter-view communication and 3D geometric priors that cause drift and misalignment when views are simply concatenated.

The combination itself is the main new element. The paper does a clear job naming the practical failure modes that matter for real robot setups with multiple cameras and links the method to downstream uses like planning and policy post-training.

The soft spot is the load-bearing claim that both deficiencies must be fixed simultaneously and that these three additions are the right way to do it. The abstract presents this as necessary and sufficient, yet the stress-test note is right that controlled ablations isolating each piece or testing partial combinations are not visible here. Leaderboard rankings are concrete outcomes, but without those experiments the attribution stays provisional.

This is for people working on world models for manipulation who already use DiT backbones and need multi-view consistency. A reader who wants a concrete recipe to try on their own simulator would get value from the component descriptions and benchmark numbers.

It deserves peer review because the problem is real for robotics and the approach is specific enough to test, though the experimental section will need more controls to back the causal story.

Referee Report

2 major / 1 minor

Summary. The paper introduces PAIWorld, a DiT-based world foundation model for robotic manipulation that augments standard multi-view diffusion transformers with three components: Geometry-Aware Cross-View Attention blocks, Geometric Rotary Position Embedding (encoding camera rays and extrinsics), and Latent 3D-REPA (distilling from frozen 3D models). It identifies two root causes of cross-view drift, depth inconsistency, and texture misalignment—lack of explicit inter-view communication and absence of 3D geometric priors—and claims that addressing both simultaneously via these additions is necessary and sufficient. The resulting model reports SOTA multi-view 3D consistency, ranking 1st on WorldArena and 2nd on AgiBot-Challenge2026, with downstream uses in model-based planning, world action models, and multi-view policy post-training.

Significance. If the attribution of gains to the three components is empirically validated, the work would provide a concrete architectural recipe for injecting geometric consistency into world models, which is a load-bearing requirement for reliable multi-camera robotic simulation and policy learning. The leaderboard results and downstream application claims would then represent a meaningful advance over simple token concatenation baselines.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that the two deficiencies are the root causes and that the three components are simultaneously necessary and sufficient is not supported by ablation studies. No controlled experiments are described that compare the full PAIWorld model against ablated variants (e.g., without Geometry-Aware Cross-View Attention, without Geometric RoPE, or without Latent 3D-REPA) on the WorldArena consistency metrics or the reported leaderboard scores. Without such results, the attribution of the 1st/2nd place rankings and downstream improvements to this specific design cannot be substantiated.
[§3 (Method)] §3 (Method): The necessity argument—that inter-view communication and 3D priors must be resolved together—rests on the design of the three components, yet no analysis is provided showing that partial combinations fail to achieve comparable consistency or that alternative mechanisms (e.g., explicit depth supervision or different attention patterns) would not suffice. This leaves open the possibility that the observed gains arise from increased capacity rather than the claimed geometric mechanisms.

minor comments (1)

[Abstract] The abstract and introduction would benefit from explicit definitions or citations for the WorldArena and AgiBot-Challenge2026 metrics used to claim SOTA multi-view 3D consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments correctly identify that the manuscript's central claims about root causes and the necessity/sufficiency of the three components rest on design motivation and overall results rather than direct empirical isolation. We address each point below and commit to revisions that add the requested controlled experiments.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that the two deficiencies are the root causes and that the three components are simultaneously necessary and sufficient is not supported by ablation studies. No controlled experiments are described that compare the full PAIWorld model against ablated variants (e.g., without Geometry-Aware Cross-View Attention, without Geometric RoPE, or without Latent 3D-REPA) on the WorldArena consistency metrics or the reported leaderboard scores. Without such results, the attribution of the 1st/2nd place rankings and downstream improvements to this specific design cannot be substantiated.

Authors: We agree that the original submission lacks explicit ablation tables isolating each of the three components on the WorldArena metrics and leaderboard scores. The manuscript instead presents the joint model and its end-to-end gains. In the revision we will add a dedicated ablation subsection in §4 that reports performance for the base DiT, plus three single-component removals and two partial combinations, all evaluated on the same consistency metrics and downstream tasks. This will allow direct attribution of gains to the geometric mechanisms rather than capacity alone. revision: yes
Referee: [§3 (Method)] §3 (Method): The necessity argument—that inter-view communication and 3D priors must be resolved together—rests on the design of the three components, yet no analysis is provided showing that partial combinations fail to achieve comparable consistency or that alternative mechanisms (e.g., explicit depth supervision or different attention patterns) would not suffice. This leaves open the possibility that the observed gains arise from increased capacity rather than the claimed geometric mechanisms.

Authors: The necessity claim in the manuscript is presented as a design hypothesis rather than an experimentally proven theorem. While the joint architecture is shown to work, we did not include systematic comparisons of partial combinations or alternative priors such as explicit depth losses. In the revision we will expand §3 with a short discussion of why the chosen mechanisms address both deficiencies simultaneously and will reference the new ablation results (from the §4 revision) to show that removing any one component measurably degrades 3D consistency. We will also note that alternatives like depth supervision remain compatible future directions but are outside the current scope. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks without self-referential reductions

full rationale

The paper identifies deficiencies in prior multi-view world models, proposes three architectural components to address them, and reports empirical results on WorldArena and AgiBot-Challenge2026 leaderboards plus downstream tasks. No equations, parameter-fitting steps presented as predictions, or self-citation chains appear in the provided text that would make any central claim equivalent to its inputs by construction. The necessity-and-sufficiency argument is stated as a design hypothesis tested via benchmarks rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are described. The framework implicitly assumes standard properties of diffusion transformers and frozen 3D foundation models, but these are not enumerated.

pith-pipeline@v0.9.1-grok · 5905 in / 1180 out tokens · 39153 ms · 2026-06-27T00:16:22.305413+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation
cs.CV 2026-06 unverdicted novelty 5.0

PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.

Reference graph

Works this paper leans on

61 extracted references · 15 linked inside Pith · cited by 1 Pith paper

[1]

World models.arXiv preprint arXiv:1803.10122, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

Pith/arXiv arXiv 2018
[2]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020

2020
[3]

Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

NVIDIA,NiketAgarwal, ArslanAli, MaciejBala, YogeshBalaji, ErikBarker, TiffanyCai, PrithvijitChattopadhyay, Yongxin Chen, Yin Cui, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[4]

Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026

NVIDIA. Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026

Pith/arXiv arXiv 2026
[5]

Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

Pith/arXiv arXiv 2024
[6]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations (ICLR), 2025

2025
[7]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024
[8]

HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[9]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[10]

DayDreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InProceedings of the Conference on Robot Learning (CoRL), volume 205, pages 2226–2240, 2022

2022
[11]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[12]

RoboDreamer: Learning compositional world models for robot imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[13]

LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

arXiv 2025
[14]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

2023
[15]

Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, et al. Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026. 17

Pith/arXiv arXiv 2026
[16]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

2023
[17]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the Conference on Robot Learning (CoRL), volume 229, pages 2165–2183, 2023

2023
[18]

Open X-embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

2024
[19]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

2023
[20]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InProceedings of the 41st International Conference on Machine Learning, 2024

2024
[21]

iVideoGPT: Interactive VideoGPTs are scalable world models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. iVideoGPT: Interactive VideoGPTs are scalable world models. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024
[22]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[23]

CameraCtrl: Enabling camera control for video diffusion models

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for video diffusion models. InInternational Conference on Learning Representations (ICLR), 2025

2025
[24]

REPA: Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. REPA: Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations, 2025

2025
[25]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[26]

DIAMOND: Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Francois Fleuret. DIAMOND: Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024
[27]

Learning interactive real-world simulators

Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024

2024
[28]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Stephen Spencer, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Ka- planis, Alexandre Moufarek, Guy Scully, Jeremy Shar, et al. Genie 2: A large-scale foundation world model. Google DeepMind Blog, 2024.https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/

2024
[29]

GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024
[30]

IRASim: Learning interactive real-robot action simulators

Fangqi Zhu, Hongtao Wu, Song Guo, et al. IRASim: Learning interactive real-robot action simulators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[31]

EnerVerse: Envisioning embodied future space for robotics manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse: Envisioning embodied future space for robotics manipulation. InAdvances in Neural Information Processing Systems, 2025

2025
[32]

AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. 18

2025
[33]

WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026
[34]

Zero-1-to-3: Zero-shot one image to 3D object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3D object. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023

2023
[35]

SyncDreamer: Generating multiview-consistent images from a single-view image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations (ICLR), 2024

2024
[36]

MVDream: Multi-view diffusion for 3D generation

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. InInternational Conference on Learning Representations, 2024

2024
[37]

MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[38]

SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion

Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[39]

SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

arXiv 2024
[40]

Barron, and Ben Poole

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. CAT3D: Create anything in 3D with multi-view diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024

2024
[41]

ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[42]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. InProceedings of the European Conference on Computer Vision (ECCV), pages 405–421, 2020

2020
[43]

3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

2023
[44]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024

2024
[45]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[46]

Depth Anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024

2024
[47]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything V2. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024
[48]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InInternational Conference on Learning Representations (ICLR), 2026

2026
[49]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[50]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 19

2023
[51]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[52]

EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

arXiv 2025
[53]

World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

NVIDIA. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[54]

Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Pith/arXiv arXiv 2025
[55]

RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Pith/arXiv arXiv 2024
[56]

Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

arXiv 2025
[57]

RoboTwin: Dual-arm robot benchmark with generative digital twins

Yao Mu, Tianxing Chen, Zeyu Gao, Zhiqian Lan, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[58]

RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, et al. RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025
[59]

Motubrain: An advanced world action model for robot control,

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, and Jun Zhu. Motubrain: An advanced world action model for robot control,
[60]

URLhttps://arxiv.org/abs/2604.27792

Pith/arXiv arXiv
[61]

Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Yue Liao, Yuxin Jiang, Liliang Chen, Siyuan Huang, Pengfei Zhou, Shengcong Chen, Chiming Liu, Xindong He, Yi Liu, Maoqing Yao, Guanghui Ren, and Hongsheng Li. Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 20

Pith/arXiv arXiv 2025

[1] [1]

World models.arXiv preprint arXiv:1803.10122, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

Pith/arXiv arXiv 2018

[2] [2]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020

2020

[3] [3]

Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

NVIDIA,NiketAgarwal, ArslanAli, MaciejBala, YogeshBalaji, ErikBarker, TiffanyCai, PrithvijitChattopadhyay, Yongxin Chen, Yin Cui, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[4] [4]

Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026

NVIDIA. Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026

Pith/arXiv arXiv 2026

[5] [5]

Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

Pith/arXiv arXiv 2024

[6] [6]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations (ICLR), 2025

2025

[7] [7]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024

[8] [8]

HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[9] [9]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[10] [10]

DayDreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InProceedings of the Conference on Robot Learning (CoRL), volume 205, pages 2226–2240, 2022

2022

[11] [11]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[12] [12]

RoboDreamer: Learning compositional world models for robot imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[13] [13]

LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

arXiv 2025

[14] [14]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023

2023

[15] [15]

Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, et al. Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026. 17

Pith/arXiv arXiv 2026

[16] [16]

RT-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023

2023

[17] [17]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the Conference on Robot Learning (CoRL), volume 229, pages 2165–2183, 2023

2023

[18] [18]

Open X-embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

2024

[19] [19]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

2023

[20] [20]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InProceedings of the 41st International Conference on Machine Learning, 2024

2024

[21] [21]

iVideoGPT: Interactive VideoGPTs are scalable world models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. iVideoGPT: Interactive VideoGPTs are scalable world models. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024

[22] [22]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[23] [23]

CameraCtrl: Enabling camera control for video diffusion models

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for video diffusion models. InInternational Conference on Learning Representations (ICLR), 2025

2025

[24] [24]

REPA: Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. REPA: Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations, 2025

2025

[25] [25]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[26] [26]

DIAMOND: Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Francois Fleuret. DIAMOND: Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024

[27] [27]

Learning interactive real-world simulators

Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024

2024

[28] [28]

Genie 2: A large-scale foundation world model

Jack Parker-Holder, Stephen Spencer, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Ka- planis, Alexandre Moufarek, Guy Scully, Jeremy Shar, et al. Genie 2: A large-scale foundation world model. Google DeepMind Blog, 2024.https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/

2024

[29] [29]

GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024

[30] [30]

IRASim: Learning interactive real-robot action simulators

Fangqi Zhu, Hongtao Wu, Song Guo, et al. IRASim: Learning interactive real-robot action simulators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[31] [31]

EnerVerse: Envisioning embodied future space for robotics manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse: Envisioning embodied future space for robotics manipulation. InAdvances in Neural Information Processing Systems, 2025

2025

[32] [32]

AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. 18

2025

[33] [33]

WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026

[34] [34]

Zero-1-to-3: Zero-shot one image to 3D object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3D object. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023

2023

[35] [35]

SyncDreamer: Generating multiview-consistent images from a single-view image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations (ICLR), 2024

2024

[36] [36]

MVDream: Multi-view diffusion for 3D generation

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. InInternational Conference on Learning Representations, 2024

2024

[37] [37]

MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[38] [38]

SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion

Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[39] [39]

SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024

arXiv 2024

[40] [40]

Barron, and Ben Poole

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. CAT3D: Create anything in 3D with multi-view diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024

2024

[41] [41]

ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[42] [42]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. InProceedings of the European Conference on Computer Vision (ECCV), pages 405–421, 2020

2020

[43] [43]

3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

2023

[44] [44]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024

2024

[45] [45]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[46] [46]

Depth Anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024

2024

[47] [47]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything V2. InAdvances in Neural Information Processing Systems, volume 37, 2024

2024

[48] [48]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InInternational Conference on Learning Representations (ICLR), 2026

2026

[49] [49]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[50] [50]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 19

2023

[51] [51]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[52] [52]

EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

arXiv 2025

[53] [53]

World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

NVIDIA. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[54] [54]

Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Pith/arXiv arXiv 2025

[55] [55]

RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Pith/arXiv arXiv 2024

[56] [56]

Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

arXiv 2025

[57] [57]

RoboTwin: Dual-arm robot benchmark with generative digital twins

Yao Mu, Tianxing Chen, Zeyu Gao, Zhiqian Lan, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[58] [58]

RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, et al. RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025

[59] [59]

Motubrain: An advanced world action model for robot control,

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, and Jun Zhu. Motubrain: An advanced world action model for robot control,

[60] [60]

URLhttps://arxiv.org/abs/2604.27792

Pith/arXiv arXiv

[61] [61]

Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Yue Liao, Yuxin Jiang, Liliang Chen, Siyuan Huang, Pengfei Zhou, Shengcong Chen, Chiming Liu, Xindong He, Yi Liu, Maoqing Yao, Guanghui Ren, and Hongsheng Li. Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 20

Pith/arXiv arXiv 2025