PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
Pith reviewed 2026-06-27 00:16 UTC · model grok-4.3
The pith
PAIWorld adds inter-view attention, ray embeddings and 3D distillation to diffusion world models to eliminate cross-view drift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAIWorld augments diffusion-transformer world models with Geometry-Aware Cross-View Attention blocks that establish explicit pathways across views, Geometric Rotary Position Embedding that injects camera ray directions and extrinsic poses, and Latent 3D-REPA that distills 3D-aware features from frozen foundation models; the authors argue that addressing the absence of inter-view communication and the lack of a 3D geometric prior simultaneously is necessary and sufficient to resolve cross-view object drift, depth inconsistency, and texture misalignment, yielding state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks.
What carries the argument
The three components—Geometry-Aware Cross-View Attention blocks, Geometric Rotary Position Embedding, and Latent 3D-REPA—that together supply inter-view communication and 3D geometric priors inside a DiT-based world model.
If this is right
- Multi-view world models can now maintain stable object positions, depths, and textures across egocentric, eye-to-hand, and wrist cameras.
- Model-based planning gains reliability because simulated future states remain geometrically consistent from every viewpoint.
- World action models can be trained directly on the consistent multi-view outputs without additional correction steps.
- Multi-view policy post-training becomes feasible because the underlying simulator no longer drifts between training views.
- The same base model reaches top leaderboard positions on WorldArena and AgiBot-Challenge2026 consistency metrics.
Where Pith is reading between the lines
- The same three-component pattern could be tested in non-robotic multi-camera settings such as scene reconstruction or video surveillance to check whether the consistency gains transfer.
- If the geometric prior is the dominant factor, future models might achieve similar results with lighter attention mechanisms by increasing reliance on the distillation step.
- The approach suggests a modular way to retrofit existing single-view world models rather than training entirely new architectures from scratch.
Load-bearing premise
The two identified deficiencies are the root causes of the observed inconsistencies, and simultaneously adding the three proposed components is necessary and sufficient to resolve them without introducing new problems.
What would settle it
An experiment in which a model using only one or two of the three components reaches comparable consistency scores on the same benchmarks, or in which the full PAIWorld model produces new inconsistencies not seen in the baseline.
read the original abstract
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PAIWorld, a DiT-based world foundation model for robotic manipulation that augments standard multi-view diffusion transformers with three components: Geometry-Aware Cross-View Attention blocks, Geometric Rotary Position Embedding (encoding camera rays and extrinsics), and Latent 3D-REPA (distilling from frozen 3D models). It identifies two root causes of cross-view drift, depth inconsistency, and texture misalignment—lack of explicit inter-view communication and absence of 3D geometric priors—and claims that addressing both simultaneously via these additions is necessary and sufficient. The resulting model reports SOTA multi-view 3D consistency, ranking 1st on WorldArena and 2nd on AgiBot-Challenge2026, with downstream uses in model-based planning, world action models, and multi-view policy post-training.
Significance. If the attribution of gains to the three components is empirically validated, the work would provide a concrete architectural recipe for injecting geometric consistency into world models, which is a load-bearing requirement for reliable multi-camera robotic simulation and policy learning. The leaderboard results and downstream application claims would then represent a meaningful advance over simple token concatenation baselines.
major comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that the two deficiencies are the root causes and that the three components are simultaneously necessary and sufficient is not supported by ablation studies. No controlled experiments are described that compare the full PAIWorld model against ablated variants (e.g., without Geometry-Aware Cross-View Attention, without Geometric RoPE, or without Latent 3D-REPA) on the WorldArena consistency metrics or the reported leaderboard scores. Without such results, the attribution of the 1st/2nd place rankings and downstream improvements to this specific design cannot be substantiated.
- [§3 (Method)] §3 (Method): The necessity argument—that inter-view communication and 3D priors must be resolved together—rests on the design of the three components, yet no analysis is provided showing that partial combinations fail to achieve comparable consistency or that alternative mechanisms (e.g., explicit depth supervision or different attention patterns) would not suffice. This leaves open the possibility that the observed gains arise from increased capacity rather than the claimed geometric mechanisms.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from explicit definitions or citations for the WorldArena and AgiBot-Challenge2026 metrics used to claim SOTA multi-view 3D consistency.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The two major comments correctly identify that the manuscript's central claims about root causes and the necessity/sufficiency of the three components rest on design motivation and overall results rather than direct empirical isolation. We address each point below and commit to revisions that add the requested controlled experiments.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that the two deficiencies are the root causes and that the three components are simultaneously necessary and sufficient is not supported by ablation studies. No controlled experiments are described that compare the full PAIWorld model against ablated variants (e.g., without Geometry-Aware Cross-View Attention, without Geometric RoPE, or without Latent 3D-REPA) on the WorldArena consistency metrics or the reported leaderboard scores. Without such results, the attribution of the 1st/2nd place rankings and downstream improvements to this specific design cannot be substantiated.
Authors: We agree that the original submission lacks explicit ablation tables isolating each of the three components on the WorldArena metrics and leaderboard scores. The manuscript instead presents the joint model and its end-to-end gains. In the revision we will add a dedicated ablation subsection in §4 that reports performance for the base DiT, plus three single-component removals and two partial combinations, all evaluated on the same consistency metrics and downstream tasks. This will allow direct attribution of gains to the geometric mechanisms rather than capacity alone. revision: yes
-
Referee: [§3 (Method)] §3 (Method): The necessity argument—that inter-view communication and 3D priors must be resolved together—rests on the design of the three components, yet no analysis is provided showing that partial combinations fail to achieve comparable consistency or that alternative mechanisms (e.g., explicit depth supervision or different attention patterns) would not suffice. This leaves open the possibility that the observed gains arise from increased capacity rather than the claimed geometric mechanisms.
Authors: The necessity claim in the manuscript is presented as a design hypothesis rather than an experimentally proven theorem. While the joint architecture is shown to work, we did not include systematic comparisons of partial combinations or alternative priors such as explicit depth losses. In the revision we will expand §3 with a short discussion of why the chosen mechanisms address both deficiencies simultaneously and will reference the new ablation results (from the §4 revision) to show that removing any one component measurably degrades 3D consistency. We will also note that alternatives like depth supervision remain compatible future directions but are outside the current scope. revision: yes
Circularity Check
No circularity; empirical claims rest on external benchmarks without self-referential reductions
full rationale
The paper identifies deficiencies in prior multi-view world models, proposes three architectural components to address them, and reports empirical results on WorldArena and AgiBot-Challenge2026 leaderboards plus downstream tasks. No equations, parameter-fitting steps presented as predictions, or self-citation chains appear in the provided text that would make any central claim equivalent to its inputs by construction. The necessity-and-sufficiency argument is stated as a design hypothesis tested via benchmarks rather than a closed logical loop.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation
PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.
Reference graph
Works this paper leans on
-
[1]
World models.arXiv preprint arXiv:1803.10122, 2018
David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018
Pith/arXiv arXiv 2018
-
[2]
Dream to control: Learning behaviors by latent imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[3]
Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025
NVIDIA,NiketAgarwal, ArslanAli, MaciejBala, YogeshBalaji, ErikBarker, TiffanyCai, PrithvijitChattopadhyay, Yongxin Chen, Yin Cui, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[4]
Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026
NVIDIA. Cosmos 3: Omnimodal world models for physical AI.arXiv preprint arXiv:2606.02800, 2026
Pith/arXiv arXiv 2026
-
[5]
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
Pith/arXiv arXiv 2024
-
[6]
CogVideoX: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations (ICLR), 2025
2025
-
[7]
Vista: A generalizable driving world model with high fidelity and versatile controllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InAdvances in Neural Information Processing Systems, volume 37, 2024
2024
-
[8]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[9]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[10]
DayDreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InProceedings of the Conference on Robot Learning (CoRL), volume 205, pages 2226–2240, 2022
2022
-
[11]
Learning universal policies via text-guided video generation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, volume 36, 2023
2023
-
[12]
RoboDreamer: Learning compositional world models for robot imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024
2024
-
[13]
Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025
arXiv 2025
-
[14]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023
2023
-
[15]
Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, et al. Pelican-Unified 1.0: A unified embodied intelligence model for understanding, reasoning, imagination and action.arXiv preprint arXiv:2605.15153, 2026. 17
Pith/arXiv arXiv 2026
-
[16]
RT-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023
2023
-
[17]
RT-2: Vision-language-action models transfer web knowledge to robotic control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the Conference on Robot Learning (CoRL), volume 229, pages 2165–2183, 2023
2023
-
[18]
Open X-embodiment: Robotic learning datasets and RT-X models
Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024
2024
-
[19]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023
2023
-
[20]
Genie: Generative interactive environments
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InProceedings of the 41st International Conference on Machine Learning, 2024
2024
-
[21]
iVideoGPT: Interactive VideoGPTs are scalable world models
Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. iVideoGPT: Interactive VideoGPTs are scalable world models. InAdvances in Neural Information Processing Systems, volume 37, 2024
2024
-
[22]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[23]
CameraCtrl: Enabling camera control for video diffusion models
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. CameraCtrl: Enabling camera control for video diffusion models. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[24]
REPA: Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. REPA: Representation alignment for generation: Training diffusion transformers is easier than you think. In International Conference on Learning Representations, 2025
2025
-
[25]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
Pith/arXiv arXiv 2023
-
[26]
DIAMOND: Diffusion for world modeling: Visual details matter in atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Francois Fleuret. DIAMOND: Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems, volume 37, 2024
2024
-
[27]
Learning interactive real-world simulators
Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[28]
Genie 2: A large-scale foundation world model
Jack Parker-Holder, Stephen Spencer, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Ka- planis, Alexandre Moufarek, Guy Scully, Jeremy Shar, et al. Genie 2: A large-scale foundation world model. Google DeepMind Blog, 2024.https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/
2024
-
[29]
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024
Pith/arXiv arXiv 2024
-
[30]
IRASim: Learning interactive real-robot action simulators
Fangqi Zhu, Hongtao Wu, Song Guo, et al. IRASim: Learning interactive real-robot action simulators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
2025
-
[31]
EnerVerse: Envisioning embodied future space for robotics manipulation
Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse: Envisioning embodied future space for robotics manipulation. InAdvances in Neural Information Processing Systems, 2025
2025
-
[32]
AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. 18
2025
-
[33]
Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026
arXiv 2026
-
[34]
Zero-1-to-3: Zero-shot one image to 3D object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3D object. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023
2023
-
[35]
SyncDreamer: Generating multiview-consistent images from a single-view image
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. SyncDreamer: Generating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[36]
MVDream: Multi-view diffusion for 3D generation
Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. InInternational Conference on Learning Representations, 2024
2024
-
[37]
MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion
Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. MVDiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. InAdvances in Neural Information Processing Systems, volume 36, 2023
2023
-
[38]
SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion
Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024
2024
-
[39]
Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, and Varun Jampani. SV4D: Dynamic 3D content generation with multi-frame and multi-view consistency.arXiv preprint arXiv:2407.17470, 2024
arXiv 2024
-
[40]
Barron, and Ben Poole
Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron, and Ben Poole. CAT3D: Create anything in 3D with multi-view diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024
2024
-
[41]
ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. ViewCrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[42]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. InProceedings of the European Conference on Computer Vision (ECCV), pages 405–421, 2020
2020
-
[43]
3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023
2023
-
[44]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024
2024
-
[45]
Grounding image matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. In Proceedings of the European Conference on Computer Vision (ECCV), 2024
2024
-
[46]
Depth Anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024
2024
-
[47]
Depth Anything V2
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything V2. InAdvances in Neural Information Processing Systems, volume 37, 2024
2024
-
[48]
Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the visual space from any views. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[49]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[50]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 19
2023
-
[51]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024
2024
-
[52]
Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, and Guanghui Ren. EnerVerse-AC: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025
arXiv 2025
-
[53]
World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025
NVIDIA. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025
Pith/arXiv arXiv 2025
-
[54]
NVIDIA. Cosmos-Reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025
Pith/arXiv arXiv 2025
-
[55]
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. RoboMIND: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024
Pith/arXiv arXiv 2024
-
[56]
Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025
arXiv 2025
-
[57]
RoboTwin: Dual-arm robot benchmark with generative digital twins
Yao Mu, Tianxing Chen, Zeyu Gao, Zhiqian Lan, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[58]
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, et al. RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025
Pith/arXiv arXiv 2025
-
[59]
Motubrain: An advanced world action model for robot control,
MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, and Jun Zhu. Motubrain: An advanced world action model for robot control,
-
[60]
URLhttps://arxiv.org/abs/2604.27792
-
[61]
Yue Liao, Yuxin Jiang, Liliang Chen, Siyuan Huang, Pengfei Zhou, Shengcong Chen, Chiming Liu, Xindong He, Yi Liu, Maoqing Yao, Guanghui Ren, and Hongsheng Li. Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 20
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.