pith. machine review for the scientific record. sign in

arxiv: 2602.00937 · v3 · submitted 2026-01-31 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords 3D pretrainingcontrastive learningrobotic manipulationpoint cloudssim-to-realdiffusion policymulti-view
0
0 comments X

The pith

CLAMP shows that contrastive pretraining on 3D multi-view action-conditioned observations from simulation transfers to superior real-world robotic manipulation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLAMP, a framework for pretraining robotic manipulation policies using 3D point cloud data and contrastive learning. It processes RGB-D images into merged point clouds, re-renders multi-view images with depth and coordinates, and trains encoders to link object geometry with robot actions on large simulated datasets. A diffusion policy is also pretrained to initialize weights. This leads to more efficient fine-tuning on real tasks with limited demonstrations, outperforming baselines in both simulation and the real world. A reader would care because it offers a way to leverage cheap simulation data to reduce expensive real-robot training needs.

Core claim

CLAMP establishes that contrastive learning on action-conditioned 3D multi-view representations derived from point clouds, combined with diffusion policy pretraining, produces encoders and policies that, when fine-tuned on few task demonstrations, achieve substantially higher success rates on unseen manipulation tasks in both simulated and real environments.

What carries the argument

The contrastive pretraining objective that associates 3D geometric and positional information from re-rendered multi-view images with robot action patterns, using merged point clouds from RGB-D and extrinsics.

If this is right

  • The pretrained encoders capture 3D spatial details essential for precise actions.
  • Sample efficiency improves during fine-tuning on real demonstrations.
  • Generalization to new tasks increases compared to non-pretrained policies.
  • The method works without major changes to the policy architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other robot learning domains like navigation or assembly if similar 3D data is available.
  • Increasing the diversity of simulated trajectories might further close the sim-to-real gap.
  • Applying the same pretraining to different policy architectures beyond diffusion policies could yield similar benefits.

Load-bearing premise

The representations learned from simulated point clouds and actions remain effective despite differences in real-world sensing noise, lighting, and physical dynamics.

What would settle it

A real-world experiment where fine-tuning the CLAMP-pretrained model on limited demos yields success rates no higher than a randomly initialized model or a 2D-pretrained baseline.

Figures

Figures reproduced from arXiv: 2602.00937 by Connor Schenck, I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang.

Figure 1
Figure 1. Figure 1: CLAMP is a 3D pre-training framework for robotic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CLAMP. (i): CLAMP consists of three encoders: image, action and text. The image encoder is a Vision Transformer (ViT) [8] that takes five multi-view observations (overhead, back-right, front-left, wrist-left, and wrist-right) as input. These views are rendered from a merged point cloud and include depth and 3D coordinates. The action encoder is a Transformer encoder that takes a history of prev… view at source ↗
Figure 3
Figure 3. Figure 3: Pictorial desciption of the STRING mechanism applied in CLAMP’s image encoder to correlate tokens form different [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation curves comparing our method (green) to baselines across three seeds in simulation. All methods are trained [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Successful rollouts of our method on the ALOHA robot. Top: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Recall@5 results for cross-modal retrieval tasks on the validation data during pre-training. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SigLIP pairwise matching probabilities on an unseen task: image-to-previous-action (top) and previous-action-to-image [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of initial configurations for the real-world [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation curves comparing our method (green) with baselines across three seeds in simulation under a low-data [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks. The project website and videos can be found at https://clamp3d.github.io/CLAMP/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CLAMP, a 3D pre-training framework for robotic manipulation that merges RGB-D point clouds using camera extrinsics, re-renders them as multi-view 4-channel images (depth plus 3D coordinates) including dynamic wrist views, and applies contrastive learning on large-scale simulated trajectories to associate geometric features with action patterns. It additionally pre-trains a Diffusion Policy for weight initialization and fine-tunes the resulting encoders and policy on limited task demonstrations, claiming improved sample efficiency and outperformance versus state-of-the-art baselines on six simulated tasks and five real-world tasks.

Significance. If the empirical claims hold, the work would be significant for the robotics community by demonstrating that action-conditioned contrastive pre-training on clean simulated 3D renderings can produce representations that transfer to real RGB-D settings, addressing the spatial limitations of 2D image pre-training and reducing reliance on large real-world datasets.

major comments (2)
  1. [Abstract] Abstract: the central claim of outperformance on six simulated and five real-world tasks is stated without any quantitative tables, ablation results, baseline details, or statistical tests, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.
  2. [Experiments] Experiments section (implied by the sim-to-real claims): no ablations are described on depth sensor noise models, calibration perturbations, or domain randomization during pre-training, leaving the robustness of the merged point-cloud re-rendering pipeline to real-world RGB-D artifacts unverified and directly threatening the transfer results on the five real tasks.
minor comments (2)
  1. [Methods] Methods: the exact construction of the four-channel re-rendered images (depth plus 3D coordinates) from the merged point cloud should be formalized with an equation or pseudocode for reproducibility.
  2. [Introduction] The paper should include a reference to prior contrastive learning works in robotics (e.g., on point-cloud or multi-view representations) to situate the novelty of the action-conditioned objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of outperformance on six simulated and five real-world tasks is stated without any quantitative tables, ablation results, baseline details, or statistical tests, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.

    Authors: We acknowledge that the abstract, as a concise summary, does not include quantitative details or tables. The full results—including success rate tables comparing CLAMP to baselines, ablation studies on pretraining components, baseline descriptions, and run-to-run variance—are presented in Section 4 (Experiments). To address the concern, we will revise the abstract to include specific quantitative highlights (e.g., average success rate improvements of X% on simulated tasks and Y% on real tasks) while remaining within length limits. This will better convey the magnitude of gains directly in the abstract. revision: yes

  2. Referee: [Experiments] Experiments section (implied by the sim-to-real claims): no ablations are described on depth sensor noise models, calibration perturbations, or domain randomization during pre-training, leaving the robustness of the merged point-cloud re-rendering pipeline to real-world RGB-D artifacts unverified and directly threatening the transfer results on the five real tasks.

    Authors: We agree this is an important aspect for strengthening the sim-to-real claims. Our current experiments demonstrate empirical transfer from clean simulated pretraining to real RGB-D tasks, but do not include explicit ablations on sensor noise, calibration perturbations, or domain randomization. We will add a dedicated subsection in the revised Experiments section with these analyses: we will introduce controlled noise models and calibration variations during pretraining and report their impact on downstream fine-tuning performance. This will directly verify robustness of the merged point-cloud pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity in pretraining-to-evaluation chain

full rationale

The paper describes an empirical pipeline: contrastive pretraining of encoders on large-scale simulated trajectories (using merged point clouds re-rendered as multi-view 4-channel images), joint pretraining of a Diffusion Policy for weight initialization, followed by fine-tuning on limited task demonstrations and evaluation on held-out simulated and real-world tasks. No equations, derivations, or load-bearing steps reduce reported performance gains to quantities defined by fitted parameters inside the paper or to self-referential definitions. The evaluation uses separate data splits and external benchmarks, so the outperformance claims remain independent of the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method rests on the domain assumption that simulated 3D trajectories contain transferable geometric-action associations and on standard contrastive and diffusion techniques imported from prior literature.

axioms (1)
  • domain assumption Re-rendered multi-view four-channel images supply clearer 3D geometric and positional cues than raw 2D RGB images for manipulation
    Invoked to justify the pretraining data generation step.

pith-pipeline@v0.9.0 · 5591 in / 1251 out tokens · 36238 ms · 2026-05-16T08:21:53.666521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/AlexanderDuality alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates... We are the first to apply STRING relative positional encoding using 3D coordinates derived from point clouds.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 10 internal anchors

  1. [1]

    Big vision

    Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/big vision, 2022

  2. [2]

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A Vi...

  3. [3]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax

  4. [4]

    Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185– 24198, 2024

  5. [5]

    Reproducible scaling laws for contrastive language- image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 2818–2829, 2023

  6. [6]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  7. [7]

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...

  9. [9]

    Eva: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023

  10. [10]

    Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024

  11. [11]

    Act3d: 3d feature field transform- ers for multi-task robotic manipulation

    Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transform- ers for multi-task robotic manipulation. InConference on Robot Learning, pages 3949–3965. PMLR, 2023

  12. [12]

    Revisiting point cloud shape classification with a simple and effective baseline

    Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. InInternational con- ference on machine learning, pages 3809–3820. PMLR, 2021

  13. [13]

    Rvt: Robotic view transformer for 3d object manipulation

    Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023

  14. [14]

    Learning dense visual descriptors using image augmentations for robot manipulation tasks

    Christian Graf, David B Adrian, Joshua Weil, Miroslav Gabriel, Philipp Schillinger, Markus Spies, Heiko Neu- mann, and Andras Gabor Kupcsik. Learning dense visual descriptors using image augmentations for robot manipulation tasks. Inconference on Robot Learning, pages 871–880. PMLR, 2023

  15. [15]

    Pct: Point cloud transformer.Computational visual media, 7(2): 187–199, 2021

    Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai- Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational visual media, 7(2): 187–199, 2021

  16. [16]

    Mvtn: Multi-view transformation network for 3d shape recognition

    Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 1–11, 2021

  17. [17]

    V oint cloud: Multi-view point cloud representation for 3d understanding.arXiv preprint arXiv:2111.15363, 2021

    Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. V oint cloud: Multi-view point cloud representation for 3d understanding.arXiv preprint arXiv:2111.15363, 2021

  18. [18]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising diffusion probabilistic models.arXiv preprint arxiv:2006.11239, 2020

  19. [19]

    Pri3d: Can 3d priors help 2d representation learning? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5693–5702, 2021

    Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and Matthias Nießner. Pri3d: Can 3d priors help 2d representation learning? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5693–5702, 2021

  20. [20]

    A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

    Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

  21. [21]

    An embodied generalist agent in 3d world,

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied general- ist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

  22. [22]

    Multi-view transformer for 3d visual grounding

    Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022

  23. [23]

    Frozen clip transformer is an efficient point cloud en- coder

    Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou, Yifan Zuo, and Wanli Ouyang. Frozen clip transformer is an efficient point cloud en- coder. InProceedings of the AAAI Conference on Artifi- cial Intelligence, volume 38, pages 2382–2390, 2024

  24. [24]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

  25. [25]

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architec- ture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021

  26. [26]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  27. [27]

    Lift3d policy: Lifting 2d foundation models for robust 3d robotic ma- nipulation

    Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilve Wang, Longzan Luo, Xiaoqi Li, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, et al. Lift3d policy: Lifting 2d foundation models for robust 3d robotic ma- nipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17347–17358, 2025

  28. [28]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

  29. [29]

    URL http://arxiv.org/abs/1412.6980

  30. [30]

    Set transformer: A framework for attention-based permutation-invariant neural networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

  31. [31]

    Class: Contrastive learning via action se- quence supervision for robot manipulation

    Sung-Wook Lee, Xuhui Kang, Brandon Yang, and Yen- Ling Kuo. Class: Contrastive learning via action se- quence supervision for robot manipulation. InConfer- ence on Robot Learning (CoRL), 2025

  32. [32]

    Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022

    Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022

  33. [33]

    A multi-view projection-based object-aware graph network for dense captioning of point clouds.Computers & Graphics, 126:104156, 2025

    Zijing Ma, Zhi Yang, Aihua Mao, Shuyi Wen, Ran Yi, and Yongjin Liu. A multi-view projection-based object-aware graph network for dense captioning of point clouds.Computers & Graphics, 126:104156, 2025

  34. [34]

    Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024

    Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024

  35. [35]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi ”Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed,...

  36. [36]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 652–660, 2017

  37. [37]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  38. [38]

    3d-mvp: 3d multi- view pretraining for manipulation

    Shengyi Qian, Kaichun Mo, Valts Blukis, David F Fouhey, Dieter Fox, and Ankit Goyal. 3d-mvp: 3d multi- view pretraining for manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22530–22539, 2025

  39. [39]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021

  40. [40]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  41. [41]

    Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562, 2025

    Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, et al. Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562, 2025

  42. [42]

    Self-supervised visual descriptor learning for dense cor- respondence.IEEE Robotics and Automation Letters, 2 (2):420–427, 2016

    Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self-supervised visual descriptor learning for dense cor- respondence.IEEE Robotics and Automation Letters, 2 (2):420–427, 2016

  43. [43]

    Perceiver-actor: A multi-task transformer for robotic ma- nipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InConference on Robot Learning, pages 785–

  44. [44]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

  45. [45]

    Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

    Gemini Robotics Team. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025. URL https://arxiv.org/abs/2510.03342

  46. [46]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  47. [47]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  48. [48]

    Dynamic graph cnn for learning on point clouds.ACM Transac- tions on Graphics (tog), 38(5):1–12, 2019

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transac- tions on Graphics (tog), 38(5):1–12, 2019

  49. [49]

    Articubot: Learning universal articulated object manipulation policy via large scale simulation

    Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia-Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object manipulation policy via large scale simulation. arXiv preprint arXiv:2503.03045, 2025

  50. [50]

    Point transformer v2: Grouped vector attention and partition-based pooling.Advances in Neu- ral Information Processing Systems, 35:33330–33342, 2022

    Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. Point transformer v2: Grouped vector attention and partition-based pooling.Advances in Neu- ral Information Processing Systems, 35:33330–33342, 2022

  51. [51]

    Point transformer v3: Simpler faster stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840– 4851, 2024

  52. [52]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. In European Conference on Computer Vision, pages 131–

  53. [53]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

    Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın- Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1179–1189, 2023

  54. [54]

    Ulip-2: Towards scalable multimodal pre-training for 3d under- standing

    Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091– 27101, 2024

  55. [55]

    mt5: A massively multilingual pre-trained text-to- text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to- text transformer. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technolo- gies, pages 483–498, 2021

  56. [56]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Con- trastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

  57. [57]

    Point-bert: Pre-training 3d point cloud transformers with masked point modeling

    Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022

  58. [58]

    Florence: A New Foundation Model for Computer Vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

  59. [59]

    Zakka, Y

    K. Zakka, Y . Tassa, and MuJoCo Menagerie Contribu- tors. MuJoCo Menagerie: A collection of high-quality simulation models for MuJoCo. http://github.com/ google-deepmind/mujoco menagerie, 2022. URL http: //github.com/google-deepmind/mujoco menagerie. Ac- cessed: 2025-12-15

  60. [60]

    Gnfactor: Multi-task real robot learning with generalizable neural feature fields

    Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on robot learning, pages 284–301. PMLR, 2023

  61. [61]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

  62. [62]

    Sigmoid loss for language image pre- training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  63. [63]

    Point- m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems, 35:27061–27074, 2022

    Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point- m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems, 35:27061–27074, 2022

  64. [64]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022

  65. [65]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021

  66. [66]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manip- ulation with Low-Cost Hardware. 2023

  67. [67]

    Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024

    Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024

  68. [68]

    Uni3d: Exploring unified 3d representation at scale

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Explor- ing unified 3d representation at scale.arXiv preprint arXiv:2310.06773, 2023

  69. [69]

    Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023

    Wenxuan Zhou, Bowen Jiang, Fan Yang, Chris Paxton, and David Held. Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023

  70. [70]

    Pointclip v2: Prompting clip and gpt for pow- erful 3d open-world learning

    Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for pow- erful 3d open-world learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2639–2650, 2023. VII. APPENDIX A. Proof of Lemma 3.1 Proof:Note that, by applying the definition...

  71. [71]

    If the point cloud contains more than 0.3M points, we randomly subsample to 0.3M; otherwise, we pad with points at the origin (0,0,0) to reach 0.3M points

    For image preprocessing, we use a voxel size of 0.001 for voxel-grid downsampling. If the point cloud contains more than 0.3M points, we randomly subsample to 0.3M; otherwise, we pad with points at the origin (0,0,0) to reach 0.3M points. Each virtual view has dimensions224×224×4, and we tile the views horizontally to form a224×1120×4input. The action enc...

  72. [72]

    GR1.5 tasks and episode counts: •multitoolsmagnifierincaddy_left: 47,679 •multitoolsscrewdriverincaddy_left: 42,466 •multitoolsmagnifierincaddy_right: 47,613 •handoverpen: 39,603 •multitoolsscissorsincaddy_right: 42,167 •multitoolscanopenerincaddy_right: 42,069 •multitoolsscissorsincaddy_left: 40,568 •multitoysraccooninbasket: 40,409 •multitoysfireenginei...