arxiv: 2602.00937 · v3 · submitted 2026-01-31 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

I-Chun Arthur Liu , Krzysztof Choromanski , Sandy Huang , Connor Schenck

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords 3D pretrainingcontrastive learningrobotic manipulationpoint cloudssim-to-realdiffusion policymulti-view

0 comments

The pith

CLAMP shows that contrastive pretraining on 3D multi-view action-conditioned observations from simulation transfers to superior real-world robotic manipulation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLAMP, a framework for pretraining robotic manipulation policies using 3D point cloud data and contrastive learning. It processes RGB-D images into merged point clouds, re-renders multi-view images with depth and coordinates, and trains encoders to link object geometry with robot actions on large simulated datasets. A diffusion policy is also pretrained to initialize weights. This leads to more efficient fine-tuning on real tasks with limited demonstrations, outperforming baselines in both simulation and the real world. A reader would care because it offers a way to leverage cheap simulation data to reduce expensive real-robot training needs.

Core claim

CLAMP establishes that contrastive learning on action-conditioned 3D multi-view representations derived from point clouds, combined with diffusion policy pretraining, produces encoders and policies that, when fine-tuned on few task demonstrations, achieve substantially higher success rates on unseen manipulation tasks in both simulated and real environments.

What carries the argument

The contrastive pretraining objective that associates 3D geometric and positional information from re-rendered multi-view images with robot action patterns, using merged point clouds from RGB-D and extrinsics.

If this is right

The pretrained encoders capture 3D spatial details essential for precise actions.
Sample efficiency improves during fine-tuning on real demonstrations.
Generalization to new tasks increases compared to non-pretrained policies.
The method works without major changes to the policy architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other robot learning domains like navigation or assembly if similar 3D data is available.
Increasing the diversity of simulated trajectories might further close the sim-to-real gap.
Applying the same pretraining to different policy architectures beyond diffusion policies could yield similar benefits.

Load-bearing premise

The representations learned from simulated point clouds and actions remain effective despite differences in real-world sensing noise, lighting, and physical dynamics.

What would settle it

A real-world experiment where fine-tuning the CLAMP-pretrained model on limited demos yields success rates no higher than a randomly initialized model or a 2D-pretrained baseline.

Figures

Figures reproduced from arXiv: 2602.00937 by Connor Schenck, I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang.

**Figure 2.** Figure 2: Overview of CLAMP. (i): CLAMP consists of three encoders: image, action and text. The image encoder is a Vision Transformer (ViT) [8] that takes five multi-view observations (overhead, back-right, front-left, wrist-left, and wrist-right) as input. These views are rendered from a merged point cloud and include depth and 3D coordinates. The action encoder is a Transformer encoder that takes a history of prev… view at source ↗

**Figure 3.** Figure 3: Pictorial desciption of the STRING mechanism applied in CLAMP’s image encoder to correlate tokens form different [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation curves comparing our method (green) to baselines across three seeds in simulation. All methods are trained [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Successful rollouts of our method on the ALOHA robot. Top: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Recall@5 results for cross-modal retrieval tasks on the validation data during pre-training. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: SigLIP pairwise matching probabilities on an unseen task: image-to-previous-action (top) and previous-action-to-image [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of initial configurations for the real-world [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Evaluation curves comparing our method (green) with baselines across three seeds in simulation under a low-data [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks. The project website and videos can be found at https://clamp3d.github.io/CLAMP/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLAMP gives a workable 3D pretraining pipeline for manipulation policies but its real-world gains rest on untested assumptions about sensor noise and calibration.

read the letter

The main thing to know is that CLAMP merges point clouds from RGB-D views, re-renders them as multi-view 4-channel images that include depth, 3D coordinates, and dynamic wrist views, then runs contrastive learning on simulated trajectories to tie geometry to actions before seeding a diffusion policy for fine-tuning. This is a direct attempt to fix the missing spatial information in standard 2D pretraining for contact-rich tasks. The pipeline is new in its specific combination of merged point-cloud re-rendering, wrist-view inclusion, and joint contrastive-plus-diffusion initialization on action-conditioned data. It does a clean job of identifying the 3D limitation in existing image-based methods and building a practical route around it using only standard RGB-D inputs and extrinsics. The design choices around clearer target-object views and large-scale sim pretraining are sensible for improving sample efficiency on unseen tasks. The soft spot is the sim-to-real step. The stress-test concern lands: without reported ablations on depth noise models, lighting variation, or calibration perturbation during pretraining, the claimed outperformance on five real-world tasks cannot be confidently attributed to the 3D representations rather than other factors like the diffusion initialization itself. The abstract supplies no quantitative tables or controls, so the transfer story stays thin. This paper is for roboticists working on behavior cloning and pretraining who need concrete 3D tricks for manipulation. A reader focused on sample-efficient policies or diffusion models would find the implementation details useful to try. It deserves serious peer review because the core idea is grounded and the pipeline is testable, even if the real-world validation needs tighter robustness checks.

Referee Report

2 major / 2 minor

Summary. The paper introduces CLAMP, a 3D pre-training framework for robotic manipulation that merges RGB-D point clouds using camera extrinsics, re-renders them as multi-view 4-channel images (depth plus 3D coordinates) including dynamic wrist views, and applies contrastive learning on large-scale simulated trajectories to associate geometric features with action patterns. It additionally pre-trains a Diffusion Policy for weight initialization and fine-tunes the resulting encoders and policy on limited task demonstrations, claiming improved sample efficiency and outperformance versus state-of-the-art baselines on six simulated tasks and five real-world tasks.

Significance. If the empirical claims hold, the work would be significant for the robotics community by demonstrating that action-conditioned contrastive pre-training on clean simulated 3D renderings can produce representations that transfer to real RGB-D settings, addressing the spatial limitations of 2D image pre-training and reducing reliance on large real-world datasets.

major comments (2)

[Abstract] Abstract: the central claim of outperformance on six simulated and five real-world tasks is stated without any quantitative tables, ablation results, baseline details, or statistical tests, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.
[Experiments] Experiments section (implied by the sim-to-real claims): no ablations are described on depth sensor noise models, calibration perturbations, or domain randomization during pre-training, leaving the robustness of the merged point-cloud re-rendering pipeline to real-world RGB-D artifacts unverified and directly threatening the transfer results on the five real tasks.

minor comments (2)

[Methods] Methods: the exact construction of the four-channel re-rendered images (depth plus 3D coordinates) from the merged point cloud should be formalized with an equation or pseudocode for reproducibility.
[Introduction] The paper should include a reference to prior contrastive learning works in robotics (e.g., on point-cloud or multi-view representations) to situate the novelty of the action-conditioned objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of outperformance on six simulated and five real-world tasks is stated without any quantitative tables, ablation results, baseline details, or statistical tests, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.

Authors: We acknowledge that the abstract, as a concise summary, does not include quantitative details or tables. The full results—including success rate tables comparing CLAMP to baselines, ablation studies on pretraining components, baseline descriptions, and run-to-run variance—are presented in Section 4 (Experiments). To address the concern, we will revise the abstract to include specific quantitative highlights (e.g., average success rate improvements of X% on simulated tasks and Y% on real tasks) while remaining within length limits. This will better convey the magnitude of gains directly in the abstract. revision: yes
Referee: [Experiments] Experiments section (implied by the sim-to-real claims): no ablations are described on depth sensor noise models, calibration perturbations, or domain randomization during pre-training, leaving the robustness of the merged point-cloud re-rendering pipeline to real-world RGB-D artifacts unverified and directly threatening the transfer results on the five real tasks.

Authors: We agree this is an important aspect for strengthening the sim-to-real claims. Our current experiments demonstrate empirical transfer from clean simulated pretraining to real RGB-D tasks, but do not include explicit ablations on sensor noise, calibration perturbations, or domain randomization. We will add a dedicated subsection in the revised Experiments section with these analyses: we will introduce controlled noise models and calibration variations during pretraining and report their impact on downstream fine-tuning performance. This will directly verify robustness of the merged point-cloud pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity in pretraining-to-evaluation chain

full rationale

The paper describes an empirical pipeline: contrastive pretraining of encoders on large-scale simulated trajectories (using merged point clouds re-rendered as multi-view 4-channel images), joint pretraining of a Diffusion Policy for weight initialization, followed by fine-tuning on limited task demonstrations and evaluation on held-out simulated and real-world tasks. No equations, derivations, or load-bearing steps reduce reported performance gains to quantities defined by fitted parameters inside the paper or to self-referential definitions. The evaluation uses separate data splits and external benchmarks, so the outperformance claims remain independent of the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method rests on the domain assumption that simulated 3D trajectories contain transferable geometric-action associations and on standard contrastive and diffusion techniques imported from prior literature.

axioms (1)

domain assumption Re-rendered multi-view four-channel images supply clearer 3D geometric and positional cues than raw 2D RGB images for manipulation
Invoked to justify the pretraining data generation step.

pith-pipeline@v0.9.0 · 5591 in / 1251 out tokens · 36238 ms · 2026-05-16T08:21:53.666521+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlexanderDuality alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates... We are the first to apply STRING relative positional encoding using 3D coordinates derived from point clouds.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 10 internal anchors

[1]

Big vision

Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/big vision, 2022

work page 2022
[2]

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax

work page 2018
[4]

Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185– 24198, 2024

work page 2024
[5]

Reproducible scaling laws for contrastive language- image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 2818–2829, 2023

work page 2023
[6]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[7]

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...

work page 2021
[9]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023

work page 2023
[10]

Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024

work page 2024
[11]

Act3d: 3d feature field transform- ers for multi-task robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transform- ers for multi-task robotic manipulation. InConference on Robot Learning, pages 3949–3965. PMLR, 2023

work page 2023
[12]

Revisiting point cloud shape classification with a simple and effective baseline

Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. InInternational con- ference on machine learning, pages 3809–3820. PMLR, 2021

work page 2021
[13]

Rvt: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023

work page 2023
[14]

Learning dense visual descriptors using image augmentations for robot manipulation tasks

Christian Graf, David B Adrian, Joshua Weil, Miroslav Gabriel, Philipp Schillinger, Markus Spies, Heiko Neu- mann, and Andras Gabor Kupcsik. Learning dense visual descriptors using image augmentations for robot manipulation tasks. Inconference on Robot Learning, pages 871–880. PMLR, 2023

work page 2023
[15]

Pct: Point cloud transformer.Computational visual media, 7(2): 187–199, 2021

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai- Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational visual media, 7(2): 187–199, 2021

work page 2021
[16]

Mvtn: Multi-view transformation network for 3d shape recognition

Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 1–11, 2021

work page 2021
[17]

V oint cloud: Multi-view point cloud representation for 3d understanding.arXiv preprint arXiv:2111.15363, 2021

Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. V oint cloud: Multi-view point cloud representation for 3d understanding.arXiv preprint arXiv:2111.15363, 2021

work page arXiv 2021
[18]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising diffusion probabilistic models.arXiv preprint arxiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[19]

Pri3d: Can 3d priors help 2d representation learning? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5693–5702, 2021

Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and Matthias Nießner. Pri3d: Can 3d priors help 2d representation learning? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5693–5702, 2021

work page 2021
[20]

A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024

work page 2024
[21]

An embodied generalist agent in 3d world,

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied general- ist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

work page arXiv 2023
[22]

Multi-view transformer for 3d visual grounding

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022

work page 2022
[23]

Frozen clip transformer is an efficient point cloud en- coder

Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou, Yifan Zuo, and Wanli Ouyang. Frozen clip transformer is an efficient point cloud en- coder. InProceedings of the AAAI Conference on Artifi- cial Intelligence, volume 38, pages 2382–2390, 2024

work page 2024
[24]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architec- ture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[27]

Lift3d policy: Lifting 2d foundation models for robust 3d robotic ma- nipulation

Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilve Wang, Longzan Luo, Xiaoqi Li, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, et al. Lift3d policy: Lifting 2d foundation models for robust 3d robotic ma- nipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17347–17358, 2025

work page 2025
[28]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

work page 2015
[29]

URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Set transformer: A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019

work page 2019
[31]

Class: Contrastive learning via action se- quence supervision for robot manipulation

Sung-Wook Lee, Xuhui Kang, Brandon Yang, and Yen- Ling Kuo. Class: Contrastive learning via action se- quence supervision for robot manipulation. InConfer- ence on Robot Learning (CoRL), 2025

work page 2025
[32]

Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022

Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022

work page arXiv 2022
[33]

A multi-view projection-based object-aware graph network for dense captioning of point clouds.Computers & Graphics, 126:104156, 2025

Zijing Ma, Zhi Yang, Aihua Mao, Shuyi Wen, Ran Yi, and Yongjin Liu. A multi-view projection-based object-aware graph network for dense captioning of point clouds.Computers & Graphics, 126:104156, 2025

work page 2025
[34]

Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024

work page arXiv 2024
[35]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, :, Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi ”Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 652–660, 2017

work page 2017
[37]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

work page 2017
[38]

3d-mvp: 3d multi- view pretraining for manipulation

Shengyi Qian, Kaichun Mo, Valts Blukis, David F Fouhey, Dieter Fox, and Ankit Goyal. 3d-mvp: 3d multi- view pretraining for manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22530–22539, 2025

work page 2025
[39]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021

work page 2021
[40]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[41]

Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562,

Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, et al. Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562, 2025

work page arXiv 2025
[42]

Self-supervised visual descriptor learning for dense cor- respondence.IEEE Robotics and Automation Letters, 2 (2):420–427, 2016

Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self-supervised visual descriptor learning for dense cor- respondence.IEEE Robotics and Automation Letters, 2 (2):420–427, 2016

work page 2016
[43]

Perceiver-actor: A multi-task transformer for robotic ma- nipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InConference on Robot Learning, pages 785–

work page
[44]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025. URL https://arxiv.org/abs/2510.03342

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012
[47]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[48]

Dynamic graph cnn for learning on point clouds.ACM Transac- tions on Graphics (tog), 38(5):1–12, 2019

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transac- tions on Graphics (tog), 38(5):1–12, 2019

work page 2019
[49]

Articubot: Learning universal articulated object manipulation policy via large scale simulation

Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia-Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object manipulation policy via large scale simulation. arXiv preprint arXiv:2503.03045, 2025

work page arXiv 2025
[50]

Point transformer v2: Grouped vector attention and partition-based pooling.Advances in Neu- ral Information Processing Systems, 35:33330–33342, 2022

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. Point transformer v2: Grouped vector attention and partition-based pooling.Advances in Neu- ral Information Processing Systems, 35:33330–33342, 2022

work page 2022
[51]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840– 4851, 2024

work page 2024
[52]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. In European Conference on Computer Vision, pages 131–

work page
[53]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın- Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1179–1189, 2023

work page 2023
[54]

Ulip-2: Towards scalable multimodal pre-training for 3d under- standing

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091– 27101, 2024

work page 2024
[55]

mt5: A massively multilingual pre-trained text-to- text transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to- text transformer. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technolo- gies, pages 483–498, 2021

work page 2021
[56]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Con- trastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022

work page 2022
[58]

Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021

work page internal anchor Pith review arXiv 2021
[59]

Zakka, Y

K. Zakka, Y . Tassa, and MuJoCo Menagerie Contribu- tors. MuJoCo Menagerie: A collection of high-quality simulation models for MuJoCo. http://github.com/ google-deepmind/mujoco menagerie, 2022. URL http: //github.com/google-deepmind/mujoco menagerie. Ac- cessed: 2025-12-15

work page 2022
[60]

Gnfactor: Multi-task real robot learning with generalizable neural feature fields

Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on robot learning, pages 284–301. PMLR, 2023

work page 2023
[61]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

work page 2024
[62]

Sigmoid loss for language image pre- training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[63]

Point- m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems, 35:27061–27074, 2022

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point- m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems, 35:27061–27074, 2022

work page 2022
[64]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022

work page 2022
[65]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021

work page 2021
[66]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manip- ulation with Low-Cost Hardware. 2023

work page 2023
[67]

Aloha unleashed: A simple recipe for robot dexterity,

Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024

work page arXiv 2024
[68]

Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Explor- ing unified 3d representation at scale.arXiv preprint arXiv:2310.06773, 2023

work page arXiv 2023
[69]

Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023

Wenxuan Zhou, Bowen Jiang, Fan Yang, Chris Paxton, and David Held. Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023

work page arXiv 2023
[70]

Pointclip v2: Prompting clip and gpt for pow- erful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for pow- erful 3d open-world learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2639–2650, 2023. VII. APPENDIX A. Proof of Lemma 3.1 Proof:Note that, by applying the definition...

work page 2023
[71]

If the point cloud contains more than 0.3M points, we randomly subsample to 0.3M; otherwise, we pad with points at the origin (0,0,0) to reach 0.3M points

For image preprocessing, we use a voxel size of 0.001 for voxel-grid downsampling. If the point cloud contains more than 0.3M points, we randomly subsample to 0.3M; otherwise, we pad with points at the origin (0,0,0) to reach 0.3M points. Each virtual view has dimensions224×224×4, and we tile the views horizontally to form a224×1120×4input. The action enc...

work page 2048
[72]

GR1.5 tasks and episode counts: •multitoolsmagnifierincaddy_left: 47,679 •multitoolsscrewdriverincaddy_left: 42,466 •multitoolsmagnifierincaddy_right: 47,613 •handoverpen: 39,603 •multitoolsscissorsincaddy_right: 42,167 •multitoolscanopenerincaddy_right: 42,069 •multitoolsscissorsincaddy_left: 40,568 •multitoysraccooninbasket: 40,409 •multitoysfireenginei...

work page