pith. sign in

arxiv: 2511.17001 · v2 · pith:5QU5RVJ6new · submitted 2025-11-21 · 💻 cs.RO

Unify Robot Actions in Camera Frame

Pith reviewed 2026-05-17 21:13 UTC · model grok-4.3

classification 💻 cs.RO
keywords cross-embodimentaction representationcamera frameextrinsics estimationrobot learningcalibration pipelinepretrainingbimanual
0
0 comments X p. Extension
pith:5QU5RVJ6 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{5QU5RVJ6}

Prints a linked pith:5QU5RVJ6 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Unifying robot actions in the camera frame creates consistent geometric semantics across embodiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing action representations differ by robot platform, blocking effective cross-embodiment learning. The paper proposes representing actions in the shared camera frame so that movements have the same meaning no matter the robot body. To enable this on past datasets that lack the needed camera information, the authors introduce CalibAll, a method that estimates camera extrinsics in a training-free way. CalibAll starts with a rough guess from temporal point matching and then sharpens it using image rendering optimization. With this, actions from many robots can be standardized and used together for pretraining that outperforms prior approaches.

Core claim

By estimating camera extrinsics, robot actions are converted to the camera coordinate system where they share consistent geometric semantics independent of the specific robot embodiment. CalibAll achieves this annotation for offline datasets through a coarse-to-fine process consisting of temporal PnP initialization followed by differentiable rendering refinement, resulting in standardized camera-frame TCP-pose actions applicable to single-arm and bimanual setups.

What carries the argument

CalibAll, a training-free pipeline that estimates camera extrinsics via temporal PnP initialization and differentiable rendering refinement to unify actions in the camera frame.

Load-bearing premise

The estimated camera extrinsics must accurately reflect the true geometry so that action semantics remain consistent when moving between different robot embodiments.

What would settle it

If experiments show that pretraining with camera-frame actions does not surpass current methods that use embodiment-specific action heads, the advantage of unification would be called into question.

Figures

Figures reproduced from arXiv: 2511.17001 by Haidong Cao, Jiaqi Leng, Lingchen Meng, Mingsheng Li, Qiuyue Wang, Shuai Bai, Shuyuan Tu, Sicheng Xie, Yu-Gang Jiang, Zhiying Du, Zijie Diao, Zuxuan Wu.

Figure 1
Figure 1. Figure 1: Overview of CalibAll, which can automatically and training-free estimate the camera extrinsic for data from any robot types, along with providing additional notations with one mark. its outputs are control signals often defined relative to the robot base. Although this design has achieved state-of-the￾art performance on many benchmarks [11, 23, 27, 38, 40, 43], it inherently requires the model to predict a… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture overview of CalibAll, a simple yet effective method that requires only a single mark. It follows a coarse-to-fine calibration pipeline that achieves training-free, stable, and accurate camera extrinsic estimation across diverse datasets and robot platforms. CalibAll first use EEF Recognition to obtain the end-effector tracking point, then apply temporal PnP to estimate a coarse extrinsic, and … view at source ↗
Figure 3
Figure 3. Figure 3: Detailed results of CalibAll on DREAM [19], the x￾axis representing the number of iterations in rendering-based opti￾mization method. EasyHeC++ [8]. A rendering-based method relies on differentiable rendering and uses an automatic but cumber￾some initialization procedure. However, it is not designed for offline calibration. As summarized in Tab. 1, CalibAll offers the following advantages: • Training-free:… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative result of EEF recognition on Franka, xArm and UR5e. The first row shows the heatmaps obtained from feature matching. The second row visualizes the selected tracking point based on the maximum similarity. Franka+Hand Franka+Hand Franka+Robotiq Xarm Xarm UR5e RGB Coarse Re fined [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative result of coarse initialization and extrinsic refinement on Franka, xArm and UR5e. The first row presents the source RGB images. The second row shows the rendered result using the camera extrinsic obtained from the automatic coarse initialization approach. The last row shows the rendered result of final camera extrinsic produced by CalibAll. 4.4. Experiment on DREAM-real Dataset We evaluate the… view at source ↗
Figure 6
Figure 6. Figure 6: Additional notations. 1) Top-left: the original RGB image; 2) Top-right: the mask of each robot link; 3) Bottom-left: the depth map of the robot arm; 4) Bottom-right: the GT 2D tra￾jectory of the end-effector. the green region denotes the combinations of noise levels for which the refinement successfully converges. On the DREAM dataset [19], the Automatic Coarse Ini￾tialization yields an average rotation e… view at source ↗
Figure 7
Figure 7. Figure 7: Camera extrinsic for evaluation. The images show the trajectory of performing the task “close the drawer.” The left image depicts the ground-truth trajectory, while the right image vi￾sualizes the action predicted by π0 [1]. The green points represent the starting points, and the red points indicate the ending points. 2D Trajectory. We can obtain the 3D trajectory of the end￾effector in the base frame usin… view at source ↗
Figure 9
Figure 9. Figure 9: Rendering result using DROID ground true camera extrinsic and camera extrinsic estimated by CalibAll. A. Details of the Evaluation of Camera Extrin￾sic Refinement Algorithm 1 Evaluation of Camera Extrinsic Refinement Input: Number of camera extrinsics n, Number of noise evaluation m, Position noise intensities ∆P = {∆P1, ∆P2, . . . , ∆PK}, Rotation noise intensi￾ties ∆R = {∆R1, ∆R2, . . . , ∆RL}. Randomly … view at source ↗
Figure 10
Figure 10. Figure 10: Overview of rendering. Point Track Refine Rot Error (°) Pos Error (cm) 20/20 15/20 14/20 3.9 4.4 [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Cross-embodiment robot learning requires a unified action representation with consistent semantics across robot platforms. Existing representations suffer from platform-specific inconsistencies, while current solutions either maintain embodiment-specific action heads or learn latent action spaces, without fundamentally resolving the mismatch. We propose to unify robot actions in the camera frame using camera extrinsics, so that actions share consistent geometric semantics across different robot embodiments, including both single-arm and bimanual robots. However, most existing datasets lack camera extrinsic annotations, and existing offline calibration methods either suffer from local minima or require robot-specific training data. To address this gap, we present CalibAll, a training-free, robot-independent annotation pipeline that estimates camera extrinsics for offline datasets and converts heterogeneous robot actions into standardized camera-frame actions. CalibAll follows a coarse-to-fine calibration strategy: temporal PnP provides a stable initialization, followed by differentiable rendering-based refinement for high precision. Beyond extrinsics, CalibAll produces standardized TCP-pose actions and auxiliary annotations. We apply CalibAll to 16 datasets across 4 robot platforms, producing approximately 97K calibrated data episodes. Downstream simulation and real-robot experiments show that cross-embodiment pretraining with camera-frame actions achieves state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes unifying heterogeneous robot actions into a camera-frame representation using estimated camera extrinsics. It introduces CalibAll, a training-free, robot-independent coarse-to-fine calibration pipeline (temporal PnP initialization followed by differentiable rendering refinement) that annotates offline datasets lacking extrinsic labels. The method is applied to 16 datasets across 4 robot platforms to produce ~97K calibrated episodes with standardized TCP-pose actions. Downstream simulation and real-robot experiments then demonstrate that cross-embodiment pretraining with these camera-frame actions achieves state-of-the-art performance.

Significance. If the reported calibration precision is sufficient to preserve consistent geometric semantics across embodiments, the approach offers a simple, geometrically grounded alternative to embodiment-specific action heads or learned latent spaces. The scale of application (16 datasets, 4 platforms) and the downstream SOTA claims would represent a practical contribution to cross-embodiment robot learning.

major comments (2)
  1. [Abstract] Abstract and experimental section: the claim that camera-frame actions achieve SOTA cross-embodiment performance rests on the unverified assumption that CalibAll recovers extrinsics accurate enough for consistent semantics. No reprojection error, pose error versus ground truth, ablation on the differentiable-rendering refinement step, or comparison against alternative extrinsic estimators is reported for the 16 datasets.
  2. [Method] Method description of the coarse-to-fine pipeline: without quantitative metrics on residual rotation/translation error after refinement, it is impossible to determine whether any remaining mismatch is smaller than the reported unification benefit, undermining the central claim that actions share consistent geometric semantics across single-arm and bimanual embodiments.
minor comments (2)
  1. Clarify the precise definition and standardization procedure for camera-frame TCP-pose actions, especially how they are adapted for bimanual robots.
  2. Add a table or section summarizing the 4 robot platforms, camera setups, and dataset characteristics to help readers assess the diversity of the 16 datasets.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of CalibAll.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental section: the claim that camera-frame actions achieve SOTA cross-embodiment performance rests on the unverified assumption that CalibAll recovers extrinsics accurate enough for consistent semantics. No reprojection error, pose error versus ground truth, ablation on the differentiable-rendering refinement step, or comparison against alternative extrinsic estimators is reported for the 16 datasets.

    Authors: We agree that direct quantitative validation of extrinsic accuracy is necessary to support the SOTA claims. Ground-truth extrinsics are unavailable for most of the 16 datasets, so pose error versus ground truth cannot be computed universally. In the revised manuscript we will report reprojection errors across the datasets, add an ablation isolating the differentiable rendering refinement step, and include comparisons to alternative extrinsic estimators on the subsets that possess ground-truth labels. These additions will demonstrate that residual errors remain small relative to the observed unification benefits. revision: yes

  2. Referee: [Method] Method description of the coarse-to-fine pipeline: without quantitative metrics on residual rotation/translation error after refinement, it is impossible to determine whether any remaining mismatch is smaller than the reported unification benefit, undermining the central claim that actions share consistent geometric semantics across single-arm and bimanual embodiments.

    Authors: We acknowledge the value of residual error metrics after refinement for assessing whether mismatches affect semantic consistency. The revised manuscript will include quantitative rotation and translation residuals post-refinement together with a discussion relating these errors to the scale of the unification gains reported in the experiments. This will clarify that the remaining mismatch does not undermine the geometric consistency claim. revision: yes

standing simulated objections not resolved
  • Ground-truth pose error cannot be reported for the majority of the 16 datasets because they lack ground-truth extrinsic annotations.

Circularity Check

0 steps flagged

No significant circularity; calibration uses independent standard CV primitives

full rationale

The paper's core derivation applies standard, externally validated computer-vision methods (temporal PnP for initialization and differentiable rendering for refinement) to estimate camera extrinsics on existing datasets. These primitives are independent of the robot-learning results and do not depend on the claimed cross-embodiment gains. The unification step simply transforms TCP-pose actions into camera-frame coordinates using the recovered extrinsics; the SOTA performance is then shown via separate downstream pretraining experiments on the resulting ~97K episodes. No equations, definitions, or self-citations reduce the performance claim to a quantity fitted from the same data or defined circularly by construction. The pipeline remains self-contained against external CV benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard pinhole-camera and rigid-body assumptions plus the claim that differentiable rendering can refine extrinsics without robot-specific training data. No new physical entities or free parameters are introduced beyond the usual hyperparameters of PnP and rendering optimization.

axioms (2)
  • standard math Pinhole camera model with known intrinsics
    Invoked when temporal PnP is used to obtain initial extrinsics.
  • domain assumption Differentiable rendering can produce gradients that improve extrinsic estimates without embodiment-specific supervision
    Central to the refinement stage described in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1337 out tokens · 43477 ms · 2026-05-17T21:13:16.960571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 12 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 8

  2. [2]

    Easyhec: Accurate and automatic hand-eye calibration via differentiable rendering and space exploration.RA-L, 2023

    Linghao Chen, Yuzhe Qin, Xiaowei Zhou, and Hao Su. Easyhec: Accurate and automatic hand-eye calibration via differentiable rendering and space exploration.RA-L, 2023. 1, 2, 3, 4, 5

  3. [3]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.IJRR, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.IJRR, 2025. 1

  4. [4]

    Hand-eye calibration using dual quaternions.IJRR, 1999

    Konstantinos Daniilidis. Hand-eye calibration using dual quaternions.IJRR, 1999. 2, 3

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

  6. [6]

    Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons.arXiv preprint arXiv:2503.03081, 2025

    Hongjie Fang, Chenxi Wang, Yiming Wang, Jingjing Chen, Shangning Xia, Jun Lv, Zihao He, Xiyan Yi, Yunhan Guo, Xinyu Zhan, et al. Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons.arXiv preprint arXiv:2503.03081, 2025. 1, 2

  7. [7]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

  8. [8]

    Easy- hec++: Fully automatic hand-eye calibration with pretrained image models

    Zhengdong Hong, Kangfu Zheng, and Linghao Chen. Easy- hec++: Fully automatic hand-eye calibration with pretrained image models. InIROS, 2024. 1, 4, 5

  9. [9]

    Robust robot-camera calibra- tion

    Jarmo Ilonen and Ville Kyrki. Robust robot-camera calibra- tion. InICAR, 2011. 2

  10. [10]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2

  11. [11]

    Rlbench: The robot learning benchmark & learning environment.RA-L, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.RA-L, 2020. 1

  12. [12]

    Do you know where your camera is? view-invariant pol- icy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025

    Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, and Matthew R Walter. Do you know where your camera is? view-invariant pol- icy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025. 1, 2

  13. [13]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InECCV, 2024. 2, 3, 10

  14. [14]

    Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

    Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InICCV, 2025. 2, 3, 10

  15. [15]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 7, 8, 10

  16. [16]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2

  17. [17]

    Single-view robot pose and joint angle estimation via render & compare

    Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Single-view robot pose and joint angle estimation via render & compare. InCVPR, 2021. 5

  18. [18]

    Modular primitives for high-performance differentiable rendering.ToG, 2020

    Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering.ToG, 2020. 4

  19. [19]

    Camera-to-robot pose estimation from a single image

    Timothy E Lee, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Oliver Kroemer, Dieter Fox, and Stan Birch- field. Camera-to-robot pose estimation from a single image. InICRA, 2020. 2, 5, 6, 7

  20. [20]

    Phantom: Training robots without robots using only human videos, 2025

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025. 1, 2

  21. [21]

    Ep n p: An accurate o (n) solution to the p n p problem.IJCV,

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.IJCV,

  22. [22]

    Prompting depth anything for 4k resolution accurate metric depth estimation

    Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting depth anything for 4k resolution accurate metric depth estimation. InCVPR, 2025. 7

  23. [23]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023. 1

  24. [24]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024. 3

  25. [25]

    Markerless camera-to-robot pose estimation via self-supervised sim-to- real transfer

    Jingpei Lu, Florian Richter, and Michael C Yip. Markerless camera-to-robot pose estimation via self-supervised sim-to- real transfer. InCVPR, 2023. 2, 5

  26. [26]

    Ctrnet-x: Camera-to- robot pose estimation in real-world conditions using a single camera

    Jingpei Lu, Zekai Liang, Tristin Xie, Florian Richter, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to- robot pose estimation in real-world conditions using a single camera. InICRA, 2025. 2, 3, 5

  27. [27]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021. 1

  28. [28]

    Segic: Unleashing the emergent correspondence for in-context segmentation

    Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M Alvarez, Zuxuan Wu, and Yu-Gang Jiang. Segic: Unleashing the emergent correspondence for in-context segmentation. In ECCV, 2024. 3

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4

  30. [30]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024. 1, 2

  31. [31]

    Robot sensor calibration: solving ax= xb on the euclidean group.T-RO, 1994

    Frank C Park and Bryan J Martin. Robot sensor calibration: solving ax= xb on the euclidean group.T-RO, 1994. 2, 3

  32. [32]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1

  33. [33]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 8

  34. [34]

    Grounding DINO 1.5: Advance the “edge” of open-set object detection

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection.arXiv preprint arXiv:2405.10300, 2024. 3

  35. [35]

    Emergent correspondence from image diffusion.NeurIPS, 2023

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.NeurIPS, 2023. 2, 3

  36. [36]

    A new technique for fully autonomous and efficient 3 d robotics hand/eye calibra- tion.T-RO, 1989

    Roger Y Tsai, Reimar K Lenz, et al. A new technique for fully autonomous and efficient 3 d robotics hand/eye calibra- tion.T-RO, 1989. 2, 3

  37. [37]

    Mim- icplay: Long-horizon imitation learning by watching hu- man play

    Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei- Fei, Danfei Xu, Yuke Zhu, and Anima Anandkumar. Mim- icplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023. 1, 2

  38. [38]

    Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, 2020. 1

  39. [39]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

  40. [40]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long- horizon reasoning tasks

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu- Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long- horizon reasoning tasks. InICCV, 2025. 1

  41. [41]

    Grounding actions in camera space: Observation-centric vision-language-action policy.arXiv preprint arXiv:2508.13103, 2025

    Tianyi Zhang, Haonan Duan, Haoran Hao, Yu Qiao, Jifeng Dai, and Zhi Hou. Grounding actions in camera space: Observation-centric vision-language-action policy.arXiv preprint arXiv:2508.13103, 2025. 1, 2

  42. [42]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2

  43. [43]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart ´ın- Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 1

  44. [44]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 1, 2