Unify Robot Actions in Camera Frame

arxiv: 2511.17001 · v2 · pith:5QU5RVJ6new · submitted 2025-11-21 · 💻 cs.RO

Unify Robot Actions in Camera Frame

Sicheng Xie , Lingchen Meng , Zijie Diao , Haidong Cao , Zhiying Du , Shuyuan Tu , Jiaqi Leng , Qiuyue Wang

show 4 more authors

Mingsheng Li Shuai Bai Zuxuan Wu Yu-Gang Jiang

This is my paper

Pith reviewed 2026-05-17 21:13 UTC · model grok-4.3

classification 💻 cs.RO

keywords cross-embodimentaction representationcamera frameextrinsics estimationrobot learningcalibration pipelinepretrainingbimanual

0 comments p. Extension

pith:5QU5RVJ6 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{5QU5RVJ6}

Prints a linked pith:5QU5RVJ6 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Unifying robot actions in the camera frame creates consistent geometric semantics across embodiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing action representations differ by robot platform, blocking effective cross-embodiment learning. The paper proposes representing actions in the shared camera frame so that movements have the same meaning no matter the robot body. To enable this on past datasets that lack the needed camera information, the authors introduce CalibAll, a method that estimates camera extrinsics in a training-free way. CalibAll starts with a rough guess from temporal point matching and then sharpens it using image rendering optimization. With this, actions from many robots can be standardized and used together for pretraining that outperforms prior approaches.

Core claim

By estimating camera extrinsics, robot actions are converted to the camera coordinate system where they share consistent geometric semantics independent of the specific robot embodiment. CalibAll achieves this annotation for offline datasets through a coarse-to-fine process consisting of temporal PnP initialization followed by differentiable rendering refinement, resulting in standardized camera-frame TCP-pose actions applicable to single-arm and bimanual setups.

What carries the argument

CalibAll, a training-free pipeline that estimates camera extrinsics via temporal PnP initialization and differentiable rendering refinement to unify actions in the camera frame.

Load-bearing premise

The estimated camera extrinsics must accurately reflect the true geometry so that action semantics remain consistent when moving between different robot embodiments.

What would settle it

If experiments show that pretraining with camera-frame actions does not surpass current methods that use embodiment-specific action heads, the advantage of unification would be called into question.

Figures

Figures reproduced from arXiv: 2511.17001 by Haidong Cao, Jiaqi Leng, Lingchen Meng, Mingsheng Li, Qiuyue Wang, Shuai Bai, Shuyuan Tu, Sicheng Xie, Yu-Gang Jiang, Zhiying Du, Zijie Diao, Zuxuan Wu.

**Figure 1.** Figure 1: Overview of CalibAll, which can automatically and training-free estimate the camera extrinsic for data from any robot types, along with providing additional notations with one mark. its outputs are control signals often defined relative to the robot base. Although this design has achieved state-of-theart performance on many benchmarks [11, 23, 27, 38, 40, 43], it inherently requires the model to predict a… view at source ↗

**Figure 2.** Figure 2: Architecture overview of CalibAll, a simple yet effective method that requires only a single mark. It follows a coarse-to-fine calibration pipeline that achieves training-free, stable, and accurate camera extrinsic estimation across diverse datasets and robot platforms. CalibAll first use EEF Recognition to obtain the end-effector tracking point, then apply temporal PnP to estimate a coarse extrinsic, and … view at source ↗

**Figure 3.** Figure 3: Detailed results of CalibAll on DREAM [19], the xaxis representing the number of iterations in rendering-based optimization method. EasyHeC++ [8]. A rendering-based method relies on differentiable rendering and uses an automatic but cumbersome initialization procedure. However, it is not designed for offline calibration. As summarized in Tab. 1, CalibAll offers the following advantages: • Training-free:… view at source ↗

**Figure 4.** Figure 4: Qualitative result of EEF recognition on Franka, xArm and UR5e. The first row shows the heatmaps obtained from feature matching. The second row visualizes the selected tracking point based on the maximum similarity. Franka+Hand Franka+Hand Franka+Robotiq Xarm Xarm UR5e RGB Coarse Re fined [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative result of coarse initialization and extrinsic refinement on Franka, xArm and UR5e. The first row presents the source RGB images. The second row shows the rendered result using the camera extrinsic obtained from the automatic coarse initialization approach. The last row shows the rendered result of final camera extrinsic produced by CalibAll. 4.4. Experiment on DREAM-real Dataset We evaluate the… view at source ↗

**Figure 6.** Figure 6: Additional notations. 1) Top-left: the original RGB image; 2) Top-right: the mask of each robot link; 3) Bottom-left: the depth map of the robot arm; 4) Bottom-right: the GT 2D trajectory of the end-effector. the green region denotes the combinations of noise levels for which the refinement successfully converges. On the DREAM dataset [19], the Automatic Coarse Initialization yields an average rotation e… view at source ↗

**Figure 7.** Figure 7: Camera extrinsic for evaluation. The images show the trajectory of performing the task “close the drawer.” The left image depicts the ground-truth trajectory, while the right image visualizes the action predicted by π0 [1]. The green points represent the starting points, and the red points indicate the ending points. 2D Trajectory. We can obtain the 3D trajectory of the endeffector in the base frame usin… view at source ↗

**Figure 9.** Figure 9: Rendering result using DROID ground true camera extrinsic and camera extrinsic estimated by CalibAll. A. Details of the Evaluation of Camera Extrinsic Refinement Algorithm 1 Evaluation of Camera Extrinsic Refinement Input: Number of camera extrinsics n, Number of noise evaluation m, Position noise intensities ∆P = {∆P1, ∆P2, . . . , ∆PK}, Rotation noise intensities ∆R = {∆R1, ∆R2, . . . , ∆RL}. Randomly … view at source ↗

**Figure 10.** Figure 10: Overview of rendering. Point Track Refine Rot Error (°) Pos Error (cm) 20/20 15/20 14/20 3.9 4.4 [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

read the original abstract

Cross-embodiment robot learning requires a unified action representation with consistent semantics across robot platforms. Existing representations suffer from platform-specific inconsistencies, while current solutions either maintain embodiment-specific action heads or learn latent action spaces, without fundamentally resolving the mismatch. We propose to unify robot actions in the camera frame using camera extrinsics, so that actions share consistent geometric semantics across different robot embodiments, including both single-arm and bimanual robots. However, most existing datasets lack camera extrinsic annotations, and existing offline calibration methods either suffer from local minima or require robot-specific training data. To address this gap, we present CalibAll, a training-free, robot-independent annotation pipeline that estimates camera extrinsics for offline datasets and converts heterogeneous robot actions into standardized camera-frame actions. CalibAll follows a coarse-to-fine calibration strategy: temporal PnP provides a stable initialization, followed by differentiable rendering-based refinement for high precision. Beyond extrinsics, CalibAll produces standardized TCP-pose actions and auxiliary annotations. We apply CalibAll to 16 datasets across 4 robot platforms, producing approximately 97K calibrated data episodes. Downstream simulation and real-robot experiments show that cross-embodiment pretraining with camera-frame actions achieves state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CalibAll gives a straightforward way to turn offline robot datasets into camera-frame actions, but the lack of reported calibration errors leaves the unification benefit unproven.

read the letter

The main thing here is a training-free pipeline that takes existing robot datasets, estimates camera extrinsics with temporal PnP followed by differentiable rendering, and rewrites the actions as TCP poses in the camera frame. They ran it on 16 datasets across four platforms and ended up with about 97k episodes, then showed downstream gains in simulation and real-robot cross-embodiment pretraining. That part is concrete and addresses a clear pain point: different robot arms have mismatched action spaces that make pooling data awkward. The method itself is built from standard CV pieces, so the novelty sits in the specific coarse-to-fine combination and the explicit goal of producing consistent geometric actions without embodiment-specific heads. They also ship auxiliary annotations, which could be handy for others. The soft spot is exactly what the stress test flags. The abstract and pipeline description give no reprojection errors, no pose accuracy against ground truth, no ablation on the refinement step, and no direct check that residual extrinsics errors are small enough not to break action semantics across single-arm and bimanual setups. Without those numbers it is difficult to judge whether the reported SOTA lift comes from the unification or from other factors in the training setup. The circularity concern is low because the calibration does not depend on the downstream learning results. This paper is aimed at people working on cross-embodiment pretraining and dataset standardization. A reader who needs a practical way to align actions from heterogeneous sources will find usable ideas even if the validation is incomplete. It deserves a serious referee because the problem is real and the approach is reproducible in principle; reviewers can ask for the missing calibration metrics and ablations without the paper falling apart. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes unifying heterogeneous robot actions into a camera-frame representation using estimated camera extrinsics. It introduces CalibAll, a training-free, robot-independent coarse-to-fine calibration pipeline (temporal PnP initialization followed by differentiable rendering refinement) that annotates offline datasets lacking extrinsic labels. The method is applied to 16 datasets across 4 robot platforms to produce ~97K calibrated episodes with standardized TCP-pose actions. Downstream simulation and real-robot experiments then demonstrate that cross-embodiment pretraining with these camera-frame actions achieves state-of-the-art performance.

Significance. If the reported calibration precision is sufficient to preserve consistent geometric semantics across embodiments, the approach offers a simple, geometrically grounded alternative to embodiment-specific action heads or learned latent spaces. The scale of application (16 datasets, 4 platforms) and the downstream SOTA claims would represent a practical contribution to cross-embodiment robot learning.

major comments (2)

[Abstract] Abstract and experimental section: the claim that camera-frame actions achieve SOTA cross-embodiment performance rests on the unverified assumption that CalibAll recovers extrinsics accurate enough for consistent semantics. No reprojection error, pose error versus ground truth, ablation on the differentiable-rendering refinement step, or comparison against alternative extrinsic estimators is reported for the 16 datasets.
[Method] Method description of the coarse-to-fine pipeline: without quantitative metrics on residual rotation/translation error after refinement, it is impossible to determine whether any remaining mismatch is smaller than the reported unification benefit, undermining the central claim that actions share consistent geometric semantics across single-arm and bimanual embodiments.

minor comments (2)

Clarify the precise definition and standardization procedure for camera-frame TCP-pose actions, especially how they are adapted for bimanual robots.
Add a table or section summarizing the 4 robot platforms, camera setups, and dataset characteristics to help readers assess the diversity of the 16 datasets.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of CalibAll.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the claim that camera-frame actions achieve SOTA cross-embodiment performance rests on the unverified assumption that CalibAll recovers extrinsics accurate enough for consistent semantics. No reprojection error, pose error versus ground truth, ablation on the differentiable-rendering refinement step, or comparison against alternative extrinsic estimators is reported for the 16 datasets.

Authors: We agree that direct quantitative validation of extrinsic accuracy is necessary to support the SOTA claims. Ground-truth extrinsics are unavailable for most of the 16 datasets, so pose error versus ground truth cannot be computed universally. In the revised manuscript we will report reprojection errors across the datasets, add an ablation isolating the differentiable rendering refinement step, and include comparisons to alternative extrinsic estimators on the subsets that possess ground-truth labels. These additions will demonstrate that residual errors remain small relative to the observed unification benefits. revision: yes
Referee: [Method] Method description of the coarse-to-fine pipeline: without quantitative metrics on residual rotation/translation error after refinement, it is impossible to determine whether any remaining mismatch is smaller than the reported unification benefit, undermining the central claim that actions share consistent geometric semantics across single-arm and bimanual embodiments.

Authors: We acknowledge the value of residual error metrics after refinement for assessing whether mismatches affect semantic consistency. The revised manuscript will include quantitative rotation and translation residuals post-refinement together with a discussion relating these errors to the scale of the unification gains reported in the experiments. This will clarify that the remaining mismatch does not undermine the geometric consistency claim. revision: yes

standing simulated objections not resolved

Ground-truth pose error cannot be reported for the majority of the 16 datasets because they lack ground-truth extrinsic annotations.

Circularity Check

0 steps flagged

No significant circularity; calibration uses independent standard CV primitives

full rationale

The paper's core derivation applies standard, externally validated computer-vision methods (temporal PnP for initialization and differentiable rendering for refinement) to estimate camera extrinsics on existing datasets. These primitives are independent of the robot-learning results and do not depend on the claimed cross-embodiment gains. The unification step simply transforms TCP-pose actions into camera-frame coordinates using the recovered extrinsics; the SOTA performance is then shown via separate downstream pretraining experiments on the resulting ~97K episodes. No equations, definitions, or self-citations reduce the performance claim to a quantity fitted from the same data or defined circularly by construction. The pipeline remains self-contained against external CV benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard pinhole-camera and rigid-body assumptions plus the claim that differentiable rendering can refine extrinsics without robot-specific training data. No new physical entities or free parameters are introduced beyond the usual hyperparameters of PnP and rendering optimization.

axioms (2)

standard math Pinhole camera model with known intrinsics
Invoked when temporal PnP is used to obtain initial extrinsics.
domain assumption Differentiable rendering can produce gradients that improve extrinsic estimates without embodiment-specific supervision
Central to the refinement stage described in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1337 out tokens · 43477 ms · 2026-05-17T21:13:16.960571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 12 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Easyhec: Accurate and automatic hand-eye calibration via differentiable rendering and space exploration.RA-L, 2023

Linghao Chen, Yuzhe Qin, Xiaowei Zhou, and Hao Su. Easyhec: Accurate and automatic hand-eye calibration via differentiable rendering and space exploration.RA-L, 2023. 1, 2, 3, 4, 5

work page 2023
[3]

Diffusion policy: Visuomotor policy learning via action dif- fusion.IJRR, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.IJRR, 2025. 1

work page 2025
[4]

Hand-eye calibration using dual quaternions.IJRR, 1999

Konstantinos Daniilidis. Hand-eye calibration using dual quaternions.IJRR, 1999. 2, 3

work page 1999
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons.arXiv preprint arXiv:2503.03081, 2025

Hongjie Fang, Chenxi Wang, Yiming Wang, Jingjing Chen, Shangning Xia, Jun Lv, Zihao He, Xiyan Yi, Yunhan Guo, Xinyu Zhan, et al. Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons.arXiv preprint arXiv:2503.03081, 2025. 1, 2

work page arXiv 2025
[7]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Easy- hec++: Fully automatic hand-eye calibration with pretrained image models

Zhengdong Hong, Kangfu Zheng, and Linghao Chen. Easy- hec++: Fully automatic hand-eye calibration with pretrained image models. InIROS, 2024. 1, 4, 5

work page 2024
[9]

Robust robot-camera calibra- tion

Jarmo Ilonen and Ville Kyrki. Robust robot-camera calibra- tion. InICAR, 2011. 2

work page 2011
[10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Rlbench: The robot learning benchmark & learning environment.RA-L, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.RA-L, 2020. 1

work page 2020
[12]

Do you know where your camera is? view-invariant pol- icy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025

Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, and Matthew R Walter. Do you know where your camera is? view-invariant pol- icy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025. 1, 2

work page arXiv 2025
[13]

Co- tracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InECCV, 2024. 2, 3, 10

work page 2024
[14]

Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InICCV, 2025. 2, 3, 10

work page 2025
[15]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Single-view robot pose and joint angle estimation via render & compare

Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Single-view robot pose and joint angle estimation via render & compare. InCVPR, 2021. 5

work page 2021
[18]

Modular primitives for high-performance differentiable rendering.ToG, 2020

Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering.ToG, 2020. 4

work page 2020
[19]

Camera-to-robot pose estimation from a single image

Timothy E Lee, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Oliver Kroemer, Dieter Fox, and Stan Birch- field. Camera-to-robot pose estimation from a single image. InICRA, 2020. 2, 5, 6, 7

work page 2020
[20]

Phantom: Training robots without robots using only human videos, 2025

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025. 1, 2

work page arXiv 2025
[21]

Ep n p: An accurate o (n) solution to the p n p problem.IJCV,

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.IJCV,

work page
[22]

Prompting depth anything for 4k resolution accurate metric depth estimation

Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting depth anything for 4k resolution accurate metric depth estimation. InCVPR, 2025. 7

work page 2025
[23]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023. 1

work page 2023
[24]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024. 3

work page 2024
[25]

Markerless camera-to-robot pose estimation via self-supervised sim-to- real transfer

Jingpei Lu, Florian Richter, and Michael C Yip. Markerless camera-to-robot pose estimation via self-supervised sim-to- real transfer. InCVPR, 2023. 2, 5

work page 2023
[26]

Ctrnet-x: Camera-to- robot pose estimation in real-world conditions using a single camera

Jingpei Lu, Zekai Liang, Tristin Xie, Florian Richter, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to- robot pose estimation in real-world conditions using a single camera. InICRA, 2025. 2, 3, 5

work page 2025
[27]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Segic: Unleashing the emergent correspondence for in-context segmentation

Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M Alvarez, Zuxuan Wu, and Yu-Gang Jiang. Segic: Unleashing the emergent correspondence for in-context segmentation. In ECCV, 2024. 3

work page 2024
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024. 1, 2

work page 2024
[31]

Robot sensor calibration: solving ax= xb on the euclidean group.T-RO, 1994

Frank C Park and Bryan J Martin. Robot sensor calibration: solving ax= xb on the euclidean group.T-RO, 1994. 2, 3

work page 1994
[32]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1

work page 2021
[33]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Grounding DINO 1.5: Advance the “edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection.arXiv preprint arXiv:2405.10300, 2024. 3

work page arXiv 2024
[35]

Emergent correspondence from image diffusion.NeurIPS, 2023

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.NeurIPS, 2023. 2, 3

work page 2023
[36]

A new technique for fully autonomous and efficient 3 d robotics hand/eye calibra- tion.T-RO, 1989

Roger Y Tsai, Reimar K Lenz, et al. A new technique for fully autonomous and efficient 3 d robotics hand/eye calibra- tion.T-RO, 1989. 2, 3

work page 1989
[37]

Mim- icplay: Long-horizon imitation learning by watching hu- man play

Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei- Fei, Danfei Xu, Yuke Zhu, and Anima Anandkumar. Mim- icplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023. 1, 2

work page arXiv 2023
[38]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, 2020. 1

work page 2020
[39]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long- horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu- Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long- horizon reasoning tasks. InICCV, 2025. 1

work page 2025
[41]

Grounding actions in camera space: Observation-centric vision-language-action policy.arXiv preprint arXiv:2508.13103, 2025

Tianyi Zhang, Haonan Duan, Haoran Hao, Yu Qiao, Jifeng Dai, and Zhi Hou. Grounding actions in camera space: Observation-centric vision-language-action policy.arXiv preprint arXiv:2508.13103, 2025. 1, 2

work page arXiv 2025
[42]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart ´ın- Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2009
[44]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 1, 2

work page 2023

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Easyhec: Accurate and automatic hand-eye calibration via differentiable rendering and space exploration.RA-L, 2023

Linghao Chen, Yuzhe Qin, Xiaowei Zhou, and Hao Su. Easyhec: Accurate and automatic hand-eye calibration via differentiable rendering and space exploration.RA-L, 2023. 1, 2, 3, 4, 5

work page 2023

[3] [3]

Diffusion policy: Visuomotor policy learning via action dif- fusion.IJRR, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.IJRR, 2025. 1

work page 2025

[4] [4]

Hand-eye calibration using dual quaternions.IJRR, 1999

Konstantinos Daniilidis. Hand-eye calibration using dual quaternions.IJRR, 1999. 2, 3

work page 1999

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons.arXiv preprint arXiv:2503.03081, 2025

Hongjie Fang, Chenxi Wang, Yiming Wang, Jingjing Chen, Shangning Xia, Jun Lv, Zihao He, Xiyan Yi, Yunhan Guo, Xinyu Zhan, et al. Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons.arXiv preprint arXiv:2503.03081, 2025. 1, 2

work page arXiv 2025

[7] [7]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Easy- hec++: Fully automatic hand-eye calibration with pretrained image models

Zhengdong Hong, Kangfu Zheng, and Linghao Chen. Easy- hec++: Fully automatic hand-eye calibration with pretrained image models. InIROS, 2024. 1, 4, 5

work page 2024

[9] [9]

Robust robot-camera calibra- tion

Jarmo Ilonen and Ville Kyrki. Robust robot-camera calibra- tion. InICAR, 2011. 2

work page 2011

[10] [10]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Rlbench: The robot learning benchmark & learning environment.RA-L, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and An- drew J Davison. Rlbench: The robot learning benchmark & learning environment.RA-L, 2020. 1

work page 2020

[12] [12]

Do you know where your camera is? view-invariant pol- icy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025

Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, and Matthew R Walter. Do you know where your camera is? view-invariant pol- icy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025. 1, 2

work page arXiv 2025

[13] [13]

Co- tracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InECCV, 2024. 2, 3, 10

work page 2024

[14] [14]

Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InICCV, 2025. 2, 3, 10

work page 2025

[15] [15]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yun- liang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Single-view robot pose and joint angle estimation via render & compare

Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Single-view robot pose and joint angle estimation via render & compare. InCVPR, 2021. 5

work page 2021

[18] [18]

Modular primitives for high-performance differentiable rendering.ToG, 2020

Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering.ToG, 2020. 4

work page 2020

[19] [19]

Camera-to-robot pose estimation from a single image

Timothy E Lee, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Oliver Kroemer, Dieter Fox, and Stan Birch- field. Camera-to-robot pose estimation from a single image. InICRA, 2020. 2, 5, 6, 7

work page 2020

[20] [20]

Phantom: Training robots without robots using only human videos, 2025

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025. 1, 2

work page arXiv 2025

[21] [21]

Ep n p: An accurate o (n) solution to the p n p problem.IJCV,

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.IJCV,

work page

[22] [22]

Prompting depth anything for 4k resolution accurate metric depth estimation

Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting depth anything for 4k resolution accurate metric depth estimation. InCVPR, 2025. 7

work page 2025

[23] [23]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.NeurIPS, 2023. 1

work page 2023

[24] [24]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024. 3

work page 2024

[25] [25]

Markerless camera-to-robot pose estimation via self-supervised sim-to- real transfer

Jingpei Lu, Florian Richter, and Michael C Yip. Markerless camera-to-robot pose estimation via self-supervised sim-to- real transfer. InCVPR, 2023. 2, 5

work page 2023

[26] [26]

Ctrnet-x: Camera-to- robot pose estimation in real-world conditions using a single camera

Jingpei Lu, Zekai Liang, Tristin Xie, Florian Richter, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to- robot pose estimation in real-world conditions using a single camera. InICRA, 2025. 2, 3, 5

work page 2025

[27] [27]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Segic: Unleashing the emergent correspondence for in-context segmentation

Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M Alvarez, Zuxuan Wu, and Yu-Gang Jiang. Segic: Unleashing the emergent correspondence for in-context segmentation. In ECCV, 2024. 3

work page 2024

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InICRA, 2024. 1, 2

work page 2024

[31] [31]

Robot sensor calibration: solving ax= xb on the euclidean group.T-RO, 1994

Frank C Park and Bryan J Martin. Robot sensor calibration: solving ax= xb on the euclidean group.T-RO, 1994. 2, 3

work page 1994

[32] [32]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1

work page 2021

[33] [33]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Grounding DINO 1.5: Advance the “edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection.arXiv preprint arXiv:2405.10300, 2024. 3

work page arXiv 2024

[35] [35]

Emergent correspondence from image diffusion.NeurIPS, 2023

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.NeurIPS, 2023. 2, 3

work page 2023

[36] [36]

A new technique for fully autonomous and efficient 3 d robotics hand/eye calibra- tion.T-RO, 1989

Roger Y Tsai, Reimar K Lenz, et al. A new technique for fully autonomous and efficient 3 d robotics hand/eye calibra- tion.T-RO, 1989. 2, 3

work page 1989

[37] [37]

Mim- icplay: Long-horizon imitation learning by watching hu- man play

Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei- Fei, Danfei Xu, Yuke Zhu, and Anima Anandkumar. Mim- icplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023. 1, 2

work page arXiv 2023

[38] [38]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, 2020. 1

work page 2020

[39] [39]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long- horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu- Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long- horizon reasoning tasks. InICCV, 2025. 1

work page 2025

[41] [41]

Grounding actions in camera space: Observation-centric vision-language-action policy.arXiv preprint arXiv:2508.13103, 2025

Tianyi Zhang, Haonan Duan, Haoran Hao, Yu Qiao, Jifeng Dai, and Zhi Hou. Grounding actions in camera space: Observation-centric vision-language-action policy.arXiv preprint arXiv:2508.13103, 2025. 1, 2

work page arXiv 2025

[42] [42]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart ´ın- Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2009

[44] [44]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 1, 2

work page 2023