SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies

Byeongguk Jeon; Haeone Lee; Jian Kim; Kimin Lee; Suchae Jeong

arxiv: 2606.24049 · v1 · pith:KWZNOBYSnew · submitted 2026-06-23 · 💻 cs.RO

SPACE: Enabling Learning from Cross-Robot Data Toward Generalist Policies

Haeone Lee , Byeongguk Jeon , Suchae Jeong , Jian Kim , Kimin Lee This is my paper

Pith reviewed 2026-06-26 00:47 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot learningcross-embodimentbehavior cloningaction representationgeneralist policiesCartesian state deltadynamics adaptation

0 comments

The pith

SPACE lets robot policies learn from mixed data across different machines by predicting Cartesian state deltas instead of specific commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robot actions are tied to each machine's dynamics, so data from different robots cannot be mixed directly for training generalist policies via behavior cloning. SPACE addresses this by training a policy to predict geometric end-effector displacements in Cartesian space as a shared representation, then using an Action Adapter to convert those predictions into robot-specific commands. The framework targets dynamics differences across embodiments, across units of the same embodiment, and during single-robot operation. Experiments show it outperforms direct command prediction on cross-robot datasets and stays effective under deployment shifts such as altered control frequency, object weight, or gains. This approach supports scaling training data across hardware without the mismatch that normally blocks combined datasets.

Core claim

SPACE consists of a Cartesian state delta policy that predicts geometric end-effector displacement and an Action Adapter that converts the prediction into robot-specific control commands. This structure handles robot dynamics variation at three levels: across different embodiments, across hardware units of the same embodiment, and within a single robot during operation.

What carries the argument

Cartesian state delta as a universal action representation, together with an Action Adapter that maps the predicted delta to embodiment-specific commands.

If this is right

Policies trained with SPACE substantially outperform those that directly predict control commands when using data collected across different embodiments and across hardware units of the same embodiment.
SPACE remains robust under dynamics shifts at deployment, including changes in control frequency, object weight, and controller gains.
The shared Cartesian representation enables effective behavior cloning from aggregated datasets that would otherwise be incompatible due to embodiment-specific actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same delta-based separation could be tested with additional sensor modalities to further reduce embodiment dependence in policy training.
If the adapter requires per-robot calibration data, the method's advantage would shrink for entirely novel hardware with no prior examples.
Extending the approach to include visual or force observations as part of the shared state could address cases where end-effector position alone is insufficient.

Load-bearing premise

The assumption that an Action Adapter can reliably map a predicted Cartesian state delta into accurate robot-specific commands for any embodiment or hardware unit without introducing large errors that would negate the benefit of the shared representation.

What would settle it

A controlled test on a new robot embodiment where the Action Adapter produces large command errors, causing the full SPACE policy to achieve lower task success than a baseline that directly predicts control commands from the same mixed data.

Figures

Figures reproduced from arXiv: 2606.24049 by Byeongguk Jeon, Haeone Lee, Jian Kim, Kimin Lee, Suchae Jeong.

**Figure 2.** Figure 2: Different robots (e.g., UR5 vs. Franka Research 3) require different commands to perform the same movement. However, different robots require different control commands to achieve the same motion ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We study co-training between UR5 and FR3 robot for 3 different tasks, and zero-shot [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate for cross-embodiment experiments. We study (a) co-training on FR3 and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: FR3 trajectory replayed on a differ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Success rate in DROID. We compare SPACE against policies trained using different action spaces available in DROID. Learning from multi-hardware data. While the above results show that SPACE enables transfer from one hardware unit to another, an equally important scenario is training a policy on data collected from many hardware units simultaneously. To test this, we use the DROID dataset [7], a large-scal… view at source ↗

**Figure 9.** Figure 9: Success rate after increasing the object weight. 0 25 50 75 100 125 150 Step (t) 0.00 0.02 0.04 bt (z-axis) Box weight (90g) Box weight (530g) [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Action Adapter bias (bt) z-axis visualization in PnP Box task. Heavy box weight leads to an increase in z-axis value after grasping (highlighted by yellow). The value is averaged over three rollouts. We also show that the SPACE can handle dynamics changes driven by environmental factors such as object weight. Specifically, we use the same PnP Box task and add metal pieces to the box, increasing its weig… view at source ↗

**Figure 11.** Figure 11: Different trajectory for Dcal used for ablation with circle and square shapes in addition to our original choice, random trajectories. Calibration trajectories. We also ablate how different choices for initial calibration trajectories affect the Action Adapter performance. To test this, we attempt new calibration trajectories named Circle and Square, in which calibration trajectories draw a circle and a s… view at source ↗

read the original abstract

In robot learning, scaling training datasets across diverse embodiments and environments has become a dominant paradigm for learning generalizable robot policies. These policies are commonly trained via behavior cloning to imitate actions from pre-collected demonstrations. However, since robot actions are tied to the dynamics of the data collection robot, different robots may require different actions to achieve the same motion. This discrepancy hinders both policy training and deployment across diverse robots. To address this, we propose using Cartesian state delta as a universal action representation across robots, and introduce State Prediction and Adaptive Command Execution (SPACE) framework. SPACE handles robot dynamics variation at three levels: across different embodiments, across hardware units of the same embodiment, and within a single robot during operation. It consists of two components: (i) a Cartesian state delta policy that predicts geometric end-effector displacement, and (ii) Action Adapter, which converts the predicted Cartesian state delta into robot-specific control commands. Experiments show that SPACE substantially outperforms policies that directly predict control commands when learning from data collected across different embodiments and across hardware units of the same embodiment. SPACE also remains robust under dynamics shifts at deployment, including changes in control frequency, object weight, and controller gains. The project page is available at http://haeone.site/space-website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPACE tries to fix cross-embodiment data pooling with Cartesian state deltas and an Action Adapter, but the abstract gives no numbers so the claims can't be checked.

read the letter

The core idea is to predict end-effector position changes instead of robot-specific commands, then let a separate Action Adapter turn those deltas into actual controls. This split is meant to let data from different robots and even different units of the same robot be used together for training.

What stands out is the explicit handling of three variation levels: embodiment differences, unit-to-unit hardware variation, and runtime shifts like frequency or payload changes. The framing of why direct behavior cloning breaks across robots is clear and practical.

The paper does not report any numbers, baselines, dataset sizes, or adapter error metrics in the abstract. Without those, it is impossible to tell whether the adapter keeps mapping error low enough that the shared policy actually wins. The stress-test note is correct on this point: adapter accuracy is load-bearing, and nothing in the provided text shows an ablation or failure analysis for it.

The approach is aimed at people building generalist policies who already have multi-robot datasets but cannot merge them directly. A reader working on embodiment transfer would find the split between policy and adapter worth examining, provided the full paper supplies the missing quantitative checks.

The work deserves peer review because the problem is real and the proposed structure is simple enough to test. A referee should ask for adapter error numbers, cross-embodiment ablations, and direct comparison to training separate policies on each robot's data.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the SPACE framework for learning generalist robot policies from cross-embodiment and cross-hardware-unit data. It replaces direct command prediction with a policy that outputs Cartesian state deltas (geometric end-effector displacements) as a universal representation, paired with an Action Adapter that converts these deltas into robot-specific control commands. The approach is intended to handle dynamics variation at three levels: across embodiments, across units of the same embodiment, and within a single robot at deployment. The abstract states that experiments demonstrate substantial outperformance over direct command prediction and robustness to shifts in control frequency, object weight, and controller gains.

Significance. If the empirical claims are substantiated with quantitative results, the separation of a shared geometric policy from an embodiment-specific adapter could provide a practical route to scaling behavior-cloning datasets across heterogeneous robots. The core idea of using Cartesian deltas to decouple policy learning from robot dynamics is a clear conceptual contribution to cross-robot generalization.

major comments (2)

[Abstract] Abstract: the central empirical claim that SPACE 'substantially outperforms policies that directly predict control commands' is presented without any reported metrics, baseline details, dataset sizes, number of trials, or statistical tests, preventing assessment of whether the data support the claim.
[Abstract] Abstract: the load-bearing assumption that the Action Adapter maps predicted Cartesian state deltas to accurate robot-specific commands across embodiments, hardware units, and deployment shifts (frequency, mass, gains) receives no quantitative validation; no adapter command-error metrics, ablations isolating the adapter, or failure-case analysis are described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on the abstract. We agree that the abstract would benefit from additional quantitative details to better support the claims. We will revise the manuscript to address these points.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that SPACE 'substantially outperforms policies that directly predict control commands' is presented without any reported metrics, baseline details, dataset sizes, number of trials, or statistical tests, preventing assessment of whether the data support the claim.

Authors: We agree that the abstract lacks specific metrics and details. In the revised version, we will update the abstract to include key quantitative results such as success rate improvements (e.g., average success rates across tasks), number of evaluation trials, and references to the experimental sections and tables that detail baselines, dataset sizes, and any statistical comparisons. This will make the empirical claims more self-contained while preserving the abstract's brevity. revision: yes
Referee: [Abstract] Abstract: the load-bearing assumption that the Action Adapter maps predicted Cartesian state deltas to accurate robot-specific commands across embodiments, hardware units, and deployment shifts (frequency, mass, gains) receives no quantitative validation; no adapter command-error metrics, ablations isolating the adapter, or failure-case analysis are described.

Authors: We acknowledge that the abstract does not provide quantitative validation or metrics for the Action Adapter. The full manuscript includes relevant ablations, command-error metrics under varying conditions, and robustness results in the experiments (Section 4). We will revise the abstract to briefly reference these adapter-specific results and the demonstrated robustness to shifts, ensuring the adapter's contribution is supported by evidence from the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with no derivation chain

full rationale

The paper introduces SPACE as a two-component framework (Cartesian state delta policy + Action Adapter) and supports its claims exclusively via experimental comparisons on cross-embodiment and cross-unit data. No equations, fitted parameters, or first-principles derivations are present in the provided text. The central claims (outperformance and robustness) are statistical outcomes of training and evaluation, not quantities that reduce to their own inputs by construction. Self-citation patterns, ansatz smuggling, or uniqueness theorems are absent. This is a standard empirical robotics paper whose validity hinges on experimental design rather than any self-referential mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or can be inferred from the provided text.

pith-pipeline@v0.9.1-grok · 5764 in / 1097 out tokens · 17360 ms · 2026-06-26T00:47:48.629421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 16 linked inside Pith

[1]

Bommasani, D

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

Pith/arXiv arXiv 2021
[2]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[3]

Comanici, E

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[4]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InInternational Conference on Robotics and Automation, 2024

2024
[5]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A compre- hensive robotic dataset for learning diverse skills in one-shot. InInternational Conference on Robotics and Automation, 2024

2024
[6]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Pith/arXiv arXiv 2024
[7]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[8]

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[9]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[10]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[11]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fu- sai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[12]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[13]

Zheng, J

J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InComputer Vision and Pattern Recognition Conference, 2025. 10

2025
[14]

Bronars, Y

A. Bronars, Y . Park, and P. Agrawal. Tune to learn: How controller gains shape robot policy learning.arXiv preprint arXiv:2604.02523, 2026

Pith/arXiv arXiv 2026
[15]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[16]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023

2023
[17]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, 2023

2023
[18]

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control. InConference on Robot Learning, 2025

2025
[19]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[20]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, 2025

2025
[21]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic con- trol with dynamics randomization. InInternational Conference on Robotics and Automation, 2018

2018
[22]

Andrychowicz, B

M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation.The Interna- tional Journal of Robotics Research, 39(1):3–20, 2020

2020
[23]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

Pith/arXiv arXiv 2021
[24]

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik. In-hand object rotation via rapid motor adaptation. InConference on Robot Learning, 2023

2023
[25]

Torabi, G

F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. InInternational Joint Conference on Artificial Intelligence, 2018

2018
[26]

Radosavovic, X

I. Radosavovic, X. Wang, L. Pinto, and J. Malik. State-only imitation learning for dexterous manipulation. InInternational Conference on Intelligent Robots and Systems, 2021

2021
[27]

N. R. Arachchige, Z. Chen, W. Jung, W. C. Shin, R. Bansal, P. Barroso, Y . H. He, Y . C. Lin, B. Joffe, S. Kousik, et al. Sail: Faster-than-demonstration execution of imitation learning policies. InConference on Robot Learning, 2025

2025
[28]

S. Kim, J. Kim, and J. J. Lim. Time optimal execution of action chunk policies beyond demon- stration speed. InInternational Conference on Learning Representations, 2026

2026
[29]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InRobotics: Science and Systems, 2024

2024
[30]

Zhaxizhuoma, K

Z. Zhaxizhuoma, K. Liu, C. Guan, Z. Jia, Z. Wu, X. Liu, T. Wang, S. Liang, P. Chen, P. Zhang, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. InConference on Robot Learning, 2025. 11

2025
[31]

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. InAdvances in Neural Information Processing Systems, 1988

1988
[32]

Y . Feng, J. Zheng, Z. Wang, D. Liu, J. Li, J. Pang, T. Wang, and X. Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026

Pith/arXiv arXiv 2026
[33]

O. Khatib. A unified approach for motion and force control of robot manipulators: The opera- tional space formulation.IEEE Journal on Robotics and Automation, 3(1):43–53, 1987

1987
[34]

L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- embodiment zero-shot policy transfer with cross-painting. InRobotics: Science and Systems, 2024

2024
[35]

L. Hao, R. Pagani, M. Beschi, and G. Legnani. Dynamic and friction parameters of an indus- trial robot: Identification, comparison and repetitiveness analysis.Robotics, 10(1):49, 2021

2021
[36]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems, 2023

2023
[37]

Akgun, M

B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz. Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective. InInternational Conference on Human-Robot Interaction, 2012

2012
[38]

H. Li, Y . Cui, and D. Sadigh. How to train your robots? the impact of demonstration modality on imitation learning. InInternational Conference on Robotics and Automation, 2025

2025
[39]

Haykin and B

S. Haykin and B. Widrow. Least-mean-square adaptive filters. 2003

2003
[40]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[41]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Neary, E. S. Hu, K. Arora, K. Ellis, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Conference on Robot Learning, 2025

2025
[42]

UMI data

L. Y . Chen, S. Adebola, and K. Goldberg. Berkeley UR5 demonstration dataset.https: //sites.google.com/view/berkeley-ur5/home. 12 Appendix A Pseudo Code Algorithm 1SPACE rollout with Action Adapter 1:Given:Cartesian state delta policyπ θ 2:Define:Action adapterˆu“W 0∆p`b 0 and its learning rateµ 3:CollectMcalibration trajectories with lengthK,D cal “ ttp0...

2022

[1] [1]

Bommasani, D

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

Pith/arXiv arXiv 2021

[2] [2]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[3] [3]

Comanici, E

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced rea- soning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[4] [4]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InInternational Conference on Robotics and Automation, 2024

2024

[5] [5]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A compre- hensive robotic dataset for learning diverse skills in one-shot. InInternational Conference on Robotics and Automation, 2024

2024

[6] [6]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

Pith/arXiv arXiv 2024

[7] [7]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[8] [8]

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[9] [9]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[10] [10]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[11] [11]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fu- sai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[12] [12]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[13] [13]

Zheng, J

J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InComputer Vision and Pattern Recognition Conference, 2025. 10

2025

[14] [14]

Bronars, Y

A. Bronars, Y . Park, and P. Agrawal. Tune to learn: How controller gains shape robot policy learning.arXiv preprint arXiv:2604.02523, 2026

Pith/arXiv arXiv 2026

[15] [15]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[16] [16]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023

2023

[17] [17]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, 2023

2023

[18] [18]

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control. InConference on Robot Learning, 2025

2025

[19] [19]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[20] [20]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, 2025

2025

[21] [21]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic con- trol with dynamics randomization. InInternational Conference on Robotics and Automation, 2018

2018

[22] [22]

Andrychowicz, B

M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation.The Interna- tional Journal of Robotics Research, 39(1):3–20, 2020

2020

[23] [23]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021

Pith/arXiv arXiv 2021

[24] [24]

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik. In-hand object rotation via rapid motor adaptation. InConference on Robot Learning, 2023

2023

[25] [25]

Torabi, G

F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. InInternational Joint Conference on Artificial Intelligence, 2018

2018

[26] [26]

Radosavovic, X

I. Radosavovic, X. Wang, L. Pinto, and J. Malik. State-only imitation learning for dexterous manipulation. InInternational Conference on Intelligent Robots and Systems, 2021

2021

[27] [27]

N. R. Arachchige, Z. Chen, W. Jung, W. C. Shin, R. Bansal, P. Barroso, Y . H. He, Y . C. Lin, B. Joffe, S. Kousik, et al. Sail: Faster-than-demonstration execution of imitation learning policies. InConference on Robot Learning, 2025

2025

[28] [28]

S. Kim, J. Kim, and J. J. Lim. Time optimal execution of action chunk policies beyond demon- stration speed. InInternational Conference on Learning Representations, 2026

2026

[29] [29]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InRobotics: Science and Systems, 2024

2024

[30] [30]

Zhaxizhuoma, K

Z. Zhaxizhuoma, K. Liu, C. Guan, Z. Jia, Z. Wu, X. Liu, T. Wang, S. Liang, P. Chen, P. Zhang, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. InConference on Robot Learning, 2025. 11

2025

[31] [31]

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. InAdvances in Neural Information Processing Systems, 1988

1988

[32] [32]

Y . Feng, J. Zheng, Z. Wang, D. Liu, J. Li, J. Pang, T. Wang, and X. Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026

Pith/arXiv arXiv 2026

[33] [33]

O. Khatib. A unified approach for motion and force control of robot manipulators: The opera- tional space formulation.IEEE Journal on Robotics and Automation, 3(1):43–53, 1987

1987

[34] [34]

L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg. Mirage: Cross- embodiment zero-shot policy transfer with cross-painting. InRobotics: Science and Systems, 2024

2024

[35] [35]

L. Hao, R. Pagani, M. Beschi, and G. Legnani. Dynamic and friction parameters of an indus- trial robot: Identification, comparison and repetitiveness analysis.Robotics, 10(1):49, 2021

2021

[36] [36]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems, 2023

2023

[37] [37]

Akgun, M

B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz. Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective. InInternational Conference on Human-Robot Interaction, 2012

2012

[38] [38]

H. Li, Y . Cui, and D. Sadigh. How to train your robots? the impact of demonstration modality on imitation learning. InInternational Conference on Robotics and Automation, 2025

2025

[39] [39]

Haykin and B

S. Haykin and B. Widrow. Least-mean-square adaptive filters. 2003

2003

[40] [40]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[41] [41]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Neary, E. S. Hu, K. Arora, K. Ellis, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Conference on Robot Learning, 2025

2025

[42] [42]

UMI data

L. Y . Chen, S. Adebola, and K. Goldberg. Berkeley UR5 demonstration dataset.https: //sites.google.com/view/berkeley-ur5/home. 12 Appendix A Pseudo Code Algorithm 1SPACE rollout with Action Adapter 1:Given:Cartesian state delta policyπ θ 2:Define:Action adapterˆu“W 0∆p`b 0 and its learning rateµ 3:CollectMcalibration trajectories with lengthK,D cal “ ttp0...

2022