Keypose Exploration: Efficient Automatic Trajectory Labelling and Cross-Embodiment Policy Transfer

Hang Xu; Jia Pan; Yizhou Chen; Yupu Lu

arxiv: 2606.29028 · v1 · pith:JMIT4CPBnew · submitted 2026-06-27 · 💻 cs.RO

Keypose Exploration: Efficient Automatic Trajectory Labelling and Cross-Embodiment Policy Transfer

Yupu Lu , Hang Xu , Yizhou Chen , Jia Pan This is my paper

Pith reviewed 2026-06-30 09:20 UTC · model grok-4.3

classification 💻 cs.RO

keywords automatic trajectory labellingkeypose conditioningdiffusion policycross-embodiment transfervision-language modelsrobotic manipulationreachability mapzero-shot transfer

0 comments

The pith

An automatic pipeline labels keyposes from one demonstration to train policies matching standard baselines and supporting cross-embodiment transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an automatic labelling method that applies vision-language models to detect semantic events in a single demonstration and aligns them with classical trajectory analysis to produce keyposes for grasp-related tasks. These labels train a diffusion policy conditioned on the keyposes, allowing the policy to steer toward specific waypoints during generation. The work tests whether reachability filtering of candidate keyposes can steer the policy toward kinematically feasible options on a new robot body. A sympathetic reader would care because the approach reduces reliance on task-specific manual annotation and explores a route to reuse policies across different hardware without full retraining.

Core claim

The authors establish that a hybrid pipeline of VLM-based semantic event detection on a single demonstration combined with classical trajectory analysis produces labels sufficient for training keypose-conditioned diffusion policies that achieve performance equivalent to a standard diffusion policy baseline on robomimic tasks, and that filtering keypose candidates by a reachability map can support zero-shot cross-embodiment transfer on multimodal insertion when valid options are present.

What carries the argument

The automatic trajectory labelling pipeline that uses VLMs for semantic event detection on one demonstration and classical analysis for temporal alignment, paired with a reachability map to filter keypose candidates for cross-embodiment transfer.

If this is right

Labelled data from the pipeline produces policies that match standard DP baseline performance.
Reachability-filtered keypose conditioning benefits zero-shot transfer on the multimodal insertion task when feasible candidates are available.
VLM inference is required only on one demo among repeating demonstrations per task.
Keypose conditioning intervenes demonstration distributions inside the diffusion policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-demo labelling approach could lower data preparation costs for new manipulation tasks.
Accurate reachability maps might extend transfer benefits to a broader set of robot morphologies.
The method may apply to additional task types if VLM detection remains reliable outside grasp-related domains.

Load-bearing premise

The VLM semantic event detection on a single demonstration produces accurate enough labels for policy training, and the reachability map correctly identifies kinematically feasible keyposes for the target embodiment.

What would settle it

Policies trained on the automatically labelled data achieve success rates significantly below the standard diffusion policy baseline on the tested tasks, or reachability-filtered keypose conditioning produces no improvement or lower success rates on the multimodal insertion transfer.

Figures

Figures reproduced from arXiv: 2606.29028 by Hang Xu, Jia Pan, Yizhou Chen, Yupu Lu.

**Figure 2.** Figure 2: One square peg-insertion demo echos Fig. 1 . Keypose [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Rollout comparison for Square on UR5e across seven time stamps ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Keypose-based manipulation decomposes tasks into critical waypoints to simplify policy learning for long-horizon tasks, but existing approaches rely on task-specific heuristics or manual annotation to extract keyposes from demonstrations. We present an automatic trajectory labelling pipeline for grasp-related tasks. This pipeline combines vision-language models (VLMs) for semantic event detection with classical trajectory analysis for precise temporal alignment, requiring VLM inference only on one single demo among repeating ones per task. Using the labelled data, we train a keypose-guided Diffusion Policy (DP) that exploits keypose conditioning to intervene demonstration distributions. We explore the possibility to apply this property for cross-embodiment transfer: candidate keyposes are sampled and filtered via a reachability map, steering the policy toward kinematically feasible keyposes for the target robot. As a preliminary feasibility study, experiments on two robomimic tasks show that the labelled data produces policies matching a standard DP baseline, and that reachability-filtered keypose conditioning may benefit zero-shot transfer on the multimodal insertion task when feasible candidates are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable pipeline for single-demo VLM keypose labelling plus reachability-filtered conditioning in diffusion policies, but the VLM step has no accuracy check so the results stay tentative.

read the letter

The main thing here is a pipeline that runs a VLM once on a single demonstration to spot grasp-related events, then uses classical trajectory analysis to label the rest of the demos. Those labels condition a diffusion policy, and the same conditioning is filtered by a reachability map to steer toward feasible keyposes on a different robot.

That combination is new enough for a conference paper. It cuts annotation effort and the two robomimic experiments show the resulting policies match a plain diffusion baseline. The reachability filter also looks like it can help zero-shot transfer on the multimodal insertion task when good candidates exist.

The soft spot is exactly the one in the stress-test note. There is no reported check on whether the VLM actually finds the right events with usable timing on that single reference demo. No precision numbers, no temporal error stats, no ablation that removes the VLM labels. If the detection is off, every downstream label is off, yet the policy parity could still appear if the tasks are forgiving. The study is also small—only two tasks, no error bars, and the transfer claim is described as preliminary.

This is for people already working with diffusion policies and keypose methods who need a practical labelling shortcut. It is coherent on its own terms and shows honest engagement with the problem, so it deserves a serious referee even though the current evidence is limited.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an automatic trajectory labelling pipeline for grasp-related manipulation tasks. It uses vision-language models (VLMs) to detect semantic events on a single demonstration, combined with classical trajectory analysis for precise alignment across multiple demos. The labelled data is used to train a keypose-guided Diffusion Policy (DP), which conditions on keyposes to steer the policy. The approach is explored for cross-embodiment policy transfer by sampling and filtering candidate keyposes using a reachability map to ensure kinematic feasibility for the target robot. As a preliminary study, experiments on two robomimic tasks demonstrate that policies from the labelled data match a standard DP baseline, and reachability-filtered conditioning may improve zero-shot transfer on the multimodal insertion task.

Significance. If the VLM labelling step proves reliable, the pipeline offers an efficiency advantage by requiring VLM inference on only one demo per task and could support more scalable keypose-based policies and cross-embodiment transfer without task-specific heuristics. The preliminary results on standard robomimic tasks provide a starting point for such methods. However, as a feasibility study limited to two tasks without detailed metrics or error bars, the significance remains exploratory.

major comments (2)

[Abstract / experimental evaluation] The claim that labelled data produces policies matching a standard DP baseline (abstract) rests on the VLM correctly identifying grasp-related semantic events with sufficient temporal precision from a single demo. No quantitative validation of this step (e.g., precision/recall of detected events, temporal offset statistics, or ablation removing VLM labels) is reported against ground-truth annotations. This is load-bearing for the central claim, as systematic detection errors would invalidate the labels while still permitting apparent policy parity if the baseline is weak or tasks are simple.
[Abstract / results discussion] The cross-embodiment transfer result is described only as 'may benefit' zero-shot transfer 'when feasible candidates are available' (abstract). Without reported quantitative improvements, conditions for candidate availability, or comparison to unfiltered conditioning, it is unclear whether the reachability map provides a meaningful advantage beyond the baseline DP.

minor comments (1)

[Abstract] The abstract and introduction could more clearly distinguish the preliminary nature of the study (only two tasks) from the main claims to set appropriate expectations.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on this preliminary feasibility study. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract / experimental evaluation] The claim that labelled data produces policies matching a standard DP baseline (abstract) rests on the VLM correctly identifying grasp-related semantic events with sufficient temporal precision from a single demo. No quantitative validation of this step (e.g., precision/recall of detected events, temporal offset statistics, or ablation removing VLM labels) is reported against ground-truth annotations. This is load-bearing for the central claim, as systematic detection errors would invalidate the labels while still permitting apparent policy parity if the baseline is weak or tasks are simple.

Authors: We agree that direct quantitative validation of the VLM labelling (e.g., precision/recall or temporal offsets vs. ground truth) is absent and would strengthen the claims. As this is a preliminary end-to-end feasibility study on standard tasks, such annotations were not collected; policy parity with the baseline serves as indirect evidence of usable labels. We will partially revise the abstract and discussion to explicitly note this limitation and the preliminary scope. revision: partial
Referee: [Abstract / results discussion] The cross-embodiment transfer result is described only as 'may benefit' zero-shot transfer 'when feasible candidates are available' (abstract). Without reported quantitative improvements, conditions for candidate availability, or comparison to unfiltered conditioning, it is unclear whether the reachability map provides a meaningful advantage beyond the baseline DP.

Authors: The abstract already employs cautious phrasing ('may benefit'). We will revise to add a brief clarification on the conditions observed for the insertion task and the role of the reachability map in ensuring kinematic feasibility, while reiterating the preliminary nature of the result. revision: yes

standing simulated objections not resolved

Quantitative validation of VLM event detection (precision/recall, temporal offsets) against ground-truth annotations, as no such annotations were collected in the original study.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external components

full rationale

The paper describes an empirical pipeline that applies an external VLM for semantic event detection on a single demonstration, followed by classical trajectory alignment to label the rest, then trains a conditioned Diffusion Policy and evaluates it on robomimic tasks. No equations, uniqueness theorems, or derivations are presented that reduce by construction to the inputs; the central claims are empirical performance matches to a baseline and preliminary transfer observations. The method relies on off-the-shelf VLM inference and standard reachability analysis rather than any self-referential fitting or self-citation chain. This is a standard applied robotics contribution whose results stand or fall on the reported experiments, not on tautological re-labeling of its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; no free parameters or invented entities are described. Relies on domain assumptions about VLM accuracy and reachability without explicit fitting or new entities.

axioms (2)

domain assumption VLMs can detect semantic events in robot demonstration videos accurately enough for labelling
The pipeline relies on this for automatic labelling without manual annotation.
domain assumption Reachability maps can accurately filter kinematically feasible keyposes for target robots
Used for cross-embodiment transfer.

pith-pipeline@v0.9.1-grok · 5718 in / 1353 out tokens · 43910 ms · 2026-06-30T09:20:55.881786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation,

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation,” in Conference on Robot Learning, 2023

2023
[2]

ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Predic- tion for Robotic Manipulation,

Z. Xian, N. Gkanatsios, T. Gervet, T.-W. Ke, and K. Fragkiadaki, “ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Predic- tion for Robotic Manipulation,” inConference on Robot Learning, 2023

2023
[3]

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,

X. Ma, S. Patidar, I. Haughton, and S. James, “Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 081–18 090

2024
[4]

Bikc: Keypose-conditioned consistency policy for bimanual robotic manipulation,

D. Yu, H. Xu, Y . Chen, Y . Ren, and J. Pan, “Bikc: Keypose-conditioned consistency policy for bimanual robotic manipulation,”arXiv preprint arXiv:2406.10093, 2024

work page arXiv 2024
[5]

Bikc+: Bimanual hierarchical imitation with keypose-conditioned coordination-aware consistency policies,

H. Xu, Y . Chen, D. Yu, Y . Ren, and J. Pan, “Bikc+: Bimanual hierarchical imitation with keypose-conditioned coordination-aware consistency policies,”IEEE Transactions on Automation Science and Engineering, vol. 23, pp. 1064–1079, 2025

2025
[6]

Bi-KVIL: Keypoints-based visual imitation learning of bimanual manipulation tasks,

J. Gao, X. Jin, F. Krebs, N. Jaquier, and T. Asfour, “Bi-KVIL: Keypoints-based visual imitation learning of bimanual manipulation tasks,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 850–16 857

2024
[7]

A Pragmatic VLA Foundation Model

W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Maet al., “A pragmatic vla foundation model,”arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

RichMap: A Reachability Map Balancing Precision, Efficiency, and Flexibility for Rich Robot Manipulation Tasks

Y . Lu, Y . Ma, and J. Pan, “Richmap: A reachability map balancing precision, efficiency, and flexibility for rich robot manipulation tasks,” arXiv preprint arXiv:2604.06778, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,” inarXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Modeling human under- standing of complex intentional action with a bayesian nonparametric subgoal model,

R. Nakahashi, C. Baker, and J. Tenenbaum, “Modeling human under- standing of complex intentional action with a bayesian nonparametric subgoal model,” inAAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016

2016
[11]

Humans choose visual subgoals to reduce cognitive cost,

F. J. Binder, M. G. Mattar, D. Kirsh, and J. E. Fan, “Humans choose visual subgoals to reduce cognitive cost,” inAnnual Meeting of the Cognitive Science Society, vol. 45, no. 45, 2023

2023
[12]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,

T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” inRobotics: Science and Systems, 2023

2023
[13]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots,” inRobotics: Science and Systems, 2024

2024
[14]

Body Extension by Using Two Mobile Manipulators,

Y . Hirao, W. Wan, D. Kanoulas, and K. Harada, “Body Extension by Using Two Mobile Manipulators,”Cyborg and Bionic Systems, vol. 4, p. 0014, Jan. 2023

2023
[15]

A Novel Robust Imitation Learning Framework for Complex Skills With Limited Demonstrations,

W. Wang, C. Zeng, H. Zhan, and C. Yang, “A Novel Robust Imitation Learning Framework for Complex Skills With Limited Demonstrations,” IEEE Transactions on Automation Science and Engineering, pp. 1–13, 2024

2024
[16]

LeTO: Learning Constrained Visuomotor Policy With Differentiable Trajectory Optimization,

Z. Xu and Y . She, “LeTO: Learning Constrained Visuomotor Policy With Differentiable Trajectory Optimization,”IEEE Transactions on Automation Science and Engineering, pp. 1–12, 2024

2024
[17]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023

2023
[18]

ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation,

G. Lu, Z. Gao, T. Chen, W. Dai, Z. Wang, and Y . Tang, “ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation,” 2024

2024
[19]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation,

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg, “Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation,” in Robotics: Science and Systems, 2024

2024
[20]

Consistency models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 32 211–32 252

2023
[21]

Improved Techniques for Training Consis- tency Models,

Y . Song and P. Dhariwal, “Improved Techniques for Training Consis- tency Models,” inInternational Conference on Learning Representa- tions, 2024

2024
[22]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5745–5753

2019
[24]

Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,

L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg, “Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,”arXiv preprint arXiv:2402.19249, 2024

work page arXiv 2024
[25]

Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning,

G. Ji, H. Polavaram, L. Y . Chen, S. Bajamahal, Z. Ma, S. Adebola, C. Xu, and K. Goldberg, “Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning,”arXiv preprint arXiv:2512.13100, 2025

work page arXiv 2025

[1] [1]

Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation,

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation,” in Conference on Robot Learning, 2023

2023

[2] [2]

ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Predic- tion for Robotic Manipulation,

Z. Xian, N. Gkanatsios, T. Gervet, T.-W. Ke, and K. Fragkiadaki, “ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Predic- tion for Robotic Manipulation,” inConference on Robot Learning, 2023

2023

[3] [3]

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,

X. Ma, S. Patidar, I. Haughton, and S. James, “Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 081–18 090

2024

[4] [4]

Bikc: Keypose-conditioned consistency policy for bimanual robotic manipulation,

D. Yu, H. Xu, Y . Chen, Y . Ren, and J. Pan, “Bikc: Keypose-conditioned consistency policy for bimanual robotic manipulation,”arXiv preprint arXiv:2406.10093, 2024

work page arXiv 2024

[5] [5]

Bikc+: Bimanual hierarchical imitation with keypose-conditioned coordination-aware consistency policies,

H. Xu, Y . Chen, D. Yu, Y . Ren, and J. Pan, “Bikc+: Bimanual hierarchical imitation with keypose-conditioned coordination-aware consistency policies,”IEEE Transactions on Automation Science and Engineering, vol. 23, pp. 1064–1079, 2025

2025

[6] [6]

Bi-KVIL: Keypoints-based visual imitation learning of bimanual manipulation tasks,

J. Gao, X. Jin, F. Krebs, N. Jaquier, and T. Asfour, “Bi-KVIL: Keypoints-based visual imitation learning of bimanual manipulation tasks,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 850–16 857

2024

[7] [7]

A Pragmatic VLA Foundation Model

W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Maet al., “A pragmatic vla foundation model,”arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

RichMap: A Reachability Map Balancing Precision, Efficiency, and Flexibility for Rich Robot Manipulation Tasks

Y . Lu, Y . Ma, and J. Pan, “Richmap: A reachability map balancing precision, efficiency, and flexibility for rich robot manipulation tasks,” arXiv preprint arXiv:2604.06778, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín, “What matters in learning from offline human demonstrations for robot manipulation,” inarXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Modeling human under- standing of complex intentional action with a bayesian nonparametric subgoal model,

R. Nakahashi, C. Baker, and J. Tenenbaum, “Modeling human under- standing of complex intentional action with a bayesian nonparametric subgoal model,” inAAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016

2016

[11] [11]

Humans choose visual subgoals to reduce cognitive cost,

F. J. Binder, M. G. Mattar, D. Kirsh, and J. E. Fan, “Humans choose visual subgoals to reduce cognitive cost,” inAnnual Meeting of the Cognitive Science Society, vol. 45, no. 45, 2023

2023

[12] [12]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,

T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” inRobotics: Science and Systems, 2023

2023

[13] [13]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots,” inRobotics: Science and Systems, 2024

2024

[14] [14]

Body Extension by Using Two Mobile Manipulators,

Y . Hirao, W. Wan, D. Kanoulas, and K. Harada, “Body Extension by Using Two Mobile Manipulators,”Cyborg and Bionic Systems, vol. 4, p. 0014, Jan. 2023

2023

[15] [15]

A Novel Robust Imitation Learning Framework for Complex Skills With Limited Demonstrations,

W. Wang, C. Zeng, H. Zhan, and C. Yang, “A Novel Robust Imitation Learning Framework for Complex Skills With Limited Demonstrations,” IEEE Transactions on Automation Science and Engineering, pp. 1–13, 2024

2024

[16] [16]

LeTO: Learning Constrained Visuomotor Policy With Differentiable Trajectory Optimization,

Z. Xu and Y . She, “LeTO: Learning Constrained Visuomotor Policy With Differentiable Trajectory Optimization,”IEEE Transactions on Automation Science and Engineering, pp. 1–12, 2024

2024

[17] [17]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proceedings of Robotics: Science and Systems (RSS), 2023

2023

[18] [18]

ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation,

G. Lu, Z. Gao, T. Chen, W. Dai, Z. Wang, and Y . Tang, “ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation,” 2024

2024

[19] [19]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation,

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg, “Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation,” in Robotics: Science and Systems, 2024

2024

[20] [20]

Consistency models,

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 32 211–32 252

2023

[21] [21]

Improved Techniques for Training Consis- tency Models,

Y . Song and P. Dhariwal, “Improved Techniques for Training Consis- tency Models,” inInternational Conference on Learning Representa- tions, 2024

2024

[22] [22]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5745–5753

2019

[24] [24]

Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,

L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg, “Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,”arXiv preprint arXiv:2402.19249, 2024

work page arXiv 2024

[25] [25]

Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning,

G. Ji, H. Polavaram, L. Y . Chen, S. Bajamahal, Z. Ma, S. Adebola, C. Xu, and K. Goldberg, “Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning,”arXiv preprint arXiv:2512.13100, 2025

work page arXiv 2025