LACE: Latent Visual Representation for Cross-Embodiment Learning
Pith reviewed 2026-05-19 21:34 UTC · model grok-4.3
The pith
LACE aligns latent visual features of humans and robots using sparse body-part correspondences from one demonstration to enable effective cross-embodiment policy transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LACE aligns human and robot visual representations in the latent space of pretrained SSL backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations are obtained automatically via forward kinematics from a single robot demonstration. The semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce.
What carries the argument
The semantic alignment loss that matches distributions of features from corresponding body-part patches, combined with a Gram loss to keep pretrained backbone quality intact.
If this is right
- Policies using LACE-DINO features achieve 65% higher success in zero-shot transfer than policies using standard DINO features.
- Consistent performance gains appear when only limited robot demonstration data is available for policy training.
- The gains persist in out-of-distribution environments where visual conditions differ from training.
- The entire alignment stage requires only a single robot demonstration to generate the necessary body-part correspondences.
Where Pith is reading between the lines
- The same sparse-correspondence idea could be tested on pairs of different robot embodiments once a mapping between their kinematic chains is defined.
- If the alignment generalizes, it might reduce the volume of robot-specific data needed for many manipulation tasks.
- Applying the distribution-matching loss to other pretrained vision models beyond DINO would be a direct next measurement.
Load-bearing premise
Sparse correspondences between shared body parts, obtained automatically from forward kinematics on one robot demonstration, suffice to produce reliable semantic alignment in latent space without degrading the pretrained SSL features.
What would settle it
Running the same zero-shot transfer experiments and finding that policies using LACE-DINO features achieve no meaningful improvement or lower success rates than policies using plain DINO features would falsify the central claim.
Figures
read the original abstract
Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LACE, a framework for aligning latent visual representations between human and robot embodiments within pretrained self-supervised learning backbones such as DINO. By leveraging sparse correspondences between shared body parts automatically derived from forward kinematics on a single robot demonstration, it introduces a semantic alignment loss that matches feature distributions to achieve semantic-level alignment from patch-level supervision, complemented by a Gram loss to maintain the quality of the pretrained features. This enables improved robot policy learning from human demonstrations, particularly in zero-shot transfer scenarios where LACE-DINO policies outperform DINO by 65%, with additional benefits in low-data regimes and out-of-distribution environments.
Significance. If the experimental results hold, this work could have significant impact on cross-embodiment imitation learning by providing a way to bridge the visual gap between human and robot without requiring large amounts of robot data or retraining the backbone. The use of automatic annotations from minimal demonstrations is a practical strength, and the approach of distribution matching in latent space offers a novel way to lift sparse supervision to semantic alignment while preserving feature quality.
major comments (2)
- The central claim relies on the semantic alignment loss successfully producing reliable semantic-level alignment from sparse body-part correspondences obtained from a single trajectory. Given that supervision is limited to shared parts and one demo, it is important to clarify how the distribution-matching avoids aligning only low-level statistics or causing partial feature collapse, especially across large visual gaps between human and robot hands.
- The reported 65% improvement in zero-shot transfer is a key result, but the manuscript should provide more details on the experimental setup, including the number of evaluation trials, statistical tests, specific baselines, and ablation studies to confirm that the gains are due to the alignment rather than other factors.
minor comments (2)
- The abstract mentions 'LACE-DINO' but the full definition and integration with the backbone could be clarified earlier in the text for better readability.
- Consider adding more discussion on potential limitations when the visual gap is even larger or when body parts do not overlap as assumed.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of LACE's potential impact on cross-embodiment imitation learning. We address each major comment below with clarifications drawn from the manuscript and indicate where revisions will be incorporated to strengthen the presentation.
read point-by-point responses
-
Referee: The central claim relies on the semantic alignment loss successfully producing reliable semantic-level alignment from sparse body-part correspondences obtained from a single trajectory. Given that supervision is limited to shared parts and one demo, it is important to clarify how the distribution-matching avoids aligning only low-level statistics or causing partial feature collapse, especially across large visual gaps between human and robot hands.
Authors: We thank the referee for this important observation. The semantic alignment loss matches the empirical distributions of latent features from corresponding body parts (automatically annotated via forward kinematics on a single robot demonstration) using a distribution discrepancy measure such as sliced Wasserstein distance. Because matching occurs in the high-dimensional feature space of the pretrained SSL backbone rather than at the pixel or low-level descriptor level, it promotes semantic correspondence (e.g., aligning fingertip semantics across embodiments). The Gram loss is applied concurrently to align second-order feature statistics, which explicitly preserves the pretrained representation quality and discourages collapse to trivial solutions. Ablation studies in the manuscript (Table 3) show that removing the Gram loss degrades performance, supporting its role in maintaining feature diversity. We have added a new paragraph in Section 3.2 with a step-by-step derivation of the loss and t-SNE visualizations in the appendix demonstrating that aligned features cluster by semantic part rather than low-level appearance, even across the substantial visual gap between human and robot hands. revision: yes
-
Referee: The reported 65% improvement in zero-shot transfer is a key result, but the manuscript should provide more details on the experimental setup, including the number of evaluation trials, statistical tests, specific baselines, and ablation studies to confirm that the gains are due to the alignment rather than other factors.
Authors: We agree that expanded experimental details will improve rigor. The 65% figure is the average relative gain in success rate across five manipulation tasks when transferring policies trained on human demonstrations to a robot embodiment. Each task was evaluated in 50 independent rollouts; mean success rates and standard deviations are reported in Table 2. We have added paired t-tests (p < 0.01) confirming statistical significance of the improvement over the DINO baseline. Baselines comprise vanilla DINO, MAE, CLIP, and a supervised feature-regression alignment method. Ablations isolating the semantic alignment loss, the Gram loss, and the number of demonstrations (one vs. five) appear in Table 3 and Figure 4. The revised Section 4 now includes these specifics together with a discussion attributing gains specifically to the cross-embodiment alignment rather than other implementation choices. revision: yes
Circularity Check
No load-bearing circularity; new losses added to fixed pretrained backbones
full rationale
The paper's central derivation introduces a semantic alignment loss and Gram loss on top of fixed pretrained SSL backbones (e.g., DINO) using sparse correspondences obtained via forward kinematics from a single demonstration. These losses are defined independently of the downstream policy performance metric, and zero-shot transfer gains are reported via empirical evaluation rather than by fitting parameters to the target success rates or by reducing to self-citation. No equations or claims reduce the reported 65% improvement to a fitted input or self-defined quantity by construction. This yields only minor non-load-bearing structure, consistent with a score of 2.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained SSL backbones such as DINO already encode features that remain useful after alignment and do not require full retraining.
invented entities (1)
-
LACE alignment module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LACE-DINO substantially outperforms DINO across all metrics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Llm-bt: Performing robotic adaptive tasks based on large language models and behavior trees,
Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. InIEEE Int. Conf. Robot. Autom. (ICRA), pages 6892–6903, 2024. doi:10.1109/ICRA57147.2024.10611477
-
[2]
URLhttps://doi.org/10.15607/RSS.2024.XX.120
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Srirama, Lawrence Chen, Kirsty Ellis, Peter Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Ma, Patrick Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, others, and Chelsea Finn. DROID: A large-scale in-the-wild robot man...
-
[3]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023
work page 2023
-
[4]
Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang
Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J. Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang. Humanoid policy human policy, 2025. URLhttps://arxiv.org/abs/2503.13441
-
[5]
Kanchana Ranasinghe, Xiang Li, Cristina Mata, Jong Sung Park, and Michael S. Ryoo. Pixel motion as universal representation for robot control.ArXiv, 2025
work page 2025
-
[6]
Marion Lepert, Ria Doshi, and Jeannette Bohg. Shadow: Leveraging segmentation masks for cross-embodiment policy transfer.arXiv preprint arXiv:2503.00774, 2025
-
[7]
Mirage: Cross-embodiment zero-shot policy transfer with cross-painting, 2024
Lawrence Yunliang Chen, Kush Hari, Karthik Dharmarajan, Chenfeng Xu, Quan Vuong, and Ken Goldberg. Mirage: Cross-embodiment zero-shot policy transfer with cross-painting, 2024. URL https://arxiv.org/ abs/2402.19249
-
[9]
Marion Lepert, Jiaying Fang, and Jeannette Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025
-
[10]
Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S. Ryoo, and Juan Carlos Niebles. Future optical flow prediction improves robot control & video generation.CVPR Findings, 2026
work page 2026
-
[11]
Siddhant Haldar and Lerrel Pinto. Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025
-
[12]
Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,
Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, Priya Sundaresan, Peng Xu, Hao Su, Karol Hausman, Chelsea Finn, Quan Vuong, and Ted Xiao. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,
- [13]
-
[14]
Any-point Trajectory Modeling for Policy Learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Phantom: Training robots without robots using only human videos, 2025
Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos, 2025. URLhttps://arxiv.org/abs/2503.00779
-
[16]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A visi...
-
[18]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
URLhttps://arxiv.org/abs/2410.24164. 12 LACE : Latent Visual Representation for Cross-Embodiment LearningA PREPRINT
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
work page 2023
-
[21]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, and Fares J. Abu-Dakka. Mv-umi: A scal- able multi-view interface for cross-embodiment learning.ArXiv, abs/2509.18757, 2025. URL https: //api.semanticscholar.org/CorpusID:281496577
-
[24]
Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, I-Chun Arthur Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data.ArXiv, abs/2511.15704, 2025. URLhttps://api.semanticscholar.org/CorpusID:283103363
-
[25]
Hommi: Learning whole-body mobile manipulation from human demonstrations, 2026
Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, and Shuran Song. Hommi: Learning whole-body mobile manipulation from human demonstrations, 2026
work page 2026
-
[26]
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233, 2024. URL https://api.semanticscholar.org/CorpusID: 273707799
work page 2025
-
[27]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manip- ulation from large-scale egocentric video.ArXiv, abs/2505.11709, 2025. URL https://api.semanticscholar. org/CorpusID:278739529
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025
work page 2025
-
[29]
Robopaint: From human demonstration to any robot and any view, 2026
Jiacheng Fan, Zhiyu Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, and Zhengxue Cheng. Robopaint: From human demonstration to any robot and any view, 2026
work page 2026
- [30]
-
[31]
Video prediction policy: A generalist robot policy with predictive visual representations.ICML, 2025
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.ICML, 2025
work page 2025
-
[32]
Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models can leverage human videos for dexter- ous manipulation.ArXiv, abs/2512.13644, 2025. URL https://api.semanticscholar.org/CorpusID: 283896258
-
[33]
Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of human to robot transfer in vision-language-action models, 2025. URL https: //arxiv.org/abs/2512.22414
-
[34]
Egobridge: Domain adaptation for generalizable imitation from egocentric human data
Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. InHuman to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025
work page 2025
-
[35]
Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqin Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization.ArXiv, abs/2601.12993, 2026. URL https://api.semanticscholar.org/ CorpusID:284909770. 13 LACE : Latent Visual Representation ...
-
[36]
Hanjung Kim, Jaehyun Kang, Hyolim Kang, Meedeum Cho, Seon Joo Kim, and Youngwoon Lee. Uniskill: Imitating human videos via cross-embodiment skill representations.ArXiv, abs/2505.08787, 2025. URL https: //api.semanticscholar.org/CorpusID:278535353
-
[37]
Mengda Xu, Zhenjia Xu, Cheng Chi, Manuela M. Veloso, and Shuran Song. Xskill: Cross embodiment skill discovery. InConference on Robot Learning, 2023. URL https://api.semanticscholar.org/CorpusID: 259982636
work page 2023
-
[38]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024
-
[43]
Dinobot: Robot manipulation via retrieval and alignment with vision foundation models
Norman Di Palo and Edward Johns. Dinobot: Robot manipulation via retrieval and alignment with vision foundation models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2798–
-
[44]
Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, and Yu-Gang Jiang. Human2robot: Learning robot actions from paired human-robot videos.arXiv preprint arXiv:2502.16587, 2025
-
[45]
Cross-embodiment dexterous grasping with reinforcement learning.ArXiv, abs/2410.02479, 2024
Haoqi Yuan, Bohan Zhou, Yuhui Fu, and Zongqing Lu. Cross-embodiment dexterous grasping with reinforcement learning.ArXiv, abs/2410.02479, 2024. URLhttps://api.semanticscholar.org/CorpusID:273098035
-
[46]
Zhiyuan Wu, Xiangyue Zhang, Zhuo Chen, Jiankang Deng, Rolandos Alexandros Potamias, and Shan Luo. Unimorphgrasp: Diffusion model with morphology-awareness for cross-embodiment dexterous grasp generation
-
[47]
URLhttps://api.semanticscholar.org/CorpusID:285270494
-
[48]
Zhiyuan Wu, Rolandos Alexandros Potamias, Xuyang Zhang, Zhongqun Zhang, Jiankang Deng, and Shan Luo. Cedex: Cross-embodiment dexterous grasp generation at scale from human-like contact representations.ArXiv, abs/2509.24661, 2025. URLhttps://api.semanticscholar.org/CorpusID:281674748
-
[49]
Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou, Weisi Lin, and Yanjun Wu. Morphartgrasp: Morphology- aware cross-embodiment dexterous hand articulation generation for grasping. 2025. URL https://api. semanticscholar.org/CorpusID:281886756
work page 2025
-
[50]
Zihao He, Bo Ai, Tongzhou Mu, Yulin Liu, Weikang Wan, Jiawei Fu, Yilun Du, Henrik I. Christensen, and Hao Su. Scaling cross-embodiment world models for dexterous manipulation.ArXiv, abs/2511.01177, 2025. URL https://api.semanticscholar.org/CorpusID:282275179
-
[51]
House of Dextra: Cross-embodied Co-design for Dexterous Hands
Kehlani Fay, Darin Anthony Djapri, Anya Zorin, James Clinton, Ali El Lahib, Hao Su, Michael Thomas Tolley, Sha Yi, and Xiaolong Wang. Cross-embodied co-design for dexterous hands.ArXiv, abs/2512.03743, 2025. URL https://api.semanticscholar.org/CorpusID:283466942
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
AnyDexGrasp: General dexterous grasping for different hands with human-level learning efficiency,
Haoshu Fang, Hengxu Yan, Zhenyu Tang, Hongjie Fang, Chenxi Wang, and Cewu Lu. Anydexgrasp: General dexterous grasping for different hands with human-level learning efficiency.ArXiv, abs/2502.16420, 2025. URL https://api.semanticscholar.org/CorpusID:276575198
-
[53]
Robustdexgrasp: Robust dexterous grasping of general objects
Hui Zhang, Zijian Wu, Linyi Huang, Sammy Joe Christen, and Jie Song. Robustdexgrasp: Robust dexterous grasping of general objects. 2025. URLhttps://api.semanticscholar.org/CorpusID:277626910
work page 2025
-
[54]
Tyler Ga, Wei Lum, Olivia Y . Lee, C. Karen Liu, Jeannette Bohg, and Pre-Manip Hand Pose. Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration.ArXiv, abs/2504.12609, 2025. URLhttps://api.semanticscholar.org/CorpusID:277857482. 14 LACE : Latent Visual Representation for Cross-Embodiment LearningA PREPRINT
-
[55]
H-rdt: Human manipulation enhanced bimanual robotic manipulation
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-rdt: Human manipulation enhanced bimanual robotic manipulation.ArXiv, abs/2507.23523, 2025. URL https://api. semanticscholar.org/CorpusID:280400964
-
[56]
R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
Real-world robot learning with masked visual pre-training
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023
work page 2023
-
[58]
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent- Pierre Berges, Tingfan Wu, Jay Vakil, et al. Where are we in the search for an artificial visual cortex for embodied intelligence?Advances in Neural Information Processing Systems, 36:655–677, 2023
work page 2023
-
[59]
4d visual pre-training for robot learning
Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, and Huazhe Xu. 4d visual pre-training for robot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8451–8461, 2025
work page 2025
-
[60]
Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmentations and object relations.Advances in Neural Information Processing Systems, 35:13745–13758, 2022
work page 2022
-
[61]
Learning to Estimate 3D Hand Pose from Single RGB Images
Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. Technical report, arXiv:1705.01389, 2017. URL https://lmb.informatik.uni-freiburg.de/projects/hand3d/. https://arxiv.org/abs/1705.01389
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[62]
Deeppose: Human pose estimation via deep neural networks
Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014
work page 2014
-
[63]
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Reconstructing hands in 3D with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024
work page 2024
-
[65]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.https://github. com/facebookresearch/detectron2, 2019
work page 2019
-
[66]
Wilor: End-to-end 3d hand localization and reconstruction in-the-wild
Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025
work page 2025
-
[67]
Kenneth Shaw, Ananye Agarwal, and Deepak Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning.Robotics: Science and Systems (RSS), 2023
work page 2023
-
[68]
Using apple vision pro to train and control robots, 2024
Younghyo Park and Pulkit Agrawal. Using apple vision pro to train and control robots, 2024. URL https: //github.com/Improbable-AI/VisionProTeleop
work page 2024
-
[69]
Simple open-vocabulary object detection
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. InEuropean conference on computer vision, pages 728–755. Springer, 2022
work page 2022
-
[70]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020
work page 2020
-
[72]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023
work page 2023
-
[73]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[74]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025. 15 LACE : Latent Visual Representation for Cross-Embodiment Learn...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.