pith. sign in

arxiv: 2606.17054 · v1 · pith:VJMYNY25new · submitted 2026-06-15 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Human Universal Grasping

Pith reviewed 2026-06-27 03:49 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords human graspingflow matchingegocentric visionrobot manipulationRGB-D inputMANO modelzero-shot transfergrasp generation
0
0 comments X

The pith

A flow-matching model generates natural human grasps from single RGB-D images after training on a million-frame egocentric dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the best way to achieve general robot grasping is to learn directly from how humans grasp objects in daily life. They collect a large-scale dataset of 1M frames using smart glasses to capture natural grasps across many objects and scenes. A flow-matching model is then trained to predict grasp parameters from RGB-D input. This model can be retargeted to robot hands and shows improved performance over existing methods on a new benchmark of challenging objects.

Core claim

HUG is a flow-matching model that fuses RGB and depth from a stereo camera to generate grasps parameterized by wrist translation, wrist rotation, and MANO hand pose. Trained on the 1M-HUGs dataset of 1M frames and 6707 object instances, it produces diverse grasps that transfer to various robot embodiments for zero-shot grasping in real environments.

What carries the argument

flow-matching model that fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose

If this is right

  • HUG enables zero-shot grasping in everyday scenes across multiple robot hands and household environments.
  • The approach outperforms state-of-the-art grasping baselines by 23% and 34% on a challenging object set.
  • A new simulated benchmark HUG-Bench with 90 unseen objects standardizes evaluation of grasp generation methods.
  • Predicted grasps from the model can be retargeted to various robot hands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that large-scale human egocentric data can provide a more general distribution of grasps than synthetic or limited datasets.
  • Future work could extend the model to dynamic scenes or multi-object interactions by collecting additional egocentric data.
  • Retargeting success implies that human grasp distributions are transferable across different hand morphologies with appropriate mapping.
  • The use of flow-matching indicates that modeling grasp distributions as continuous flows may capture natural variability better than discrete sampling methods.

Load-bearing premise

The 1M-frame egocentric dataset collected via smart glasses is representative of the full distribution of natural human grasps across object geometries, sizes, and everyday scenes without significant selection or recording bias.

What would settle it

Evaluating the model on a set of objects and scenes deliberately chosen to be outside the distribution of the 1M-HUGs dataset, such as rare geometries or unusual environments, and checking if the performance advantage over baselines disappears.

Figures

Figures reproduced from arXiv: 2606.17054 by Billy Yan, Dandan Shan, David Fouhey, Irmak Guzey, Isaac Tu, Kevin Yuanbo Wu, Lerrel Pinto, Tianxing Zhou.

Figure 1
Figure 1. Figure 1: HUG learns dexterous grasping without any robot data. Trained solely on egocentric human grasp data, HUG generates diverse human grasps for real-world objects in a single RGBD image captured from a stereo camera, which can be retargeted to robot hands for zero-shot, in-the￾wild dexterous grasping. † Correspondence to: k.wu@nyu.edu. ‡ Equal advising. arXiv:2606.17054v1 [cs.RO] 15 Jun 2026 [PITH_FULL_IMAGE:… view at source ↗
Figure 2
Figure 2. Figure 2: 1M-HUGS dataset. Our training data comprises 1M egocentric frames of human grasps, spanning 6,707 object instances. Each entry provides synchronized RGB and grayscale views, met￾ric depth, an object mask, and a MANO hand pose with wrist transformation in the camera frame. Robot learning from non-robot datasets. Given the difficulty of collecting robot-specific data, recent work learns robot behaviors from … view at source ↗
Figure 3
Figure 3. Figure 3: HUG architecture. Conditioned on an RGB-D image and a query point on the target object, HUG predicts MANO hand grasps via a flow-matching transformer over fused RGB and point cloud features. Predicted human grasps are then retargeted to robot hands. Dataset statistics and labels. The dataset contains 6,707 recordings across 41 buildings with an estimated ∼1.5K unique objects. Within each building we grasp … view at source ↗
Figure 4
Figure 4. Figure 4: Predicted grasps on HUG-BENCH. HUG’s predicted grasps for 30 unseen objects across six scenes of the HUG-BENCH test split. HUG generalizes across a variety of object shapes and sizes, environments, and camera viewpoints. Top row: small 2, medium 1, large 1. Bottom row: large 2, medium 2, small 1. See Appendix [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real world grasping with HUG. Grasp executions on unseen objects from HUG-BENCH test split in an unseen home, performed by a YOR mobile manipulator equipped with WUJI hands. 5 Experiments We introduce HUG-BENCH (§ 5.1), then evaluate HUG in simulation (§ 5.2) and real-world (§ 5.3). 5.1 HUG-BENCH [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: HUG-BENCH test split. 30 unseen objects (top) from 5 ge￾ometric categories × 3 size bins (2 per class), with their simulation assets (bottom). Evaluation dataset. We propose a difficult-to-grasp set of objects for evaluation. HUG-BENCH comprises 90 objects spanning five geometric categories (cylindrical, spheroidal, prismatic, appendaged, amorphous) and three size bins (small, medium, large), with six obje… view at source ↗
Figure 7
Figure 7. Figure 7: Dataset scaling. Impact of dataset size on HUG-BENCH SR and FC error (Eq. 2); training sets are nested proper subsets. predicted MANO joint angles, with a small extra flexion to apply force. Then, the wrist lifts straight up by 0.5 m. A grasp succeeds if the object is no longer in contact with the surface after the lift. 5.2 Simulation Grasping Evaluation Metrics. Success rate (SR, %) is the fraction of tr… view at source ↗
Figure 8
Figure 8. Figure 8: Real-to-sim grasping. Eval￾uating HUG on HUG-BENCH in sim￾ulation using real captured inputs. The human grasp oracle replays the 10 recorded human grasps on each object, estimating an upper bound on what is achievable in our simulator. It falls short of 100% due to hand-tracking error in our Aria Gen 2 data [72], slight inaccuracies in object assets, and open-loop execution fail￾ures (Appendix E.2). Our fu… view at source ↗
Figure 9
Figure 9. Figure 9: Single-modality failures. Cases where RGB-only or PC-only prediction fails but RGB+PC succeeds. Objects, left to right: pineapple, hair brush, anchovies, spoon, softball. RGB and depth are complementary. On the modality axis, PC-only remains a strong standalone baseline at 64.2% val SR and 70.7% test SR, while RGB-only collapses to 26.8% val SR and 29.7% test SR. The contrast is sharper in FC error: RGB-on… view at source ↗
Figure 10
Figure 10. Figure 10: Hand sizes. The fixed-shape MANO hand alongside its simulation mesh and the Abil￾ity and WUJI robot hands. WUJI is a similar size to HUG’s fixed hand size, but Ability is much smaller. Baselines. We compare HUG against two recent learning-based grasping methods. Dex1B [3] is a generative multi-fingered grasp￾ing model trained on 1B simulated demonstra￾tions by combining grasp optimization with generative … view at source ↗
Figure 11
Figure 11. Figure 11: Failure mode breakdown. Grasp-outcome flow for the 300 HUG-BENCH test trials in each real-world setting, tracing every attempt through the pre-grasp, grasp, and lift stages into success or a specific failure mode. Failure modes [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: aria2mano MANO fitting. aria2mano fits a full articulated MANO hand (bottom, blue) to the sparse 21-landmark Aria skeleton (top, red) via a per-frame anatomically￾constrained optimization, recovering pose, mesh, and dense joint angles from each grasp recording. We elaborate on the MANO [19] fitting pro￾cedure summarized in § 3. Aria Gen 2 re￾ports the wearer’s hand only as a sparse 21- landmark skeleton i… view at source ↗
Figure 13
Figure 13. Figure 13: Data collection. A wearer captures a grasp with Aria Gen 2 glasses. stationary object: the wearer moves their head around the object for 15–30 seconds with both hands behind their back, then grasps the object without lifting it. From this footage we identify the grasped object, segment it across all frames, select the grasp frame, manually verify the result, and prepare the final training entries. The ful… view at source ↗
Figure 14
Figure 14. Figure 14: Grasp annotation app. The web interface used to verify and correct the automatic labels. The left panel refines the object mask with point prompts and SAM3 re-segmentation; the right panel steps through frames to select and verify the grasp frame. Each recording is marked checked once its mask and grasp pass review. B.3 Dataset Preparation [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: RGB crop. Stereo-left depth repro￾jected into the RGB frame: valid reprojected depth (black), the crop (green), and parallax holes filled by nearest neighbor (red). We export training entries from the checked recordings under a sequence of filters. Shared filters keep frames whose index lies in [20, grasp−10) (skipping the SLAM warm-up and the 10 frames before the grasp), drop the annotation frame, requir… view at source ↗
Figure 16
Figure 16. Figure 16: Point cloud crop. The 0.3 m radius crop around the 3D query point. which lifts the 2D click to the 3D query point pq, and the inverse projection [ u ′ , v′ , 1 ]⊤ ∝ K ci that maps each cen￾troid ci onto the image plane for point painting. No layer takes K as a learned input. The model therefore transfers across stereo cameras with different intrinsics without retrain￾ing, which is why training on Aria rec… view at source ↗
Figure 17
Figure 17. Figure 17: Training curves. Success rate and fingertip contact error on the HUG-BENCH val split for model ablations (top) and data scaling experiments (bottom) [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: aria2mesh asset reconstruction. Five egocentric Aria views (camera frustums) are fused with Multi-view SAM3D, then pose-optimized and gravity-aligned, and finally manually edited against the semidense point cloud and dense stereo depth points to recover per-object metric￾scale meshes and poses (axes). object is turned into a metric-scale, simulation-ready asset in minutes, making it practical to grow the … view at source ↗
Figure 19
Figure 19. Figure 19: GT grasp offset. HUG￾BENCH SR (%) vs. ground-truth grasp displacement along all three wrist axes. How precise must grasps be? To quantify how much spatial precision a successful grasp demands, we displace each ground-truth grasp simultaneously along all three wrist axes before execution, t ′ = t + δ (ex + ey + ez), δ ∈ {1, 2, . . .} cm, (6) and measure the resulting SR over all 90 HUG-BENCH objects (val a… view at source ↗
Figure 20
Figure 20. Figure 20: Real-robot tabletop setup. Abil￾ity hand + ZED cam￾era + xArm. We select the checkpoint with the best HUG-BENCH val SR in simu￾lation, and without ever testing the model in the real world, run all 300 HUG-BENCH test trials consecutively, each a single grasp prediction followed by one open-loop execution, with no cuts and no retries. We do not tune our model nor the open-loop execution strategy, discussed … view at source ↗
read the original abstract

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents HUG, a flow-matching model trained on the new 1M-HUGs egocentric dataset (1M frames, 6,707 instances) collected via smart glasses to generate diverse human grasps (wrist pose + MANO parameters) from a single RGB-D image. Grasps are retargeted to robot hands for zero-shot use. A new simulated benchmark HUG-Bench (90 unseen objects, five geometric categories) and a 30-object real-world test set are introduced; HUG is reported to outperform SOTA baselines by +23% and +34%. Code, data, benchmark, and checkpoints are released.

Significance. If the generalization claims hold after addressing evaluation gaps, the work offers a scalable route to natural grasp distributions from everyday human data, with potential impact on multi-fingered robotic grasping generality. The public release of the 1M-HUGs dataset, HUG-Bench meshes, model checkpoints, and interactive demo is a clear strength for reproducibility and follow-on research.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: The +23% and +34% outperformance claims on HUG-Bench and the real-world 30-object set supply no information on the exact success metric (e.g., grasp success rate definition, contact thresholds), baseline implementations, statistical significance testing, or controls for data leakage between the 6,707 training instances and the 90 test objects.
  2. [Dataset / Experiments] Dataset collection and evaluation sections: The central generalization claim rests on the assumption that the smart-glasses egocentric protocol yields an unbiased sample of natural grasps; however, no quantitative validation (grasp-type histograms, object-size coverage statistics, or direct comparison against third-party human grasp datasets) is reported to rule out systematic biases in wrist orientation, visibility, or object selection.
minor comments (1)
  1. [Abstract] The abstract refers to 'our challenging object set' without a forward reference to the precise composition of the 90-object HUG-Bench or the 30-object real-world subset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The +23% and +34% outperformance claims on HUG-Bench and the real-world 30-object set supply no information on the exact success metric (e.g., grasp success rate definition, contact thresholds), baseline implementations, statistical significance testing, or controls for data leakage between the 6,707 training instances and the 90 test objects.

    Authors: We agree these details are essential. In the revised manuscript we will explicitly define the grasp success metric (percentage of grasps maintaining stable contact without slippage under gravity for 5 seconds, with 0.01 m penetration threshold) in the Experiments section. Baseline implementations and any adaptations will be detailed with hyperparameters. Statistical significance will be reported via paired tests with p-values. We will add an explicit statement confirming the 90 test objects are disjoint from the 6,707 training instances (collected in separate sessions and buildings) along with supporting evidence. revision: yes

  2. Referee: [Dataset / Experiments] Dataset collection and evaluation sections: The central generalization claim rests on the assumption that the smart-glasses egocentric protocol yields an unbiased sample of natural grasps; however, no quantitative validation (grasp-type histograms, object-size coverage statistics, or direct comparison against third-party human grasp datasets) is reported to rule out systematic biases in wrist orientation, visibility, or object selection.

    Authors: We acknowledge the value of such validation. The revised version will include grasp-type histograms derived from MANO parameters and object-size coverage statistics for the 1M-HUGs dataset. Direct quantitative comparisons against third-party datasets require aligning incompatible capture protocols and taxonomies and are not feasible without substantial new effort; we will discuss this limitation and describe how the multi-building, everyday-environment protocol was designed to reduce bias. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper collects an external egocentric dataset (1M-HUGs) via smart glasses, trains a flow-matching model on it to predict grasp parameters from RGB-D input, and evaluates generalization on held-out objects in a newly constructed benchmark (HUG-Bench) plus real-world tests. No equations, parameters, or claims reduce the reported performance gains to quantities defined by the same fitted values or self-citations; the central results rest on standard supervised learning from independent data with external baselines. Dataset representativeness is an empirical assumption, not a definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the collected human grasp data and on standard assumptions of neural network training and retargeting; no new physical entities are postulated.

free parameters (1)
  • flow-matching model parameters
    Weights of the neural network are fitted to the 1M-frame dataset to approximate the conditional distribution of grasps.
axioms (1)
  • domain assumption Egocentric RGB-D frames from smart glasses capture the natural distribution of human grasps without major bias from recording setup or participant behavior.
    The model is trained directly on this data to learn the grasp distribution.

pith-pipeline@v0.9.1-grok · 5822 in / 1340 out tokens · 47032 ms · 2026-06-27T03:49:49.027617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 6 canonical work pages

  1. [1]

    R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation, 2023. URLhttps: //arxiv.org/abs/2210.02697

  2. [2]

    H.-S. Fang, H. Yan, Z. Tang, H. Fang, C. Wang, and C. Lu. Anydexgrasp: General dexter- ous grasping for different hands with human-level learning efficiency, 2025. URLhttps: //arxiv.org/abs/2502.16420

  3. [3]

    J. Ye, K. Wang, C. Yuan, R. Yang, Y . Li, J. Zhu, Y . Qin, X. Zou, and X. Wang. Dex1b: Learning with 1b demonstrations for dexterous manipulation, 2025. URLhttps://arxiv.org/abs/ 2506.17198

  4. [4]

    Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen, T. Liu, L. Yi, and H. Wang. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy, 2023. URLhttps://arxiv.org/ abs/2303.00938

  5. [5]

    H. Chen, Y . Yao, Y . Ye, Z. Xu, H. Bharadhwaj, J. Wang, S. Tulsiani, Z. Erickson, and J. Ich- nowski. Web2grasp: Learning functional grasps from web images of hand-object interactions. arXiv preprint arXiv:2505.05517, 2025

  6. [6]

    Gupta, M

    H. Gupta, M. A. Mirzaee, and W. Yuan. Grasp to act: Dexterous grasping for tool use in dynamic settings.IEEE Robotics and Automation Letters, 11(5):6288–6295, 2026

  7. [7]

    A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto. Open teach: A versatile teleoperation system for robotic manipulation. InConference on Robot Learning (CoRL), 2024

  8. [8]

    S. P. Arunachalam, I. G ¨uzey, S. Chintala, and L. Pinto. Holo-dex: Teaching dexterity with im- mersive mixed reality. InIEEE International Conference on Robotics and Automation (ICRA), 2023

  9. [9]

    R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang. Bunny-VisionPro: Real- time bimanual dexterous teleoperation for imitation learning, 2024. URLhttps://arxiv. org/abs/2407.03162

  10. [10]

    Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox. AnyTeleop: A general vision-based dexterous robot arm-hand teleoperation system. InRobotics: Science and Systems (RSS), 2023

  11. [11]

    Project Aria Gen 2.https://facebookresearch.github

    Meta Reality Labs Research. Project Aria Gen 2.https://facebookresearch.github. io/projectaria_tools/gen2/, 2026. Accessed: 2026-06-15

  12. [12]

    Zorin, I

    A. Zorin, I. Guzey, B. Yan, A. Iyer, L. Kondrich, N. X. Bhattasali, and L. Pinto. Ruka: Re- thinking the design of humanoid hands with learning, 2025. URLhttps://arxiv.org/abs/ 2504.13165. 11

  13. [13]

    Ability Hand.https://www.psyonic.io/ability-hand, 2026

    Psyonic. Ability Hand.https://www.psyonic.io/ability-hand, 2026. Accessed: 2026- 06-15

  14. [14]

    K. Shaw, A. Agarwal, and D. Pathak. LEAP Hand: Low-cost, efficient, and anthropomorphic hand for robot learning. InRobotics: Science and Systems (RSS), 2023

  15. [15]

    WUJI Hand.https://docs.wuji.tech/docs/en/wuji-hand/v1/,

    WUJI Technology. WUJI Hand.https://docs.wuji.tech/docs/en/wuji-hand/v1/,

  16. [16]

    Accessed: 2026-06-15

  17. [17]

    Mandi, Y

    Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. DexMachina: Functional retargeting for bimanual dexterous manipulation, 2025. URLhttps://arxiv.org/abs/ 2505.24853

  18. [18]

    K. Li, P. Li, T. Liu, Y . Li, and S. Huang. Maniptrans: Efficient dexterous bimanual manipula- tion transfer via residual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6991–7003, 2025

  19. [19]

    He and W

    G. He and W. Zhang. Wujihand retargeting.https://github.com/wuji-technology/ wuji-retargeting, 2026. Accessed: 2026-06-15

  20. [20]

    ACM Trans

    J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transactions on Graphics, 36(6):1–17, Nov. 2017. ISSN 1557-7368. doi:10.1145/3130800.3130883. URLhttp://dx.doi.org/10.1145/3130800.3130883

  21. [21]

    Pinto and A

    L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, 2015. URLhttps://arxiv.org/abs/1509.06825

  22. [22]

    H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for gen- eral object grasping. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 11441–11450, 2020. doi:10.1109/CVPR42600.2020.01146

  23. [23]

    H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y . Xie, and C. Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains, 2023. URLhttps: //arxiv.org/abs/2212.08333

  24. [24]

    Mousavian, C

    A. Mousavian, C. Eppner, and D. Fox. 6-dof graspnet: Variational grasp generation for object manipulation. InProceedings of the IEEE/CVF international conference on computer vision, pages 2901–2910, 2019

  25. [25]

    Sundermeyer, A

    M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In2021 IEEE international conference on robotics and automation (ICRA), pages 13438–13444. IEEE, 2021

  26. [26]

    Chavan-Dafle, S

    N. Chavan-Dafle, S. Popovych, S. Agrawal, D. D. Lee, and V . Isler. Simultaneous object reconstruction and grasp prediction using a camera-centric object shell representation, 2022. URLhttps://arxiv.org/abs/2109.06837

  27. [27]

    P. Liu, Y . Orru, J. Vakil, C. Paxton, N. Shafiullah, and L. Pinto. Demonstrating ok-robot: What really matters in integrating open-knowledge models for robotics. InRobotics: Science and Systems XX, RSS2024. Robotics: Science and Systems Foundation, July 2024. doi:10.15607/ rss.2024.xx.091. URLhttp://dx.doi.org/10.15607/RSS.2024.XX.091

  28. [28]

    Z. J. Cui, O. Rayyan, H. Etukuru, B. Tan, Z. Andrianarivo, Z. Teng, Y . Zhou, K. Mehta, N. Wojno, K. Y . Wu, M. H. Anjaria, Z. Wu, M. Mao, G. Zhang, B. Shah, Y . Kim, S. Chintala, L. Pinto, and N. M. M. Shafiullah. Contact-anchored policies: Contact conditioning creates strong robot utility models, 2026. URLhttps://arxiv.org/abs/2602.09017

  29. [29]

    T. G. W. Lum, M. Matak, V . Makoviychuk, A. Handa, A. Allshire, T. Hermans, N. D. Ratliff, and K. V . Wyk. DextrAH-g: Pixels-to-action dexterous arm-hand grasping with geometric fabrics. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview. net/forum?id=S2Jwb0i7HN. 12

  30. [30]

    Singh, A

    R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. V . Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands, 2025. URLhttps://arxiv.org/abs/2412.01791

  31. [31]

    Christen, M

    S. Christen, M. Kocabas, E. Aksan, J. Hwangbo, J. Song, and O. Hilliges. D-grasp: Phys- ically plausible dynamic grasp synthesis for hand-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20577–20586, 2022

  32. [32]

    W. Wan, H. Geng, Y . Liu, Z. Shan, Y . Yang, L. Yi, and H. Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist- specialist learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3891–3902, 2023

  33. [33]

    Zhong, Q

    Y . Zhong, Q. Jiang, J. Yu, and Y . Ma. Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness, 2025. URLhttps://arxiv.org/abs/2503.08257

  34. [34]

    J. Lu, H. Kang, H. Li, B. Liu, Y . Yang, Q. Huang, and G. Hua.UGG: Unified Generative Grasping, page 414–433. Springer Nature Switzerland, Nov. 2024. ISBN 9783031728556. doi:10.1007/978-3-031-72855-6 24. URLhttp://dx.doi.org/10. 1007/978-3-031-72855-6_24

  35. [35]

    Etukuru, N

    H. Etukuru, N. Naka, Z. Hu, S. Lee, J. Mehu, A. Edsinger, C. Paxton, S. Chintala, L. Pinto, and N. M. M. Shafiullah. Robot utility models: General policies for zero-shot deployment in new environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8275–8283. IEEE, 2025

  36. [36]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  37. [37]

    M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

  38. [38]

    Guzey, H

    I. Guzey, H. Qi, J. Urain, C. Wang, J. Yin, K. Bodduluri, M. Lambeta, L. Pinto, A. Rai, J. Malik, T. Wu, A. Sharma, and H. Bharadhwaj. Dexterity from smart lenses: Multi-fingered robot manipulation with in-the-wild human demonstrations, 2025. URLhttps://arxiv. org/abs/2511.16661

  39. [39]

    Guzey, Y

    I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

  40. [40]

    Karaev, I

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker: It is better to track together. InEuropean Conference on Computer Vision (ECCV), 2024

  41. [41]

    Doersch, Y

    C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement, 2023. URL https://arxiv.org/abs/2306.08637

  42. [42]

    Y . Ye, P. Hebbar, A. Gupta, and S. Tulsiani. Diffusion-guided reconstruction of everyday hand-object interaction clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 19717–19728, 2023

  43. [43]

    Pavlakos, D

    G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 13

  44. [44]

    Y . Ye, X. Li, A. Gupta, S. De Mello, S. Birchfield, J. Song, S. Tulsiani, and S. Liu. Affordance diffusion: Synthesizing hand-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22479–22489, 2023

  45. [45]

    S. Park, H. Bharadhwaj, and S. Tulsiani. Demodiffusion: One-shot human imitation using pre-trained diffusion policy, 2025. URLhttps://arxiv.org/abs/2506.20668

  46. [46]

    Haldar and L

    S. Haldar and L. Pinto. Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

  47. [47]

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023

  48. [48]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  49. [49]

    Bharadhwaj, R

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision (ECCV), 2024

  50. [50]

    K. Shaw, S. Bahl, and D. Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning (CoRL), 2023

  51. [51]

    J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Ze- romimic: Distilling robotic manipulation skills from web videos. InInternational Conference on Robotics and Automation (ICRA), 2025

  52. [52]

    M. K. Srirama, S. Dasari, S. Bahl, and A. Gupta. Hrp: Human affordances for robotic pre- training. InRobotics: Science and Systems (RSS), 2024

  53. [53]

    T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human interac- tions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025

  54. [54]

    S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  55. [55]

    Engel, K

    J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561, 2023

  56. [56]

    V . Liu, A. Adeniji, H. Zhan, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses, 2025. URLhttps://arxiv.org/abs/2505.20290

  57. [57]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv. org/abs/2410.24221

  58. [58]

    Y .-W. Chao, W. Yang, Y . Xiang, P. Molchanov, A. Handa, J. Tremblay, Y . S. Narang, K. Van Wyk, U. Iqbal, S. Birchfield, J. Kautz, and D. Fox. DexYCB: A benchmark for cap- turing hand grasping of objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9044–9053, 2021

  59. [59]

    Carion, L

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. R¨adle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y . Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Doll ´ar, N. Ravi, K. ...

  60. [60]

    J. Min, Y . Jeon, J. Kim, and M. Choi. S2M2: Scalable stereo matching model for reliable depth estimation, 2025. URLhttps://arxiv.org/abs/2507.13229

  61. [61]

    L. Yang, X. Zhan, K. Li, W. Xu, J. Li, and C. Lu. CPF: Learning a contact potential field to model the hand-object interaction. InICCV, 2021

  62. [62]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

  63. [63]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  64. [64]

    Darcet, M

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski. Vision transformers need registers. In International Conference on Learning Representations (ICLR), 2024

  65. [65]

    G. Qian, Y . Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem. PointNeXt: Revisiting PointNet++ with improved training and scaling strategies. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  66. [66]

    Tancik, P

    M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ra- mamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency func- tions in low dimensional domains. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  67. [67]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InIEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4195–4205, 2023

  68. [68]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026– 5033, 2012

  69. [69]

    B. Li, D. Wu, J. Li, S. Zhou, Z. Zeng, L. Li, and H. Zha. Mv-sam3d: Adaptive multi-view fusion for layout-aware 3d generation, 2026. URLhttps://arxiv.org/abs/2603.11633

  70. [70]

    B. Yi, C. M. Kim, J. Kerr, G. Wu, R. Feng, A. Zhang, J. Kulhanek, H. Choi, Y . Ma, M. Tancik, and A. Kanazawa. Viser: Imperative, web-based 3d visualization in python, 2025. URL https://arxiv.org/abs/2507.22885

  71. [71]

    Portaneri, M

    C. Portaneri, M. Rouxel-Labb ´e, M. Hemmer, D. Cohen-Steiner, and P. Alliez. Alpha wrapping with an offset.ACM Trans. Graph., 41(4), July 2022. ISSN 0730-0301. doi:10.1145/3528223. 3530152. URLhttps://doi.org/10.1145/3528223.3530152

  72. [72]

    X. Wei, M. Liu, Z. Ling, and H. Su. Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search.ACM Transactions on Graphics (TOG), 41(4):1–18, 2022

  73. [73]

    Project aria gen 2 mps performance benchmarks.https: //facebookresearch.github.io/projectaria_tools/gen2/technical-specs/ mps/benchmarks/performance, 2025

    Meta Reality Labs Research. Project aria gen 2 mps performance benchmarks.https: //facebookresearch.github.io/projectaria_tools/gen2/technical-specs/ mps/benchmarks/performance, 2025. Accessed: 2026-06-14

  74. [74]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  75. [75]

    xArm 7.https://www.ufactory.us/product/ufactory-xarm-7, 2026

    UFACTORY. xArm 7.https://www.ufactory.us/product/ufactory-xarm-7, 2026. Accessed: 2026-06-15

  76. [76]

    M. H. Anjaria, M. E. Erciyes, V . Ghatnekar, N. Navarkar, H. Etukuru, X. Jiang, K. Patel, D. Kabra, N. Wojno, R. A. Prayage, S. Chintala, L. Pinto, N. M. M. Shafiullah, and Z. J. Cui. Yor: Your own mobile manipulator for generalizable robotics, 2026. URLhttps://arxiv. org/abs/2602.11150

  77. [77]

    NERO.https://global.agilex.ai/products/nero, 2026

    AgileX Robotics. NERO.https://global.agilex.ai/products/nero, 2026. Accessed: 2026-06-15

  78. [78]

    Gyenes, E

    B. Gyenes, E. Gospodinov, J. Frieling, E. Krohmer, N. Schreiber, X. Jia, N. Freymuth, and G. Neumann. Fourier features let agents learn high precision policies with imitation learning,

  79. [79]

    URLhttps://arxiv.org/abs/2606.12334

  80. [80]

    S. D. Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Doll´ar, G. Gkioxari, M. Feiszli, and J. Malik. Sam 3d: 3dfy anything in images, 2026. URL https://arxiv.org/abs/2511.16624

Showing first 80 references.