pith. sign in

arxiv: 2510.27420 · v3 · submitted 2025-10-31 · 💻 cs.RO

Towards a Multi-Embodied Grasping Agent

Pith reviewed 2026-05-18 03:00 UTC · model grok-4.3

classification 💻 cs.RO
keywords multi-embodiment graspingequivariant grasp synthesisflow-based architecturegripper geometrykinematic model deductionrobotic manipulationdata-efficient learning
0
0 comments X

The pith

A grasp synthesis method handles any gripper by deducing its full kinematics from shape and scene geometry alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a grasping approach that works across many different robot hands and grippers without large custom datasets for each design. It builds a flow-based equivariant architecture that reads the gripper geometry and the scene to determine how the gripper can move and what grasps are possible. This replaces the need to supply explicit joint parameters or train separately on each embodiment. A reader would care because most current grasping systems are locked to one robot hand and demand enormous amounts of new data when the hardware changes. If the claim holds, robots could switch between humanoid hands, parallel jaws, and other designs using the same learned model.

Core claim

The central claim is that a data-efficient, flow-based, equivariant grasp synthesis architecture can handle different gripper types with variable degrees of freedom by successfully exploiting the underlying kinematic model, with all necessary information deduced solely from the gripper and scene geometry. The method translates every module from the ground up to JAX to support batching over scenes, grippers, and grasps, which improves learning smoothness, performance, and inference speed. Supporting evidence comes from a dataset spanning humanoid hands to parallel yaw grippers, 25,000 scenes, and 20 million grasps.

What carries the argument

The flow-based equivariant grasp synthesis architecture that deduces the gripper kinematic model directly from gripper geometry and scene geometry inputs.

If this is right

  • The same model can be applied to grippers with varying numbers of degrees of freedom without retraining or new kinematic labels.
  • Batching over scenes, grippers, and grasps in the JAX implementation produces faster inference and smoother optimization than prior equivariant methods.
  • A single trained system achieves grasping success across humanoid hands and parallel yaw grippers.
  • Training data requirements drop because the architecture does not need embodiment-specific large-scale datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometry-only deduction could extend to other manipulation primitives such as in-hand reorientation or tool use.
  • Real-robot experiments with previously unseen gripper designs would provide a direct check on whether the inferred kinematics transfer without fine-tuning.
  • Combining the architecture with online scene reconstruction might allow a robot to select and use a new gripper on the fly.

Load-bearing premise

The full kinematic model of any gripper can be accurately recovered from static gripper and scene geometry without explicit parameters or extra supervision.

What would settle it

A test set of grippers whose motion cannot be inferred from geometry alone, such as those with hidden joints or non-rigid compliance, where the model produces invalid or unsafe grasp predictions.

Figures

Figures reproduced from arXiv: 2510.27420 by Alexander Qualmann, Gerhard Neumann, Ngo Anh Vien, Roman Freiberg.

Figure 1
Figure 1. Figure 1: Equivariant Gripper Embeddings. An initial gripper configuration (a) is represented by a learned feature embedding z. After a physical joint rotation ∆R, the gripper is in a new configuration (b). Our method ensures the features are correspondingly transformed via the Wigner-D matrices, z ′ = D(∆R)z, keeping the representation consistent with the physical state. II. RELATED WORK Grasp detection approaches … view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. (Left) Grippers are represented with per-joint equivariant embeddings. (a) Full Pipeline. A scene point cloud is encoded into a multi-scale equivariant feature pyramid. Time-conditioned joint features query this pyramid to extract pose and joint information. These scene-aware queries are then decoded to predict flow gradients, which generate the final pre-grasp configuration. (b) Kinematic… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-Embodiment Grasp Synthesis Examples. Renderings of three sampled pre-grasp configurations for five distinct grippers in cluttered scenes. Included grippers (a) ViperX 300s parallel gripper, (b) Franka Emika parallel gripper, (c) DEX-EE dexterous hand, (d) Allegro Hand, and (e) Shadow Hand. B. Geometric Gripper Encoding The gripper encoder produces configuration-aware, equivariant query features repre… view at source ↗
read the original abstract

Multi-embodiment grasping focuses on developing approaches that exhibit generalist behavior across diverse gripper designs. Existing methods often learn the kinematic structure of the robot implicitly and face challenges due to the difficulty of sourcing the required large-scale data. In this work, we present a data-efficient, flow-based, equivariant grasp synthesis architecture that can handle different gripper types with variable degrees of freedom and successfully exploit the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry. Unlike previous equivariant grasping methods, we translated all modules from the ground up to JAX and provide a model with batching capabilities over scenes, grippers, and grasps, resulting in smoother learning, improved performance and faster inference time. Our dataset encompasses grippers ranging from humanoid hands to parallel yaw grippers and includes 25,000 scenes and 20 million grasps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a data-efficient, flow-based, equivariant grasp synthesis architecture for multi-embodiment grasping. It claims to handle grippers with variable degrees of freedom by exploiting the underlying kinematic model, deducing all necessary information solely from gripper and scene geometry inputs. The work includes a JAX reimplementation with batching over scenes, grippers, and grasps for improved performance and inference speed, supported by a dataset of 25,000 scenes and 20 million grasps spanning humanoid hands to parallel yaw grippers.

Significance. If the central claims are substantiated, the approach could meaningfully advance generalist grasping by reducing reliance on embodiment-specific data and explicit kinematic parameters, enabling better generalization across diverse grippers through geometric inputs alone. The JAX-based batching and large-scale dataset are practical strengths that could support reproducible follow-up work.

major comments (1)
  1. [Abstract] Abstract: The load-bearing claim that the architecture 'successfully exploit[s] the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry' without explicit kinematic parameters requires stronger substantiation. Static geometry (meshes or point clouds) does not encode joint axes, limits, or configuration spaces for variable-DOF grippers; any kinematic exploitation must therefore be shown to arise from geometry rather than implicit learning on the 20M-grasp training set. Generalization experiments on unseen gripper topologies would directly test this distinction.
minor comments (1)
  1. The abstract references improved performance and faster inference but does not include quantitative metrics, baselines, or ablation results; adding these details with specific numbers and comparisons would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying the need for stronger substantiation of the central claim in the abstract. We address this point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The load-bearing claim that the architecture 'successfully exploit[s] the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry' without explicit kinematic parameters requires stronger substantiation. Static geometry (meshes or point clouds) does not encode joint axes, limits, or configuration spaces for variable-DOF grippers; any kinematic exploitation must therefore be shown to arise from geometry rather than implicit learning on the 20M-grasp training set. Generalization experiments on unseen gripper topologies would directly test this distinction.

    Authors: We appreciate the referee highlighting this distinction. Our model receives only geometric inputs (gripper meshes or point clouds together with scene geometry) and no explicit kinematic parameters such as joint axes, limits, or configuration spaces at any stage. The flow-based equivariant architecture is trained to produce grasp distributions that respect the feasible motions of each gripper by learning from the geometric structure and the associated successful grasps in the dataset. Results across the range of embodiments (humanoid hands to parallel yaw grippers) show that the network generates kinematically plausible outputs for each gripper geometry without being supplied joint information. We acknowledge that this capability is acquired through training on the 20 million grasps rather than from an analytic kinematic model. To strengthen the presentation, we will revise the manuscript to include a clearer discussion of how geometric inputs alone enable the model to infer valid grasp configurations and to add further analysis of performance under gripper variations. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation relies on learned equivariant flow model from explicit geometry inputs and large dataset

full rationale

The paper's central claim is that a flow-based equivariant architecture, trained on 20M grasps across 25k scenes and diverse grippers, can exploit kinematics implicitly from gripper/scene geometry alone. This is presented as an empirical capability of the JAX-implemented model rather than a mathematical derivation that reduces to its inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract and claims. The architecture is described as translating modules from the ground up with batching, yielding performance gains, but the kinematic deduction is an assumption about what the trained model achieves, not a step that equates output to input definitionally. The approach is self-contained against external benchmarks via the dataset and equivariance properties.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that geometry alone suffices to recover kinematics for variable-DoF grippers; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Kinematic structure of a gripper can be deduced solely from its geometry and the scene geometry
    Explicitly stated in the abstract as the basis for handling different gripper types without additional inputs.

pith-pipeline@v0.9.0 · 5678 in / 1197 out tokens · 34068 ms · 2026-05-18T03:00:41.588586+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present a data-efficient, flow-based, equivariant grasp synthesis architecture that can handle different gripper types with variable degrees of freedom and successfully exploit the underlying kinematic model, deducing all necessary information solely from the gripper and scene geometry.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 4 internal anchors

  1. [1]

    Gendexgrasp: Generalizable dexterous grasping,

    P. Liet al., “Gendexgrasp: Generalizable dexterous grasping,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8068–8074

  2. [2]

    Geometry matching for multi-embodiment grasping,

    M. Attarianet al., “Geometry matching for multi-embodiment grasping,” inConference on Robot Learning. PMLR, 2023, pp. 1242–1256

  3. [3]

    Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,

    J. Urain, N. Funk, J. Peters, and G. Chalvatzaki, “Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5923–5930

  4. [4]

    Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation,

    H. Ryuet al., “Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 007–18 018

  5. [5]

    Orbitgrasp: Se (3)-equivariant grasp learning,

    B. Huet al., “Orbitgrasp: Se (3)-equivariant grasp learning,” in8th Annual Conference on Robot Learning, 2024

  6. [6]

    RiEMann: Near real-time SE(3)-equivariant robot manip- ulation without point cloud segmentation,

    C. Gaoet al., “RiEMann: Near real-time SE(3)-equivariant robot manip- ulation without point cloud segmentation,” in8th Annual Conference on Robot Learning, 2024

  7. [7]

    Equibot: Sim(3)-equivariant diffusion policy for generalizable and data efficient learning,

    J. Yang, Z.-a. Cao, C. Deng, R. Antonova, S. Song, and J. Bohg, “Equibot: Sim(3)-equivariant diffusion policy for generalizable and data efficient learning,” in8th Annual Conference on Robot Learning, 2024

  8. [8]

    SE(3)-equivariant diffusion policy in spherical fourier space,

    X. Zhu, F. Wang, R. Walters, and J. Shi, “SE(3)-equivariant diffusion policy in spherical fourier space,” inF orty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=U5nRMOs8Ed

  9. [9]

    JAX: composable transformations of Python+NumPy programs,

    J. Bradburyet al., “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax

  10. [10]

    Diffusion for multi-embodiment grasping,

    R. Freiberg, A. Qualmann, N. A. Vien, and G. Neumann, “Diffusion for multi-embodiment grasping,”IEEE Robotics and Automation Letters, 2025

  11. [11]

    Contact- graspnet: Efficient 6-dof grasp generation in cluttered scenes,

    M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact- graspnet: Efficient 6-dof grasp generation in cluttered scenes,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 438–13 444

  12. [12]

    V olumetric grasping network: Real-time 6 dof grasp detection in clutter,

    M. Breyer, J. J. Chung, L. Ott, S. Roland, and N. Juan, “V olumetric grasping network: Real-time 6 dof grasp detection in clutter,” in Conference on Robot Learning, 2020

  13. [13]

    Synergies between affordance and geometry: 6-dof grasp detection via implicit representations,

    Z. Jiang, Y . Zhu, M. Svetlik, K. Fang, and Y . Zhu, “Synergies between affordance and geometry: 6-dof grasp detection via implicit representations,”Robotics: science and systems, 2021

  14. [14]

    ACRONYM: A large-scale grasp dataset based on simulation,

    C. Eppner, A. Mousavian, and D. Fox, “ACRONYM: A large-scale grasp dataset based on simulation,” in2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020

  15. [15]

    Graspnet-1billion: A large-scale benchmark for general object grasping,

    H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 441–11 450

  16. [16]

    Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

    J. Zhanget al., “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th Annual Conference on Robot Learning, 2024

  17. [17]

    Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation,

    D. Turpinet al., “Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation,” inICRA, 2023

  18. [18]

    Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning,

    W. Wanet al., “Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3891–3902

  19. [19]

    Ugg: Unified generative grasping,

    J. Luet al., “Ugg: Unified generative grasping,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 414–433

  20. [20]

    Afforddexgrasp: Open-set language-guided dexter- ous grasp with generalizable-instructive affordance,

    Y .-L. Weiet al., “Afforddexgrasp: Open-set language-guided dexter- ous grasp with generalizable-instructive affordance,”arXiv preprint arXiv:2503.07360, 2025

  21. [21]

    Dexgraspvla: A vision-language-action framework towards general dexterous grasping,

    Y . Zhonget al., “Dexgraspvla: A vision-language-action framework towards general dexterous grasping,”arXiv preprint arXiv:2502.20900, 2025

  22. [22]

    Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,

    Y . Zhong, Q. Jiang, J. Yu, and Y . Ma, “Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 584–22 594

  23. [23]

    Multi- grippergrasp: A dataset for robotic grasping from parallel jaw grippers to dexterous hands,

    L. F. Casas, N. Khargonkar, B. Prabhakaran, and Y . Xiang, “Multi- grippergrasp: A dataset for robotic grasping from parallel jaw grippers to dexterous hands,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 2978–2984

  24. [24]

    Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

    J. Chen, Y . Ke, and H. Wang, “Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization,”arXiv preprint arXiv:2412.16490, 2024. 8

  25. [25]

    Get a grip: Multi-finger grasp evaluation at scale enables robust sim-to-real transfer,

    T. G. W. Lumet al., “Get a grip: Multi-finger grasp evaluation at scale enables robust sim-to-real transfer,” in8th Annual Conference on Robot Learning, 2024

  26. [26]

    D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping,

    Z. Weiet al., “D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping,”CoRR, 2024

  27. [27]

    Robotfinger- print: Unified gripper coordinate space for multi-gripper grasp synthesis,

    N. Khargonkar, L. F. Casas, , B. Prabhakaran, and Y . Xiang, “Robotfinger- print: Unified gripper coordinate space for multi-gripper grasp synthesis,” arXiv preprint arXiv:2409.14519, 2024

  28. [28]

    Cedex: Cross-embodiment dexterous grasp generation at scale from human-like contact representations.ArXiv, abs/2509.24661, 2025

    Z. Wu, R. A. Potamias, X. Zhang, Z. Zhang, J. Deng, and S. Luo, “Cedex: Cross-embodiment dexterous grasp generation at scale from human-like contact representations,”arXiv preprint arXiv:2509.24661, 2025

  29. [29]

    Adagrasp: Learning an adaptive gripper-aware grasping policy,

    Z. Xu, B. Qi, S. Agrawal, and S. Song, “Adagrasp: Learning an adaptive gripper-aware grasping policy,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4620–4626

  30. [30]

    Cross-embodiment dexterous hand articulation generation via morphology-aware learning,

    H. Zhang, K. Y . Ma, M. Z. Shou, W. Lin, and Y . Wu, “Cross-embodiment dexterous hand articulation generation via morphology-aware learning,” arXiv preprint arXiv:2510.06068, 2025

  31. [31]

    Anydexgrasp: General dexterous grasping for different hands with human-level learning efficiency.ArXiv, abs/2502.16420, 2025

    H.-S. Fang, H. Yan, Z. Tang, H. Fang, C. Wang, and C. Lu, “Anydexgrasp: General dexterous grasping for different hands with human-level learning efficiency,”arXiv preprint arXiv:2502.16420, 2025

  32. [32]

    Multi-agent deep reinforce- ment learning for variable-finger dexterous grasping through multi-stream embedding fusion,

    M. Bonyani, M. Soleymani, and C. Wang, “Multi-agent deep reinforce- ment learning for variable-finger dexterous grasping through multi-stream embedding fusion,” inICRA 2025 Workshop”Handy Moves: Dexterity in Multi-Fingered Hands”Paper Submission, 2025

  33. [33]

    Planning with diffusion for flexible behavior synthesis,

    M. Janner, Y . Du, J. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” inInternational Conference on Machine Learning, 2022

  34. [34]

    Actionflow: Equivariant, accurate, and efficient policies with spatially symmetric flow matching,

    N. Funk, J. Urain, J. Carvalho, V . Prasad, G. Chalvatzaki, and J. Peters, “Actionflow: Equivariant, accurate, and efficient policies with spatially symmetric flow matching,” 2024

  35. [35]

    Graspldm: Generative 6-dof grasp synthesis using latent diffusion models,

    K. R. Barad, A. Orsula, A. Richard, J. Dentler, M. Olivares-Mendez, and C. Martinez, “Graspldm: Generative 6-dof grasp synthesis using latent diffusion models,”IEEE Access, 2024

  36. [36]

    Don’t Start From Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion,

    K. Chen, E. Lim, L. Kelvin, Y . Chen, and H. Soh, “Don’t Start From Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

  37. [37]

    Equigraspflow: Se(3)- equivariant 6-dof grasp pose generative flows,

    B. Lim, J. Kim, J. Kim, Y . Lee, and F. C. Park, “Equigraspflow: Se(3)- equivariant 6-dof grasp pose generative flows,” in8th Annual Conference on Robot Learning, 2024

  38. [38]

    Se (3)-stochastic flow matching for protein backbone generation,

    J. Boseet al., “Se (3)-stochastic flow matching for protein backbone generation,” inThe Twelfth International Conference on Learning Representations, 2024

  39. [39]

    Se (3) diffusion model with application to protein backbone generation,

    J. Yimet al., “Se (3) diffusion model with application to protein backbone generation,”arXiv preprint arXiv:2302.02277, 2023

  40. [40]

    Improved motif-scaffolding with SE(3) flow matching,

    Y . Jasonet al., “Improved motif-scaffolding with SE(3) flow matching,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=fa1ne8xDGn

  41. [41]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    K. Wuet al., “Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,”arXiv preprint arXiv:2412.13877, 2024

  42. [42]

    Droid: A large-scale in-the-wild robot manipulation dataset,

    A. Khazatskyet al., “Droid: A large-scale in-the-wild robot manipulation dataset,” inRSS 2024 Workshop: Data Generation for Robotics, 2024

  43. [43]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Q. Buet al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

  44. [44]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. Kimet al., “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

  45. [45]

    Octo: An open-source generalist robot policy,

    Octo Model Teamet al., “Octo: An open-source generalist robot policy,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  46. [46]

    Open x-embodiment: Robotic learning datasets and RT-x models,

    Q. Vuonget al., “Open x-embodiment: Robotic learning datasets and RT-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023, 2023

  47. [47]

    Pushing the limits of cross-embodiment learning for manipulation and navigation,

    J. Yanget al., “Pushing the limits of cross-embodiment learning for manipulation and navigation,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, 07 2024

  48. [48]

    Real-Time Execution of Action Chunking Flow Policies

    K. Black, M. Y . Galliker, and S. Levine, “Real-time execution of action chunking flow policies,”arXiv preprint arXiv:2506.07339, 2025

  49. [49]

    Latent policy steering with embodiment-agnostic pretrained world models,

    Y . Wang, M. Verghese, and J. Schneider, “Latent policy steering with embodiment-agnostic pretrained world models,”arXiv preprint arXiv:2507.13340, 2025

  50. [50]

    One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion,

    N. Bohlingeret al., “One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion,” in8th Annual Conference on Robot Learning, 2025

  51. [51]

    Multi-loco: Unifying multi-embodiment legged loco- motion via reinforcement learning augmented diffusion,

    S. Yanget al., “Multi-loco: Unifying multi-embodiment legged loco- motion via reinforcement learning augmented diffusion,”arXiv preprint arXiv:2506.11470, 2025

  52. [52]

    Towards embodiment scaling laws in robot locomotion,

    B. Aiet al., “Towards embodiment scaling laws in robot locomotion,” arXiv preprint arXiv:2505.05753, 2025

  53. [53]

    Unpaired image-to-image translation using cycle-consistent adversarial networkss,

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networkss,” inComputer Vision (ICCV), 2017 IEEE International Conference on, 2017

  54. [54]

    Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,

    C. Lawrence, H. Kush, D. Karthik, X. Chenfeng, V . Quan, and G. Ken, “Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,” inRobotics: Science and Systems, 2024

  55. [55]

    Shadow: Leveraging segmentation masks for cross-embodiment policy transfer.arXiv preprint arXiv:2503.00774, 2025

    M. Lepert, R. Doshi, and J. Bohg, “Shadow: Leveraging segmen- tation masks for cross-embodiment policy transfer,”arXiv preprint arXiv:2503.00774, 2025

  56. [56]

    Group equivariant convolutional networks,

    T. Cohen and M. Welling, “Group equivariant convolutional networks,” inInternational conference on machine learning. PMLR, 2016, pp. 2990–2999

  57. [57]

    Harmonic networks: Deep translation and rotation equivariance,

    D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Harmonic networks: Deep translation and rotation equivariance,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5028–5037

  58. [58]

    B-spline cnns on lie groups,

    E. J. Bekkers, “B-spline cnns on lie groups,” inInternational Conference on Learning Representations, 2020

  59. [59]

    3D steerable CNNs: Learning rotationally equivariant features in volumetric data.Advances in Neural information processing systems, 31, 2018a

    M. Weiler, P. Forr´e, E. Verlinde, and M. Welling, “Coordinate independent convolutional networks–isometry and gauge equivariant convolutions on riemannian manifolds,”arXiv preprint arXiv:2106.06020, 2021

  60. [60]

    Deepgcns: Can gcns go as deep as cnns?

    G. Li, M. M ¨uller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” inThe IEEE International Conference on Computer Vision (ICCV), 2019

  61. [61]

    Learning local equivariant representations for large- scale atomistic dynamics,

    A. Musaelianet al., “Learning local equivariant representations for large- scale atomistic dynamics,”Nature Communications, vol. 14, no. 1, p. 579, 2023

  62. [62]

    Foundationstereo: Zero-shot stereo matching,

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5249– 5260

  63. [63]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  64. [64]

    Reducing so (3) convolutions to so (2) for efficient equivariant gnns,

    S. Passaro and C. L. Zitnick, “Reducing so (3) convolutions to so (2) for efficient equivariant gnns,” inInternational conference on machine learning. PMLR, 2023, pp. 27 420–27 438

  65. [65]

    EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations,

    Y .-L. Liao, B. Wood, A. Das*, and T. Smidt*, “EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations,” inInternational Conference on Learning Representations (ICLR), 2024. [Online]. Available: https://openreview.net/forum?id=mCOBKZmrzD

  66. [66]

    Sonata: Self-supervised learning of reliable point representations,

    X. Wuet al., “Sonata: Self-supervised learning of reliable point representations,” inCVPR, 2025

  67. [67]

    Mujoco: A physics engine for model- based control,

    E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

  68. [68]

    Google scanned objects: A high-quality dataset of 3d scanned household items,

    L. Downset al., “Google scanned objects: A high-quality dataset of 3d scanned household items,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2553–2560

  69. [69]

    The ycb object and model set: Towards common benchmarks for manipulation research,

    B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The ycb object and model set: Towards common benchmarks for manipulation research,” in2015 international conference on advanced robotics (ICAR). IEEE, 2015, pp. 510–517