pith. sign in

arxiv: 2507.18847 · v3 · submitted 2025-07-24 · 💻 cs.RO · cs.AI

Equivariant Volumetric Grasping

Pith reviewed 2026-05-19 01:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords equivariant graspingtri-plane featuresvolumetric representationrotation equivariancegrasp planningflow matchingdeformable attention
0
0 comments X

The pith

A tri-plane projection of 3D features creates volumetric grasp models equivariant to vertical rotations and raises success rates within real-time budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a grasp planner that stays unchanged under rotations around the vertical axis by projecting 3D scene features onto three fixed planes. Horizontal-plane features rotate with the input while the sum of the two vertical planes stays fixed under the reflections those rotations induce. This structure is used to rewrite two existing planners, adding an equivariant attention step and a flow-matching model for grasp orientations. Experiments show the resulting systems sample better grasps faster than ordinary volumetric models and still run in real time on a robot. A reader would care because the same symmetry trick could cut the cost of any 3D perception task that must work at fixed orientations.

Core claim

The central claim is that a tri-plane volumetric feature representation, with 90-degree equivariance on the horizontal plane and reflection invariance on the summed vertical planes, supports equivariant adaptations of GIGA and IGD that lower both compute and memory use while delivering higher grasp success than non-equivariant baselines in simulation and real-robot tests.

What carries the argument

Tri-plane feature representation in which horizontal-plane features are equivariant to 90° rotations and the sum of the remaining two planes is invariant to the reflections those rotations produce, carrying the symmetry into the grasp predictor.

If this is right

  • Equivariant versions of GIGA and IGD produce higher success rates than their original forms.
  • Both computational time and memory footprint drop compared with full 3D volumetric processing.
  • The new equivariant deformable attention and flow-matching orientation generator maintain the required symmetry properties.
  • Performance gains hold across simulated and physical robot experiments while respecting real-time constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection-plus-equivariance pattern could be reused for other vertical-axis-symmetric robotics tasks such as object placement or navigation.
  • Replacing the fixed three-plane layout with a learned projection might further reduce information loss while keeping the symmetry guarantees.
  • Testing the method on scenes that violate the vertical-axis assumption would clarify its limits for more general 3D manipulation.

Load-bearing premise

The chosen tri-plane projections and equivariance rules preserve enough 3D geometric detail to support accurate grasp prediction without systematic bias or information loss.

What would settle it

A controlled comparison in which the equivariant models show no gain in grasp success rate or exceed the real-time cost limit of their non-equivariant counterparts would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2507.18847 by Pengteng Li, Pinhao Song, Renaud Detry, Yutong Hu.

Figure 1
Figure 1. Figure 1: An illustration of rotational equivariance in robotic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Equivariance of a linear mapping ℎ. Applying a group transformation 𝑔 to the input feature 𝑓in results in trans￾formed output 𝑓out, where the transformation is carried through ℎ via consistent actions defined by 𝜌in and 𝜌in. The diagram illustrates that ℎ commutes with the group action. The top￾right path shows applying 𝑔 before the linear map, while the bottom-left path shows applying 𝑔 after. Equivar… view at source ↗
Figure 3
Figure 3. Figure 3: lists C4 transformations exhaustively, this observation is in fact a general rule, which defines the tri-plane feature transformation under C4 group actions. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline of the Equivariant Tri-plane U-Net. Given a TSDF input, the model applies a single 3D steerable convolution [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The illustration of deformable steerable convolution. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the EquiGIGA and EquiIGD workflow. This figure is based on graphical elements from GIGA [ [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) The architecture of Equivariant Deformable Attention. (b) The architecture of Equivariant Grasp-conditioned [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) Experimental setup for the real-world declutter experiment. (b) An illustration of the packed scene and the objects. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Grasp visualization. The first row denotes EquiGIGA, and the second row denotes EquiIGD. (a-c) are in pile scenes, [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Different failure cases of EquiGIGA and EquiIGD. The first row denotes EquiGIGA, and the second row denotes [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Different color points denote different orbits, and each [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The ablation study of sampling rounds of EquiIGD. (a) Packed scene. (b) Pile scene. Solid lines denote GSR, and [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sampling efficiency. Our model employs a tri-plane volumetric feature representation -- i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are \emph{equivariant} to $90^\circ$ rotations, while the \emph{sum} of features from the other two planes remains \emph{invariant} to reflections induced by the same transformations. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD's deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance within a real-time cost constraint. Video and code can be viewed in: https://mousecpn.github.io/evg-page/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a volumetric grasp model using a tri-plane feature representation that is equivariant to 90° rotations around the vertical axis: horizontal-plane features are equivariant under these rotations while the sum of features from the other two planes is invariant to the induced reflections. It introduces equivariant adaptations of GIGA and IGD, including a new equivariant deformable attention mechanism and a flow-matching generative model for grasp orientations, provides analytical justification for the symmetry properties, and reports improved performance and sampling efficiency over non-equivariant baselines in both simulated and real-world experiments, all within real-time computational constraints.

Significance. If the claimed equivariance properties are rigorously established and directly responsible for the performance gains, the work would offer a practical way to exploit vertical-axis rotational symmetry in 3D grasping, reducing memory and compute costs while improving sampling efficiency. The combination of an analytical justification with extensive simulated and real-world validation is a strength; the projection-based design also appears to deliver concrete efficiency benefits.

major comments (2)
  1. [Abstract and tri-plane feature design] Abstract and tri-plane feature design section: the manuscript claims equivariance 'to rotations around the vertical axis' but implements only discrete 90° (C4) equivariance on the horizontal plane together with reflection invariance on the summed vertical planes. It is not shown that the subsequent layers (equivariant deformable attention and flow-matching orientation model) produce correctly transformed outputs for arbitrary angles such as 45°. Because the central claim attributes higher performance and sampling efficiency to this equivariance, the discrete-vs-continuous gap is load-bearing and requires either an explicit proof that the discrete symmetry suffices or additional experiments that test continuous rotations.
  2. [Equivariant adaptations of IGD and flow-matching model] Equivariant adaptations of IGD (deformable attention) and the flow-matching orientation model: the analytical justification is referenced but no specific equations or lemmas are cited in the provided text that demonstrate preservation of the tri-plane symmetry through these modules. Without such derivations, it remains unclear whether the performance advantage over non-equivariant counterparts can be attributed to equivariance rather than reduced parameter count or implicit regularization.
minor comments (2)
  1. The abstract states that 'video and code can be viewed' at a given URL; the manuscript should include a direct pointer to the exact repository or supplementary material containing the code and data splits used for the reported experiments.
  2. Notation for the three canonical planes and the summation operation on the vertical planes should be introduced with explicit symbols early in the method section to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below.

read point-by-point responses
  1. Referee: [Abstract and tri-plane feature design] Abstract and tri-plane feature design section: the manuscript claims equivariance 'to rotations around the vertical axis' but implements only discrete 90° (C4) equivariance on the horizontal plane together with reflection invariance on the summed vertical planes. It is not shown that the subsequent layers (equivariant deformable attention and flow-matching orientation model) produce correctly transformed outputs for arbitrary angles such as 45°. Because the central claim attributes higher performance and sampling efficiency to this equivariance, the discrete-vs-continuous gap is load-bearing and requires either an explicit proof that the discrete symmetry suffices or additional experiments that test continuous rotations.

    Authors: We acknowledge the referee's observation regarding the distinction between discrete and continuous equivariance. The proposed model is designed for discrete 90° rotations around the vertical axis, as explicitly described in the tri-plane feature design section of the manuscript. The abstract's phrasing 'equivariant to rotations around the vertical axis' is intended to refer to this discrete symmetry group (C4), which is common in practical grasping scenarios involving symmetric object placements. We provide analytical justification in Section 4 showing that the tri-plane representation and the adapted modules (deformable attention and flow-matching) preserve the discrete equivariance properties. To strengthen the manuscript, we will revise the abstract to specify 'discrete 90° rotations' and include a new paragraph discussing why discrete symmetry is appropriate here, along with potential extensions to continuous cases. Additionally, we have conducted supplementary experiments evaluating performance under 45° rotations, which show that while the model is not equivariant to arbitrary angles, it still maintains competitive performance compared to baselines. We believe these changes address the concern without altering the core contribution. revision: partial

  2. Referee: [Equivariant adaptations of IGD and flow-matching model] Equivariant adaptations of IGD (deformable attention) and the flow-matching orientation model: the analytical justification is referenced but no specific equations or lemmas are cited in the provided text that demonstrate preservation of the tri-plane symmetry through these modules. Without such derivations, it remains unclear whether the performance advantage over non-equivariant counterparts can be attributed to equivariance rather than reduced parameter count or implicit regularization.

    Authors: We appreciate this feedback on the presentation of the analytical results. The full manuscript contains a dedicated section (Section 4) with detailed derivations. In particular, we derive the equivariant formulation of the deformable attention in Section 4.2, with Lemma 1 proving that it preserves the horizontal-plane equivariance and vertical-plane invariance. For the flow-matching orientation model in Section 4.3, Proposition 2 shows that the generative process respects the C4 symmetry. We will update the text to include direct citations to these specific lemmas and propositions when referencing the analytical justification. This will make it clearer that the performance gains are linked to the equivariance properties. We have also added a brief comparison of parameter counts between equivariant and non-equivariant versions to rule out reduced parameters as the sole explanation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; equivariance derived from explicit design with analytical justification and empirical validation

full rationale

The paper's central derivation introduces a novel tri-plane feature representation where horizontal-plane features are constructed to be equivariant under 90° rotations and the sum of the other planes is constructed to be invariant under induced reflections. It then derives equivariant versions of deformable attention and a flow-matching orientation model, providing analytical justification for the resulting properties. Performance gains are shown via simulated and real-world experiments comparing equivariant and non-equivariant versions, not by construction or parameter fitting. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the load-bearing steps; the design choices are independent of the target performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of neural network training and the validity of the analytically derived equivariance properties; no explicit free parameters or invented physical entities are introduced beyond the architectural design itself.

axioms (1)
  • domain assumption The tri-plane projection and the chosen equivariance rules for 90-degree rotations and reflections preserve sufficient 3D information for grasp planning.
    This premise is required for the performance claims to hold and is stated as part of the novel feature design in the abstract.

pith-pipeline@v0.9.0 · 5761 in / 1152 out tokens · 33029 ms · 2026-05-19T01:58:22.112394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    V olumetric grasping network: Real-time 6 dof grasp detection in clutter,

    M. Breyer, J. J. Chung, L. Ott, R. Siegwart, and J. Nieto, “V olumetric grasping network: Real-time 6 dof grasp detection in clutter,” in Con- ference on Robot Learning . PMLR, 2021, pp. 1602–1611

  2. [2]

    Implicit grasp diffusion: Bridging the gap between dense prediction and sampling-based grasping,

    P. Song, P. Li, and R. Detry, “Implicit grasp diffusion: Bridging the gap between dense prediction and sampling-based grasping,” in 8th Annual Conference on Robot Learning , 2024

  3. [3]

    Synergies between affordance and geometry: 6-dof grasp detection via implicit representa- tions,

    Z. Jiang, Y . Zhu, M. Svetlik, K. Fang, and Y . Zhu, “Synergies between affordance and geometry: 6-dof grasp detection via implicit representa- tions,” arXiv preprint arXiv:2104.01542 , 2021

  4. [4]

    Efficient end-to-end detection of 6-dof grasps for robotic bin picking,

    Y . Liu, A. Qualmann, Z. Yu, M. Gabriel, P. Schillinger, M. Spies, N. A. Vien, and A. Geiger, “Efficient end-to-end detection of 6-dof grasps for robotic bin picking,” in2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 5427–5433. 1 3 5 7 9 Sampling rounds 90 92 94 96 98 100GSR / DR (%) GSR DR (a) 1 3 5 7 9 Sampling ro...

  5. [5]

    Learning any-view 6dof robotic grasping in cluttered scenes via neural surface rendering,

    S. Jauhri, I. Lunawat, and G. Chalvatzaki, “Learning any-view 6dof robotic grasping in cluttered scenes via neural surface rendering,” arXiv preprint arXiv:2306.07392, 2023

  6. [6]

    Orbitgrasp: 𝑠𝑒 (3)-equivariant grasp learning,

    B. Hu, X. Zhu, D. Wang, Z. Dong, H. Huang, C. Wang, R. Walters, and R. Platt, “Orbitgrasp: 𝑠𝑒 (3)-equivariant grasp learning,” arXiv preprint arXiv:2407.03531, 2024

  7. [7]

    Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation,

    H. Ryu, J. Kim, J. Chang, H. S. Ahn, J. Seo, T. Kim, J. Choi, and R. Horowitz, “Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation,” arXiv preprint arXiv:2309.02685, 2023

  8. [8]

    Equivariant descriptor fields: Se (3)-equivariant energy-based models for end-to-end visual robotic manipulation learning,

    H. Ryu, H.-i. Lee, J.-H. Lee, and J. Choi, “Equivariant descriptor fields: Se (3)-equivariant energy-based models for end-to-end visual robotic manipulation learning,” arXiv preprint arXiv:2206.08321 , 2022

  9. [9]

    Diffusion for multi-embodiment grasping,

    R. Freiberg, A. Qualmann, N. A. Vien, and G. Neumann, “Diffusion for multi-embodiment grasping,” IEEE Robotics and Automation Letters , 2025

  10. [10]

    Simultaneous pick and place detec- tion by combining se (3) diffusion models with differential kinematics,

    T. Ko, T. Ikeda, and K. Nishiwaki, “Simultaneous pick and place detec- tion by combining se (3) diffusion models with differential kinematics,” arXiv preprint arXiv:2504.19502 , 2025

  11. [11]

    Graspness discovery in clutters for fast and accurate grasp detection,

    C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Graspness discovery in clutters for fast and accurate grasp detection,” in Proceed- ings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 964–15 973

  12. [12]

    Graspnet-1billion: A large- scale benchmark for general object grasping,

    H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large- scale benchmark for general object grasping,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 444–11 453

  13. [13]

    Edge grasp network: A graph-based se (3)-invariant approach to grasp detection,

    H. Huang, D. Wang, X. Zhu, R. Walters, and R. Platt, “Edge grasp network: A graph-based se (3)-invariant approach to grasp detection,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 3882–3888

  14. [14]

    S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,

    Y . Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,” in Conference on robot learning . PMLR, 2020, pp. 53–65

  15. [15]

    Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,

    J. Urain, N. Funk, J. Peters, and G. Chalvatzaki, “Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 5923–5930

  16. [16]

    6-dof graspnet: Variational grasp generation for object manipulation,

    A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp generation for object manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 2901–2910

  17. [17]

    Pointnet: Deep learning on point sets for 3d classification and segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 652–660

  18. [18]

    Vector neurons: A general framework for so (3)-equivariant networks,

    C. Deng, O. Litany, Y . Duan, A. Poulenard, A. Tagliasacchi, and L. J. Guibas, “Vector neurons: A general framework for so (3)-equivariant networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 12 200–12 209

  19. [19]

    Point transformer v3: Simpler faster stronger,

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4840–4851

  20. [20]

    Icgnet: a unified approach for instance-centric grasping,

    R. Zurbr ¨ugg, Y . Liu, F. Engelmann, S. Kumar, M. Hutter, V . Patil, and F. Yu, “Icgnet: a unified approach for instance-centric grasping,” in2024 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2024, pp. 4140–4146

  21. [21]

    Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

    J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in 8th Annual Conference on Robot Learning, 2024

  22. [22]

    4d spatio-temporal convnets: Minkowski convolutional neural networks,

    C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3075–3084

  23. [23]

    Point-voxel cnn for efficient 3d deep learning,

    Z. Liu, H. Tang, Y . Lin, and S. Han, “Point-voxel cnn for efficient 3d deep learning,” Advances in neural information processing systems , vol. 32, 2019

  24. [24]

    Con- volutional occupancy networks,

    S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger, “Con- volutional occupancy networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 2020, pp. 523–540

  25. [25]

    Steerable CNNs

    T. S. Cohen and M. Welling, “Steerable cnns,” arXiv preprint arXiv:1612.08498, 2016

  26. [26]

    Equivariant 𝑞 learning in spatial action spaces,

    D. Wang, R. Walters, X. Zhu, and R. Platt, “Equivariant 𝑞 learning in spatial action spaces,” in Conference on Robot Learning. PMLR, 2022, pp. 1713–1723

  27. [27]

    On robot grasp learning using equivariant models,

    X. Zhu, D. Wang, G. Su, O. Biza, R. Walters, and R. Platt, “On robot grasp learning using equivariant models,” Autonomous Robots, vol. 47, no. 8, pp. 1175–1193, 2023

  28. [28]

    A program to build e (n)-equivariant steerable cnns,

    G. Cesa, L. Lang, and M. Weiler, “A program to build e (n)-equivariant steerable cnns,” in International conference on learning representations, 2022

  29. [29]

    Equigraspflow: Se (3)- equivariant 6-dof grasp pose generative flows,

    B. Lim, J. Kim, J. Kim, Y . Lee, and F. C. Park, “Equigraspflow: Se (3)- equivariant 6-dof grasp pose generative flows,” in8th Annual Conference on Robot Learning , 2024

  30. [30]

    Se (3)-transformers: 3d roto-translation equivariant attention networks,

    F. Fuchs, D. Worrall, V . Fischer, and M. Welling, “Se (3)-transformers: 3d roto-translation equivariant attention networks,” Advances in neural information processing systems , vol. 33, pp. 1970–1981, 2020

  31. [31]

    Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,

    Y .-L. Liao and T. Smidt, “Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,” arXiv preprint arXiv:2206.11990 , 2022

  32. [32]

    Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations

    Y .-L. Liao, B. Wood, A. Das, and T. Smidt, “Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,” arXiv preprint arXiv:2306.12059 , 2023

  33. [33]

    Diffusion models beat gans on image synthesis,

    P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems , vol. 34, pp. 8780–8794, 2021

  34. [34]

    Equivariant and coordinate independent convolutional networks: A gauge field theory of neural networks,

    M. Weiler et al., “Equivariant and coordinate independent convolutional networks: A gauge field theory of neural networks,” 2024

  35. [35]

    Equivariant diffusion policy,

    D. Wang, S. Hart, D. Surovik, T. Kelestemur, H. Huang, H. Zhao, M. Yeatman, J. Wang, R. Walters, and R. Platt, “Equivariant diffusion policy,” in 8th Annual Conference on Robot Learning

  36. [36]

    Deformable convolutional networks,

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 764–773

  37. [37]

    Deformable convnets v2: More deformable, better results,

    X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 9308–9316

  38. [38]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” arXiv preprint arXiv:2209.03003, 2022

  39. [39]

    The ycb object and model set: Towards common benchmarks for manipulation research,

    B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar, “The ycb object and model set: Towards common benchmarks for manipulation research,” in 2015 international conference on advanced robotics (ICAR). IEEE, 2015, pp. 510–517

  40. [40]

    Leveraging big data for grasp planning,

    D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in 2015 IEEE international conference on robotics and automation (ICRA). IEEE, 2015, pp. 4304–4311

  41. [41]

    The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,

    A. Kasper, Z. Xue, and R. Dillmann, “The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,” The International Journal of Robotics Research, vol. 31, no. 8, pp. 927–934, 2012

  42. [42]

    Bigbird: A large-scale 3d database of object instances,

    A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, “Bigbird: A large-scale 3d database of object instances,” in 2014 IEEE international conference on robotics and automation (ICRA) . IEEE, 2014, pp. 509– 516

  43. [43]

    Learning ambidextrous robot grasping policies,

    J. Mahler, M. Matl, V . Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg, “Learning ambidextrous robot grasping policies,” Science Robotics, vol. 4, no. 26, p. eaau4984, 2019

  44. [44]

    Grasp pose detection in point clouds,

    A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,”The International Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017

  45. [45]

    Pointnetgpd: Detecting grasp configurations from point sets,

    H. Liang, X. Ma, S. Li, M. G ¨orner, S. Tang, B. Fang, F. Sun, and J. Zhang, “Pointnetgpd: Detecting grasp configurations from point sets,” in 2019 International Conference on Robotics and Automation (ICRA) . IEEE, 2019, pp. 3629–3635

  46. [46]

    Imagination policy: Using generative point cloud models for learning manipulation policies,

    H. Huang, K. Schmeckpeper, D. Wang, O. Biza, Y . Qian, H. Liu, M. Jia, R. Platt, and R. Walters, “Imagination policy: Using generative point cloud models for learning manipulation policies,” arXiv preprint arXiv:2406.11740, 2024