pith. sign in

arxiv: 1906.08989 · v1 · pith:F3OXSALVnew · submitted 2019-06-21 · 💻 cs.RO · cs.CV

Data-Efficient Learning for Sim-to-Real Robotic Grasping using Deep Point Cloud Prediction Networks

Pith reviewed 2026-05-25 19:09 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords sim-to-real transferrobotic graspingpoint cloud predictiondomain-invariant representationdata-efficient learningdeep networks for roboticstable-top graspingRGBD snapshots
0
0 comments X

The pith

A two-step process learns domain-invariant 3D point clouds from simulation episodes and real snapshots to train grasping policies entirely in simulation that transfer to the real world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a grasping policy can be trained for table-top instance grasping of varied objects with no real-world grasping data by first learning 3D point cloud predictions that remain consistent across simulation and reality. This representation comes from roughly 76,000 simulation episodes plus 530 short real-world RGBD snapshot sequences. A critic network is then trained only in simulation on top of these 3D shapes. The resulting policy outperforms a 2.5D shape baseline by 10 percent when deployed on a real robot. Real data collection requires only passive camera snapshots rather than active grasping attempts, which lowers cost and time.

Core claim

The method learns a domain-invariant 3D shape representation of objects from about 76K episodes in simulation and about 530 episodes in the real world, where each episode lasts less than a minute, then trains a critic grasping policy in simulation only based on that 3D representation; the learned policy performs table-top instance grasping of a wide variety of objects in the real world without any real grasping data and outperforms the 2.5D baseline by 10 percent.

What carries the argument

Deep point cloud prediction network that produces domain-invariant 3D shape representations from RGBD inputs for use by a simulation-trained grasping critic.

If this is right

  • Grasping policies for new objects and arrangements can be developed without collecting real grasping trials.
  • The 3D representation learned in the first step can be reused for other robotic interaction tasks.
  • Data collection effort drops because real episodes need only multiple RGBD snapshots rather than physical attempts.
  • Performance gains of 10 percent over 2.5D methods hold across wide object variety in table-top settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-step pattern could extend to other manipulation skills such as pushing or stacking if suitable critics are defined in simulation.
  • If point cloud accuracy proves sufficient, it may reduce the amount of domain randomization needed during simulation training.
  • Testing the representation on cluttered or partially occluded scenes would reveal how far the current snapshot collection suffices.

Load-bearing premise

A domain-invariant 3D shape learned only from snapshots without any grasping attempts is sufficient for a simulation-trained policy to transfer and succeed at real-world grasping.

What would settle it

A real-robot test in which the sim-trained policy achieves grasping success rates at or below the 2.5D baseline even when point cloud predictions on real scenes match simulation accuracy would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 1906.08989 by Honglak Lee, Jasmine Hsu, Mohi Khansari, S\"oren Pirk, Xinchen Yan, Yuanzheng Gong, Yunfei Bai.

Figure 1
Figure 1. Figure 1: Architecture overview: (a) we use an object detection [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our object detection and point cloud prediction networks: we detect an object and obtain its cropped color and depth [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our point cloud-based grasping network. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of data collection for shape prediction in the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the dataset used for learning the domain [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of point clouds generated with our point [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Grasping sequence evaluation: we visualize the real world grasping sequences for the baseline model (left) and our model (right). [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Training a deep network policy for robot manipulation is notoriously costly and time consuming as it depends on collecting a significant amount of real world data. To work well in the real world, the policy needs to see many instances of the task, including various object arrangements in the scene as well as variations in object geometry, texture, material, and environmental illumination. In this paper, we propose a method that learns to perform table-top instance grasping of a wide variety of objects while using no real world grasping data, outperforming the baseline using 2.5D shape by 10%. Our method learns 3D point cloud of object, and use that to train a domain-invariant grasping policy. We formulate the learning process as a two-step procedure: 1) Learning a domain-invariant 3D shape representation of objects from about 76K episodes in simulation and about 530 episodes in the real world, where each episode lasts less than a minute and 2) Learning a critic grasping policy in simulation only based on the 3D shape representation from step 1. Our real world data collection in step 1 is both cheaper and faster compared to existing approaches as it only requires taking multiple snapshots of the scene using a RGBD camera. Finally, the learned 3D representation is not specific to grasping, and can potentially be used in other interaction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims a two-step data-efficient sim-to-real method for table-top robotic grasping: first learn a domain-invariant 3D point cloud predictor from ~76K simulation episodes plus 530 real RGBD snapshots (no grasping attempts), then train a grasping critic policy entirely in simulation on the resulting 3D representation; the approach reportedly enables grasping of diverse objects without any real grasping data and yields a 10% improvement over a 2.5D shape baseline.

Significance. If the domain-invariance claim holds with supporting evidence, the separation of cheap snapshot-based shape learning from simulation-only policy training would meaningfully lower the barrier to real-world deployment of manipulation policies. The non-grasping-specific nature of the learned representation is also noted as potentially reusable for other tasks.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method): the central claim that the learned 3D representation is 'domain-invariant' and thereby enables zero-shot transfer is load-bearing, yet the manuscript provides no description of the invariance mechanism (adversarial loss, cycle consistency, shared latent space, etc.) nor any quantitative alignment metric (Chamfer distance, feature-space MMD, or classifier accuracy on sim vs. real predicted clouds) between the two domains.
  2. [§4, Abstract] §4 (experiments) and abstract: the reported 10% outperformance over the 2.5D baseline is stated without the number of real-world trials, standard deviation, statistical significance test, or precise definition of the baseline architecture and input representation, preventing assessment of whether the gain is attributable to successful domain transfer versus simply better shape estimation.
minor comments (1)
  1. [§3] Notation for the point-cloud predictor and critic networks is introduced without a consolidated table of layer dimensions or loss weights.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript to improve clarity where the points are valid.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): the central claim that the learned 3D representation is 'domain-invariant' and thereby enables zero-shot transfer is load-bearing, yet the manuscript provides no description of the invariance mechanism (adversarial loss, cycle consistency, shared latent space, etc.) nor any quantitative alignment metric (Chamfer distance, feature-space MMD, or classifier accuracy on sim vs. real predicted clouds) between the two domains.

    Authors: The domain invariance arises from jointly training the point cloud prediction network on the combined set of ~76K simulated episodes and 530 real RGBD snapshots using a shared architecture and loss; this mixed-domain training encourages features that are consistent across domains without requiring an explicit adversarial or cycle-consistency term. We will revise §3 to explicitly describe this training procedure and architecture. Quantitative alignment metrics between simulated and real predicted clouds were not computed in the original work, as downstream grasping success served as the primary validation; adding them would require new analysis. revision: partial

  2. Referee: [§4, Abstract] §4 (experiments) and abstract: the reported 10% outperformance over the 2.5D baseline is stated without the number of real-world trials, standard deviation, statistical significance test, or precise definition of the baseline architecture and input representation, preventing assessment of whether the gain is attributable to successful domain transfer versus simply better shape estimation.

    Authors: We agree that the experimental reporting requires additional detail. The 10% improvement is measured over real-world grasping trials on a fixed set of objects; we will specify the exact trial count, include standard deviations, add a statistical significance test, and provide a precise description of the 2.5D baseline (raw depth image input to an otherwise identical critic network) in the revised §4 and abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical two-stage training is self-contained

full rationale

The paper presents an empirical two-step procedure: a 3D point cloud predictor is trained on 76K simulation episodes plus 530 real-world RGBD snapshots (no grasping attempts), after which a grasping critic policy is trained exclusively in simulation using the predictor's outputs as input. The final real-world grasping performance is reported as an experimental outcome of this pipeline, with no equations, fitted parameters, or self-citations that reduce the claimed 10% gain to a definitional equivalence or input by construction. The domain-invariance claim is treated as a learned property of the first stage rather than an imposed identity. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method relies on standard deep learning assumptions and the domain-invariance of the learned 3D representation, with no new physical entities postulated. Free parameters are the typical ones in neural network training.

free parameters (1)
  • network architectures and hyperparameters for point cloud prediction and grasping critic
    Deep networks have many parameters tuned during training on the 76K sim and 530 real episodes.
axioms (2)
  • domain assumption The 3D shape representation learned from RGBD snapshots is domain-invariant between simulation and real world.
    Invoked in step 1 of the method to enable transfer.
  • domain assumption A grasping policy trained on predicted 3D shapes in simulation will perform well in the real world when using the same shape predictor.
    Central to the two-step procedure.

pith-pipeline@v0.9.0 · 5801 in / 1421 out tokens · 31266 ms · 2026-05-25T19:09:05.341557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    Achlioptas, O

    P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas. Learning representations and generative mod- els for 3d point clouds. In ICML, 2018

  2. [2]

    Bohg and D

    J. Bohg and D. Kragic. Learning grasping points with shape context. Robot. Autonom. Syst., 58(4):362–377, 2010

  3. [3]

    Bousmalis, A

    K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. P. Sampedro, K. Konolige, S. Levine, and V . Vanhoucke. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In ICRA, 2018

  4. [4]

    A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Han- rahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, 2015

  5. [5]

    Coumans and Y

    E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019

  6. [6]

    G. Csurka. Domain adaptation for visual applications: A comprehensive survey. CoRR, abs/1702.05374, 2017

  7. [7]

    A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner. Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans. In CVPR, 2018

  8. [8]

    Dang and P

    H. Dang and P. K. Allen. Semantic grasping: planning task-specific stable robotic grasps.Autonomous Robots, 37(3):301–316, 2014

  9. [9]

    C. M. Devin, E. Jang, S. Levine, and V . Vanhoucke. Grasp2vec: Learning object representations from self- supervised grasping. 2018

  10. [10]

    Dogar, K

    M. Dogar, K. Hsiao, M. Ciocarlie, and S. Srinivasa. Physics-based grasp planning through clutter. In Robotics: Science and Systems VIII, July 2012

  11. [11]

    Eigen and R

    D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture. In ICCV, pages 2650– 2658, 2015

  12. [12]

    S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. Neural scene repre- sentation and rendering. Science, 2018

  13. [13]

    H. Fan, H. Su, and L. J. Guibas. A point set genera- tion network for 3d object reconstruction from a single image. In CVPR, pages 2463–2471, 2017

  14. [14]

    K. Fang, Y . Bai, S. Hinterstoißer, S. Savarese, and M. Kalakrishnan. Multi-task domain adaptation for deep learning of instance grasping from simulation. ICRA, pages 3516–3523, 2018

  15. [15]

    Gadelha, R

    M. Gadelha, R. Wang, and S. Maji. Multiresolution tree networks for 3d point cloud processing. In ECCV, 2018

  16. [16]

    Ganapathi-Subramanian, O

    V . Ganapathi-Subramanian, O. Diamanti, S. Pirk, C. Tang, M. Niessner, and L. Guibas. Parsing geometry using structure-aware shape templates. In 3DV, 2018

  17. [17]

    R. Garg, V . K. BG, G. Carneiro, and I. Reid. Unsuper- vised cnn for single view depth estimation: Geometry to the rescue. In ECCV, pages 740–756, 2016

  18. [18]

    Goldfeder, M

    C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen. The columbia grasp database. In ICRA, 2009

  19. [19]

    Gualtieri, A

    M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. InIROS. IEEE, 2016

  20. [20]

    Gualtieri, A

    M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. IROS, pages 598–605, 2016

  21. [21]

    K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In ICCV, 2017

  22. [22]

    Henderson and V

    P. Henderson and V . Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In BMVC, 2018

  23. [23]

    James, A

    S. James, A. J. Davison, and E. Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In CoRL, 2017

  24. [24]

    James, P

    S. James, P. Wohlhart, M. Kalakrishnan, D. Kalash- nikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data- efficient robotic grasping via randomized-to-canonical adaptation networks. 12 2018

  25. [25]

    Jiang, S

    L. Jiang, S. Shi, X. Qi, and J. Jia. Gal: Geometric ad- versarial loss for single-view 3d-object reconstruction. In ECCV, 2018

  26. [26]

    Johns, S

    E. Johns, S. Leutenegger, and A. J. Davison. Deep learning a grasp function for grasping under gripper pose uncertainty. In IROS, 2016

  27. [27]

    H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018

  28. [28]

    D. Katz, A. Venkatraman, M. Kazemi, J. A. Bagnell, and A. Stentz. Perceiving, learning, and exploiting object affordances for autonomous pile manipulation. Autonomous Robots, 37(4):369–382, 2014

  29. [29]

    Kopicki, R

    M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt. One-shot learning and generation of dexterous grasps for novel objects. Int. J. Robotics Res., 35(8):959–976, 2016

  30. [30]

    I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. Int. J. Robotics Res. , 34(4- 5):705–724, 2015

  31. [31]

    Le´on, S

    B. Le´on, S. Ulbrich, R. Diankov, G. Puche, M. Przy- bylski, A. Morales, T. Asfour, S. Moisio, J. Bohg, J. Kuffner, et al. Opengrasp: A toolkit for robot grasp- ing simulation

  32. [32]

    Levine, P

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collec- tion. Int. J. Robotics Res., page 0278364917710318

  33. [33]

    M. Li, K. Hang, D. Kragic, and A. Billard. Dexterous grasping under shape uncertainty. Robot. Autonom. Syst., 75:352–364, 2016

  34. [34]

    Y . Li, A. Dai, L. Guibas, and M. Niessner. Database- assisted object retrieval for real-time 3d reconstruction. Comput. Graph. Forum, 34(2):435–446, 2015

  35. [35]

    S. Liu, W. Chen, T. Li, and H. Li. Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. CoRR, 2019

  36. [36]

    Mahler, J

    J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017

  37. [37]

    Mahler, F

    J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kr ¨oger, J. Kuffner, and K. Goldberg. Dex-net 1.0: A cloud- based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated re- wards. In ICRA, 2016

  38. [38]

    Montesano and M

    L. Montesano and M. Lopes. Active learning of vi- sual descriptors for grasping using non-parametric smoothed beta distributions. Robot. Autonom. Syst., 60(3):452–462, 2012

  39. [39]

    R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127– 136, 2011

  40. [40]

    D. T. Nguyen, B. Hua, M. Tran, Q. Pham, and S. Yeung. A field model for repairing 3d shapes. In CVPR, pages 5676–5684, 2016

  41. [41]

    Nikandrova and V

    E. Nikandrova and V . Kyrki. Category-based task spe- cific grasping.Robot. Autonom. Syst., 70:25–35, 2015

  42. [42]

    T. Osa, J. Peters, and G. Neumann. Experiments with hierarchical reinforcement learning of multiple grasp- ing policies. In ISER, pages 160–172. Springer, 2016

  43. [43]

    V . M. Patel, R. Gopalan, R. Li, and R. Chellappa. Vi- sual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, 2015

  44. [44]

    Pinto and A

    L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016

  45. [45]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017

  46. [46]

    Rubinstein and D

    R. Rubinstein and D. Kroese. The cross-entropy method: A unified approach to combinatorial optimiza- tion, monte-carlo simulation, and machine learning. 2004

  47. [47]

    Saxena, J

    A. Saxena, J. Driemeyer, and A. Y . Ng. Robotic grasp- ing of novel objects using vision. Int. J. Robotics Res., 27(2):157–173, 2008

  48. [48]

    S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017

  49. [49]

    Tobin, R

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. IROS, pages 23–30, 2017

  50. [50]

    Vahrenkamp, L

    N. Vahrenkamp, L. Westkamp, N. Yamanobe, E. E. Aksoy, and T. Asfour. Part-based grasp planning for familiar objects. In Humanoid Robots (Humanoids), pages 919–925, 2016

  51. [51]

    Varley, C

    J. Varley, C. DeChant, A. Richardson, A. Nair, J. Ru- ales, and P. Allen. Shape completion enabled robotic grasping. 2016

  52. [52]

    Learning a visuomotor controller for real world robotic grasping using simulated depth images

    U. Viereck, A. ten Pas, K. Saenko, and R. Platt. Learn- ing a visuomotor controller for real world robotic grasping using easily simulated depth images. CoRR, abs/1706.04652, 2017

  53. [53]

    C. Wang, D. Xu, Y . Zhu, R. Mart ´ın-Mart´ın, C. Lu, L. Fei-Fei, and S. Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In CVPR, 2019

  54. [54]

    N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018

  55. [55]

    S. Wang, J. Wu, X. Sun, W. Yuan, W. T. Freeman, J. B. Tenenbaum, and E. H. Adelson. 3d shape perception from monocular vision, touch, and shape priors. In IROS. IEEE, 2018

  56. [56]

    J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen- baum. Learning a probabilistic latent space of ob- ject shapes via 3d generative-adversarial modeling. In NeurIPS, 2016

  57. [57]

    D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. InCVPR, pages 244–253, 2018

  58. [58]

    X. Yan, J. Hsu, M. Khansari, Y . Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d rep- resentations. In ICRA, 2018

  59. [59]

    X. Yan, J. Yang, E. Yumer, Y . Guo, and H. Lee. Perspec- tive transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016

  60. [60]

    T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu- pervised learning of depth and ego-motion from video. In CVPR, 2017