Data-Efficient Learning for Sim-to-Real Robotic Grasping using Deep Point Cloud Prediction Networks

Honglak Lee; Jasmine Hsu; Mohi Khansari; S\"oren Pirk; Xinchen Yan; Yuanzheng Gong; Yunfei Bai

arxiv: 1906.08989 · v1 · pith:F3OXSALVnew · submitted 2019-06-21 · 💻 cs.RO · cs.CV

Data-Efficient Learning for Sim-to-Real Robotic Grasping using Deep Point Cloud Prediction Networks

Xinchen Yan , Mohi Khansari , Jasmine Hsu , Yuanzheng Gong , Yunfei Bai , S\"oren Pirk , Honglak Lee This is my paper

Pith reviewed 2026-05-25 19:09 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords sim-to-real transferrobotic graspingpoint cloud predictiondomain-invariant representationdata-efficient learningdeep networks for roboticstable-top graspingRGBD snapshots

0 comments

The pith

A two-step process learns domain-invariant 3D point clouds from simulation episodes and real snapshots to train grasping policies entirely in simulation that transfer to the real world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a grasping policy can be trained for table-top instance grasping of varied objects with no real-world grasping data by first learning 3D point cloud predictions that remain consistent across simulation and reality. This representation comes from roughly 76,000 simulation episodes plus 530 short real-world RGBD snapshot sequences. A critic network is then trained only in simulation on top of these 3D shapes. The resulting policy outperforms a 2.5D shape baseline by 10 percent when deployed on a real robot. Real data collection requires only passive camera snapshots rather than active grasping attempts, which lowers cost and time.

Core claim

The method learns a domain-invariant 3D shape representation of objects from about 76K episodes in simulation and about 530 episodes in the real world, where each episode lasts less than a minute, then trains a critic grasping policy in simulation only based on that 3D representation; the learned policy performs table-top instance grasping of a wide variety of objects in the real world without any real grasping data and outperforms the 2.5D baseline by 10 percent.

What carries the argument

Deep point cloud prediction network that produces domain-invariant 3D shape representations from RGBD inputs for use by a simulation-trained grasping critic.

If this is right

Grasping policies for new objects and arrangements can be developed without collecting real grasping trials.
The 3D representation learned in the first step can be reused for other robotic interaction tasks.
Data collection effort drops because real episodes need only multiple RGBD snapshots rather than physical attempts.
Performance gains of 10 percent over 2.5D methods hold across wide object variety in table-top settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-step pattern could extend to other manipulation skills such as pushing or stacking if suitable critics are defined in simulation.
If point cloud accuracy proves sufficient, it may reduce the amount of domain randomization needed during simulation training.
Testing the representation on cluttered or partially occluded scenes would reveal how far the current snapshot collection suffices.

Load-bearing premise

A domain-invariant 3D shape learned only from snapshots without any grasping attempts is sufficient for a simulation-trained policy to transfer and succeed at real-world grasping.

What would settle it

A real-robot test in which the sim-trained policy achieves grasping success rates at or below the 2.5D baseline even when point cloud predictions on real scenes match simulation accuracy would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 1906.08989 by Honglak Lee, Jasmine Hsu, Mohi Khansari, S\"oren Pirk, Xinchen Yan, Yuanzheng Gong, Yunfei Bai.

**Figure 2.** Figure 2: Overview of our object detection and point cloud prediction networks: we detect an object and obtain its cropped color and depth [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our point cloud-based grasping network. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Examples of data collection for shape prediction in the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Overview of the dataset used for learning the domain [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Visualizations of point clouds generated with our point [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Grasping sequence evaluation: we visualize the real world grasping sequences for the baseline model (left) and our model (right). [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Training a deep network policy for robot manipulation is notoriously costly and time consuming as it depends on collecting a significant amount of real world data. To work well in the real world, the policy needs to see many instances of the task, including various object arrangements in the scene as well as variations in object geometry, texture, material, and environmental illumination. In this paper, we propose a method that learns to perform table-top instance grasping of a wide variety of objects while using no real world grasping data, outperforming the baseline using 2.5D shape by 10%. Our method learns 3D point cloud of object, and use that to train a domain-invariant grasping policy. We formulate the learning process as a two-step procedure: 1) Learning a domain-invariant 3D shape representation of objects from about 76K episodes in simulation and about 530 episodes in the real world, where each episode lasts less than a minute and 2) Learning a critic grasping policy in simulation only based on the 3D shape representation from step 1. Our real world data collection in step 1 is both cheaper and faster compared to existing approaches as it only requires taking multiple snapshots of the scene using a RGBD camera. Finally, the learned 3D representation is not specific to grasping, and can potentially be used in other interaction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper splits shape learning (sim + cheap real snapshots, no grasps) from policy training (sim only) and claims 10% better real grasping than 2.5D, but the abstract gives almost no experimental detail to back the transfer claim.

read the letter

The core move here is training a point cloud predictor on 76k sim episodes plus 530 real RGBD snapshots, then using the resulting 3D representation to train a grasping critic entirely in simulation. That critic is supposed to transfer zero-shot because the representation is domain-invariant. The practical payoff they advertise is skipping all real grasping trials, which is a real bottleneck. That separation of concerns is the main thing that is new relative to standard sim-to-real grasping work. It also means the real-world data collection is genuinely cheaper—just camera snapshots instead of repeated grasp attempts. If the numbers hold, this is the kind of incremental but useful reduction in real-robot time that people actually care about deploying. The 10% gain over the 2.5D baseline is the headline result, but the abstract supplies no trial counts, no variance, no statistical test, and no description of how invariance was achieved or measured. The stress-test concern lands: without a quantitative check that the predicted real point clouds sit inside the distribution the critic was trained on, it is hard to know whether the improvement comes from better shape or from actual domain transfer. The paper would be stronger with an ablation that shows the critic fails when fed raw real point clouds but succeeds on the predicted ones, plus some distance metric between sim and real predicted clouds. The citation pattern is not visible from the abstract, but the approach itself looks like a straightforward extension of existing point-cloud and sim-to-real ideas rather than a wholesale reinvention. This is the kind of paper that belongs in a robotics conference if the experiments are solid and reproducible. It is aimed at people who need to move learned manipulation policies to hardware without burning weeks on real data collection. The central claim is practically important enough that a serious editor should send it out for review rather than desk-reject, even if the current write-up leaves the transfer mechanism under-specified.

Referee Report

2 major / 1 minor

Summary. The paper claims a two-step data-efficient sim-to-real method for table-top robotic grasping: first learn a domain-invariant 3D point cloud predictor from ~76K simulation episodes plus 530 real RGBD snapshots (no grasping attempts), then train a grasping critic policy entirely in simulation on the resulting 3D representation; the approach reportedly enables grasping of diverse objects without any real grasping data and yields a 10% improvement over a 2.5D shape baseline.

Significance. If the domain-invariance claim holds with supporting evidence, the separation of cheap snapshot-based shape learning from simulation-only policy training would meaningfully lower the barrier to real-world deployment of manipulation policies. The non-grasping-specific nature of the learned representation is also noted as potentially reusable for other tasks.

major comments (2)

[Abstract, §3] Abstract and §3 (method): the central claim that the learned 3D representation is 'domain-invariant' and thereby enables zero-shot transfer is load-bearing, yet the manuscript provides no description of the invariance mechanism (adversarial loss, cycle consistency, shared latent space, etc.) nor any quantitative alignment metric (Chamfer distance, feature-space MMD, or classifier accuracy on sim vs. real predicted clouds) between the two domains.
[§4, Abstract] §4 (experiments) and abstract: the reported 10% outperformance over the 2.5D baseline is stated without the number of real-world trials, standard deviation, statistical significance test, or precise definition of the baseline architecture and input representation, preventing assessment of whether the gain is attributable to successful domain transfer versus simply better shape estimation.

minor comments (1)

[§3] Notation for the point-cloud predictor and critic networks is introduced without a consolidated table of layer dimensions or loss weights.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript to improve clarity where the points are valid.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): the central claim that the learned 3D representation is 'domain-invariant' and thereby enables zero-shot transfer is load-bearing, yet the manuscript provides no description of the invariance mechanism (adversarial loss, cycle consistency, shared latent space, etc.) nor any quantitative alignment metric (Chamfer distance, feature-space MMD, or classifier accuracy on sim vs. real predicted clouds) between the two domains.

Authors: The domain invariance arises from jointly training the point cloud prediction network on the combined set of ~76K simulated episodes and 530 real RGBD snapshots using a shared architecture and loss; this mixed-domain training encourages features that are consistent across domains without requiring an explicit adversarial or cycle-consistency term. We will revise §3 to explicitly describe this training procedure and architecture. Quantitative alignment metrics between simulated and real predicted clouds were not computed in the original work, as downstream grasping success served as the primary validation; adding them would require new analysis. revision: partial
Referee: [§4, Abstract] §4 (experiments) and abstract: the reported 10% outperformance over the 2.5D baseline is stated without the number of real-world trials, standard deviation, statistical significance test, or precise definition of the baseline architecture and input representation, preventing assessment of whether the gain is attributable to successful domain transfer versus simply better shape estimation.

Authors: We agree that the experimental reporting requires additional detail. The 10% improvement is measured over real-world grasping trials on a fixed set of objects; we will specify the exact trial count, include standard deviations, add a statistical significance test, and provide a precise description of the 2.5D baseline (raw depth image input to an otherwise identical critic network) in the revised §4 and abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical two-stage training is self-contained

full rationale

The paper presents an empirical two-step procedure: a 3D point cloud predictor is trained on 76K simulation episodes plus 530 real-world RGBD snapshots (no grasping attempts), after which a grasping critic policy is trained exclusively in simulation using the predictor's outputs as input. The final real-world grasping performance is reported as an experimental outcome of this pipeline, with no equations, fitted parameters, or self-citations that reduce the claimed 10% gain to a definitional equivalence or input by construction. The domain-invariance claim is treated as a learned property of the first stage rather than an imposed identity. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method relies on standard deep learning assumptions and the domain-invariance of the learned 3D representation, with no new physical entities postulated. Free parameters are the typical ones in neural network training.

free parameters (1)

network architectures and hyperparameters for point cloud prediction and grasping critic
Deep networks have many parameters tuned during training on the 76K sim and 530 real episodes.

axioms (2)

domain assumption The 3D shape representation learned from RGBD snapshots is domain-invariant between simulation and real world.
Invoked in step 1 of the method to enable transfer.
domain assumption A grasping policy trained on predicted 3D shapes in simulation will perform well in the real world when using the same shape predictor.
Central to the two-step procedure.

pith-pipeline@v0.9.0 · 5801 in / 1421 out tokens · 31266 ms · 2026-05-25T19:09:05.341557+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

[1]

Achlioptas, O

P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas. Learning representations and generative mod- els for 3d point clouds. In ICML, 2018

work page 2018
[2]

Bohg and D

J. Bohg and D. Kragic. Learning grasping points with shape context. Robot. Autonom. Syst., 58(4):362–377, 2010

work page 2010
[3]

Bousmalis, A

K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. P. Sampedro, K. Konolige, S. Levine, and V . Vanhoucke. Using simulation and domain adaptation to improve efﬁciency of deep robotic grasping. In ICRA, 2018

work page 2018
[4]

A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Han- rahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, 2015

work page 2015
[5]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019

work page 2016
[6]

G. Csurka. Domain adaptation for visual applications: A comprehensive survey. CoRR, abs/1702.05374, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner. Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans. In CVPR, 2018

work page 2018
[8]

Dang and P

H. Dang and P. K. Allen. Semantic grasping: planning task-speciﬁc stable robotic grasps.Autonomous Robots, 37(3):301–316, 2014

work page 2014
[9]

C. M. Devin, E. Jang, S. Levine, and V . Vanhoucke. Grasp2vec: Learning object representations from self- supervised grasping. 2018

work page 2018
[10]

Dogar, K

M. Dogar, K. Hsiao, M. Ciocarlie, and S. Srinivasa. Physics-based grasp planning through clutter. In Robotics: Science and Systems VIII, July 2012

work page 2012
[11]

Eigen and R

D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture. In ICCV, pages 2650– 2658, 2015

work page 2015
[12]

S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. Neural scene repre- sentation and rendering. Science, 2018

work page 2018
[13]

H. Fan, H. Su, and L. J. Guibas. A point set genera- tion network for 3d object reconstruction from a single image. In CVPR, pages 2463–2471, 2017

work page 2017
[14]

K. Fang, Y . Bai, S. Hinterstoißer, S. Savarese, and M. Kalakrishnan. Multi-task domain adaptation for deep learning of instance grasping from simulation. ICRA, pages 3516–3523, 2018

work page 2018
[15]

Gadelha, R

M. Gadelha, R. Wang, and S. Maji. Multiresolution tree networks for 3d point cloud processing. In ECCV, 2018

work page 2018
[16]

Ganapathi-Subramanian, O

V . Ganapathi-Subramanian, O. Diamanti, S. Pirk, C. Tang, M. Niessner, and L. Guibas. Parsing geometry using structure-aware shape templates. In 3DV, 2018

work page 2018
[17]

R. Garg, V . K. BG, G. Carneiro, and I. Reid. Unsuper- vised cnn for single view depth estimation: Geometry to the rescue. In ECCV, pages 740–756, 2016

work page 2016
[18]

Goldfeder, M

C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen. The columbia grasp database. In ICRA, 2009

work page 2009
[19]

Gualtieri, A

M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. InIROS. IEEE, 2016

work page 2016
[20]

Gualtieri, A

M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. IROS, pages 598–605, 2016

work page 2016
[21]

K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In ICCV, 2017

work page 2017
[22]

Henderson and V

P. Henderson and V . Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In BMVC, 2018

work page 2018
[23]

James, A

S. James, A. J. Davison, and E. Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In CoRL, 2017

work page 2017
[24]

James, P

S. James, P. Wohlhart, M. Kalakrishnan, D. Kalash- nikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data- efﬁcient robotic grasping via randomized-to-canonical adaptation networks. 12 2018

work page 2018
[25]

Jiang, S

L. Jiang, S. Shi, X. Qi, and J. Jia. Gal: Geometric ad- versarial loss for single-view 3d-object reconstruction. In ECCV, 2018

work page 2018
[26]

Johns, S

E. Johns, S. Leutenegger, and A. J. Davison. Deep learning a grasp function for grasping under gripper pose uncertainty. In IROS, 2016

work page 2016
[27]

H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018

work page 2018
[28]

D. Katz, A. Venkatraman, M. Kazemi, J. A. Bagnell, and A. Stentz. Perceiving, learning, and exploiting object affordances for autonomous pile manipulation. Autonomous Robots, 37(4):369–382, 2014

work page 2014
[29]

Kopicki, R

M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt. One-shot learning and generation of dexterous grasps for novel objects. Int. J. Robotics Res., 35(8):959–976, 2016

work page 2016
[30]

I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. Int. J. Robotics Res. , 34(4- 5):705–724, 2015

work page 2015
[31]

Le´on, S

B. Le´on, S. Ulbrich, R. Diankov, G. Puche, M. Przy- bylski, A. Morales, T. Asfour, S. Moisio, J. Bohg, J. Kuffner, et al. Opengrasp: A toolkit for robot grasp- ing simulation

work page
[32]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collec- tion. Int. J. Robotics Res., page 0278364917710318

work page
[33]

M. Li, K. Hang, D. Kragic, and A. Billard. Dexterous grasping under shape uncertainty. Robot. Autonom. Syst., 75:352–364, 2016

work page 2016
[34]

Y . Li, A. Dai, L. Guibas, and M. Niessner. Database- assisted object retrieval for real-time 3d reconstruction. Comput. Graph. Forum, 34(2):435–446, 2015

work page 2015
[35]

S. Liu, W. Chen, T. Li, and H. Li. Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. CoRR, 2019

work page 2019
[36]

Mahler, J

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017

work page 2017
[37]

Mahler, F

J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kr ¨oger, J. Kuffner, and K. Goldberg. Dex-net 1.0: A cloud- based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated re- wards. In ICRA, 2016

work page 2016
[38]

Montesano and M

L. Montesano and M. Lopes. Active learning of vi- sual descriptors for grasping using non-parametric smoothed beta distributions. Robot. Autonom. Syst., 60(3):452–462, 2012

work page 2012
[39]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127– 136, 2011

work page 2011
[40]

D. T. Nguyen, B. Hua, M. Tran, Q. Pham, and S. Yeung. A ﬁeld model for repairing 3d shapes. In CVPR, pages 5676–5684, 2016

work page 2016
[41]

Nikandrova and V

E. Nikandrova and V . Kyrki. Category-based task spe- ciﬁc grasping.Robot. Autonom. Syst., 70:25–35, 2015

work page 2015
[42]

T. Osa, J. Peters, and G. Neumann. Experiments with hierarchical reinforcement learning of multiple grasp- ing policies. In ISER, pages 160–172. Springer, 2016

work page 2016
[43]

V . M. Patel, R. Gopalan, R. Li, and R. Chellappa. Vi- sual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, 2015

work page 2015
[44]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016

work page 2016
[45]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, pages 652–660, 2017

work page 2017
[46]

Rubinstein and D

R. Rubinstein and D. Kroese. The cross-entropy method: A uniﬁed approach to combinatorial optimiza- tion, monte-carlo simulation, and machine learning. 2004

work page 2004
[47]

Saxena, J

A. Saxena, J. Driemeyer, and A. Y . Ng. Robotic grasp- ing of novel objects using vision. Int. J. Robotics Res., 27(2):157–173, 2008

work page 2008
[48]

S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017

work page 2017
[49]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. IROS, pages 23–30, 2017

work page 2017
[50]

Vahrenkamp, L

N. Vahrenkamp, L. Westkamp, N. Yamanobe, E. E. Aksoy, and T. Asfour. Part-based grasp planning for familiar objects. In Humanoid Robots (Humanoids), pages 919–925, 2016

work page 2016
[51]

Varley, C

J. Varley, C. DeChant, A. Richardson, A. Nair, J. Ru- ales, and P. Allen. Shape completion enabled robotic grasping. 2016

work page 2016
[52]

Learning a visuomotor controller for real world robotic grasping using simulated depth images

U. Viereck, A. ten Pas, K. Saenko, and R. Platt. Learn- ing a visuomotor controller for real world robotic grasping using easily simulated depth images. CoRR, abs/1706.04652, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

C. Wang, D. Xu, Y . Zhu, R. Mart ´ın-Mart´ın, C. Lu, L. Fei-Fei, and S. Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In CVPR, 2019

work page 2019
[54]

N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018

work page 2018
[55]

S. Wang, J. Wu, X. Sun, W. Yuan, W. T. Freeman, J. B. Tenenbaum, and E. H. Adelson. 3d shape perception from monocular vision, touch, and shape priors. In IROS. IEEE, 2018

work page 2018
[56]

J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen- baum. Learning a probabilistic latent space of ob- ject shapes via 3d generative-adversarial modeling. In NeurIPS, 2016

work page 2016
[57]

D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. InCVPR, pages 244–253, 2018

work page 2018
[58]

X. Yan, J. Hsu, M. Khansari, Y . Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d rep- resentations. In ICRA, 2018

work page 2018
[59]

X. Yan, J. Yang, E. Yumer, Y . Guo, and H. Lee. Perspec- tive transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016

work page 2016
[60]

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu- pervised learning of depth and ego-motion from video. In CVPR, 2017

work page 2017

[1] [1]

Achlioptas, O

P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas. Learning representations and generative mod- els for 3d point clouds. In ICML, 2018

work page 2018

[2] [2]

Bohg and D

J. Bohg and D. Kragic. Learning grasping points with shape context. Robot. Autonom. Syst., 58(4):362–377, 2010

work page 2010

[3] [3]

Bousmalis, A

K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. P. Sampedro, K. Konolige, S. Levine, and V . Vanhoucke. Using simulation and domain adaptation to improve efﬁciency of deep robotic grasping. In ICRA, 2018

work page 2018

[4] [4]

A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Han- rahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, 2015

work page 2015

[5] [5]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019

work page 2016

[6] [6]

G. Csurka. Domain adaptation for visual applications: A comprehensive survey. CoRR, abs/1702.05374, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner. Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans. In CVPR, 2018

work page 2018

[8] [8]

Dang and P

H. Dang and P. K. Allen. Semantic grasping: planning task-speciﬁc stable robotic grasps.Autonomous Robots, 37(3):301–316, 2014

work page 2014

[9] [9]

C. M. Devin, E. Jang, S. Levine, and V . Vanhoucke. Grasp2vec: Learning object representations from self- supervised grasping. 2018

work page 2018

[10] [10]

Dogar, K

M. Dogar, K. Hsiao, M. Ciocarlie, and S. Srinivasa. Physics-based grasp planning through clutter. In Robotics: Science and Systems VIII, July 2012

work page 2012

[11] [11]

Eigen and R

D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture. In ICCV, pages 2650– 2658, 2015

work page 2015

[12] [12]

S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. Neural scene repre- sentation and rendering. Science, 2018

work page 2018

[13] [13]

H. Fan, H. Su, and L. J. Guibas. A point set genera- tion network for 3d object reconstruction from a single image. In CVPR, pages 2463–2471, 2017

work page 2017

[14] [14]

K. Fang, Y . Bai, S. Hinterstoißer, S. Savarese, and M. Kalakrishnan. Multi-task domain adaptation for deep learning of instance grasping from simulation. ICRA, pages 3516–3523, 2018

work page 2018

[15] [15]

Gadelha, R

M. Gadelha, R. Wang, and S. Maji. Multiresolution tree networks for 3d point cloud processing. In ECCV, 2018

work page 2018

[16] [16]

Ganapathi-Subramanian, O

V . Ganapathi-Subramanian, O. Diamanti, S. Pirk, C. Tang, M. Niessner, and L. Guibas. Parsing geometry using structure-aware shape templates. In 3DV, 2018

work page 2018

[17] [17]

R. Garg, V . K. BG, G. Carneiro, and I. Reid. Unsuper- vised cnn for single view depth estimation: Geometry to the rescue. In ECCV, pages 740–756, 2016

work page 2016

[18] [18]

Goldfeder, M

C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen. The columbia grasp database. In ICRA, 2009

work page 2009

[19] [19]

Gualtieri, A

M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. InIROS. IEEE, 2016

work page 2016

[20] [20]

Gualtieri, A

M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt. High precision grasp pose detection in dense clutter. IROS, pages 598–605, 2016

work page 2016

[21] [21]

K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In ICCV, 2017

work page 2017

[22] [22]

Henderson and V

P. Henderson and V . Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In BMVC, 2018

work page 2018

[23] [23]

James, A

S. James, A. J. Davison, and E. Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In CoRL, 2017

work page 2017

[24] [24]

James, P

S. James, P. Wohlhart, M. Kalakrishnan, D. Kalash- nikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data- efﬁcient robotic grasping via randomized-to-canonical adaptation networks. 12 2018

work page 2018

[25] [25]

Jiang, S

L. Jiang, S. Shi, X. Qi, and J. Jia. Gal: Geometric ad- versarial loss for single-view 3d-object reconstruction. In ECCV, 2018

work page 2018

[26] [26]

Johns, S

E. Johns, S. Leutenegger, and A. J. Davison. Deep learning a grasp function for grasping under gripper pose uncertainty. In IROS, 2016

work page 2016

[27] [27]

H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In CVPR, 2018

work page 2018

[28] [28]

D. Katz, A. Venkatraman, M. Kazemi, J. A. Bagnell, and A. Stentz. Perceiving, learning, and exploiting object affordances for autonomous pile manipulation. Autonomous Robots, 37(4):369–382, 2014

work page 2014

[29] [29]

Kopicki, R

M. Kopicki, R. Detry, M. Adjigble, R. Stolkin, A. Leonardis, and J. L. Wyatt. One-shot learning and generation of dexterous grasps for novel objects. Int. J. Robotics Res., 35(8):959–976, 2016

work page 2016

[30] [30]

I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. Int. J. Robotics Res. , 34(4- 5):705–724, 2015

work page 2015

[31] [31]

Le´on, S

B. Le´on, S. Ulbrich, R. Diankov, G. Puche, M. Przy- bylski, A. Morales, T. Asfour, S. Moisio, J. Bohg, J. Kuffner, et al. Opengrasp: A toolkit for robot grasp- ing simulation

work page

[32] [32]

Levine, P

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collec- tion. Int. J. Robotics Res., page 0278364917710318

work page

[33] [33]

M. Li, K. Hang, D. Kragic, and A. Billard. Dexterous grasping under shape uncertainty. Robot. Autonom. Syst., 75:352–364, 2016

work page 2016

[34] [34]

Y . Li, A. Dai, L. Guibas, and M. Niessner. Database- assisted object retrieval for real-time 3d reconstruction. Comput. Graph. Forum, 34(2):435–446, 2015

work page 2015

[35] [35]

S. Liu, W. Chen, T. Li, and H. Li. Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. CoRR, 2019

work page 2019

[36] [36]

Mahler, J

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. 2017

work page 2017

[37] [37]

Mahler, F

J. Mahler, F. T. Pokorny, B. Hou, M. Roderick, M. Laskey, M. Aubry, K. Kohlhoff, T. Kr ¨oger, J. Kuffner, and K. Goldberg. Dex-net 1.0: A cloud- based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated re- wards. In ICRA, 2016

work page 2016

[38] [38]

Montesano and M

L. Montesano and M. Lopes. Active learning of vi- sual descriptors for grasping using non-parametric smoothed beta distributions. Robot. Autonom. Syst., 60(3):452–462, 2012

work page 2012

[39] [39]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127– 136, 2011

work page 2011

[40] [40]

D. T. Nguyen, B. Hua, M. Tran, Q. Pham, and S. Yeung. A ﬁeld model for repairing 3d shapes. In CVPR, pages 5676–5684, 2016

work page 2016

[41] [41]

Nikandrova and V

E. Nikandrova and V . Kyrki. Category-based task spe- ciﬁc grasping.Robot. Autonom. Syst., 70:25–35, 2015

work page 2015

[42] [42]

T. Osa, J. Peters, and G. Neumann. Experiments with hierarchical reinforcement learning of multiple grasp- ing policies. In ISER, pages 160–172. Springer, 2016

work page 2016

[43] [43]

V . M. Patel, R. Gopalan, R. Li, and R. Chellappa. Vi- sual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, 2015

work page 2015

[44] [44]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016

work page 2016

[45] [45]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, pages 652–660, 2017

work page 2017

[46] [46]

Rubinstein and D

R. Rubinstein and D. Kroese. The cross-entropy method: A uniﬁed approach to combinatorial optimiza- tion, monte-carlo simulation, and machine learning. 2004

work page 2004

[47] [47]

Saxena, J

A. Saxena, J. Driemeyer, and A. Y . Ng. Robotic grasp- ing of novel objects using vision. Int. J. Robotics Res., 27(2):157–173, 2008

work page 2008

[48] [48]

S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017

work page 2017

[49] [49]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. IROS, pages 23–30, 2017

work page 2017

[50] [50]

Vahrenkamp, L

N. Vahrenkamp, L. Westkamp, N. Yamanobe, E. E. Aksoy, and T. Asfour. Part-based grasp planning for familiar objects. In Humanoid Robots (Humanoids), pages 919–925, 2016

work page 2016

[51] [51]

Varley, C

J. Varley, C. DeChant, A. Richardson, A. Nair, J. Ru- ales, and P. Allen. Shape completion enabled robotic grasping. 2016

work page 2016

[52] [52]

Learning a visuomotor controller for real world robotic grasping using simulated depth images

U. Viereck, A. ten Pas, K. Saenko, and R. Platt. Learn- ing a visuomotor controller for real world robotic grasping using easily simulated depth images. CoRR, abs/1706.04652, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [53]

C. Wang, D. Xu, Y . Zhu, R. Mart ´ın-Mart´ın, C. Lu, L. Fei-Fei, and S. Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. In CVPR, 2019

work page 2019

[54] [54]

N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018

work page 2018

[55] [55]

S. Wang, J. Wu, X. Sun, W. Yuan, W. T. Freeman, J. B. Tenenbaum, and E. H. Adelson. 3d shape perception from monocular vision, touch, and shape priors. In IROS. IEEE, 2018

work page 2018

[56] [56]

J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen- baum. Learning a probabilistic latent space of ob- ject shapes via 3d generative-adversarial modeling. In NeurIPS, 2016

work page 2016

[57] [57]

D. Xu, D. Anguelov, and A. Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. InCVPR, pages 244–253, 2018

work page 2018

[58] [58]

X. Yan, J. Hsu, M. Khansari, Y . Bai, A. Pathak, A. Gupta, J. Davidson, and H. Lee. Learning 6-dof grasping interaction via deep geometry-aware 3d rep- resentations. In ICRA, 2018

work page 2018

[59] [59]

X. Yan, J. Yang, E. Yumer, Y . Guo, and H. Lee. Perspec- tive transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016

work page 2016

[60] [60]

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu- pervised learning of depth and ego-motion from video. In CVPR, 2017

work page 2017