Rotation-Aware Point-Cloud Embeddings for Vision-Based In-Hand Reorientation

Karthik Dantu; Yashom Dighe

arxiv: 2606.21788 · v1 · pith:WWRO4IVJnew · submitted 2026-06-19 · 💻 cs.RO · cs.CV

Rotation-Aware Point-Cloud Embeddings for Vision-Based In-Hand Reorientation

Yashom Dighe , Karthik Dantu This is my paper

Pith reviewed 2026-06-26 13:52 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords point-cloud embeddingin-hand reorientationrotation-aware representationmodel-free reinforcement learningSO(3) geodesic distancedexterous manipulationvision-based controlgoal-conditioned policy

0 comments

The pith

A learned embedding makes Euclidean distance between point clouds equal the SO(3) geodesic rotation error, letting model-free RL policies reorient objects from raw current and goal clouds alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that raw point-cloud goals for in-hand reorientation entangle rotation with sampling and visibility noise, so prior methods add external pose, flow, or teacher signals. It trains an embedding network so that the straight-line distance in its latent space matches the true rotation difference measured along the shortest path on the sphere of orientations. With this calibrated distance as the comparison signal, plus only proprioception and centroid data, a reinforcement-learning policy learns the task end-to-end without ever receiving object pose, relative pose, dense correspondences, or distilled actions. The same interface reaches the success rates of privileged-state and distillation baselines while removing their test-time brittle modules. The work also finds that generic point-cloud pretraining does not suffice, because it keeps shape but drops the rotation state needed for current-to-goal comparison.

Core claim

We close this gap by learning a rotation-aware point-cloud embedding whose Euclidean latent distance is calibrated to the SO(3) geodesic error between object orientations. The resulting representation turns current-goal comparison into a smooth control signal, allowing a model-free RL policy to act from current and goal point-cloud embeddings, proprioception, and centroid metadata, without object pose, relative pose, dense flow, or teacher-action supervision. In in-hand reorientation experiments, this interface matches privileged-state and distillation-based baselines while avoiding brittle test-time computation of structured pose or flow inputs. These results suggest that point-cloud goals

What carries the argument

The rotation-aware point-cloud embedding: a neural network trained so its output Euclidean distance equals the geodesic rotation error on SO(3) for any pair of object point clouds.

If this is right

Model-free RL policies can solve in-hand reorientation directly from current and goal point clouds without any pose estimation or flow module at test time.
The same embedding supplies a usable policy gradient signal that matches the performance of methods relying on privileged state or teacher distillation.
Generic visual pretraining on point clouds is not enough, because it preserves shape but discards the rotation state required for goal comparison.
Point-cloud goals become a practical interface once the representation itself carries the rotation geometry instead of external structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same calibrated-distance idea could be tested on other rotation-heavy tasks such as peg insertion or lid opening by swapping the object class.
If the embedding generalizes across object shapes, it would let a single policy handle families of objects without per-object pose frames.
Replacing external pose solvers with an end-to-end learned metric reduces the number of brittle components that must be maintained at deployment.

Load-bearing premise

A neural network can be trained so that its latent Euclidean distance reliably equals the true SO(3) geodesic rotation error across arbitrary sampled point clouds of the object.

What would settle it

Train the embedding and then measure the correlation between its latent distances and actual rotation angles on a large held-out set of point-cloud pairs; if the correlation is near zero or the RL policy using the embedding fails to reach reorientation success rates above random, the central claim is false.

Figures

Figures reproduced from arXiv: 2606.21788 by Karthik Dantu, Yashom Dighe.

**Figure 1.** Figure 1: Policy rollout. We train a rotation-aware point-cloud encoder that enables direct RL policy learning from observed and goal point clouds, rotating the object toward the target geometry without object pose, relative pose, dense flow, or teacher-action supervision. Abstract: Point-cloud goals provide a direct way to specify dexterous in-hand reorientation: instead of defining an object-specific pose frame or… view at source ↗

**Figure 2.** Figure 2: Overview of the method. (a) Encoder architecture. (b) Encoder training uses paired ro [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cosine similarity between embedding deltas induced by rotations around different axes for pear [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pear t-SNE. To test this, we perform two controlled diagnostics. In the first, we rotate the object by the same angle about different axes and compute the cosine similarity between the resulting embeddings. In the second, we rotate the object by different angles about the same axis and again measure cosine similarity between the embeddings. This allows us to test whether embeddings are organized by rotati… view at source ↗

**Figure 5.** Figure 5: Latent embedding distance as a function of simultaneous rotations around pairs of axes [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Occlusion curriculum visualizations for Rubik’s cube (top row) and pear (bottom row). [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Rendered RGB-D fine-tuning examples for Rubik’s cube (top row) and pear (bottom row) [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Smoothed policy learning curves for all four evaluated YCB objects. Columns (a)–(d) [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative t-SNE visualizations of rotation samples from our encoder, colored by rotation [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Cosine similarity of embedding displacement directions under controlled single-axis ro [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Cosine similarity of embedding displacement directions under controlled single-axis [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Cosine similarity of embedding displacement directions under controlled single-axis [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Angle t-SNE visualizations for the medium clamp, scissors, and large marker. Marker [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Axis similarity diagnostics for the three symmetry stress cases. The high cross-axis [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Raw colored point-cloud renders and overlaid comparison clouds for selected high-cosine [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Latent embedding distance as a function of simultaneous rotations around pairs of axes [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Latent embedding distance as a function of simultaneous rotations around pairs of axes [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Latent embedding distance as a function of simultaneous rotations around pairs of axes [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

read the original abstract

Point-cloud goals provide a direct way to specify dexterous in-hand reorientation: instead of defining an object-specific pose frame or estimating 6D pose at test time, the policy is given the desired 3D geometry of the object. Yet raw point-cloud goal conditioning is poorly conditioned for policy learning. Current and goal clouds are unordered, independently sampled, and often visibility-dependent, so their discrepancy entangles object rotation with permutation, resampling, and unstable correspondence structure. For this reason, prior point-cloud manipulation methods typically add structure outside the representation itself, such as explicit pose or relative-pose inputs, dense flow features, or distillation from privileged teachers. We close this gap by learning a rotation-aware point-cloud embedding whose Euclidean latent distance is calibrated to the SO(3) geodesic error between object orientations. The resulting representation turns current-goal comparison into a smooth control signal, allowing a model-free RL policy to act from current and goal point-cloud embeddings, proprioception, and centroid metadata, without object pose, relative pose, dense flow, or teacher-action supervision. In in-hand reorientation experiments, this interface matches privileged-state and distillation-based baselines while avoiding brittle test-time computation of structured pose or flow inputs. These results suggest that point-cloud goals become practical for this task when the representation, rather than an external module, encodes the task-relevant geometry of rotation. We also show evidence that generic visual point-cloud pretraining is insufficient for such a current-goal comparison because it discards the task-relevant state and preserves only shape features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains a point-cloud embedding so Euclidean latent distance approximates SO(3) geodesic rotation error, letting model-free RL compare current and goal clouds directly without pose or flow modules.

read the letter

The main thing here is that they learn an embedding where the latent Euclidean distance is calibrated to match SO(3) geodesic error between object orientations. This lets the RL policy treat current-goal point cloud comparison as a smooth signal using only the embeddings, proprioception, and centroid info.

What the work actually does is remove the need for external pose estimation, relative pose, dense flow, or teacher distillation at test time. The experiments report that this matches privileged-state and distillation baselines on in-hand reorientation while avoiding brittle test-time modules. They also show that generic visual pretraining does not work for this comparison task because it drops the rotation-relevant state.

The calibration step is the concrete new piece relative to prior point-cloud manipulation methods. It is a direct attempt to make the representation itself encode the task geometry instead of bolting on structure afterward.

The soft spot is the stress-test concern about independent sampling and visibility mismatch. Current and goal clouds differ in point selection and occlusion, so any residual permutation or density signal in the latent distance would pollute the control signal. The abstract notes generic pretraining fails for exactly this reason, but without seeing the loss details, ablations, or quantitative checks on how well the task-specific training suppresses those factors, it is hard to judge whether the claimed clean gradient signal holds. If those checks are weak, the advantage over prior methods shrinks.

This is for people working on vision-based dexterous manipulation who want simpler goal interfaces. It has a clear method, a practical motivation, and baseline comparisons, so it deserves peer review even if the embedding robustness needs closer examination.

Referee Report

1 major / 1 minor

Summary. The paper claims to close a gap in vision-based in-hand reorientation by learning a rotation-aware point-cloud embedding f such that the Euclidean distance ||f(P) - f(Q)|| is calibrated to the SO(3) geodesic rotation error between object orientations. This calibrated representation supplies a smooth control signal for a model-free RL policy that takes current/goal embeddings, proprioception, and centroid metadata as input, without requiring object pose, relative pose, dense flow, or teacher-action supervision. Experiments on in-hand reorientation tasks show performance matching privileged-state and distillation baselines, while also demonstrating that generic point-cloud pretraining is insufficient because it discards task-relevant rotation state.

Significance. If the calibration result holds under independent sampling, the work would make point-cloud goals practical for dexterous manipulation by moving the necessary structure into the learned representation rather than external modules. The explicit demonstration that generic pretraining fails for current-goal comparison is a useful negative result that clarifies the requirements for task-specific embeddings.

major comments (1)

[§3] §3 (distance-calibration loss): the central claim requires that ||f(P)-f(Q)|| equals the SO(3) geodesic even when P and Q are independently sampled point clouds with different visibility and point density. The skeptic concern is load-bearing: any residual permutation or density variance orthogonal to rotation would inject spurious components into the latent distance and undermine the claim that the embedding alone supplies a usable, smooth policy gradient without correspondence structure. The manuscript must either provide quantitative bounds on how well the task-specific loss overcomes sampling variance or show ablations that isolate this factor.

minor comments (1)

The abstract states that the embedding 'turns current-goal comparison into a smooth control signal,' but the precise form of the RL reward or policy input that uses this distance should be stated explicitly in the methods for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the robustness of the distance calibration. We address the concern about independent sampling variance below.

read point-by-point responses

Referee: [§3] §3 (distance-calibration loss): the central claim requires that ||f(P)-f(Q)|| equals the SO(3) geodesic even when P and Q are independently sampled point clouds with different visibility and point density. The skeptic concern is load-bearing: any residual permutation or density variance orthogonal to rotation would inject spurious components into the latent distance and undermine the claim that the embedding alone supplies a usable, smooth policy gradient without correspondence structure. The manuscript must either provide quantitative bounds on how well the task-specific loss overcomes sampling variance or show ablations that isolate this factor.

Authors: The embedding is trained precisely on pairs (P, Q) that are independently sampled from the object mesh at random orientations, with randomized point counts (500–2000), density variation, and partial visibility to simulate sensor conditions. The loss directly regresses the latent Euclidean distance to the SO(3) geodesic on these pairs. Experiments already evaluate calibration on held-out independently sampled test pairs and show strong correlation. To directly address the request, the revision will add (i) an ablation isolating sampling variance by comparing calibration error of the task-specific loss versus a generic point-cloud autoencoder (confirming the latter fails to calibrate under density/visibility changes) and (ii) quantitative bounds via mean absolute deviation |‖f(P)−f(Q)‖ − geodesic| stratified by point density and visibility level. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training result with no self-referential derivation

full rationale

The provided abstract and description contain no equations, no derivation chain, and no load-bearing self-citations. The central claim is an empirical statement that a neural embedding can be trained such that its Euclidean distance approximates SO(3) geodesic error; this is presented as the outcome of a distance-calibration loss rather than a mathematical reduction that equals its own inputs by construction. No fitted parameter is renamed as a prediction, no uniqueness theorem is invoked from prior self-work, and no ansatz is smuggled via citation. The paper is therefore self-contained as a standard empirical result evaluated against baselines, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the learnability of the desired embedding property.

pith-pipeline@v0.9.1-grok · 5811 in / 1209 out tokens · 18791 ms · 2026-06-26T13:52:06.102117+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 6 linked inside Pith

[1]

Huang, I

W. Huang, I. Mordatch, P. Abbeel, and D. Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

arXiv 2021
[2]

W. Zhou, B. Jiang, F. Yang, C. Paxton, and D. Held. Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023

arXiv 2023
[3]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024. URLhttps://arxiv.org/abs/2312.08344

arXiv 2024
[4]

Hodan, F

T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, et al. Bop: Benchmark for 6d object pose estimation. InProceedings of the European conference on computer vision (ECCV), pages 19–34, 2018

2018
[5]

T. Chen, M. Tippur, S. Wu, V . Kumar, E. Adelson, and P. Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8(84):eadc9244, 2023

2023
[6]

T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation.Conference on Robot Learning, 2021

2021
[7]

Kaya and H

M. Kaya and H. S ¸. Bilge. Deep metric learning: A survey.Symmetry, 11(9):1066, 2019

2019
[8]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

2017
[9]

Bromley, I

J. Bromley, I. Guyon, Y . LeCun, E. S ¨ackinger, and R. Shah. Signature verification using a” siamese” time delay neural network.Advances in neural information processing systems, 6, 1993

1993
[10]

Calli, A

B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dol- lar. Yale-cmu-berkeley dataset for robotic manipulation research.The International Journal of Robotics Research, 36(3):261–268, 2017

2017
[11]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

Pith/arXiv arXiv 2025
[12]

https://www.shadowrobot.com/dexterous-hand-series/

Shadow Dexterous Hand Series - Research and Development Tool — shadowrobot.com. https://www.shadowrobot.com/dexterous-hand-series/. [Accessed 27-05-2026]

2026
[13]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[14]

Szeliski.Computer vision: algorithms and applications

R. Szeliski.Computer vision: algorithms and applications. Springer Nature, 2022. 9

2022
[15]

[Accessed 28-05-2026]

Colored point cloud registration - Open3D primary (unknown) documentation — open3d.org.https://www.open3d.org/docs/latest/tutorial/pipelines/colored_ pointcloud_registration.html. [Accessed 28-05-2026]

2026
[16]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

arXiv 2025
[17]

Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022

2022
[18]

Van der Maaten and G

L. Van der Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

2008
[19]

A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. volume 31, 2018

2018
[20]

V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-fit: State-covering self- supervised reinforcement learning.arXiv preprint arXiv:1903.03698, 2019

arXiv 1903
[21]

Andrychowicz, F

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. volume 30, 2017

2017
[22]

Z. Ren, K. Dong, Y . Zhou, Q. Liu, and J. Peng. Exploration via hindsight goal generation. volume 32, 2019

2019
[23]

Eysenbach, T

B. Eysenbach, T. Zhang, S. Levine, and R. R. Salakhutdinov. Contrastive learning as goal- conditioned reinforcement learning. volume 35, pages 35603–35620, 2022

2022
[24]

Mendonca, O

R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak. Discovering and achieving goals via world models. volume 34, pages 24379–24391, 2021

2021
[25]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017
[26]

Ebert, S

F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. InConference on robot learning, pages 983–993. PMLR, 2018

2018
[27]

A. Xie, F. Ebert, S. Levine, and C. Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight. 2019

2019
[28]

Andrychowicz, B

OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welin- der, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation, 2019. URL https://arxiv.org/abs/1808.00177

Pith/arXiv arXiv 2019
[29]

Myers, A

V . Myers, A. W. He, K. Fang, H. R. Walke, P. Hansen-Estruch, C.-A. Cheng, M. Jalobeanu, A. Kolobov, A. Dragan, and S. Levine. Goal representations for instruction following: A semi-supervised language interface to control. pages 3894–3908, 2023

2023
[30]

Belkhale, T

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

Pith/arXiv arXiv 2024
[31]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

2023
[32]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 10

Pith/arXiv arXiv 2023
[33]

S. Chen, R. Garcia, C. Schmid, and I. Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation.arXiv preprint arXiv:2309.15596, 2023

arXiv 2023
[34]

Gervet, Z

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. InConference on Robot Learning (CoRL), 2023

2023
[35]

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024

Pith/arXiv arXiv 2024
[36]

T.-W. Ke, N. Gkanatsios, J. Xu, and K. Fragkiadaki. Bi3d diffuser actor: 3d policy diffusion for bi-manual robot manipulation. InCoRL 2024 Workshop on Mastering Robot Manipulation in a World of Abundant Data, 2024

2024
[37]

J. Wan, X. Liu, and Y . Dong. Dexremoe: In-hand reorientation of general object via mixtures of experts.arXiv preprint arXiv:2508.01695, 2025

arXiv 2025
[38]

H. Zhu, Y . Wang, D. Huang, W. Ye, W. Ouyang, and T. He. Point cloud matters: Rethinking the impact of different observation spaces on robot learning.Advances in Neural Information Processing Systems, 37:77799–77830, 2024

2024
[39]

Bartsch, A

A. Bartsch, A. Car, C. Avra, and A. B. Farimani. Sculptdiff: Learning robotic clay sculpting from humans with goal conditioned diffusion policy. In2024 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 7307–7314. IEEE, 2024. 11 Appendix A1 Evidence for Sim-to-Real Transfer Our experiments are implemented in IsaacLab with ...

2024

[1] [1]

Huang, I

W. Huang, I. Mordatch, P. Abbeel, and D. Pathak. Generalization in dexterous manipulation via geometry-aware multi-task learning.arXiv preprint arXiv:2111.03062, 2021

arXiv 2021

[2] [2]

W. Zhou, B. Jiang, F. Yang, C. Paxton, and D. Held. Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023

arXiv 2023

[3] [3]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024. URLhttps://arxiv.org/abs/2312.08344

arXiv 2024

[4] [4]

Hodan, F

T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, et al. Bop: Benchmark for 6d object pose estimation. InProceedings of the European conference on computer vision (ECCV), pages 19–34, 2018

2018

[5] [5]

T. Chen, M. Tippur, S. Wu, V . Kumar, E. Adelson, and P. Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes.Science Robotics, 8(84):eadc9244, 2023

2023

[6] [6]

T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation.Conference on Robot Learning, 2021

2021

[7] [7]

Kaya and H

M. Kaya and H. S ¸. Bilge. Deep metric learning: A survey.Symmetry, 11(9):1066, 2019

2019

[8] [8]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

2017

[9] [9]

Bromley, I

J. Bromley, I. Guyon, Y . LeCun, E. S ¨ackinger, and R. Shah. Signature verification using a” siamese” time delay neural network.Advances in neural information processing systems, 6, 1993

1993

[10] [10]

Calli, A

B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dol- lar. Yale-cmu-berkeley dataset for robotic manipulation research.The International Journal of Robotics Research, 36(3):261–268, 2017

2017

[11] [11]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

Pith/arXiv arXiv 2025

[12] [12]

https://www.shadowrobot.com/dexterous-hand-series/

Shadow Dexterous Hand Series - Research and Development Tool — shadowrobot.com. https://www.shadowrobot.com/dexterous-hand-series/. [Accessed 27-05-2026]

2026

[13] [13]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[14] [14]

Szeliski.Computer vision: algorithms and applications

R. Szeliski.Computer vision: algorithms and applications. Springer Nature, 2022. 9

2022

[15] [15]

[Accessed 28-05-2026]

Colored point cloud registration - Open3D primary (unknown) documentation — open3d.org.https://www.open3d.org/docs/latest/tutorial/pipelines/colored_ pointcloud_registration.html. [Accessed 28-05-2026]

2026

[16] [16]

Schwarke, M

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

arXiv 2025

[17] [17]

Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022

2022

[18] [18]

Van der Maaten and G

L. Van der Maaten and G. Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

2008

[19] [19]

A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. volume 31, 2018

2018

[20] [20]

V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-fit: State-covering self- supervised reinforcement learning.arXiv preprint arXiv:1903.03698, 2019

arXiv 1903

[21] [21]

Andrychowicz, F

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay. volume 30, 2017

2017

[22] [22]

Z. Ren, K. Dong, Y . Zhou, Q. Liu, and J. Peng. Exploration via hindsight goal generation. volume 32, 2019

2019

[23] [23]

Eysenbach, T

B. Eysenbach, T. Zhang, S. Levine, and R. R. Salakhutdinov. Contrastive learning as goal- conditioned reinforcement learning. volume 35, pages 35603–35620, 2022

2022

[24] [24]

Mendonca, O

R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak. Discovering and achieving goals via world models. volume 34, pages 24379–24391, 2021

2021

[25] [25]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. In2017 IEEE inter- national conference on robotics and automation (ICRA), pages 2786–2793. IEEE, 2017

2017

[26] [26]

Ebert, S

F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. InConference on robot learning, pages 983–993. PMLR, 2018

2018

[27] [27]

A. Xie, F. Ebert, S. Levine, and C. Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight. 2019

2019

[28] [28]

Andrychowicz, B

OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welin- der, L. Weng, and W. Zaremba. Learning dexterous in-hand manipulation, 2019. URL https://arxiv.org/abs/1808.00177

Pith/arXiv arXiv 2019

[29] [29]

Myers, A

V . Myers, A. W. He, K. Fang, H. R. Walke, P. Hansen-Estruch, C.-A. Cheng, M. Jalobeanu, A. Kolobov, A. Dragan, and S. Levine. Goal representations for instruction following: A semi-supervised language interface to control. pages 3894–3908, 2023

2023

[30] [30]

Belkhale, T

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

Pith/arXiv arXiv 2024

[31] [31]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

2023

[32] [32]

Huang, C

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 10

Pith/arXiv arXiv 2023

[33] [33]

S. Chen, R. Garcia, C. Schmid, and I. Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation.arXiv preprint arXiv:2309.15596, 2023

arXiv 2023

[34] [34]

Gervet, Z

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. InConference on Robot Learning (CoRL), 2023

2023

[35] [35]

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024

Pith/arXiv arXiv 2024

[36] [36]

T.-W. Ke, N. Gkanatsios, J. Xu, and K. Fragkiadaki. Bi3d diffuser actor: 3d policy diffusion for bi-manual robot manipulation. InCoRL 2024 Workshop on Mastering Robot Manipulation in a World of Abundant Data, 2024

2024

[37] [37]

J. Wan, X. Liu, and Y . Dong. Dexremoe: In-hand reorientation of general object via mixtures of experts.arXiv preprint arXiv:2508.01695, 2025

arXiv 2025

[38] [38]

H. Zhu, Y . Wang, D. Huang, W. Ye, W. Ouyang, and T. He. Point cloud matters: Rethinking the impact of different observation spaces on robot learning.Advances in Neural Information Processing Systems, 37:77799–77830, 2024

2024

[39] [39]

Bartsch, A

A. Bartsch, A. Car, C. Avra, and A. B. Farimani. Sculptdiff: Learning robotic clay sculpting from humans with goal conditioned diffusion policy. In2024 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 7307–7314. IEEE, 2024. 11 Appendix A1 Evidence for Sim-to-Real Transfer Our experiments are implemented in IsaacLab with ...

2024