You Only Touch Once: 6-DoF Object Pose Estimation from Single Tactile Contact

Brian Sheil; Edward Adelson; Guangming Wang; Haonan Chen; Pengfei Ye; Yilun Du; Yixiong Jing; Yuxiang Ma

arxiv: 2606.28899 · v1 · pith:ICSJC3CSnew · submitted 2026-06-27 · 💻 cs.RO

You Only Touch Once: 6-DoF Object Pose Estimation from Single Tactile Contact

Pengfei Ye , Yuxiang Ma , Haonan Chen , Guangming Wang , Yixiong Jing , Brian Sheil , Yilun Du , Edward Adelson This is my paper

Pith reviewed 2026-06-30 09:46 UTC · model grok-4.3

classification 💻 cs.RO

keywords tactile pose estimation6-DoF object poseGelSight sensorcoarse-to-fine localizationnormal-aware SVD solverrobotic manipulationvision-free sensingcontact localization

0 comments

The pith

Two simultaneous tactile contacts recover an object's full 6-DoF pose through surface localization and a closed-form solver.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces YOTO, a tactile-only method that estimates the complete six-degree-of-freedom pose of an object from one pair of simultaneous contacts without any motion history or visual input. Each contact is captured as a local 3D point cloud and mapped to the object surface by a coarse-to-fine neural network; the two localized points together with known sensor poses are then passed to a normal-aware SVD solver that yields the rigid transformation in a single closed-form step. Pretraining on virtual patches from the object model followed by light fine-tuning on real contacts reduces the need for extensive physical data collection. A reader would care because the approach targets scenarios where cameras fail due to occlusion, lighting, or surface properties, offering a direct alternative for robotic manipulation tasks.

Core claim

YOTO recovers the full 6-DoF object pose from a single pair of simultaneous tactile contacts by representing each contact as a local 3D point cloud, localizing the contacts on the object surface with a coarse-to-fine network that is pretrained on virtual tactile patches and fine-tuned on a small number of real contacts, and then feeding the localized contacts along with calibrated sensor poses into a closed-form normal-aware SVD solver that computes the pose in one step.

What carries the argument

The closed-form normal-aware SVD solver that computes the rigid 6-DoF transformation directly from two localized contact points and their associated sensor poses.

If this is right

The system produces accurate localization and pose estimates across four geometrically diverse objects.
Performance exceeds both vision-based and purely geometric baselines, particularly under conditions where visual sensing is unreliable.
The method functions with object models obtained from consumer-grade mobile scans, though with a measurable accuracy reduction relative to CAD models.
Virtual pretraining plus limited real fine-tuning suffices to train the localization network without large real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The one-step solver could be extended to incorporate additional contacts for improved robustness if the normal-aware formulation is generalized.
Integration with force or slip sensing might allow the same contacts to support both pose estimation and grasp stability checks.
The approach could be tested on sequences of touches to handle cases where two contacts are insufficient due to symmetry.
Performance on objects with deformable surfaces would test the rigid-body assumption implicit in the SVD step.

Load-bearing premise

The coarse-to-fine localization network, after pretraining on virtual tactile patches and fine-tuning on few real contacts, accurately maps real tactile point clouds to positions on the object surface.

What would settle it

Ground-truth 6-DoF pose measured by an external tracking system on the same objects shows large errors when the method is run on real GelSight contacts from two simultaneous touches.

Figures

Figures reproduced from arXiv: 2606.28899 by Brian Sheil, Edward Adelson, Guangming Wang, Haonan Chen, Pengfei Ye, Yilun Du, Yixiong Jing, Yuxiang Ma.

**Figure 2.** Figure 2: YOTO system pipeline (Sec. 3): surface representation, coarse-to-fine localization with virtual pretraining and few-shot real fine-tuning, and a closed-form normal-aware SVD pose solver. Given an object model M and a simultaneous dual-GelSight contact, YOTO outputs the 6-DoF pose Tˆ W O ∈ SE(3) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Coarse-to-fine tactile surface localization network (Sec. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Representative tactile contact predictions (blue) vs. ground-truth contact regions (red) for the four [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Per-frame 6-DoF tracking error for a representative trajectory per object (top: translation; bottom: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: CAD source meshes (grey) versus KIRI Engine mobile scans (yellow) for the four evaluation objects. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: The pair of 3D-printed L-shaped handheld rigs used for real-world data collection (Sec. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Accurate 6-DoF object pose estimation is fundamental to robotic manipulation, yet vision-based methods often fail under occlusion, poor lighting, and reflective or transparent surfaces. We present YOTO, a tactile-only pose estimation system that recovers the full 6-DoF object pose from a single pair of simultaneous contacts, without requiring contact history. YOTO represents each tactile contact as a local 3D point cloud and localizes it on the object surface through a coarse-to-fine network. The two localized contacts, together with the calibrated sensor poses, are then fed to a closed-form normal-aware SVD solver that recovers the full 6-DoF object pose in one step. To reduce real-data requirements, the localization network is pretrained on virtual tactile patches sampled from the object model and fine-tuned with a small number of real contacts. We further show that YOTO can operate on object models reconstructed from consumer-grade mobile scans, and quantify the gap relative to CAD-based models. Experiments on four geometrically diverse objects demonstrate accurate tactile contact localization and pose estimation, outperforming vision-based and geometric baselines, especially when visual perception is unreliable. Code, trained models, and the real GelSight dataset will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean tactile pipeline from two contacts to 6-DoF pose via virtual-pretrained localization plus closed-form SVD, but the real-data localization accuracy remains the unproven step.

read the letter

The new piece is using a single simultaneous pair of contacts, localizing each via a coarse-to-fine network that starts from virtual patches off the object mesh, then feeding the results plus sensor poses into a normal-aware SVD that solves for full pose in closed form. That combination is not in the cited prior work.

The algebraic solver is a genuine plus: once the contacts are localized correctly it needs no learned parameters and is exact. Pretraining on virtual data plus light real fine-tuning also lowers the data requirement, and testing on mobile-scanned models instead of CAD is a practical touch.

The weakest link is still the localization network on real GelSight data. Virtual pretraining helps, but contact geometry, gel deformation, and sensor noise create domain shift that the SVD cannot fix. If the surface correspondences are off by more than a few millimeters the final pose will be wrong, and the abstract gives no error distributions or ablation numbers to show how often that happens. The stress-test concern lands.

This is for robotics groups already working with GelSight or similar tactile sensors who need pose under occlusion. The solver and data-release plan make it reproducible enough to referee. I would send it out for review rather than desk-reject; the experiments will decide whether the localization step actually delivers.

Referee Report

2 major / 1 minor

Summary. The paper presents YOTO, a tactile-only 6-DoF pose estimation pipeline that represents each of two simultaneous contacts as a local 3D point cloud, localizes them on the object surface via a coarse-to-fine network (pretrained on virtual patches from the object model and fine-tuned on a small real GelSight set), and recovers the full pose in one step by feeding the localized positions+normals plus known sensor poses into a closed-form normal-aware SVD solver. It reports accurate localization and pose results on four geometrically diverse objects, outperforming vision-based and geometric baselines (especially under poor visual conditions), and shows viability with mobile-scanned object models.

Significance. If the real-data localization accuracy holds, the work offers a practical tactile alternative for pose estimation where vision fails, with the algebraic SVD step providing an exact, parameter-free recovery given accurate inputs and the virtual pretraining strategy lowering real-data requirements. The planned release of code, trained models, and the real GelSight dataset is a clear strength that supports reproducibility.

major comments (2)

[abstract and Experiments section] The central claim that the closed-form SVD recovers accurate 6-DoF pose from two contacts reduces directly to the accuracy of the coarse-to-fine localization network on real tactile point clouds (abstract, final paragraph). No quantitative localization metrics (e.g., mean position or normal error on held-out real contacts), error distributions, or propagation analysis to the SVD output are supplied, leaving the load-bearing assumption unverified.
[Method (localization network) and Experiments] The fine-tuning dataset size is listed as a free parameter yet no ablation is reported on how localization or final pose error varies with the number of real contacts used for fine-tuning (abstract: 'fine-tuned with a small number of real contacts'). This directly affects the claim of reduced real-data requirements.

minor comments (1)

[Pose Recovery subsection] The description of the normal-aware SVD solver would benefit from an explicit equation or pseudocode block showing the input matrix construction and the normal weighting term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested additions.

read point-by-point responses

Referee: [abstract and Experiments section] The central claim that the closed-form SVD recovers accurate 6-DoF pose from two contacts reduces directly to the accuracy of the coarse-to-fine localization network on real tactile point clouds (abstract, final paragraph). No quantitative localization metrics (e.g., mean position or normal error on held-out real contacts), error distributions, or propagation analysis to the SVD output are supplied, leaving the load-bearing assumption unverified.

Authors: We agree that localization accuracy on real data is central to the claims. While the manuscript reports overall pose estimation accuracy, we acknowledge the absence of explicit quantitative localization metrics (mean position/normal error on held-out real contacts), error distributions, and propagation analysis to the SVD. In revision we will add these metrics, distributions, and a propagation study to the Experiments section. revision: yes
Referee: [Method (localization network) and Experiments] The fine-tuning dataset size is listed as a free parameter yet no ablation is reported on how localization or final pose error varies with the number of real contacts used for fine-tuning (abstract: 'fine-tuned with a small number of real contacts'). This directly affects the claim of reduced real-data requirements.

Authors: We thank the referee for this observation. The manuscript states that fine-tuning uses a small number of real contacts but does not include an ablation varying that number. We will add an ablation study in the revised Experiments section showing localization and pose errors as a function of the number of real contacts used for fine-tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pose recovery is closed-form algebraic step

full rationale

The paper's derivation chain consists of (1) a coarse-to-fine localization network trained on virtual patches sampled from the object model plus a small real GelSight set, followed by (2) feeding the resulting contact positions+normals plus calibrated sensor poses into a closed-form normal-aware SVD solver. The SVD step is presented as an exact algebraic recovery given its inputs and does not reduce to any fitted parameter or self-citation by the paper's own equations. No self-citation load-bearing, uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text. The central claim therefore remains self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the availability of an object 3D model for virtual sampling and on accurate sensor calibration; these are domain-standard but load-bearing for the localization and solver steps.

free parameters (1)

fine-tuning dataset size
A small number of real contacts is used to adapt the network from virtual to real data; the exact count and selection criteria are not specified.

axioms (1)

domain assumption An accurate 3D model of the target object is available for pretraining virtual tactile patches and for localization reference.
The abstract states that virtual patches are sampled from the object model and that the system operates on reconstructed models.

pith-pipeline@v0.9.1-grok · 5771 in / 1300 out tokens · 42457 ms · 2026-06-30T09:46:52.609705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6- dof camera relocalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

2015
[2]

S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4561–4570, 2019

2019
[3]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

2024
[4]

E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodan. Foundpose: Unseen object pose estimation with foundation features. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024

2024
[5]

Bauza, A

M. Bauza, A. Bronars, and A. Rodriguez. Tac2pose: Tactile object pose estimation from the first touch.The International Journal of Robotics Research, 42(13):1185–1209, 2023

2023
[6]

Huang, M

H.-J. Huang, M. Kaess, and W. Yuan. Normalflow: Fast, robust, and accurate contact-based object 6dof pose tracking with vision-based tactile sensors.IEEE Robotics and Automation Letters, 10(1):452–459, 2025. doi:10.1109/LRA.2024.3505815

work page doi:10.1109/lra.2024.3505815 2025
[7]

Hinterstoisser, V

S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian conference on computer vision, pages 548–562. Springer, 2012

2012
[8]

Sundermeyer, Z.-C

M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. InProceedings of the european conference on computer vision (ECCV), pages 699–715, 2018

2018
[9]

A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017
[10]

C. Choy, J. Park, and V . Koltun. Fully convolutional geometric features. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

2019
[11]

Dikhale, K

S. Dikhale, K. Patel, D. Dhingra, I. Naramura, A. Hayashi, S. Iba, and N. Jamali. Visuotactile 6d pose estimation of an in-hand object using vision and tactile sensor data.IEEE Robotics and Automation Letters, 7(2):2148–2155, 2022

2022
[12]

Suresh, Z

S. Suresh, Z. Si, J. G. Mangelson, W. Yuan, and M. Kaess. Shapemap 3-d: Efficient shape mapping through dense touch and vision. In2022 International Conference on Robotics and Automation (ICRA), pages 7073–7080. IEEE, 2022

2022
[13]

Suresh, H

S. Suresh, H. Qi, T. Wu, T. Fan, L. Pineda, M. Lambeta, J. Malik, M. Kalakrishnan, R. Ca- landra, M. Kaess, J. Ortiz, and M. Mukadam. Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation.Science Robotics, page adl0628, 2024

2024
[14]

Petrovskaya and O

A. Petrovskaya and O. Khatib. Global localization of objects via touch.IEEE Transactions on Robotics, 27(3):569–585, 2011

2011
[15]

M. B. Villalonga, A. Rodriguez, B. Lim, E. Valls, and T. Sechopoulos. Tactile object pose estimation from the first touch with geometric contact rendering. In J. Kober, F. Ramos, and C. Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1015–1029. PMLR, 16–18 Nov 2021. URL ht...

2020
[16]

G. M. Caddeo, N. A. Piga, F. Bottarel, and L. Natale. Collision-aware in-hand 6d object pose estimation using multiple vision-based tactile sensors.arXiv preprint arXiv:2301.13667, 2023

work page arXiv 2023
[17]

Sodhi, M

P. Sodhi, M. Kaess, M. Mukadanr, and S. Anderson. Patchgraph: In-hand tactile tracking with learned surface normals. In2022 International Conference on Robotics and Automation (ICRA), pages 2164–2170. IEEE, 2022

2022
[18]

Suresh, Z

S. Suresh, Z. Si, S. Anderson, M. Kaess, and M. Mukadam. Midastouch: Monte-carlo infer- ence over distributions across sliding touch. InConference on Robot Learning, pages 319–331. PMLR, 2023

2023
[19]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shot- ton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, 2011. doi:10.1109/ISMAR.2011.6092378

work page doi:10.1109/ismar.2011.6092378 2011
[20]

J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

2016
[21]

Kazhdan, M

M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. InProceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006

2006
[22]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017
[23]

C. R. Qi, L. Yi, H. Su, and L. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, vol- ume 30. Curran Associates, Inc., 2017. URLhttps://proceedings.neurips.cc/pap...

2017
[24]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. doi: 10.1109/IROS.2017.8202133

work page doi:10.1109/iros.2017.8202133 2017
[25]

Si and W

Z. Si and W. Yuan. Taxim: An example-based simulation model for gelsight tactile sen- sors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022. doi:10.1109/LRA.2022. 3142412

work page doi:10.1109/lra.2022 2022
[26]

Drost, M

B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 998–1005. Ieee, 2010

2010
[27]

Wang and J

Y . Wang and J. M. Solomon. Deep closest point: Learning representations for point cloud registration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

2019
[28]

K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-d point sets.IEEE Transactions on pattern analysis and machine intelligence, (5):698–700, 1987

1987
[29]

S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

2002
[30]

W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for esti- mating geometry and force.Sensors, 17(12), 2017. ISSN 1424-8220. doi:10.3390/s17122762. URLhttps://www.mdpi.com/1424-8220/17/12/2762. 11

work page doi:10.3390/s17122762 2017
[31]

P. Ye, Y . Ma, Y . Zhou, W. Chen, W. Dong, and M. Duan. Invariantcloud: A globally invariant, uniquely indexed point cloud framework for robust 6-dof tactile pose tracking.arXiv preprint arXiv:2605.25216, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable tactile transformers for representa- tion learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

work page arXiv 2024
[33]

R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors.arXiv preprint arXiv:2502.12191, 2025. 12 A Dataset and Object Models This section details how YOTO acquires per-object surface representations and constructs the virtual tactile patch datas...

work page arXiv 2025
[34]

the patch centroid is offset to the in-cloud point closest to it, giving the object-frame contact locationp O
[35]

thedominantblocki ⋆, defined as the valid block holding the largest fraction of the patch’s points, is recorded as the coarse-stage retrieval target
[36]

the residual∆p O =p O −c O i⋆ is recorded as the fine-stage regression target
[37]

Parent cloudN

the patch’s concave direction is estimated from PCA (smallest-eigenvalue eigenvector, sign-disambiguated by comparing centre-vs-edge height along that direction), and the patch coordinates and normals are rotated so this direction aligns with positivezaxis. Why PCA for patches but normals for blocks?Block standardisation uses averaged surface normals beca...

[1] [1]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6- dof camera relocalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

2015

[2] [2]

S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4561–4570, 2019

2019

[3] [3]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

2024

[4] [4]

E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodan. Foundpose: Unseen object pose estimation with foundation features. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024

2024

[5] [5]

Bauza, A

M. Bauza, A. Bronars, and A. Rodriguez. Tac2pose: Tactile object pose estimation from the first touch.The International Journal of Robotics Research, 42(13):1185–1209, 2023

2023

[6] [6]

Huang, M

H.-J. Huang, M. Kaess, and W. Yuan. Normalflow: Fast, robust, and accurate contact-based object 6dof pose tracking with vision-based tactile sensors.IEEE Robotics and Automation Letters, 10(1):452–459, 2025. doi:10.1109/LRA.2024.3505815

work page doi:10.1109/lra.2024.3505815 2025

[7] [7]

Hinterstoisser, V

S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian conference on computer vision, pages 548–562. Springer, 2012

2012

[8] [8]

Sundermeyer, Z.-C

M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. InProceedings of the european conference on computer vision (ECCV), pages 699–715, 2018

2018

[9] [9]

A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017

[10] [10]

C. Choy, J. Park, and V . Koltun. Fully convolutional geometric features. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

2019

[11] [11]

Dikhale, K

S. Dikhale, K. Patel, D. Dhingra, I. Naramura, A. Hayashi, S. Iba, and N. Jamali. Visuotactile 6d pose estimation of an in-hand object using vision and tactile sensor data.IEEE Robotics and Automation Letters, 7(2):2148–2155, 2022

2022

[12] [12]

Suresh, Z

S. Suresh, Z. Si, J. G. Mangelson, W. Yuan, and M. Kaess. Shapemap 3-d: Efficient shape mapping through dense touch and vision. In2022 International Conference on Robotics and Automation (ICRA), pages 7073–7080. IEEE, 2022

2022

[13] [13]

Suresh, H

S. Suresh, H. Qi, T. Wu, T. Fan, L. Pineda, M. Lambeta, J. Malik, M. Kalakrishnan, R. Ca- landra, M. Kaess, J. Ortiz, and M. Mukadam. Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation.Science Robotics, page adl0628, 2024

2024

[14] [14]

Petrovskaya and O

A. Petrovskaya and O. Khatib. Global localization of objects via touch.IEEE Transactions on Robotics, 27(3):569–585, 2011

2011

[15] [15]

M. B. Villalonga, A. Rodriguez, B. Lim, E. Valls, and T. Sechopoulos. Tactile object pose estimation from the first touch with geometric contact rendering. In J. Kober, F. Ramos, and C. Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1015–1029. PMLR, 16–18 Nov 2021. URL ht...

2020

[16] [16]

G. M. Caddeo, N. A. Piga, F. Bottarel, and L. Natale. Collision-aware in-hand 6d object pose estimation using multiple vision-based tactile sensors.arXiv preprint arXiv:2301.13667, 2023

work page arXiv 2023

[17] [17]

Sodhi, M

P. Sodhi, M. Kaess, M. Mukadanr, and S. Anderson. Patchgraph: In-hand tactile tracking with learned surface normals. In2022 International Conference on Robotics and Automation (ICRA), pages 2164–2170. IEEE, 2022

2022

[18] [18]

Suresh, Z

S. Suresh, Z. Si, S. Anderson, M. Kaess, and M. Mukadam. Midastouch: Monte-carlo infer- ence over distributions across sliding touch. InConference on Robot Learning, pages 319–331. PMLR, 2023

2023

[19] [19]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shot- ton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, 2011. doi:10.1109/ISMAR.2011.6092378

work page doi:10.1109/ismar.2011.6092378 2011

[20] [20]

J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

2016

[21] [21]

Kazhdan, M

M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. InProceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006

2006

[22] [22]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017

[23] [23]

C. R. Qi, L. Yi, H. Su, and L. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, vol- ume 30. Curran Associates, Inc., 2017. URLhttps://proceedings.neurips.cc/pap...

2017

[24] [24]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. doi: 10.1109/IROS.2017.8202133

work page doi:10.1109/iros.2017.8202133 2017

[25] [25]

Si and W

Z. Si and W. Yuan. Taxim: An example-based simulation model for gelsight tactile sen- sors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022. doi:10.1109/LRA.2022. 3142412

work page doi:10.1109/lra.2022 2022

[26] [26]

Drost, M

B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 998–1005. Ieee, 2010

2010

[27] [27]

Wang and J

Y . Wang and J. M. Solomon. Deep closest point: Learning representations for point cloud registration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

2019

[28] [28]

K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-d point sets.IEEE Transactions on pattern analysis and machine intelligence, (5):698–700, 1987

1987

[29] [29]

S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

2002

[30] [30]

W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for esti- mating geometry and force.Sensors, 17(12), 2017. ISSN 1424-8220. doi:10.3390/s17122762. URLhttps://www.mdpi.com/1424-8220/17/12/2762. 11

work page doi:10.3390/s17122762 2017

[31] [31]

P. Ye, Y . Ma, Y . Zhou, W. Chen, W. Dong, and M. Duan. Invariantcloud: A globally invariant, uniquely indexed point cloud framework for robust 6-dof tactile pose tracking.arXiv preprint arXiv:2605.25216, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable tactile transformers for representa- tion learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

work page arXiv 2024

[33] [33]

R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors.arXiv preprint arXiv:2502.12191, 2025. 12 A Dataset and Object Models This section details how YOTO acquires per-object surface representations and constructs the virtual tactile patch datas...

work page arXiv 2025

[34] [34]

the patch centroid is offset to the in-cloud point closest to it, giving the object-frame contact locationp O

[35] [35]

thedominantblocki ⋆, defined as the valid block holding the largest fraction of the patch’s points, is recorded as the coarse-stage retrieval target

[36] [36]

the residual∆p O =p O −c O i⋆ is recorded as the fine-stage regression target

[37] [37]

Parent cloudN

the patch’s concave direction is estimated from PCA (smallest-eigenvalue eigenvector, sign-disambiguated by comparing centre-vs-edge height along that direction), and the patch coordinates and normals are rotated so this direction aligns with positivezaxis. Why PCA for patches but normals for blocks?Block standardisation uses averaged surface normals beca...