You Only Touch Once: 6-DoF Object Pose Estimation from Single Tactile Contact
Pith reviewed 2026-06-30 09:46 UTC · model grok-4.3
The pith
Two simultaneous tactile contacts recover an object's full 6-DoF pose through surface localization and a closed-form solver.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
YOTO recovers the full 6-DoF object pose from a single pair of simultaneous tactile contacts by representing each contact as a local 3D point cloud, localizing the contacts on the object surface with a coarse-to-fine network that is pretrained on virtual tactile patches and fine-tuned on a small number of real contacts, and then feeding the localized contacts along with calibrated sensor poses into a closed-form normal-aware SVD solver that computes the pose in one step.
What carries the argument
The closed-form normal-aware SVD solver that computes the rigid 6-DoF transformation directly from two localized contact points and their associated sensor poses.
If this is right
- The system produces accurate localization and pose estimates across four geometrically diverse objects.
- Performance exceeds both vision-based and purely geometric baselines, particularly under conditions where visual sensing is unreliable.
- The method functions with object models obtained from consumer-grade mobile scans, though with a measurable accuracy reduction relative to CAD models.
- Virtual pretraining plus limited real fine-tuning suffices to train the localization network without large real-world datasets.
Where Pith is reading between the lines
- The one-step solver could be extended to incorporate additional contacts for improved robustness if the normal-aware formulation is generalized.
- Integration with force or slip sensing might allow the same contacts to support both pose estimation and grasp stability checks.
- The approach could be tested on sequences of touches to handle cases where two contacts are insufficient due to symmetry.
- Performance on objects with deformable surfaces would test the rigid-body assumption implicit in the SVD step.
Load-bearing premise
The coarse-to-fine localization network, after pretraining on virtual tactile patches and fine-tuning on few real contacts, accurately maps real tactile point clouds to positions on the object surface.
What would settle it
Ground-truth 6-DoF pose measured by an external tracking system on the same objects shows large errors when the method is run on real GelSight contacts from two simultaneous touches.
Figures
read the original abstract
Accurate 6-DoF object pose estimation is fundamental to robotic manipulation, yet vision-based methods often fail under occlusion, poor lighting, and reflective or transparent surfaces. We present YOTO, a tactile-only pose estimation system that recovers the full 6-DoF object pose from a single pair of simultaneous contacts, without requiring contact history. YOTO represents each tactile contact as a local 3D point cloud and localizes it on the object surface through a coarse-to-fine network. The two localized contacts, together with the calibrated sensor poses, are then fed to a closed-form normal-aware SVD solver that recovers the full 6-DoF object pose in one step. To reduce real-data requirements, the localization network is pretrained on virtual tactile patches sampled from the object model and fine-tuned with a small number of real contacts. We further show that YOTO can operate on object models reconstructed from consumer-grade mobile scans, and quantify the gap relative to CAD-based models. Experiments on four geometrically diverse objects demonstrate accurate tactile contact localization and pose estimation, outperforming vision-based and geometric baselines, especially when visual perception is unreliable. Code, trained models, and the real GelSight dataset will be released upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents YOTO, a tactile-only 6-DoF pose estimation pipeline that represents each of two simultaneous contacts as a local 3D point cloud, localizes them on the object surface via a coarse-to-fine network (pretrained on virtual patches from the object model and fine-tuned on a small real GelSight set), and recovers the full pose in one step by feeding the localized positions+normals plus known sensor poses into a closed-form normal-aware SVD solver. It reports accurate localization and pose results on four geometrically diverse objects, outperforming vision-based and geometric baselines (especially under poor visual conditions), and shows viability with mobile-scanned object models.
Significance. If the real-data localization accuracy holds, the work offers a practical tactile alternative for pose estimation where vision fails, with the algebraic SVD step providing an exact, parameter-free recovery given accurate inputs and the virtual pretraining strategy lowering real-data requirements. The planned release of code, trained models, and the real GelSight dataset is a clear strength that supports reproducibility.
major comments (2)
- [abstract and Experiments section] The central claim that the closed-form SVD recovers accurate 6-DoF pose from two contacts reduces directly to the accuracy of the coarse-to-fine localization network on real tactile point clouds (abstract, final paragraph). No quantitative localization metrics (e.g., mean position or normal error on held-out real contacts), error distributions, or propagation analysis to the SVD output are supplied, leaving the load-bearing assumption unverified.
- [Method (localization network) and Experiments] The fine-tuning dataset size is listed as a free parameter yet no ablation is reported on how localization or final pose error varies with the number of real contacts used for fine-tuning (abstract: 'fine-tuned with a small number of real contacts'). This directly affects the claim of reduced real-data requirements.
minor comments (1)
- [Pose Recovery subsection] The description of the normal-aware SVD solver would benefit from an explicit equation or pseudocode block showing the input matrix construction and the normal weighting term.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested additions.
read point-by-point responses
-
Referee: [abstract and Experiments section] The central claim that the closed-form SVD recovers accurate 6-DoF pose from two contacts reduces directly to the accuracy of the coarse-to-fine localization network on real tactile point clouds (abstract, final paragraph). No quantitative localization metrics (e.g., mean position or normal error on held-out real contacts), error distributions, or propagation analysis to the SVD output are supplied, leaving the load-bearing assumption unverified.
Authors: We agree that localization accuracy on real data is central to the claims. While the manuscript reports overall pose estimation accuracy, we acknowledge the absence of explicit quantitative localization metrics (mean position/normal error on held-out real contacts), error distributions, and propagation analysis to the SVD. In revision we will add these metrics, distributions, and a propagation study to the Experiments section. revision: yes
-
Referee: [Method (localization network) and Experiments] The fine-tuning dataset size is listed as a free parameter yet no ablation is reported on how localization or final pose error varies with the number of real contacts used for fine-tuning (abstract: 'fine-tuned with a small number of real contacts'). This directly affects the claim of reduced real-data requirements.
Authors: We thank the referee for this observation. The manuscript states that fine-tuning uses a small number of real contacts but does not include an ablation varying that number. We will add an ablation study in the revised Experiments section showing localization and pose errors as a function of the number of real contacts used for fine-tuning. revision: yes
Circularity Check
No significant circularity; pose recovery is closed-form algebraic step
full rationale
The paper's derivation chain consists of (1) a coarse-to-fine localization network trained on virtual patches sampled from the object model plus a small real GelSight set, followed by (2) feeding the resulting contact positions+normals plus calibrated sensor poses into a closed-form normal-aware SVD solver. The SVD step is presented as an exact algebraic recovery given its inputs and does not reduce to any fitted parameter or self-citation by the paper's own equations. No self-citation load-bearing, uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text. The central claim therefore remains self-contained against external benchmarks rather than circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning dataset size
axioms (1)
- domain assumption An accurate 3D model of the target object is available for pretraining virtual tactile patches and for localization reference.
Reference graph
Works this paper leans on
-
[1]
Kendall, M
A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6- dof camera relocalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015
2015
-
[2]
S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4561–4570, 2019
2019
-
[3]
B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024
2024
-
[4]
E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodan. Foundpose: Unseen object pose estimation with foundation features. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024
2024
-
[5]
Bauza, A
M. Bauza, A. Bronars, and A. Rodriguez. Tac2pose: Tactile object pose estimation from the first touch.The International Journal of Robotics Research, 42(13):1185–1209, 2023
2023
-
[6]
H.-J. Huang, M. Kaess, and W. Yuan. Normalflow: Fast, robust, and accurate contact-based object 6dof pose tracking with vision-based tactile sensors.IEEE Robotics and Automation Letters, 10(1):452–459, 2025. doi:10.1109/LRA.2024.3505815
-
[7]
Hinterstoisser, V
S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian conference on computer vision, pages 548–562. Springer, 2012
2012
-
[8]
Sundermeyer, Z.-C
M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. InProceedings of the european conference on computer vision (ECCV), pages 699–715, 2018
2018
-
[9]
A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
2017
-
[10]
C. Choy, J. Park, and V . Koltun. Fully convolutional geometric features. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
2019
-
[11]
Dikhale, K
S. Dikhale, K. Patel, D. Dhingra, I. Naramura, A. Hayashi, S. Iba, and N. Jamali. Visuotactile 6d pose estimation of an in-hand object using vision and tactile sensor data.IEEE Robotics and Automation Letters, 7(2):2148–2155, 2022
2022
-
[12]
Suresh, Z
S. Suresh, Z. Si, J. G. Mangelson, W. Yuan, and M. Kaess. Shapemap 3-d: Efficient shape mapping through dense touch and vision. In2022 International Conference on Robotics and Automation (ICRA), pages 7073–7080. IEEE, 2022
2022
-
[13]
Suresh, H
S. Suresh, H. Qi, T. Wu, T. Fan, L. Pineda, M. Lambeta, J. Malik, M. Kalakrishnan, R. Ca- landra, M. Kaess, J. Ortiz, and M. Mukadam. Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation.Science Robotics, page adl0628, 2024
2024
-
[14]
Petrovskaya and O
A. Petrovskaya and O. Khatib. Global localization of objects via touch.IEEE Transactions on Robotics, 27(3):569–585, 2011
2011
-
[15]
M. B. Villalonga, A. Rodriguez, B. Lim, E. Valls, and T. Sechopoulos. Tactile object pose estimation from the first touch with geometric contact rendering. In J. Kober, F. Ramos, and C. Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1015–1029. PMLR, 16–18 Nov 2021. URL ht...
2020
- [16]
-
[17]
Sodhi, M
P. Sodhi, M. Kaess, M. Mukadanr, and S. Anderson. Patchgraph: In-hand tactile tracking with learned surface normals. In2022 International Conference on Robotics and Automation (ICRA), pages 2164–2170. IEEE, 2022
2022
-
[18]
Suresh, Z
S. Suresh, Z. Si, S. Anderson, M. Kaess, and M. Mukadam. Midastouch: Monte-carlo infer- ence over distributions across sliding touch. InConference on Robot Learning, pages 319–331. PMLR, 2023
2023
-
[19]
R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shot- ton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, 2011. doi:10.1109/ISMAR.2011.6092378
-
[20]
J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
2016
-
[21]
Kazhdan, M
M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. InProceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006
2006
-
[22]
C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
2017
-
[23]
C. R. Qi, L. Yi, H. Su, and L. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, vol- ume 30. Curran Associates, Inc., 2017. URLhttps://proceedings.neurips.cc/pap...
2017
-
[24]
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. doi: 10.1109/IROS.2017.8202133
-
[25]
Z. Si and W. Yuan. Taxim: An example-based simulation model for gelsight tactile sen- sors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022. doi:10.1109/LRA.2022. 3142412
-
[26]
Drost, M
B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 998–1005. Ieee, 2010
2010
-
[27]
Wang and J
Y . Wang and J. M. Solomon. Deep closest point: Learning representations for point cloud registration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
2019
-
[28]
K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-d point sets.IEEE Transactions on pattern analysis and machine intelligence, (5):698–700, 1987
1987
-
[29]
S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002
2002
-
[30]
W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for esti- mating geometry and force.Sensors, 17(12), 2017. ISSN 1424-8220. doi:10.3390/s17122762. URLhttps://www.mdpi.com/1424-8220/17/12/2762. 11
-
[31]
P. Ye, Y . Ma, Y . Zhou, W. Chen, W. Dong, and M. Duan. Invariantcloud: A globally invariant, uniquely indexed point cloud framework for robust 6-dof tactile pose tracking.arXiv preprint arXiv:2605.25216, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [32]
-
[33]
R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors.arXiv preprint arXiv:2502.12191, 2025. 12 A Dataset and Object Models This section details how YOTO acquires per-object surface representations and constructs the virtual tactile patch datas...
-
[34]
the patch centroid is offset to the in-cloud point closest to it, giving the object-frame contact locationp O
-
[35]
thedominantblocki ⋆, defined as the valid block holding the largest fraction of the patch’s points, is recorded as the coarse-stage retrieval target
-
[36]
the residual∆p O =p O −c O i⋆ is recorded as the fine-stage regression target
-
[37]
Parent cloudN
the patch’s concave direction is estimated from PCA (smallest-eigenvalue eigenvector, sign-disambiguated by comparing centre-vs-edge height along that direction), and the patch coordinates and normals are rotated so this direction aligns with positivezaxis. Why PCA for patches but normals for blocks?Block standardisation uses averaged surface normals beca...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.