pith. sign in

arxiv: 2606.28899 · v1 · pith:ICSJC3CSnew · submitted 2026-06-27 · 💻 cs.RO

You Only Touch Once: 6-DoF Object Pose Estimation from Single Tactile Contact

Pith reviewed 2026-06-30 09:46 UTC · model grok-4.3

classification 💻 cs.RO
keywords tactile pose estimation6-DoF object poseGelSight sensorcoarse-to-fine localizationnormal-aware SVD solverrobotic manipulationvision-free sensingcontact localization
0
0 comments X

The pith

Two simultaneous tactile contacts recover an object's full 6-DoF pose through surface localization and a closed-form solver.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces YOTO, a tactile-only method that estimates the complete six-degree-of-freedom pose of an object from one pair of simultaneous contacts without any motion history or visual input. Each contact is captured as a local 3D point cloud and mapped to the object surface by a coarse-to-fine neural network; the two localized points together with known sensor poses are then passed to a normal-aware SVD solver that yields the rigid transformation in a single closed-form step. Pretraining on virtual patches from the object model followed by light fine-tuning on real contacts reduces the need for extensive physical data collection. A reader would care because the approach targets scenarios where cameras fail due to occlusion, lighting, or surface properties, offering a direct alternative for robotic manipulation tasks.

Core claim

YOTO recovers the full 6-DoF object pose from a single pair of simultaneous tactile contacts by representing each contact as a local 3D point cloud, localizing the contacts on the object surface with a coarse-to-fine network that is pretrained on virtual tactile patches and fine-tuned on a small number of real contacts, and then feeding the localized contacts along with calibrated sensor poses into a closed-form normal-aware SVD solver that computes the pose in one step.

What carries the argument

The closed-form normal-aware SVD solver that computes the rigid 6-DoF transformation directly from two localized contact points and their associated sensor poses.

If this is right

  • The system produces accurate localization and pose estimates across four geometrically diverse objects.
  • Performance exceeds both vision-based and purely geometric baselines, particularly under conditions where visual sensing is unreliable.
  • The method functions with object models obtained from consumer-grade mobile scans, though with a measurable accuracy reduction relative to CAD models.
  • Virtual pretraining plus limited real fine-tuning suffices to train the localization network without large real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The one-step solver could be extended to incorporate additional contacts for improved robustness if the normal-aware formulation is generalized.
  • Integration with force or slip sensing might allow the same contacts to support both pose estimation and grasp stability checks.
  • The approach could be tested on sequences of touches to handle cases where two contacts are insufficient due to symmetry.
  • Performance on objects with deformable surfaces would test the rigid-body assumption implicit in the SVD step.

Load-bearing premise

The coarse-to-fine localization network, after pretraining on virtual tactile patches and fine-tuning on few real contacts, accurately maps real tactile point clouds to positions on the object surface.

What would settle it

Ground-truth 6-DoF pose measured by an external tracking system on the same objects shows large errors when the method is run on real GelSight contacts from two simultaneous touches.

Figures

Figures reproduced from arXiv: 2606.28899 by Brian Sheil, Edward Adelson, Guangming Wang, Haonan Chen, Pengfei Ye, Yilun Du, Yixiong Jing, Yuxiang Ma.

Figure 1
Figure 1. Figure 1: YOTO recovers 6-DoF object pose from a single dual-GelSight contact. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: YOTO system pipeline (Sec. 3): surface representation, coarse-to-fine localization with virtual pre￾training and few-shot real fine-tuning, and a closed-form normal-aware SVD pose solver. Given an object model M and a simultaneous dual-GelSight contact, YOTO outputs the 6-DoF pose Tˆ W O ∈ SE(3) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Coarse-to-fine tactile surface localization network (Sec. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative tactile contact predictions (blue) vs. ground-truth contact regions (red) for the four [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-frame 6-DoF tracking error for a representative trajectory per object (top: translation; bottom: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CAD source meshes (grey) versus KIRI Engine mobile scans (yellow) for the four evaluation objects. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The pair of 3D-printed L-shaped handheld rigs used for real-world data collection (Sec. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Accurate 6-DoF object pose estimation is fundamental to robotic manipulation, yet vision-based methods often fail under occlusion, poor lighting, and reflective or transparent surfaces. We present YOTO, a tactile-only pose estimation system that recovers the full 6-DoF object pose from a single pair of simultaneous contacts, without requiring contact history. YOTO represents each tactile contact as a local 3D point cloud and localizes it on the object surface through a coarse-to-fine network. The two localized contacts, together with the calibrated sensor poses, are then fed to a closed-form normal-aware SVD solver that recovers the full 6-DoF object pose in one step. To reduce real-data requirements, the localization network is pretrained on virtual tactile patches sampled from the object model and fine-tuned with a small number of real contacts. We further show that YOTO can operate on object models reconstructed from consumer-grade mobile scans, and quantify the gap relative to CAD-based models. Experiments on four geometrically diverse objects demonstrate accurate tactile contact localization and pose estimation, outperforming vision-based and geometric baselines, especially when visual perception is unreliable. Code, trained models, and the real GelSight dataset will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents YOTO, a tactile-only 6-DoF pose estimation pipeline that represents each of two simultaneous contacts as a local 3D point cloud, localizes them on the object surface via a coarse-to-fine network (pretrained on virtual patches from the object model and fine-tuned on a small real GelSight set), and recovers the full pose in one step by feeding the localized positions+normals plus known sensor poses into a closed-form normal-aware SVD solver. It reports accurate localization and pose results on four geometrically diverse objects, outperforming vision-based and geometric baselines (especially under poor visual conditions), and shows viability with mobile-scanned object models.

Significance. If the real-data localization accuracy holds, the work offers a practical tactile alternative for pose estimation where vision fails, with the algebraic SVD step providing an exact, parameter-free recovery given accurate inputs and the virtual pretraining strategy lowering real-data requirements. The planned release of code, trained models, and the real GelSight dataset is a clear strength that supports reproducibility.

major comments (2)
  1. [abstract and Experiments section] The central claim that the closed-form SVD recovers accurate 6-DoF pose from two contacts reduces directly to the accuracy of the coarse-to-fine localization network on real tactile point clouds (abstract, final paragraph). No quantitative localization metrics (e.g., mean position or normal error on held-out real contacts), error distributions, or propagation analysis to the SVD output are supplied, leaving the load-bearing assumption unverified.
  2. [Method (localization network) and Experiments] The fine-tuning dataset size is listed as a free parameter yet no ablation is reported on how localization or final pose error varies with the number of real contacts used for fine-tuning (abstract: 'fine-tuned with a small number of real contacts'). This directly affects the claim of reduced real-data requirements.
minor comments (1)
  1. [Pose Recovery subsection] The description of the normal-aware SVD solver would benefit from an explicit equation or pseudocode block showing the input matrix construction and the normal weighting term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested additions.

read point-by-point responses
  1. Referee: [abstract and Experiments section] The central claim that the closed-form SVD recovers accurate 6-DoF pose from two contacts reduces directly to the accuracy of the coarse-to-fine localization network on real tactile point clouds (abstract, final paragraph). No quantitative localization metrics (e.g., mean position or normal error on held-out real contacts), error distributions, or propagation analysis to the SVD output are supplied, leaving the load-bearing assumption unverified.

    Authors: We agree that localization accuracy on real data is central to the claims. While the manuscript reports overall pose estimation accuracy, we acknowledge the absence of explicit quantitative localization metrics (mean position/normal error on held-out real contacts), error distributions, and propagation analysis to the SVD. In revision we will add these metrics, distributions, and a propagation study to the Experiments section. revision: yes

  2. Referee: [Method (localization network) and Experiments] The fine-tuning dataset size is listed as a free parameter yet no ablation is reported on how localization or final pose error varies with the number of real contacts used for fine-tuning (abstract: 'fine-tuned with a small number of real contacts'). This directly affects the claim of reduced real-data requirements.

    Authors: We thank the referee for this observation. The manuscript states that fine-tuning uses a small number of real contacts but does not include an ablation varying that number. We will add an ablation study in the revised Experiments section showing localization and pose errors as a function of the number of real contacts used for fine-tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pose recovery is closed-form algebraic step

full rationale

The paper's derivation chain consists of (1) a coarse-to-fine localization network trained on virtual patches sampled from the object model plus a small real GelSight set, followed by (2) feeding the resulting contact positions+normals plus calibrated sensor poses into a closed-form normal-aware SVD solver. The SVD step is presented as an exact algebraic recovery given its inputs and does not reduce to any fitted parameter or self-citation by the paper's own equations. No self-citation load-bearing, uniqueness theorem, ansatz smuggling, or renaming of known results appears in the provided text. The central claim therefore remains self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the availability of an object 3D model for virtual sampling and on accurate sensor calibration; these are domain-standard but load-bearing for the localization and solver steps.

free parameters (1)
  • fine-tuning dataset size
    A small number of real contacts is used to adapt the network from virtual to real data; the exact count and selection criteria are not specified.
axioms (1)
  • domain assumption An accurate 3D model of the target object is available for pretraining virtual tactile patches and for localization reference.
    The abstract states that virtual patches are sampled from the object model and that the system operates on reconstructed models.

pith-pipeline@v0.9.1-grok · 5771 in / 1300 out tokens · 42457 ms · 2026-06-30T09:46:52.609705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Kendall, M

    A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6- dof camera relocalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

  2. [2]

    S. Peng, Y . Liu, Q. Huang, X. Zhou, and H. Bao. Pvnet: Pixel-wise voting network for 6dof pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4561–4570, 2019

  3. [3]

    B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17868–17879, 2024

  4. [4]

    E. P. ¨Ornek, Y . Labb´e, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodan. Foundpose: Unseen object pose estimation with foundation features. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024

  5. [5]

    Bauza, A

    M. Bauza, A. Bronars, and A. Rodriguez. Tac2pose: Tactile object pose estimation from the first touch.The International Journal of Robotics Research, 42(13):1185–1209, 2023

  6. [6]

    Huang, M

    H.-J. Huang, M. Kaess, and W. Yuan. Normalflow: Fast, robust, and accurate contact-based object 6dof pose tracking with vision-based tactile sensors.IEEE Robotics and Automation Letters, 10(1):452–459, 2025. doi:10.1109/LRA.2024.3505815

  7. [7]

    Hinterstoisser, V

    S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian conference on computer vision, pages 548–562. Springer, 2012

  8. [8]

    Sundermeyer, Z.-C

    M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. InProceedings of the european conference on computer vision (ECCV), pages 699–715, 2018

  9. [9]

    A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, and T. Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  10. [10]

    C. Choy, J. Park, and V . Koltun. Fully convolutional geometric features. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

  11. [11]

    Dikhale, K

    S. Dikhale, K. Patel, D. Dhingra, I. Naramura, A. Hayashi, S. Iba, and N. Jamali. Visuotactile 6d pose estimation of an in-hand object using vision and tactile sensor data.IEEE Robotics and Automation Letters, 7(2):2148–2155, 2022

  12. [12]

    Suresh, Z

    S. Suresh, Z. Si, J. G. Mangelson, W. Yuan, and M. Kaess. Shapemap 3-d: Efficient shape mapping through dense touch and vision. In2022 International Conference on Robotics and Automation (ICRA), pages 7073–7080. IEEE, 2022

  13. [13]

    Suresh, H

    S. Suresh, H. Qi, T. Wu, T. Fan, L. Pineda, M. Lambeta, J. Malik, M. Kalakrishnan, R. Ca- landra, M. Kaess, J. Ortiz, and M. Mukadam. Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation.Science Robotics, page adl0628, 2024

  14. [14]

    Petrovskaya and O

    A. Petrovskaya and O. Khatib. Global localization of objects via touch.IEEE Transactions on Robotics, 27(3):569–585, 2011

  15. [15]

    M. B. Villalonga, A. Rodriguez, B. Lim, E. Valls, and T. Sechopoulos. Tactile object pose estimation from the first touch with geometric contact rendering. In J. Kober, F. Ramos, and C. Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 1015–1029. PMLR, 16–18 Nov 2021. URL ht...

  16. [16]

    G. M. Caddeo, N. A. Piga, F. Bottarel, and L. Natale. Collision-aware in-hand 6d object pose estimation using multiple vision-based tactile sensors.arXiv preprint arXiv:2301.13667, 2023

  17. [17]

    Sodhi, M

    P. Sodhi, M. Kaess, M. Mukadanr, and S. Anderson. Patchgraph: In-hand tactile tracking with learned surface normals. In2022 International Conference on Robotics and Automation (ICRA), pages 2164–2170. IEEE, 2022

  18. [18]

    Suresh, Z

    S. Suresh, Z. Si, S. Anderson, M. Kaess, and M. Mukadam. Midastouch: Monte-carlo infer- ence over distributions across sliding touch. InConference on Robot Learning, pages 319–331. PMLR, 2023

  19. [19]

    R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shot- ton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and track- ing. In2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, 2011. doi:10.1109/ISMAR.2011.6092378

  20. [20]

    J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  21. [21]

    Kazhdan, M

    M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. InProceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006

  22. [22]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  23. [23]

    C. R. Qi, L. Yi, H. Su, and L. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, vol- ume 30. Curran Associates, Inc., 2017. URLhttps://proceedings.neurips.cc/pap...

  24. [24]

    Tobin, R

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. doi: 10.1109/IROS.2017.8202133

  25. [25]

    Si and W

    Z. Si and W. Yuan. Taxim: An example-based simulation model for gelsight tactile sen- sors.IEEE Robotics and Automation Letters, 7(2):2361–2368, 2022. doi:10.1109/LRA.2022. 3142412

  26. [26]

    Drost, M

    B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 998–1005. Ieee, 2010

  27. [27]

    Wang and J

    Y . Wang and J. M. Solomon. Deep closest point: Learning representations for point cloud registration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

  28. [28]

    K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-d point sets.IEEE Transactions on pattern analysis and machine intelligence, (5):698–700, 1987

  29. [29]

    S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

  30. [30]

    W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for esti- mating geometry and force.Sensors, 17(12), 2017. ISSN 1424-8220. doi:10.3390/s17122762. URLhttps://www.mdpi.com/1424-8220/17/12/2762. 11

  31. [31]

    P. Ye, Y . Ma, Y . Zhou, W. Chen, W. Dong, and M. Duan. Invariantcloud: A globally invariant, uniquely indexed point cloud framework for robust 6-dof tactile pose tracking.arXiv preprint arXiv:2605.25216, 2026

  32. [32]

    J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable tactile transformers for representa- tion learning across diverse sensors and tasks.arXiv preprint arXiv:2406.13640, 2024

  33. [33]

    R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors.arXiv preprint arXiv:2502.12191, 2025. 12 A Dataset and Object Models This section details how YOTO acquires per-object surface representations and constructs the virtual tactile patch datas...

  34. [34]

    the patch centroid is offset to the in-cloud point closest to it, giving the object-frame contact locationp O

  35. [35]

    thedominantblocki ⋆, defined as the valid block holding the largest fraction of the patch’s points, is recorded as the coarse-stage retrieval target

  36. [36]

    the residual∆p O =p O −c O i⋆ is recorded as the fine-stage regression target

  37. [37]

    Parent cloudN

    the patch’s concave direction is estimated from PCA (smallest-eigenvalue eigenvector, sign-disambiguated by comparing centre-vs-edge height along that direction), and the patch coordinates and normals are rotated so this direction aligns with positivezaxis. Why PCA for patches but normals for blocks?Block standardisation uses averaged surface normals beca...