Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

Hong Li; Jiamin Qiu; Mingzhu Li; Nan Xue; Qihang Yao; Xing Zhu; Yankang Dong; Yihan Tang; Yong-Lu Li; Yue Xu

arxiv: 2606.12069 · v1 · pith:U45GE3E7new · submitted 2026-06-10 · 💻 cs.CV

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

Hong Li , Yankang Dong , Yue Xu , Yihan Tang , Mingzhu Li , Jiamin Qiu , Qihang Yao , Xing Zhu

show 3 more authors

Yujun Shen Nan Xue Yong-Lu Li

This is my paper

Pith reviewed 2026-06-27 10:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-tactile learningpatch alignmenttactile datasetholographic matchingcross-modal representationTac-DINOVTPAlocal-to-global alignment

0 comments

The pith

Tac-DINO learns vision-tactile features by aligning local patches rather than whole images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to advance vision-tactile representation learning beyond image-level methods by emphasizing scale alignment and local-to-global correspondence. It supports this with a new data collection system yielding over 20,000 tactile contacts across 505 objects and a Vis-Tac Holographic Matching Benchmark to test alignment ability. The authors introduce Vision-Tactile Patch Alignment methods inside Tac-DINO and report that these exceed non-alignment baselines while matching the performance of whole-object image approaches. A sympathetic reader would care because tactile signals provide local contact information that vision alone cannot supply, and improved cross-modal features could help systems handle partial observations more effectively.

Core claim

By building a large-scale tactile dataset and the Vis-Tac Holographic Matching Benchmark, the work proposes Vision-Tactile Patch Alignment (VTPA) methods in Tac-DINO for vision-tactile representation learning; experiments on the benchmark show these methods exceed the performance of approaches without alignment and produce features that align with whole-object images.

What carries the argument

Vision-Tactile Patch Alignment (VTPA), the process of matching local tactile contact patches to corresponding visual patches to learn cross-modal features.

If this is right

Tac-DINO exceeds the performance of methods without alignment on the Vis-Tac Holographic Matching Benchmark.
The learned features align with whole-object images.
Patch-level alignment enables local-to-global correspondence learning for vision-tactile data.
The new dataset of over 20K contacts supports training of such alignment methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robots could combine these features with partial touch observations to identify objects more reliably than with vision alone.
The benchmark offers a standardized test for other cross-modal alignment techniques beyond the ones proposed here.
Extensions to multi-contact or dynamic scenes might reveal whether patch alignment generalizes to real-time manipulation.

Load-bearing premise

The collected tactile dataset and holographic matching benchmark accurately capture the local-to-global correspondence problem real robotic systems face without biases from contact geometry or sensor calibration.

What would settle it

An evaluation on a held-out set of objects or a different tactile sensor where Tac-DINO shows no improvement over non-alignment methods.

Figures

Figures reproduced from arXiv: 2606.12069 by Hong Li, Jiamin Qiu, Mingzhu Li, Nan Xue, Qihang Yao, Xing Zhu, Yankang Dong, Yihan Tang, Yong-Lu Li, Yue Xu, Yujun Shen.

**Figure 1.** Figure 1: Overview of Tac-DINO. (a) Research focus: compared to existing work, we target on vision-tactile patch alignment. (b) Data collection platform we proposed to collect 3D-VisionTactile data. (c) Benchmark for evaluating vision-tactile patch alignment and local contact to global alignment ability. ∗Correspondence to: Yong-Lu Li <yonglu li@sjtu.edu.cn>. 1 arXiv:2606.12069v1 [cs.CV] 10 Jun 2026 [PITH_FULL_IMA… view at source ↗

**Figure 2.** Figure 2: Data distribution and labeling comparison. Top image shows visualization results in each scenario of our Touch3D. Bottom-left image displays the object distribution across different scenarios. The bottom-right image shows a comparison of tactile material labeling against OBJECTFOLDER REAL Gao et al. (2023). over 700 hours, including 200 hours to scan 3D shapes at a rate of 2-3 objects per hour, and 500 ho… view at source ↗

**Figure 3.** Figure 3: VisTac Dataset Curation and Patch-Aligned Vision-Tactile Data. (a) Based on the collected 3D vision-tactile data and labeled contacts, we render the 3D shapes from specific camera views to compute the pixel-labeled vision-tactile data. (b), (c): Visualizations of the generated data on Touch3D and OBJECTFOLDER REAL. follows: L ∈ {Ilocal, Tlocal, F(Ilocal, Tlocal)}, (1) G ∗ = arg max G∈G Sim (Elocal(L), Eglo… view at source ↗

**Figure 4.** Figure 4: VTPA: Vision-Tactile Patch Alignment. Following the DINOv2 Oquab et al. (2023), input images are cropped to local and global sizes. We further constrain local crops to correspond to a single tactile input. Local information originates from three modalities: tactile, vision, and visiontactile fusion, each combined with global alignment (GA), Contrastive Learning (CL), DSCMR Aytar et al. (2017), and DAR Zh… view at source ↗

**Figure 5.** Figure 5: Local-to-Glabal Retrieval results on Touch3D. Top and bottom sections show selected flexible deformable and complex functional cases. Each example presents local visual and tactile data, comparing baseline methods (the first two rows) with our proposed methods (the last four rows) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Attention Visualizations of Linear Probing on Touch3D. Top row: local tactile data. Bottom row: local vision data. From left to right: normal local data, normal local input with Global Alignment (GA), patch input, patch input with Contrastive Learning (CL), DSCMR Aytar et al. (2017), and DAR Zhen et al. (2019). cropped local inputs. Specifically, on our Touch3D top-1 k-NN results, the Patch Tac configurat… view at source ↗

**Figure 7.** Figure 7: Same as Fig. 6, but for VisTac local data on our Touch3D dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Metric Consistency. Geometric matching requires more advanced semantic strategies. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Data labeling system. The system includes two functions: loading multisensory captured data (which includes scanned 3D shapes with contact locations, contact vision data, and tactile data), and adding or selecting tactile materials [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of Tactile Material Labeling. Orange indicates materials exclusive to OBJECTFOLDER REAL, blue indicates materials unique to our Touch3D dataset, and green represents shared materials. B DATA STATISTICS [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of Object Word Cloud. From left to right: objects exclusive to OBJECTFOLDER REAL, objects shared between both datasets, and objects unique to our Touch3D dataset. As discussed in the main paper, we further relabel the tactile materials in OBJECTFOLDER REAL Gao et al. (2023) due to their visual and tactile indistinguishability. Specifically, we manually group objects into broader material cate… view at source ↗

**Figure 12.** Figure 12: Comparison of Object Count. Objects are sorted by the number of instances per object. Blue denotes OBJECTFOLDER REAL, while orange represents our Touch3D [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Additional local-to-global retrieval results. The left and right panels both display tactile local data. The top row shows selected simple rigid-body structures, while the bottom row presents geometrically clear structures. Each example is evaluated using baselines (Normal and Normal + GA) and our proposed methods (Patch, Patch + CL, Patch + DSCMR, and Patch + DAR). 19 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 14.** Figure 14: Additional attention results. The top row displays tactile local data, while the bottom row shows visual local data. Each section includes attention maps from Layers 1 through 4. Each column indicates a specific setup, including the baselines (Normal, Normal + GA) and our proposed methods (Patch, Patch + CL, Patch + DSCMR, Patch + DAR). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: VisTac local-to-global retrieval results. Overall, the four panels represent four classic cases: simple rigid-body structures, geometrically clear structures, flexible deformable structures, and complex functional structures. Each case includes VisTac early and late fusion methods, combined with Contrastive Loss, DSCMR, and DAR. C.2 MORE DETAILED ANALYSIS [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 20k-contact tactile dataset and holographic benchmark are the real additions here, but missing experimental numbers and unaddressed collection biases make the performance claims hard to trust yet.

read the letter

The paper's main contribution is the new dataset of over 20k tactile contacts from 505 objects plus the Vis-Tac Holographic Matching Benchmark for local-to-global alignment. VTPA then applies patch-level contrastive learning across vision and touch.

They correctly note that most prior tactile work stays at image level and that local contact alignment has lacked both data and evaluation tools. Collecting real-object contacts at this scale is useful infrastructure work, and the benchmark directly targets the scale mismatch problem.

The experiments are the soft spot. The abstract asserts that VTPA beats non-aligned baselines and matches whole-object performance, yet supplies no numbers, ablations, or variance. Without those, the central claim cannot be checked. The stress-test point on possible rig biases also lands: fixed poses, pressure mapping, or calibration choices could make the local correspondences easier than they are on a moving robot, and the abstract mentions no cross-check against other tactile sets.

This is for people building multimodal perception stacks in robotics who need contact data or a local alignment test. The resources could be worth using even if the method itself needs tighter validation. It deserves peer review so the full results and data pipeline can be examined.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a data collection system yielding a tactile dataset of over 20K contacts from 505 real-world objects, defines the Vis-Tac Holographic Matching Benchmark to assess local-to-global vision-tactile alignment, and proposes Vision-Tactile Patch Alignment (VTPA) methods for representation learning. It claims that VTPA outperforms non-alignment baselines and achieves performance comparable to whole-object image alignment.

Significance. If the experimental claims hold after standard controls, the dataset and benchmark would constitute useful resources for multimodal tactile-vision research, and the patch-alignment approach could clarify how local contact signals relate to global object structure in robotic perception.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim that VTPA methods 'exceed the performance of methods without alignment' is stated without any quantitative metrics, ablation tables, error bars, or statistical tests, preventing verification that the reported gains survive standard controls for dataset bias or hyperparameter tuning.
[Dataset and Benchmark] Dataset construction and Vis-Tac Holographic Matching Benchmark: the claim that superior benchmark numbers imply transferable local-to-global alignment capability rests on the untested assumption that the >20K-contact collection rig introduces no systematic distortions in contact geometry, force distribution, or sensor calibration; no cross-validation against external tactile corpora or sensitivity analysis to pose/pressure variations is described.

minor comments (1)

[Abstract] Abstract: the phrase 'align with whole-object images' is ambiguous; clarify whether this refers to feature similarity, downstream task performance, or a specific metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and dataset validation.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that VTPA methods 'exceed the performance of methods without alignment' is stated without any quantitative metrics, ablation tables, error bars, or statistical tests, preventing verification that the reported gains survive standard controls for dataset bias or hyperparameter tuning.

Authors: We agree the abstract is high-level. The experiments section reports comparative results, but we will expand it in revision to include full ablation tables, quantitative metrics, error bars across runs, and statistical significance tests. This will allow direct verification that gains hold under controls for bias and hyperparameter choices. revision: yes
Referee: [Dataset and Benchmark] Dataset construction and Vis-Tac Holographic Matching Benchmark: the claim that superior benchmark numbers imply transferable local-to-global alignment capability rests on the untested assumption that the >20K-contact collection rig introduces no systematic distortions in contact geometry, force distribution, or sensor calibration; no cross-validation against external tactile corpora or sensitivity analysis to pose/pressure variations is described.

Authors: We recognize that additional validation would strengthen claims about the rig. In revision we will add sensitivity analyses for pose and pressure variations and, where feasible, cross-checks against available external tactile datasets to test for systematic distortions in geometry, force, or calibration. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmark evaluation with no self-referential derivations

full rationale

The paper describes an empirical pipeline: building a data collection system for a >20K-contact tactile dataset from 505 objects, designing a Vis-Tac Holographic Matching Benchmark, proposing VTPA methods for representation learning, and reporting experimental comparisons. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on external data collection and standard performance metrics rather than any reduction of outputs to inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the empirical performance of a standard contrastive-style patch alignment loss applied to a newly collected dataset.

pith-pipeline@v0.9.1-grok · 5693 in / 1164 out tokens · 18799 ms · 2026-06-27T10:17:33.389614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

166 extracted references · 2 canonical work pages

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Sim2real manipulation on unknown objects with tactile-based reinforcement learning , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024
[5]

IEEE Robotics and Automation Letters , year=

Dextouch: Learning to seek and manipulate objects with tactile dexterity , author=. IEEE Robotics and Automation Letters , year=
[6]

2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Eyesight hand: Design of a fully-actuated dexterous robot hand with integrated vision-based tactile sensors and compliant actuation , author=. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2024 , organization=

2024
[7]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Generalize by touching: Tactile ensemble skill transfer for robotic furniture assembly , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024
[8]

arXiv preprint arXiv:2410.11834 , year=

Contrastive touch-to-touch pretraining , author=. arXiv preprint arXiv:2410.11834 , year=

arXiv
[10]

arXiv preprint arXiv:2310.16917 , year=

Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation , author=. arXiv preprint arXiv:2310.16917 , year=

arXiv
[11]

arXiv preprint arXiv:2409.17549 , year=

Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning , author=. arXiv preprint arXiv:2409.17549 , year=

arXiv
[12]

2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Contrastive touch-to-touch pretraining , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=

2025
[14]

arXiv preprint arXiv:2405.02794 , year=

Octopi: Object property reasoning with large tactile-language models , author=. arXiv preprint arXiv:2405.02794 , year=

arXiv
[15]

arXiv preprint arXiv:2507.09985 , year=

Demonstrating the Octopi-1.5 Visual-Tactile-Language Model , author=. arXiv preprint arXiv:2507.09985 , year=

arXiv
[16]

Information Fusion , pages=

Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation , author=. Information Fusion , pages=. 2025 , publisher=

2025
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

The objectfolder benchmark: Multisensory learning with neural and real objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Tactile-augmented radiance fields , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[19]

2022 International Conference on Robotics and Automation (ICRA) , pages=

Gelslim 3.0: High-resolution measurement of shape, force and slip in a compact tactile-sensing finger , author=. 2022 International Conference on Robotics and Automation (ICRA) , pages=. 2022 , organization=

2022
[20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[21]

Advances in Neural Information Processing Systems , volume=

Objaverse-xl: A universe of 10m+ 3d objects , author=. Advances in Neural Information Processing Systems , volume=
[22]

IEEE Robotics and Automation Letters , year=

9dtact: A compact vision-based tactile sensor for accurate 3d shape reconstruction and generalizable 6d force estimation , author=. IEEE Robotics and Automation Letters , year=
[23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

From isolated islands to pangea: Unifying semantic space for human action understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[24]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Beyond object recognition: A new benchmark towards object concept learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[25]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

2014
[26]

arXiv preprint arXiv:1512.03012 , year=

Shapenet: An information-rich 3d model repository , author=. arXiv preprint arXiv:1512.03012 , year=

Pith/arXiv arXiv
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Abo: Dataset and benchmarks for real-world 3d object understanding , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[28]

2022 International Conference on Robotics and Automation (ICRA) , pages=

Google scanned objects: A high-quality dataset of 3d scanned household items , author=. 2022 International Conference on Robotics and Automation (ICRA) , pages=. 2022 , organization=

2022
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[30]

Sensors , volume=

Gelsight: High-resolution robot tactile sensors for estimating geometry and force , author=. Sensors , volume=. 2017 , publisher=

2017
[31]

2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

GelTip: A finger-shaped optical tactile sensor for robotic manipulation , author=. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2020 , organization=

2020
[32]

2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Omnitact: A multi-directional high-resolution touch sensor , author=. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2020 , organization=

2020
[33]

2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Improved gelsight tactile sensor for measuring geometry and slip , author=. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2017 , organization=

2017
[35]

2013 World Haptics Conference (WHC) , pages=

Tactile sensing over articulated joints with stretchable sensors , author=. 2013 World Haptics Conference (WHC) , pages=. 2013 , organization=

2013
[37]

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Adaptive visuo-tactile fusion with predictive force attention for dexterous manipulation , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2025 , organization=

2025
[41]

2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Anyskin: Plug-and-play skin sensing for robotic touch , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=

2025
[42]

Science Robotics , volume=

NeuralFeels with neural fields: Visuotactile perception for in-hand manipulation , author=. Science Robotics , volume=. 2024 , publisher=

2024
[44]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Robot synesthesia: In-hand manipulation with visuotactile sensing , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024
[48]

IEEE Robotics and Automation Letters , volume=

Tacipc: Intersection-and inversion-free fem-based elastomer simulation for optical tactile sensors , author=. IEEE Robotics and Automation Letters , volume=. 2024 , publisher=

2024
[49]

The Fourteenth International Conference on Learning Representations , year=

TaCo: A Benchmark for Lossless and Lossy Codecs of Heterogeneous Tactile Data , author=. The Fourteenth International Conference on Learning Representations , year=
[50]

Sensors , volume=

Flexible tactile sensing based on piezoresistive composites: A review , author=. Sensors , volume=. 2014 , publisher=

2014
[51]

Nature , volume=

Learning the signatures of the human grasp using a scalable tactile glove , author=. Nature , volume=. 2019 , publisher=

2019
[52]

Smart Materials and Structures , volume=

Stretch not flex: programmable rubber keyboard , author=. Smart Materials and Structures , volume=. 2016 , publisher=

2016
[53]

Proceedings of the 33rd annual acm symposium on user interface software and technology , pages=

Capacitivo: Contact-based object recognition on interactive fabrics using capacitive sensing , author=. Proceedings of the 33rd annual acm symposium on user interface software and technology , pages=
[54]

ACM Transactions on Graphics (TOG) , volume=

Deformation capture via soft and stretchable sensor arrays , author=. ACM Transactions on Graphics (TOG) , volume=. 2019 , publisher=

2019
[55]

Nature neuroscience , volume=

Extrastriate body area in human occipital cortex responds to the performance of motor actions , author=. Nature neuroscience , volume=. 2004 , publisher=

2004
[57]

arXiv preprint arXiv:2601.20239 , year=

TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance , author=. arXiv preprint arXiv:2601.20239 , year=

Pith/arXiv arXiv
[58]

International Conference on Intelligent Robotics and Applications , pages=

MC-TAC: Modular camera-based tactile sensor for robot gripper , author=. International Conference on Intelligent Robotics and Applications , pages=. 2023 , organization=

2023
[59]

Neuron , volume=

Topographic representation of the human body in the occipitotemporal cortex , author=. Neuron , volume=. 2010 , publisher=

2010
[60]

Nature , volume=

Vicarious body maps bridge vision and touch in the human brain , author=. Nature , volume=
[61]

IEEE Robotics and Automation Letters , year=

Tactile-driven dexterous in-hand writing via extrinsic contact sensing , author=. IEEE Robotics and Automation Letters , year=
[62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Binding touch to everything: Learning unified multimodal tactile representations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[65]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Deep supervised cross-modal retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[67]

Advances in Neural Information Processing Systems , volume=

Tactile dreamfusion: Exploiting tactile sensing for 3d generation , author=. Advances in Neural Information Processing Systems , volume=
[68]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Structured 3d latents for scalable and versatile 3d generation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[69]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Vggt: Visual geometry grounded transformer , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[72]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=
[73]

European Conference on Computer Vision , pages=

Pace: A large-scale dataset with pose annotations in cluttered environments , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Weakly-Supervised Learning of Dense Functional Correspondences , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[76]

Sensors , volume=

Design and evaluation of a rapid monolithic manufacturing technique for a novel vision-based tactile sensor: C-Sight , author=. Sensors , volume=. 2024 , publisher=

2024
[78]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Objectfolder 2.0: A multisensory object dataset for sim2real transfer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[79]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[80]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[81]

2016 fourth international conference on 3D vision (3DV) , pages=

Scenenn: A scene meshes dataset with annotations , author=. 2016 fourth international conference on 3D vision (3DV) , pages=. 2016 , organization=

2016
[82]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Image-to-image translation with conditional adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[83]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Connecting touch and vision via cross-modal prediction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[84]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ulip-2: Towards scalable multimodal pre-training for 3d understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[85]

arXiv preprint arXiv:2402.13232 , year=

A touch, vision, and language dataset for multimodal alignment , author=. arXiv preprint arXiv:2402.13232 , year=

arXiv
[86]

Proceedings of the 24th annual conference on Computer graphics and interactive techniques , pages=

Surface simplification using quadric error metrics , author=. Proceedings of the 24th annual conference on Computer graphics and interactive techniques , pages=
[87]

Blender Foundation , title =
[88]

ACM SIGGRAPH 2024 Conference Papers , pages=

Part123: part-aware 3d reconstruction from a single-view image , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

2024
[89]

arXiv preprint arXiv:2411.07184 , year=

Sampart3d: Segment any part in 3d objects , author=. arXiv preprint arXiv:2411.07184 , year=

arXiv
[90]

Sensor fusion IV: control paradigms and data structures , volume=

Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=

1992
[93]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[94]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[95]

M. J. Kearns , title =
[96]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[97]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[98]

Suppressed for Anonymity , author=
[99]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[100]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[101]

2022 , eprint=

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , author=. 2022 , eprint=

2022

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[4] [4]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Sim2real manipulation on unknown objects with tactile-based reinforcement learning , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024

[5] [5]

IEEE Robotics and Automation Letters , year=

Dextouch: Learning to seek and manipulate objects with tactile dexterity , author=. IEEE Robotics and Automation Letters , year=

[6] [6]

2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Eyesight hand: Design of a fully-actuated dexterous robot hand with integrated vision-based tactile sensors and compliant actuation , author=. 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2024 , organization=

2024

[7] [7]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Generalize by touching: Tactile ensemble skill transfer for robotic furniture assembly , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024

[8] [8]

arXiv preprint arXiv:2410.11834 , year=

Contrastive touch-to-touch pretraining , author=. arXiv preprint arXiv:2410.11834 , year=

arXiv

[9] [10]

arXiv preprint arXiv:2310.16917 , year=

Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation , author=. arXiv preprint arXiv:2310.16917 , year=

arXiv

[10] [11]

arXiv preprint arXiv:2409.17549 , year=

Canonical representation and force-based pretraining of 3d tactile for dexterous visuo-tactile policy learning , author=. arXiv preprint arXiv:2409.17549 , year=

arXiv

[11] [12]

2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Contrastive touch-to-touch pretraining , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=

2025

[12] [14]

arXiv preprint arXiv:2405.02794 , year=

Octopi: Object property reasoning with large tactile-language models , author=. arXiv preprint arXiv:2405.02794 , year=

arXiv

[13] [15]

arXiv preprint arXiv:2507.09985 , year=

Demonstrating the Octopi-1.5 Visual-Tactile-Language Model , author=. arXiv preprint arXiv:2507.09985 , year=

arXiv

[14] [16]

Information Fusion , pages=

Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation , author=. Information Fusion , pages=. 2025 , publisher=

2025

[15] [17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

The objectfolder benchmark: Multisensory learning with neural and real objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[16] [18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Tactile-augmented radiance fields , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[17] [19]

2022 International Conference on Robotics and Automation (ICRA) , pages=

Gelslim 3.0: High-resolution measurement of shape, force and slip in a compact tactile-sensing finger , author=. 2022 International Conference on Robotics and Automation (ICRA) , pages=. 2022 , organization=

2022

[18] [20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[19] [21]

Advances in Neural Information Processing Systems , volume=

Objaverse-xl: A universe of 10m+ 3d objects , author=. Advances in Neural Information Processing Systems , volume=

[20] [22]

IEEE Robotics and Automation Letters , year=

9dtact: A compact vision-based tactile sensor for accurate 3d shape reconstruction and generalizable 6d force estimation , author=. IEEE Robotics and Automation Letters , year=

[21] [23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

From isolated islands to pangea: Unifying semantic space for human action understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[22] [24]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Beyond object recognition: A new benchmark towards object concept learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[23] [25]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

2014

[24] [26]

arXiv preprint arXiv:1512.03012 , year=

Shapenet: An information-rich 3d model repository , author=. arXiv preprint arXiv:1512.03012 , year=

Pith/arXiv arXiv

[25] [27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Abo: Dataset and benchmarks for real-world 3d object understanding , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[26] [28]

2022 International Conference on Robotics and Automation (ICRA) , pages=

Google scanned objects: A high-quality dataset of 3d scanned household items , author=. 2022 International Conference on Robotics and Automation (ICRA) , pages=. 2022 , organization=

2022

[27] [29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[28] [30]

Sensors , volume=

Gelsight: High-resolution robot tactile sensors for estimating geometry and force , author=. Sensors , volume=. 2017 , publisher=

2017

[29] [31]

2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

GelTip: A finger-shaped optical tactile sensor for robotic manipulation , author=. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2020 , organization=

2020

[30] [32]

2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Omnitact: A multi-directional high-resolution touch sensor , author=. 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2020 , organization=

2020

[31] [33]

2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Improved gelsight tactile sensor for measuring geometry and slip , author=. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2017 , organization=

2017

[32] [35]

2013 World Haptics Conference (WHC) , pages=

Tactile sensing over articulated joints with stretchable sensors , author=. 2013 World Haptics Conference (WHC) , pages=. 2013 , organization=

2013

[33] [37]

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Adaptive visuo-tactile fusion with predictive force attention for dexterous manipulation , author=. 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2025 , organization=

2025

[34] [41]

2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Anyskin: Plug-and-play skin sensing for robotic touch , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=

2025

[35] [42]

Science Robotics , volume=

NeuralFeels with neural fields: Visuotactile perception for in-hand manipulation , author=. Science Robotics , volume=. 2024 , publisher=

2024

[36] [44]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Robot synesthesia: In-hand manipulation with visuotactile sensing , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

2024

[37] [48]

IEEE Robotics and Automation Letters , volume=

Tacipc: Intersection-and inversion-free fem-based elastomer simulation for optical tactile sensors , author=. IEEE Robotics and Automation Letters , volume=. 2024 , publisher=

2024

[38] [49]

The Fourteenth International Conference on Learning Representations , year=

TaCo: A Benchmark for Lossless and Lossy Codecs of Heterogeneous Tactile Data , author=. The Fourteenth International Conference on Learning Representations , year=

[39] [50]

Sensors , volume=

Flexible tactile sensing based on piezoresistive composites: A review , author=. Sensors , volume=. 2014 , publisher=

2014

[40] [51]

Nature , volume=

Learning the signatures of the human grasp using a scalable tactile glove , author=. Nature , volume=. 2019 , publisher=

2019

[41] [52]

Smart Materials and Structures , volume=

Stretch not flex: programmable rubber keyboard , author=. Smart Materials and Structures , volume=. 2016 , publisher=

2016

[42] [53]

Proceedings of the 33rd annual acm symposium on user interface software and technology , pages=

Capacitivo: Contact-based object recognition on interactive fabrics using capacitive sensing , author=. Proceedings of the 33rd annual acm symposium on user interface software and technology , pages=

[43] [54]

ACM Transactions on Graphics (TOG) , volume=

Deformation capture via soft and stretchable sensor arrays , author=. ACM Transactions on Graphics (TOG) , volume=. 2019 , publisher=

2019

[44] [55]

Nature neuroscience , volume=

Extrastriate body area in human occipital cortex responds to the performance of motor actions , author=. Nature neuroscience , volume=. 2004 , publisher=

2004

[45] [57]

arXiv preprint arXiv:2601.20239 , year=

TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance , author=. arXiv preprint arXiv:2601.20239 , year=

Pith/arXiv arXiv

[46] [58]

International Conference on Intelligent Robotics and Applications , pages=

MC-TAC: Modular camera-based tactile sensor for robot gripper , author=. International Conference on Intelligent Robotics and Applications , pages=. 2023 , organization=

2023

[47] [59]

Neuron , volume=

Topographic representation of the human body in the occipitotemporal cortex , author=. Neuron , volume=. 2010 , publisher=

2010

[48] [60]

Nature , volume=

Vicarious body maps bridge vision and touch in the human brain , author=. Nature , volume=

[49] [61]

IEEE Robotics and Automation Letters , year=

Tactile-driven dexterous in-hand writing via extrinsic contact sensing , author=. IEEE Robotics and Automation Letters , year=

[50] [62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Binding touch to everything: Learning unified multimodal tactile representations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[51] [65]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Deep supervised cross-modal retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[52] [67]

Advances in Neural Information Processing Systems , volume=

Tactile dreamfusion: Exploiting tactile sensing for 3d generation , author=. Advances in Neural Information Processing Systems , volume=

[53] [68]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Structured 3d latents for scalable and versatile 3d generation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[54] [69]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Vggt: Visual geometry grounded transformer , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[55] [72]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

[56] [73]

European Conference on Computer Vision , pages=

Pace: A large-scale dataset with pose annotations in cluttered environments , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024

[57] [75]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Weakly-Supervised Learning of Dense Functional Correspondences , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[58] [76]

Sensors , volume=

Design and evaluation of a rapid monolithic manufacturing technique for a novel vision-based tactile sensor: C-Sight , author=. Sensors , volume=. 2024 , publisher=

2024

[59] [78]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Objectfolder 2.0: A multisensory object dataset for sim2real transfer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[60] [79]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[61] [80]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[62] [81]

2016 fourth international conference on 3D vision (3DV) , pages=

Scenenn: A scene meshes dataset with annotations , author=. 2016 fourth international conference on 3D vision (3DV) , pages=. 2016 , organization=

2016

[63] [82]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Image-to-image translation with conditional adversarial networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[64] [83]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Connecting touch and vision via cross-modal prediction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[65] [84]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Ulip-2: Towards scalable multimodal pre-training for 3d understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[66] [85]

arXiv preprint arXiv:2402.13232 , year=

A touch, vision, and language dataset for multimodal alignment , author=. arXiv preprint arXiv:2402.13232 , year=

arXiv

[67] [86]

Proceedings of the 24th annual conference on Computer graphics and interactive techniques , pages=

Surface simplification using quadric error metrics , author=. Proceedings of the 24th annual conference on Computer graphics and interactive techniques , pages=

[68] [87]

Blender Foundation , title =

[69] [88]

ACM SIGGRAPH 2024 Conference Papers , pages=

Part123: part-aware 3d reconstruction from a single-view image , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

2024

[70] [89]

arXiv preprint arXiv:2411.07184 , year=

Sampart3d: Segment any part in 3d objects , author=. arXiv preprint arXiv:2411.07184 , year=

arXiv

[71] [90]

Sensor fusion IV: control paradigms and data structures , volume=

Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=

1992

[72] [93]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[73] [94]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[74] [95]

M. J. Kearns , title =

[75] [96]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[76] [97]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[77] [98]

Suppressed for Anonymity , author=

[78] [99]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[79] [100]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[80] [101]

2022 , eprint=

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , author=. 2022 , eprint=

2022