Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

Alex Zhou; Bin Fang; Daizong Liu; Di Tian; Henghui Ding; Hui Xiong; Qing-Long Han; Runwei Guan; Shaofeng Liang; Tao Huang

arxiv: 2605.17336 · v1 · pith:Z6MUGU7Mnew · submitted 2026-05-17 · 💻 cs.RO · cs.CV· eess.SP

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

Zhixiang Cao , Di Tian , Runwei Guan , Yanzhou Mu , Xiaolou Sun , Shaofeng Liang , Daizong Liu , Tao Huang

show 6 more authors

Yutao Yue Henghui Ding Bin Fang Alex Zhou Qing-Long Han Hui Xiong

This is my paper

Pith reviewed 2026-05-20 12:48 UTC · model grok-4.3

classification 💻 cs.RO cs.CVeess.SP

keywords tactile sensingmultimodal fusionembodied intelligencevision-language-tactilecross-modal generationrobot manipulationcontact-driven paradigmsperception and interaction

0 comments

The pith

This survey unifies fragmented tactile-vision-language research in robotics through a new hierarchical taxonomy of datasets and methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys work on combining tactile sensing with vision and language to support embodied intelligence in robots and agents. It notes that unimodal tactile data lacks context while remote sensors miss contact details, so fusion is needed for semantic reasoning during physical tasks. The authors organize datasets by modality combinations and methods into three pillars focused on understanding objects, generating cross-modal outputs, and guiding interactions. This structure is intended to make scattered results easier to compare and extend.

Core claim

The paper establishes a hierarchical taxonomy that organizes multimodal tactile fusion research into multimodal datasets (Tactile-Vision, Tactile-Language, Tactile-Vision-Language, and Tactile-Vision-Other) and three core method pillars: Multimodal Perception and Recognition for object understanding and grasp prediction, Cross-Modal Generation for bidirectional translation between tactile, vision, and text, and Multimodal Interaction for feedback control and language-guided manipulation. It also reviews tactile hardware, evaluation metrics, benchmark settings, challenges, and future directions up to the first quarter of 2026.

What carries the argument

The hierarchical taxonomy that divides the field into modality-based datasets and the three method pillars of perception, cross-modal generation, and interaction.

If this is right

Datasets can be systematically located by whether they pair tactile data with vision, language, both, or other signals.
Perception and recognition methods improve grasp prediction by fusing local contact information with global visual context.
Cross-modal generation allows models to produce tactile outputs from images or text descriptions and vice versa.
Multimodal interaction supports closed-loop control where language instructions adjust actions based on real-time tactile feedback.
Standardized metrics and benchmarks become comparable once work is mapped onto the same taxonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be extended by adding a fourth pillar for long-horizon planning that combines all three existing ones.
Hardware reviews in the survey imply that future sensor designs should prioritize dense spatial coverage to better match vision resolution.
Language-guided manipulation results suggest that large language models could be fine-tuned directly on tactile sequences to improve physical commonsense.
Benchmark summaries point to the need for new testbeds that measure transfer from simulation to real contact-rich tasks.

Load-bearing premise

Existing tactile fusion studies are fragmented enough across datasets and tasks that a single new taxonomy can organize them without major omissions or overlaps through early 2026.

What would settle it

Discovery of a substantial body of post-2026 work or pre-2026 studies that cannot be placed into any of the four dataset categories or three method pillars without forcing overlaps or gaps.

Figures

Figures reproduced from arXiv: 2605.17336 by Alex Zhou, Bin Fang, Daizong Liu, Di Tian, Henghui Ding, Hui Xiong, Qing-Long Han, Runwei Guan, Shaofeng Liang, Tao Huang, Xiaolou Sun, Yanzhou Mu, Yutao Yue, Zhixiang Cao.

**Figure 2.** Figure 2: Overview of representative datasets, methods in multimodal tactile fusion. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Representative tactile sensors. cross-modal matching, retrieval, or representation alignment. More recent models further employ cross-attention to capture fine-grained interactions across tactile patches, visual regions, and language tokens. In addition, contrastive learning is widely used to pull paired samples closer and push unpaired samples apart, providing an effective objective for T-V, T-L, and T-V-… view at source ↗

**Figure 4.** Figure 4: Publication trend of multimodal tactile fusion papers [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: General paradigm of multimodal tactile fusion with [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Categorization of multimodal perception and recogni [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Categorization of multimodal cross-modal generation [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Categorization of multimodal interaction and manipula [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey gives a workable taxonomy for tactile-vision-language fusion but stays within standard survey bounds.

read the letter

This survey organizes the growing body of work on combining tactile sensing with vision and language for embodied robots. The main point is that it introduces a taxonomy with four dataset types and three method pillars to reduce fragmentation in the field. The authors split datasets into Tactile-Vision, Tactile-Language, Tactile-Vision-Language, and Tactile-Vision-Other groups. Methods fall under Multimodal Perception and Recognition for object understanding and grasp prediction, Cross-Modal Generation for bidirectional translation across modalities, and Multimodal Interaction for feedback control and language-guided manipulation. They also review hardware options, common metrics, and some future challenges. That compilation is the useful part for anyone trying to get oriented quickly. The structure looks reasonable on the surface and pulls together scattered pieces without obvious internal contradictions. What could be softer is how complete the coverage actually is. Any taxonomy like this rests on selection choices, and overlaps between categories or gaps in recent work up to the 2026 cutoff are always possible. The fragmentation argument is standard for surveys and does not require extra proof, but it does not turn the paper into a first-principles advance. Readers who would benefit most are researchers in embodied AI or tactile robotics who need a map of datasets and method families before starting a project. It is not aimed at people outside that subfield. The paper shows clear enough thinking in how it structures the literature to deserve referee time. I would send it for peer review rather than desk reject.

Referee Report

0 major / 3 minor

Summary. This survey paper reviews multimodal tactile fusion research in embodied intelligence up to the first quarter of 2026. It proposes a hierarchical taxonomy organizing the literature along two dimensions: multimodal datasets (categorized as Tactile-Vision, Tactile-Language, Tactile-Vision-Language, and Tactile-Vision-Other) and multimodal methods (structured into three pillars: Multimodal Perception and Recognition for object understanding and grasp prediction; Cross-Modal Generation for bidirectional translation between tactile, vision, and text; and Multimodal Interaction for feedback control and language-guided manipulation). The manuscript additionally summarizes tactile sensing hardware, common evaluation metrics and benchmarks, current challenges, and future directions.

Significance. If the taxonomy proves comprehensive and the coverage thorough without major omissions or overlaps, the paper would offer a valuable unifying framework for a fragmented research area. This could help researchers efficiently navigate datasets and methods for integrating tactile sensing with vision and language in robotics applications such as grasp prediction and language-guided manipulation. The organizational synthesis itself constitutes the primary contribution, as is typical for high-quality surveys.

minor comments (3)

Abstract: the phrase 'existing researches remain fragmented' should be revised to 'existing research remains fragmented' or 'existing studies remain fragmented' for grammatical accuracy.
Dataset categorization section: the boundary between 'Tactile-Vision-Language datasets' and 'Tactile-Vision-Other datasets' would benefit from an explicit statement of the decision criteria used to assign papers to each category, to minimize potential reader confusion about overlaps.
Hardware review: a comparative table listing key specifications (spatial resolution, sensing area, sampling rate, and typical use cases) for the representative tactile sensors would improve clarity and utility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our survey and the recommendation for minor revision. The report accurately captures the scope, taxonomy, and contributions of the manuscript. No specific major comments were provided in the referee report.

Circularity Check

0 steps flagged

No significant circularity in survey taxonomy or synthesis

full rationale

The paper is a literature survey whose central contribution is a proposed hierarchical taxonomy that organizes existing multimodal tactile fusion research into dataset categories and three method pillars. This taxonomy is presented as an organizational synthesis of prior work rather than a derivation, prediction, or proof that reduces to the paper's own inputs by construction. All load-bearing elements rely on citations to external literature, with no self-referential definitions, fitted parameters renamed as predictions, or uniqueness theorems imported from the authors' prior work. The structure is self-contained as a review and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that reviews and categorizes prior literature without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5869 in / 915 out tokens · 45246 ms · 2026-05-20T12:48:32.611952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition... (2) Cross-Modal Generation... (3) Multimodal Interaction...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multimodal tactile fusion process comprises the following hierarchical stages... Modality-Specific Representation Learning... Cross-Modal Fusion and Joint Representation... Embodied Decoding and Task Execution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · 9 internal anchors

[1]

Multimodal visual- tactile representation learning through self-supervised con- trastive pre-training,

V . Dave, F. Lygerakis, and E. Rueckert, “Multimodal visual- tactile representation learning through self-supervised con- trastive pre-training,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 8013– 8020

work page 2024
[2]

Bind- ing touch to everything: Learning unified multimodal tactile representations,

F. Yang, C. Feng, Z. Chen, H. Park, D. Wang, Y . Dou, Z. Zeng, X. Chen, R. Gangopadhyay, A. Owenset al., “Bind- ing touch to everything: Learning unified multimodal tactile representations,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 26 340–26 353

work page 2024
[3]

A touch, vision, and language dataset for multimodal alignment,

L. Fu, G. Datta, H. Huang, W. C.-H. Panitch, J. Drake, J. Ortiz, M. Mukadam, M. Lambeta, R. Calandra, and K. Goldberg, “A touch, vision, and language dataset for multimodal alignment,” arXiv preprint arXiv:2402.13232, 2024

work page arXiv 2024
[4]

Towards comprehensive multimodal perception: Introducing the touch- language-vision dataset,

N. Cheng, Y . Li, J. Gao, B. Fang, J. Xu, and W. Han, “Towards comprehensive multimodal perception: Introducing the touch- language-vision dataset,”arXiv preprint arXiv:2403.09813, 2024

work page arXiv 2024
[5]

Cltp: Contrastive language- tactile pre-training for 3d contact geometry understanding,

W. Ma, X. Cao, Y . Zhang, C. Zhang, S. Yang, P. Hao, B. Fang, Y . Cai, S. Cui, and S. Wang, “Cltp: Contrastive language- tactile pre-training for 3d contact geometry understanding,” arXiv preprint arXiv:2505.08194, 2025

work page arXiv 2025
[6]

Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation,

N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han, “Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation,”Information Fusion, p. 103305, 2025

work page 2025
[7]

Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation,

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang, “Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation,”arXiv preprint arXiv:2505.09577, 2025

work page arXiv 2025
[8]

Universal visuo-tactile video understanding for em- bodied interaction,

Y . Xie, M. Li, S. Li, X. Li, G. Chen, F. Ma, F. R. Yu, and W. Ding, “Universal visuo-tactile video understanding for em- bodied interaction,”arXiv preprint arXiv:2505.22566, 2025

work page arXiv 2025
[9]

Anytouch: Learning unified static-dynamic repre- sentation across multiple visuo-tactile sensors,

R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu, “Anytouch: Learning unified static-dynamic repre- sentation across multiple visuo-tactile sensors,”arXiv preprint arXiv:2502.12191, 2025

work page arXiv 2025
[10]

Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition,

S. Luo, W. Yuan, E. Adelson, A. G. Cohn, and R. Fuentes, “Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition,” in2018 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE, 2018, pp. 2722–2727

work page 2018
[11]

Can vision feel touch? tactile-aware visual grasping for transparent objects,

L. Tong, K. Qian, Z. Yue, and S. Luo, “Can vision feel touch? tactile-aware visual grasping for transparent objects,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[12]

Surformer v1: Transformer-based surface classification using 18 tactile and vision features,

M. Kansana, E. Hossain, S. Rahimi, and N. Amiri Golilarz, “Surformer v1: Transformer-based surface classification using 18 tactile and vision features,”Information, vol. 16, no. 10, p. 839, 2025

work page 2025
[13]

Ra- touch: Retrieval-augmented touch understanding with enriched visual data,

Y . Cho, H. Kim, S. Kim, Y . Zhang, Y . Choi, and S. Hong, “Ra- touch: Retrieval-augmented touch understanding with enriched visual data,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 1288–1297

work page 2025
[14]

A survey of deep learning and its applications: a new paradigm to machine learning,

S. Dargan, M. Kumar, M. R. Ayyagari, and G. Kumar, “A survey of deep learning and its applications: a new paradigm to machine learning,”Archives of computational methods in engineering, vol. 27, no. 4, pp. 1071–1092, 2020

work page 2020
[15]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[16]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Imagebind: One embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180– 15 190

work page 2023
[18]

Transformer in touch: A survey,

J. Gao, N. Cheng, B. Fang, and W. Han, “Transformer in touch: A survey,”arXiv preprint arXiv:2405.12779, 2024

work page arXiv 2024
[19]

Tactile data generation and applications based on visuo-tactile sensors: A review,

Y . Sun, N. Cheng, S. Zhang, W. Li, L. Yang, S. Cui, H. Liu, F. Sun, J. Zhang, D. Guoet al., “Tactile data generation and applications based on visuo-tactile sensors: A review,”Infor- mation Fusion, vol. 121, p. 103162, 2025

work page 2025
[20]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778

work page 2016
[21]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Trans- formers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[22]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[23]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[24]

Recent progress in pressure and temperature tactile sensors: principle, classification, integration and outlook,

J. Yu, K. Zhang, and Y . Deng, “Recent progress in pressure and temperature tactile sensors: principle, classification, integration and outlook,”Soft Science, vol. 1, no. 1, pp. N–A, 2021

work page 2021
[25]

Classification of vision-based tactile sensors: A review,

H. Li, Y . Lin, C. Lu, M. Yang, E. Psomopoulou, and N. F. Lep- ora, “Classification of vision-based tactile sensors: A review,” IEEE Sensors Journal, 2025

work page 2025
[26]

Tactile sensors: A review,

M. Meribout, N. A. Takele, O. Derege, N. Rifiki, M. El Khalil, V . Tiwari, and J. Zhong, “Tactile sensors: A review,”Measure- ment, vol. 238, p. 115332, 2024

work page 2024
[27]

Recent progresses on flexi- ble tactile sensors,

Y . Wan, Y . Wang, and C. F. Guo, “Recent progresses on flexi- ble tactile sensors,”Materials Today Physics, vol. 1, pp. 61–73, 2017

work page 2017
[28]

A review of tactile information: Perception and action through touch,

Q. Li, O. Kroemer, Z. Su, F. F. Veiga, M. Kaboli, and H. J. Ritter, “A review of tactile information: Perception and action through touch,”IEEE Transactions on Robotics, vol. 36, no. 6, pp. 1619–1634, 2020

work page 2020
[29]

Biomimetic tactile sensors and signal processing with spike trains: A review,

Z. Yi, Y . Zhang, and J. Peters, “Biomimetic tactile sensors and signal processing with spike trains: A review,”Sensors and Actuators A: Physical, vol. 269, pp. 41–52, 2018

work page 2018
[30]

Gelsight: High- resolution robot tactile sensors for estimating geometry and force,

W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High- resolution robot tactile sensors for estimating geometry and force,”Sensors, vol. 17, no. 12, p. 2762, 2017

work page 2017
[31]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,

M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammereret al., “Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,”IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 3838–3845, 2020

work page 2020
[32]

Tac3d: A novel vision- based tactile sensor for measuring forces distribution and estimating friction coefficient distribution,

L. Zhang, Y . Wang, and Y . Jiang, “Tac3d: A novel vision- based tactile sensor for measuring forces distribution and estimating friction coefficient distribution,”arXiv preprint arXiv:2202.06211, 2022

work page arXiv 2022
[33]

Gelstereo 2.0: An improved gelstereo sensor with multimedium refractive stereo calibration,

C. Zhang, S. Cui, S. Wang, J. Hu, Y . Cai, R. Wang, and Y . Wang, “Gelstereo 2.0: An improved gelstereo sensor with multimedium refractive stereo calibration,”IEEE Transactions on Industrial Electronics, vol. 71, no. 7, pp. 7452–7462, 2023

work page 2023
[34]

Gelslim 3.0: High- resolution measurement of shape, force and slip in a compact tactile-sensing finger,

I. H. Taylor, S. Dong, and A. Rodriguez, “Gelslim 3.0: High- resolution measurement of shape, force and slip in a compact tactile-sensing finger,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 10 781– 10 787

work page 2022
[35]

Omnitact: A multi-directional high-resolution touch sensor,

A. Padmanabha, F. Ebert, S. Tian, R. Calandra, C. Finn, and S. Levine, “Omnitact: A multi-directional high-resolution touch sensor,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 618–624

work page 2020
[36]

Seeing through your skin: Rec- ognizing objects with a novel visuotactile sensor,

F. R. Hogan, M. Jenkin, S. Rezaei-Shoshtari, Y . Girdhar, D. Meger, and G. Dudek, “Seeing through your skin: Rec- ognizing objects with a novel visuotactile sensor,” inProceed- ings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1218–1227

work page 2021
[37]

Multimodal alignment and fusion: A sur- vey,

S. Li and H. Tang, “Multimodal alignment and fusion: A sur- vey,”arXiv preprint arXiv:2411.17040, 2024

work page arXiv 2024
[38]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024
[39]

Vhtformer: A joint query perception method for visual-haptic- textual information based on transformer,

L. Li, G. Chen, H. Wang, B. Li, B. Wang, Z. Yi, and C. Zhao, “Vhtformer: A joint query perception method for visual-haptic- textual information based on transformer,”Applied Soft Com- puting, p. 113529, 2025

work page 2025
[40]

Visual–tactile fusion for object recognition,

H. Liu, Y . Yu, F. Sun, and J. Gu, “Visual–tactile fusion for object recognition,”IEEE Transactions on Automation Science and Engineering, vol. 14, no. 2, pp. 996–1008, 2016

work page 2016
[41]

The feeling of success: Does touch sensing help predict grasp outcomes?

R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H. Adelson, and S. Levine, “The feeling of success: Does touch sensing help predict grasp outcomes?”arXiv preprint arXiv:1710.05512, 2017

work page arXiv 2017
[42]

Connecting look and feel: Associating the visual and tactile properties of physical materials,

W. Yuan, S. Wang, S. Dong, and E. Adelson, “Connecting look and feel: Associating the visual and tactile properties of physical materials,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5580–5588

work page 2017
[43]

More than a feeling: Learning to grasp and regrasp using vision and touch,

R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Ma- lik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300–3307, 2018

work page 2018
[44]

Multimodal grasp data set: A novel visual–tactile data set for robotic manipulation,

T. Wang, C. Yang, F. Kirchner, P. Du, F. Sun, and B. Fang, “Multimodal grasp data set: A novel visual–tactile data set for robotic manipulation,”International Journal of Advanced Robotic Systems, vol. 16, no. 1, p. 1729881418821571, 2019

work page 2019
[45]

Connecting touch and vision via cross-modal prediction,

Y . Li, J.-Y . Zhu, R. Tedrake, and A. Torralba, “Connecting touch and vision via cross-modal prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 609–10 618

work page 2019
[46]

Touch and go: Learning from human-collected vision and touch,

F. Yang, C. Ma, J. Zhang, J. Zhu, W. Yuan, and A. Owens, “Touch and go: Learning from human-collected vision and touch,”arXiv preprint arXiv:2211.12498, 2022

work page arXiv 2022
[47]

Controllable visual-tactile synthesis,

R. Gao, W. Yuan, and J.-Y . Zhu, “Controllable visual-tactile synthesis,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023, pp. 7040–7052

work page 2023
[48]

Learning to jointly understand visual and tactile signals,

Y . Li, Y . Du, C. Liu, F. Williams, M. Foshey, B. Eckart, J. Kautz, J. B. Tenenbaum, A. Torralba, and W. Matusik, “Learning to jointly understand visual and tactile signals,” in 19 The Twelfth International Conference on Learning Represen- tations, 2023

work page 2023
[49]

Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile grip- per,

X. Zhu, B. Huang, and Y . Li, “Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile grip- per,”arXiv preprint arXiv:2507.15062, 2025

work page arXiv 2025
[50]

Octopi: Object property reasoning with large tactile-language models,

S. Yu, K. Lin, A. Xiao, J. Duan, and H. Soh, “Octopi: Object property reasoning with large tactile-language models,”arXiv preprint arXiv:2405.02794, 2024

work page arXiv 2024
[51]

Stola: Self- adaptive touch-language framework with tactile common- sense reasoning in open-ended scenarios,

N. Cheng, J. Xu, J. Chen, and W. Han, “Stola: Self- adaptive touch-language framework with tactile common- sense reasoning in open-ended scenarios,”arXiv preprint arXiv:2505.04201, 2025

work page arXiv 2025
[52]

Multi-modal representation learning with tactile data,

H.-G. Chi, J. Barreiros, J. Mercat, K. Ramani, and T. Kol- lar, “Multi-modal representation learning with tactile data,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9660–9667

work page 2024
[53]

Damf: A semantic-guided dynamic attention frame- work for visual-haptic-textual multimodal fusion,

B. Wang, B. Li, T. Gao, L. Li, H. Wang, C. Zhao, and Z. Yi, “Damf: A semantic-guided dynamic attention frame- work for visual-haptic-textual multimodal fusion,”Knowledge- Based Systems, p. 114244, 2025

work page 2025
[54]

Tvt- transformer: A tactile-visual-textual fusion network for object recognition,

B. Li, L. Li, H. Wang, G. Chen, B. Wang, and S. Qiu, “Tvt- transformer: A tactile-visual-textual fusion network for object recognition,”Information Fusion, vol. 118, p. 102943, 2025

work page 2025
[55]

Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing,

Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang, “Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing,”arXiv preprint arXiv:2508.08706, 2025

work page arXiv 2025
[56]

Ob- jectfolder: A dataset of objects with implicit visual, auditory, and tactile representations,

R. Gao, Y .-Y . Chang, S. Mall, L. Fei-Fei, and J. Wu, “Ob- jectfolder: A dataset of objects with implicit visual, auditory, and tactile representations,”arXiv preprint arXiv:2109.07991, 2021

work page arXiv 2021
[57]

Objectfolder 2.0: A multisensory object dataset for sim2real transfer,

R. Gao, Z. Si, Y .-Y . Chang, S. Clarke, J. Bohg, L. Fei-Fei, W. Yuan, and J. Wu, “Objectfolder 2.0: A multisensory object dataset for sim2real transfer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 598–10 608

work page 2022
[58]

The objectfolder benchmark: Multisensory learning with neural and real objects,

R. Gao, Y . Dou, H. Li, T. Agarwal, J. Bohg, Y . Li, L. Fei- Fei, and J. Wu, “The objectfolder benchmark: Multisensory learning with neural and real objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 17 276–17 286

work page 2023
[59]

Tla: Tactile-language-action model for contact-rich manipu- lation,

P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang, “Tla: Tactile-language-action model for contact-rich manipu- lation,”arXiv preprint arXiv:2503.08548, 2025

work page arXiv 2025
[60]

Freetacman: Robot-free visuo-tactile data col- lection system for contact-rich manipulation,

L. Wu, C. Yu, J. Ren, L. Chen, Y . Jiang, R. Huang, G. Gu, and H. Li, “Freetacman: Robot-free visuo-tactile data col- lection system for contact-rich manipulation,”arXiv preprint arXiv:2506.01941, 2025

work page arXiv 2025
[61]

Opentouch: Bring- ing full-hand touch to real-world interaction,

Y . R. Song, J. Li, R. Fu, D. Murphy, K. Zhou, R. Shiv, Y . Li, H. Xiong, C. E. Owens, Y . Duet al., “Opentouch: Bring- ing full-hand touch to real-world interaction,”arXiv preprint arXiv:2512.16842, 2025

work page arXiv 2025
[62]

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

T. Engelbracht, R. Zurbr ¨ugg, M. Wohlrapp, M. B ¨uchner, A. Valada, M. Pollefeys, H. Blum, and Z. Bauer, “Hoi!–a multimodal dataset for force-grounded, cross-view articulated manipulation,”arXiv preprint arXiv:2512.04884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Vint-6d: A large-scale object-in-hand dataset from vision, touch and proprioception,

Z. Wan, Y . Ling, S. Yi, L. Qi, W. Lee, M. Lu, S. Yang, X. Teng, P. Lu, X. Yanget al., “Vint-6d: A large-scale object-in-hand dataset from vision, touch and proprioception,”arXiv preprint arXiv:2501.00510, 2024

work page arXiv 2024
[64]

Omnivta: Visuo-tactile world modeling for contact- rich robotic manipulation,

Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, R. Wu, C. Hao, C. Gao, S. Liu, H. Li, Y . Chen, S. Yan, and W. Ding, “Omnivta: Visuo-tactile world modeling for contact- rich robotic manipulation,” 2026

work page 2026
[65]

Imagenet clas- sification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas- sification with deep convolutional neural networks,”Advances in neural information processing systems, vol. 25, 2012

work page 2012
[66]

See, feel, act: Hierarchical learning for complex manipula- tion skills with multisensory fusion,

“See, feel, act: Hierarchical learning for complex manipula- tion skills with multisensory fusion,”Science Robotics, vol. 4, no. 26, p. eaav3123, 2019

work page 2019
[67]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inProceedings of the IEEE international conference on com- puter vision, 2017, pp. 2961–2969

work page 2017
[68]

Bayesian neural networks,

I. Kononenko, “Bayesian neural networks,”Biological Cyber- netics, vol. 61, no. 5, pp. 361–370, 1989

work page 1989
[69]

Learning cross-modal visual-tactile representation using ensembled generative adver- sarial networks,

X. Li, H. Liu, J. Zhou, and F. Sun, “Learning cross-modal visual-tactile representation using ensembled generative adver- sarial networks,”Cognitive Computation and Systems, vol. 1, no. 2, pp. 40–44, 2019

work page 2019
[70]

Gen- erative adversarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Gen- erative adversarial nets,”Advances in neural information pro- cessing systems, vol. 27, 2014

work page 2014
[71]

Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei- Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,” in2019 International conference on robotics and automation (ICRA). IEEE, 2019, pp. 8943–8950

work page 2019
[72]

Flownet: Learning optical flow with convolutional networks,

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProceedings of the IEEE international conference on com- puter vision, 2015, pp. 2758–2766

work page 2015
[73]

Lifelong visual-tactile cross- modal learning for robotic material perception,

W. Zheng, H. Liu, and F. Sun, “Lifelong visual-tactile cross- modal learning for robotic material perception,”IEEE transac- tions on neural networks and learning systems, vol. 32, no. 3, pp. 1192–1203, 2020

work page 2020
[74]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[75]

Visuo-tactile transformers for manipulation,

Y . Chen, A. Sipos, M. Van der Merwe, and N. Fazeli, “Visuo-tactile transformers for manipulation,”arXiv preprint arXiv:2210.00121, 2022

work page arXiv 2022
[76]

Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning,

J. Hansen, F. Hogan, D. Rivkin, D. Meger, M. Jenkin, and G. Dudek, “Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning,” in2022 Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 8298–8304

work page 2022
[77]

Mastering vi- sual continuous control: Improved data-augmented reinforce- ment learning,

D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering vi- sual continuous control: Improved data-augmented reinforce- ment learning,”arXiv preprint arXiv:2107.09645, 2021

work page arXiv 2021
[78]

Vito-transformer: a visual-tactile fusion network for object recognition,

B. Li, J. Bai, S. Qiu, H. Wang, and Y . Guo, “Vito-transformer: a visual-tactile fusion network for object recognition,”IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–10, 2023

work page 2023
[79]

Mlp-mixer: An all-mlp architecture for vision,

I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,”Ad- vances in neural information processing systems, vol. 34, pp. 24 261–24 272, 2021

work page 2021
[80]

Fine-tuned clip models are efficient video learners,

H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine-tuned clip models are efficient video learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6545–6554

work page 2023

Showing first 80 references.

[1] [1]

Multimodal visual- tactile representation learning through self-supervised con- trastive pre-training,

V . Dave, F. Lygerakis, and E. Rueckert, “Multimodal visual- tactile representation learning through self-supervised con- trastive pre-training,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 8013– 8020

work page 2024

[2] [2]

Bind- ing touch to everything: Learning unified multimodal tactile representations,

F. Yang, C. Feng, Z. Chen, H. Park, D. Wang, Y . Dou, Z. Zeng, X. Chen, R. Gangopadhyay, A. Owenset al., “Bind- ing touch to everything: Learning unified multimodal tactile representations,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2024, pp. 26 340–26 353

work page 2024

[3] [3]

A touch, vision, and language dataset for multimodal alignment,

L. Fu, G. Datta, H. Huang, W. C.-H. Panitch, J. Drake, J. Ortiz, M. Mukadam, M. Lambeta, R. Calandra, and K. Goldberg, “A touch, vision, and language dataset for multimodal alignment,” arXiv preprint arXiv:2402.13232, 2024

work page arXiv 2024

[4] [4]

Towards comprehensive multimodal perception: Introducing the touch- language-vision dataset,

N. Cheng, Y . Li, J. Gao, B. Fang, J. Xu, and W. Han, “Towards comprehensive multimodal perception: Introducing the touch- language-vision dataset,”arXiv preprint arXiv:2403.09813, 2024

work page arXiv 2024

[5] [5]

Cltp: Contrastive language- tactile pre-training for 3d contact geometry understanding,

W. Ma, X. Cao, Y . Zhang, C. Zhang, S. Yang, P. Hao, B. Fang, Y . Cai, S. Cui, and S. Wang, “Cltp: Contrastive language- tactile pre-training for 3d contact geometry understanding,” arXiv preprint arXiv:2505.08194, 2025

work page arXiv 2025

[6] [6]

Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation,

N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han, “Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal representation,”Information Fusion, p. 103305, 2025

work page 2025

[7] [7]

Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation,

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang, “Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation,”arXiv preprint arXiv:2505.09577, 2025

work page arXiv 2025

[8] [8]

Universal visuo-tactile video understanding for em- bodied interaction,

Y . Xie, M. Li, S. Li, X. Li, G. Chen, F. Ma, F. R. Yu, and W. Ding, “Universal visuo-tactile video understanding for em- bodied interaction,”arXiv preprint arXiv:2505.22566, 2025

work page arXiv 2025

[9] [9]

Anytouch: Learning unified static-dynamic repre- sentation across multiple visuo-tactile sensors,

R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu, “Anytouch: Learning unified static-dynamic repre- sentation across multiple visuo-tactile sensors,”arXiv preprint arXiv:2502.12191, 2025

work page arXiv 2025

[10] [10]

Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition,

S. Luo, W. Yuan, E. Adelson, A. G. Cohn, and R. Fuentes, “Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition,” in2018 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE, 2018, pp. 2722–2727

work page 2018

[11] [11]

Can vision feel touch? tactile-aware visual grasping for transparent objects,

L. Tong, K. Qian, Z. Yue, and S. Luo, “Can vision feel touch? tactile-aware visual grasping for transparent objects,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[12] [12]

Surformer v1: Transformer-based surface classification using 18 tactile and vision features,

M. Kansana, E. Hossain, S. Rahimi, and N. Amiri Golilarz, “Surformer v1: Transformer-based surface classification using 18 tactile and vision features,”Information, vol. 16, no. 10, p. 839, 2025

work page 2025

[13] [13]

Ra- touch: Retrieval-augmented touch understanding with enriched visual data,

Y . Cho, H. Kim, S. Kim, Y . Zhang, Y . Choi, and S. Hong, “Ra- touch: Retrieval-augmented touch understanding with enriched visual data,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 1288–1297

work page 2025

[14] [14]

A survey of deep learning and its applications: a new paradigm to machine learning,

S. Dargan, M. Kumar, M. R. Ayyagari, and G. Kumar, “A survey of deep learning and its applications: a new paradigm to machine learning,”Archives of computational methods in engineering, vol. 27, no. 4, pp. 1071–1092, 2020

work page 2020

[15] [15]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[16] [16]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Imagebind: One embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180– 15 190

work page 2023

[18] [18]

Transformer in touch: A survey,

J. Gao, N. Cheng, B. Fang, and W. Han, “Transformer in touch: A survey,”arXiv preprint arXiv:2405.12779, 2024

work page arXiv 2024

[19] [19]

Tactile data generation and applications based on visuo-tactile sensors: A review,

Y . Sun, N. Cheng, S. Zhang, W. Li, L. Yang, S. Cui, H. Liu, F. Sun, J. Zhang, D. Guoet al., “Tactile data generation and applications based on visuo-tactile sensors: A review,”Infor- mation Fusion, vol. 121, p. 103162, 2025

work page 2025

[20] [20]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778

work page 2016

[21] [21]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, “An image is worth 16x16 words: Trans- formers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[22] [22]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[23] [23]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021

[24] [24]

Recent progress in pressure and temperature tactile sensors: principle, classification, integration and outlook,

J. Yu, K. Zhang, and Y . Deng, “Recent progress in pressure and temperature tactile sensors: principle, classification, integration and outlook,”Soft Science, vol. 1, no. 1, pp. N–A, 2021

work page 2021

[25] [25]

Classification of vision-based tactile sensors: A review,

H. Li, Y . Lin, C. Lu, M. Yang, E. Psomopoulou, and N. F. Lep- ora, “Classification of vision-based tactile sensors: A review,” IEEE Sensors Journal, 2025

work page 2025

[26] [26]

Tactile sensors: A review,

M. Meribout, N. A. Takele, O. Derege, N. Rifiki, M. El Khalil, V . Tiwari, and J. Zhong, “Tactile sensors: A review,”Measure- ment, vol. 238, p. 115332, 2024

work page 2024

[27] [27]

Recent progresses on flexi- ble tactile sensors,

Y . Wan, Y . Wang, and C. F. Guo, “Recent progresses on flexi- ble tactile sensors,”Materials Today Physics, vol. 1, pp. 61–73, 2017

work page 2017

[28] [28]

A review of tactile information: Perception and action through touch,

Q. Li, O. Kroemer, Z. Su, F. F. Veiga, M. Kaboli, and H. J. Ritter, “A review of tactile information: Perception and action through touch,”IEEE Transactions on Robotics, vol. 36, no. 6, pp. 1619–1634, 2020

work page 2020

[29] [29]

Biomimetic tactile sensors and signal processing with spike trains: A review,

Z. Yi, Y . Zhang, and J. Peters, “Biomimetic tactile sensors and signal processing with spike trains: A review,”Sensors and Actuators A: Physical, vol. 269, pp. 41–52, 2018

work page 2018

[30] [30]

Gelsight: High- resolution robot tactile sensors for estimating geometry and force,

W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High- resolution robot tactile sensors for estimating geometry and force,”Sensors, vol. 17, no. 12, p. 2762, 2017

work page 2017

[31] [31]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,

M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammereret al., “Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,”IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 3838–3845, 2020

work page 2020

[32] [32]

Tac3d: A novel vision- based tactile sensor for measuring forces distribution and estimating friction coefficient distribution,

L. Zhang, Y . Wang, and Y . Jiang, “Tac3d: A novel vision- based tactile sensor for measuring forces distribution and estimating friction coefficient distribution,”arXiv preprint arXiv:2202.06211, 2022

work page arXiv 2022

[33] [33]

Gelstereo 2.0: An improved gelstereo sensor with multimedium refractive stereo calibration,

C. Zhang, S. Cui, S. Wang, J. Hu, Y . Cai, R. Wang, and Y . Wang, “Gelstereo 2.0: An improved gelstereo sensor with multimedium refractive stereo calibration,”IEEE Transactions on Industrial Electronics, vol. 71, no. 7, pp. 7452–7462, 2023

work page 2023

[34] [34]

Gelslim 3.0: High- resolution measurement of shape, force and slip in a compact tactile-sensing finger,

I. H. Taylor, S. Dong, and A. Rodriguez, “Gelslim 3.0: High- resolution measurement of shape, force and slip in a compact tactile-sensing finger,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 10 781– 10 787

work page 2022

[35] [35]

Omnitact: A multi-directional high-resolution touch sensor,

A. Padmanabha, F. Ebert, S. Tian, R. Calandra, C. Finn, and S. Levine, “Omnitact: A multi-directional high-resolution touch sensor,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 618–624

work page 2020

[36] [36]

Seeing through your skin: Rec- ognizing objects with a novel visuotactile sensor,

F. R. Hogan, M. Jenkin, S. Rezaei-Shoshtari, Y . Girdhar, D. Meger, and G. Dudek, “Seeing through your skin: Rec- ognizing objects with a novel visuotactile sensor,” inProceed- ings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1218–1227

work page 2021

[37] [37]

Multimodal alignment and fusion: A sur- vey,

S. Li and H. Tang, “Multimodal alignment and fusion: A sur- vey,”arXiv preprint arXiv:2411.17040, 2024

work page arXiv 2024

[38] [38]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, vol. 11, no. 12, p. nwae403, 2024

work page 2024

[39] [39]

Vhtformer: A joint query perception method for visual-haptic- textual information based on transformer,

L. Li, G. Chen, H. Wang, B. Li, B. Wang, Z. Yi, and C. Zhao, “Vhtformer: A joint query perception method for visual-haptic- textual information based on transformer,”Applied Soft Com- puting, p. 113529, 2025

work page 2025

[40] [40]

Visual–tactile fusion for object recognition,

H. Liu, Y . Yu, F. Sun, and J. Gu, “Visual–tactile fusion for object recognition,”IEEE Transactions on Automation Science and Engineering, vol. 14, no. 2, pp. 996–1008, 2016

work page 2016

[41] [41]

The feeling of success: Does touch sensing help predict grasp outcomes?

R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H. Adelson, and S. Levine, “The feeling of success: Does touch sensing help predict grasp outcomes?”arXiv preprint arXiv:1710.05512, 2017

work page arXiv 2017

[42] [42]

Connecting look and feel: Associating the visual and tactile properties of physical materials,

W. Yuan, S. Wang, S. Dong, and E. Adelson, “Connecting look and feel: Associating the visual and tactile properties of physical materials,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5580–5588

work page 2017

[43] [43]

More than a feeling: Learning to grasp and regrasp using vision and touch,

R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Ma- lik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300–3307, 2018

work page 2018

[44] [44]

Multimodal grasp data set: A novel visual–tactile data set for robotic manipulation,

T. Wang, C. Yang, F. Kirchner, P. Du, F. Sun, and B. Fang, “Multimodal grasp data set: A novel visual–tactile data set for robotic manipulation,”International Journal of Advanced Robotic Systems, vol. 16, no. 1, p. 1729881418821571, 2019

work page 2019

[45] [45]

Connecting touch and vision via cross-modal prediction,

Y . Li, J.-Y . Zhu, R. Tedrake, and A. Torralba, “Connecting touch and vision via cross-modal prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 609–10 618

work page 2019

[46] [46]

Touch and go: Learning from human-collected vision and touch,

F. Yang, C. Ma, J. Zhang, J. Zhu, W. Yuan, and A. Owens, “Touch and go: Learning from human-collected vision and touch,”arXiv preprint arXiv:2211.12498, 2022

work page arXiv 2022

[47] [47]

Controllable visual-tactile synthesis,

R. Gao, W. Yuan, and J.-Y . Zhu, “Controllable visual-tactile synthesis,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023, pp. 7040–7052

work page 2023

[48] [48]

Learning to jointly understand visual and tactile signals,

Y . Li, Y . Du, C. Liu, F. Williams, M. Foshey, B. Eckart, J. Kautz, J. B. Tenenbaum, A. Torralba, and W. Matusik, “Learning to jointly understand visual and tactile signals,” in 19 The Twelfth International Conference on Learning Represen- tations, 2023

work page 2023

[49] [49]

Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile grip- per,

X. Zhu, B. Huang, and Y . Li, “Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile grip- per,”arXiv preprint arXiv:2507.15062, 2025

work page arXiv 2025

[50] [50]

Octopi: Object property reasoning with large tactile-language models,

S. Yu, K. Lin, A. Xiao, J. Duan, and H. Soh, “Octopi: Object property reasoning with large tactile-language models,”arXiv preprint arXiv:2405.02794, 2024

work page arXiv 2024

[51] [51]

Stola: Self- adaptive touch-language framework with tactile common- sense reasoning in open-ended scenarios,

N. Cheng, J. Xu, J. Chen, and W. Han, “Stola: Self- adaptive touch-language framework with tactile common- sense reasoning in open-ended scenarios,”arXiv preprint arXiv:2505.04201, 2025

work page arXiv 2025

[52] [52]

Multi-modal representation learning with tactile data,

H.-G. Chi, J. Barreiros, J. Mercat, K. Ramani, and T. Kol- lar, “Multi-modal representation learning with tactile data,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9660–9667

work page 2024

[53] [53]

Damf: A semantic-guided dynamic attention frame- work for visual-haptic-textual multimodal fusion,

B. Wang, B. Li, T. Gao, L. Li, H. Wang, C. Zhao, and Z. Yi, “Damf: A semantic-guided dynamic attention frame- work for visual-haptic-textual multimodal fusion,”Knowledge- Based Systems, p. 114244, 2025

work page 2025

[54] [54]

Tvt- transformer: A tactile-visual-textual fusion network for object recognition,

B. Li, L. Li, H. Wang, G. Chen, B. Wang, and S. Qiu, “Tvt- transformer: A tactile-visual-textual fusion network for object recognition,”Information Fusion, vol. 118, p. 102943, 2025

work page 2025

[55] [55]

Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing,

Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang, “Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing,”arXiv preprint arXiv:2508.08706, 2025

work page arXiv 2025

[56] [56]

Ob- jectfolder: A dataset of objects with implicit visual, auditory, and tactile representations,

R. Gao, Y .-Y . Chang, S. Mall, L. Fei-Fei, and J. Wu, “Ob- jectfolder: A dataset of objects with implicit visual, auditory, and tactile representations,”arXiv preprint arXiv:2109.07991, 2021

work page arXiv 2021

[57] [57]

Objectfolder 2.0: A multisensory object dataset for sim2real transfer,

R. Gao, Z. Si, Y .-Y . Chang, S. Clarke, J. Bohg, L. Fei-Fei, W. Yuan, and J. Wu, “Objectfolder 2.0: A multisensory object dataset for sim2real transfer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 598–10 608

work page 2022

[58] [58]

The objectfolder benchmark: Multisensory learning with neural and real objects,

R. Gao, Y . Dou, H. Li, T. Agarwal, J. Bohg, Y . Li, L. Fei- Fei, and J. Wu, “The objectfolder benchmark: Multisensory learning with neural and real objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 17 276–17 286

work page 2023

[59] [59]

Tla: Tactile-language-action model for contact-rich manipu- lation,

P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang, “Tla: Tactile-language-action model for contact-rich manipu- lation,”arXiv preprint arXiv:2503.08548, 2025

work page arXiv 2025

[60] [60]

Freetacman: Robot-free visuo-tactile data col- lection system for contact-rich manipulation,

L. Wu, C. Yu, J. Ren, L. Chen, Y . Jiang, R. Huang, G. Gu, and H. Li, “Freetacman: Robot-free visuo-tactile data col- lection system for contact-rich manipulation,”arXiv preprint arXiv:2506.01941, 2025

work page arXiv 2025

[61] [61]

Opentouch: Bring- ing full-hand touch to real-world interaction,

Y . R. Song, J. Li, R. Fu, D. Murphy, K. Zhou, R. Shiv, Y . Li, H. Xiong, C. E. Owens, Y . Duet al., “Opentouch: Bring- ing full-hand touch to real-world interaction,”arXiv preprint arXiv:2512.16842, 2025

work page arXiv 2025

[62] [62]

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

T. Engelbracht, R. Zurbr ¨ugg, M. Wohlrapp, M. B ¨uchner, A. Valada, M. Pollefeys, H. Blum, and Z. Bauer, “Hoi!–a multimodal dataset for force-grounded, cross-view articulated manipulation,”arXiv preprint arXiv:2512.04884, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Vint-6d: A large-scale object-in-hand dataset from vision, touch and proprioception,

Z. Wan, Y . Ling, S. Yi, L. Qi, W. Lee, M. Lu, S. Yang, X. Teng, P. Lu, X. Yanget al., “Vint-6d: A large-scale object-in-hand dataset from vision, touch and proprioception,”arXiv preprint arXiv:2501.00510, 2024

work page arXiv 2024

[64] [64]

Omnivta: Visuo-tactile world modeling for contact- rich robotic manipulation,

Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, R. Wu, C. Hao, C. Gao, S. Liu, H. Li, Y . Chen, S. Yan, and W. Ding, “Omnivta: Visuo-tactile world modeling for contact- rich robotic manipulation,” 2026

work page 2026

[65] [65]

Imagenet clas- sification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas- sification with deep convolutional neural networks,”Advances in neural information processing systems, vol. 25, 2012

work page 2012

[66] [66]

See, feel, act: Hierarchical learning for complex manipula- tion skills with multisensory fusion,

“See, feel, act: Hierarchical learning for complex manipula- tion skills with multisensory fusion,”Science Robotics, vol. 4, no. 26, p. eaav3123, 2019

work page 2019

[67] [67]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inProceedings of the IEEE international conference on com- puter vision, 2017, pp. 2961–2969

work page 2017

[68] [68]

Bayesian neural networks,

I. Kononenko, “Bayesian neural networks,”Biological Cyber- netics, vol. 61, no. 5, pp. 361–370, 1989

work page 1989

[69] [69]

Learning cross-modal visual-tactile representation using ensembled generative adver- sarial networks,

X. Li, H. Liu, J. Zhou, and F. Sun, “Learning cross-modal visual-tactile representation using ensembled generative adver- sarial networks,”Cognitive Computation and Systems, vol. 1, no. 2, pp. 40–44, 2019

work page 2019

[70] [70]

Gen- erative adversarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Gen- erative adversarial nets,”Advances in neural information pro- cessing systems, vol. 27, 2014

work page 2014

[71] [71]

Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei- Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,” in2019 International conference on robotics and automation (ICRA). IEEE, 2019, pp. 8943–8950

work page 2019

[72] [72]

Flownet: Learning optical flow with convolutional networks,

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProceedings of the IEEE international conference on com- puter vision, 2015, pp. 2758–2766

work page 2015

[73] [73]

Lifelong visual-tactile cross- modal learning for robotic material perception,

W. Zheng, H. Liu, and F. Sun, “Lifelong visual-tactile cross- modal learning for robotic material perception,”IEEE transac- tions on neural networks and learning systems, vol. 32, no. 3, pp. 1192–1203, 2020

work page 2020

[74] [74]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[75] [75]

Visuo-tactile transformers for manipulation,

Y . Chen, A. Sipos, M. Van der Merwe, and N. Fazeli, “Visuo-tactile transformers for manipulation,”arXiv preprint arXiv:2210.00121, 2022

work page arXiv 2022

[76] [76]

Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning,

J. Hansen, F. Hogan, D. Rivkin, D. Meger, M. Jenkin, and G. Dudek, “Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning,” in2022 Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 8298–8304

work page 2022

[77] [77]

Mastering vi- sual continuous control: Improved data-augmented reinforce- ment learning,

D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering vi- sual continuous control: Improved data-augmented reinforce- ment learning,”arXiv preprint arXiv:2107.09645, 2021

work page arXiv 2021

[78] [78]

Vito-transformer: a visual-tactile fusion network for object recognition,

B. Li, J. Bai, S. Qiu, H. Wang, and Y . Guo, “Vito-transformer: a visual-tactile fusion network for object recognition,”IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–10, 2023

work page 2023

[79] [79]

Mlp-mixer: An all-mlp architecture for vision,

I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,”Ad- vances in neural information processing systems, vol. 34, pp. 24 261–24 272, 2021

work page 2021

[80] [80]

Fine-tuned clip models are efficient video learners,

H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine-tuned clip models are efficient video learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6545–6554

work page 2023