UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models

Fucai Zhu; Jiaxin Shi; Michael Yu Wang; Siyu Zhu; Weihao Yuan; Xiaojun Wu; Xidong Zhang; Yichi Zhang

arxiv: 2606.31723 · v1 · pith:NRVEMB56new · submitted 2026-06-30 · 💻 cs.RO

UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models

Xidong Zhang , Yichi Zhang , Jiaxin Shi , Fucai Zhu , Siyu Zhu , Michael Yu Wang , Xiaojun Wu , Weihao Yuan This is my paper

Pith reviewed 2026-07-01 05:20 UTC · model grok-4.3

classification 💻 cs.RO

keywords tactile sensingvision language action modelscontact-rich manipulationunified latent spacechain-of-thought reasoningfuture tactile predictiondexterous roboticsmixed controller

0 comments

The pith

A unified tactile latent space with chain-of-thought reasoning and coarse-to-fine prediction lets vision-language-action models handle contact-rich manipulation more reliably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that treating tactile signals as active dynamic cues rather than passive inputs improves dexterous robotic tasks. It does this by building one latent space that jointly represents current contact states and future changes, then feeding the resulting prior into a mixed controller that blends real-time and predicted feedback. If the approach holds, robots would achieve higher success rates, better accuracy, and greater robustness on tasks such as insertion, wiping, and assembly, even when external disturbances occur. Existing vision-tactile-language-action methods fall short because they do not explicitly model future physical interactions.

Core claim

The authors claim that constructing a unified tactile latent space and jointly modeling current tactile states and future contact changes through tactile chain-of-thought reasoning and coarse-to-fine future tactile prediction forms a state-aware and dynamics-aware tactile prior; a tactile-action mixed controller then uses real-time and predicted tactile feedback to refine low-frequency action chunks with high-frequency corrections, yielding higher success rates, manipulation accuracy, and contact robustness on four categories of contact-rich tasks under both clean and perturbed conditions.

What carries the argument

Unified tactile latent space that supports chain-of-thought reasoning for current states and coarse-to-fine prediction for future contact changes, serving as a dynamics-aware prior for action refinement.

If this is right

The tactile-action mixed controller produces higher success rates on adjustment, insertion, wiping, and assembly tasks.
Manipulation accuracy and contact robustness increase under both clean and externally perturbed conditions.
Low-frequency action chunks receive high-frequency corrections from combined real-time and predicted tactile feedback.
Tactile signals function as dynamic interaction cues rather than auxiliary inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-space construction could be tested on additional sensory modalities to create multi-modal priors for manipulation.
If the coarse-to-fine prediction generalizes, it might allow lower control frequencies without sacrificing contact stability.
The framework might scale to more complex multi-fingered hands once the latent space is shown to transfer across hardware.

Load-bearing premise

A single unified tactile latent space combined with chain-of-thought reasoning and coarse-to-fine prediction can capture both current contact semantics and future physical interaction dynamics without loss of critical information or introduction of artifacts.

What would settle it

A controlled experiment on the same four task categories that shows no improvement in success rate or accuracy, or that demonstrates measurable artifacts in predicted tactile signals, when the unified latent space and prediction modules are used versus a passive-tactile baseline.

Figures

Figures reproduced from arXiv: 2606.31723 by Fucai Zhu, Jiaxin Shi, Michael Yu Wang, Siyu Zhu, Weihao Yuan, Xiaojun Wu, Xidong Zhang, Yichi Zhang.

**Figure 2.** Figure 2: Overview of the real-robot setup and task setup. We evaluate UniTacVLA on four categories of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Demonstration of the four contact-rich manipulation subtasks evaluated in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of prediction window size on USB task without disturbance. (a) Stage-prediction visualization (b) Attention-weight visualization (c) t-SNE visualization Loose Holding Contact Visual weight Tactile weight w/o T-CoT w/ T-CoT [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of T-CoT reasoning. (a) Our method accurately predicts rapid contact-stage tran [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of coarse-to-fine future tactile prediction on the board-wiping task with distur [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Analysis of the high-frequency controller on the USB insertion task. The controller produces timely [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Gripper setup with two DM-Tac W visuo-tactile sensors mounted on the fingertips. The two tactile [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Representative trajectories and task setups for the four categories of contact-rich manipulation tasks: [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Tactile data collection for encoder pretraining using a handheld gripper with the same tactile sensing [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: t-SNE visualization of the learned tactile latent space after encoder pretraining. The pretrained [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt template used for tactile chain-of-thought annotation. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison between fine-level tactile prediction and ground-truth tactile observations [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison between fine-level tactile prediction and ground-truth tactile observations [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison of tactile prediction within an action-tactile pair chunk on the plug insertion [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of the state-awareness capability of T-CoT across different tasks. The vertical axis [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative behaviors of the action-tactile mixed controller in insertion tasks. The first row shows [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models have achieved strong performance in many robotic manipulation tasks, yet remain limited in contact-rich dexterous manipulation. To overcome this limitation, recent vision-tactile-language-action (VTLA) methods incorporate tactile sensing into VLA models to provide direct contact information. However, they typically treat tactile signals as passive auxiliary inputs, making it difficult to model tactile semantics and future physical interactions. To this end, we propose a unified tactile learning framework for contact-rich manipulation that models tactile signals as dynamic interaction cues for both contact understanding and prediction. Specifically, we construct a unified tactile latent space and jointly model current tactile states and future contact changes through tactile chain-of-thought reasoning and coarse-to-fine future tactile prediction, thereby forming a state-aware and dynamics-aware tactile prior. Based on this prior, we introduce a tactile-action mixed controller that combines real-time and predicted tactile feedback to refine low-frequency action chunks with high-frequency corrections. Real-world experiments on four categories of contact-rich tasks, including adjustment, insertion, wiping, and assembly, under both clean and externally perturbed settings, show that our method improves success rate, manipulation accuracy, and contact robustness over existing methods, demonstrating its effectiveness in dexterous physical interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a unified tactile latent space with chain-of-thought reasoning and coarse-to-fine prediction to VLA models, then uses a mixed controller to show gains on four real-world contact-rich tasks under clean and perturbed conditions.

read the letter

The punchline is that this work gives VLA models a way to actively predict and use tactile information rather than just reacting to it as an extra input. They build a single latent space for tactile signals, use chain-of-thought reasoning to link current states to future changes, and add a coarse-to-fine prediction step. This prior then feeds into a controller that mixes real-time tactile feedback with predicted values to adjust actions at higher frequency.

What the paper does well is identify the limitation in existing VTLA approaches where tactile is treated passively, and then propose concrete steps to make it predictive. The experiments cover four categories of tasks—adjustment, insertion, wiping, and assembly—under both normal and perturbed conditions. The reported improvements in success rate, accuracy, and robustness are the kind of evidence that matters for manipulation work.

The new elements are the unified latent space, the tactile-specific chain-of-thought, and the mixed controller design. These are presented as a framework rather than isolated tricks, which helps tie the pieces together.

On the soft spots, the central modeling choice—that one latent space plus the reasoning and prediction steps can capture both semantics and dynamics without losing key information—needs strong support in the full paper. The abstract does not show equations or detailed protocols, so the strength of the results depends on how well the experiments control for variables and compare to strong baselines. If the gains hold up under closer inspection, fine; if they rely on particular task setups, that would be worth noting.

This paper is for researchers focused on integrating multiple modalities into robotic control, especially those dealing with contact-rich scenarios where vision alone falls short. Readers who work on VLA extensions or tactile sensing will find the architecture and the controller idea useful to consider. The real-world validation puts it in a position where it deserves a serious referee review rather than a desk reject, though the reviewers will likely ask for more ablation studies and clearer metrics.

I would send this to peer review.

Referee Report

0 major / 3 minor

Summary. The paper proposes UniTacVLA, a unified tactile learning framework for vision-language-action (VLA) models to address limitations in contact-rich dexterous manipulation. It constructs a unified tactile latent space and uses tactile chain-of-thought reasoning together with coarse-to-fine future tactile prediction to form a state-aware and dynamics-aware tactile prior. This prior informs a tactile-action mixed controller that combines real-time and predicted tactile feedback for refining action chunks. Real-world experiments across four contact-rich task categories (adjustment, insertion, wiping, assembly) under clean and externally perturbed conditions report improvements in success rate, manipulation accuracy, and contact robustness relative to existing methods.

Significance. If the reported gains hold under detailed scrutiny, the work would meaningfully extend VTLA models by shifting tactile signals from passive auxiliaries to active, predictive components of interaction dynamics. The emphasis on real-world validation across multiple task categories with external perturbations provides a practical test of robustness that is directly relevant to deployment in unstructured environments.

minor comments (3)

The abstract and introduction would benefit from explicit quantitative results (e.g., success-rate deltas and statistical significance) rather than qualitative statements of improvement; this would allow readers to gauge effect sizes immediately.
Notation for the unified tactile latent space and the coarse-to-fine prediction modules should be introduced with a clear diagram or equation set early in the methods section to avoid ambiguity when describing the chain-of-thought reasoning.
The description of the tactile-action mixed controller would be clearer if the frequency separation between low-frequency action chunks and high-frequency corrections were illustrated with a timing diagram or pseudocode.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary of our work and the positive assessment of its significance for extending VTLA models with predictive tactile components. The recommendation for minor revision is noted. However, no specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an architectural framework (unified tactile latent space + CoT reasoning + coarse-to-fine prediction) whose value is asserted via downstream real-world task success rates on four contact-rich categories under clean and perturbed conditions. No equations, fitted parameters, or self-citations are presented that reduce any claimed prediction or uniqueness result to the inputs by construction. The modeling choices are standard extensions of VLA/VTLA architectures and are validated externally by empirical comparisons rather than by internal tautology or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; cannot enumerate any.

pith-pipeline@v0.9.1-grok · 5777 in / 1073 out tokens · 23617 ms · 2026-07-01T05:20:29.660463+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 29 canonical work pages · 11 internal anchors

[1]

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu. See, hear, and feel: Smart sensory fusion for robotic manipulation.arXiv preprint arXiv:2212.03858, 2022

work page arXiv 2022
[2]

Huang, S

J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-vla: unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

work page arXiv 2025
[3]

Zhang, H

K. Zhang, H. Zhang, Z. Xu, Z. Zhang, M. R. I. Prince, X. Li, X. Han, Y . Zhou, A. Ajoudani, and Y . She. Tacvla: Contact-aware tactile fusion for robust vision-language-action manipulation. arXiv preprint arXiv:2603.12665, 2026

work page arXiv 2026
[4]

Zhang, P

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. Vtla: Vision-tactile-language- action model with preference learning for insertion manipulation.Biomimetic Intelligence and Robotics, page 100333, 2026

2026
[5]

R. Feng, D. Hu, W. Ma, and X. Li. Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation.arXiv preprint arXiv:2408.01366, 2024

work page arXiv 2024
[6]

Nazari, W

K. Nazari, W. Mandill, M. Hanheide, and A. G. Esfahani. Tactile dynamic behaviour prediction based on robot action. InAnnual Conference Towards Autonomous Robotic Systems, pages 284–293. Springer, 2021

2021
[7]

Mandil, K

W. Mandil, K. Nazari, et al. Action conditioned tactile prediction: case study on slip prediction. arXiv preprint arXiv:2205.09430, 2022

work page arXiv 2022
[8]

G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.arXiv preprint arXiv:2512.23864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Zheng, S

Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, C. Hao, C. Gao, S. Liu, et al. Omnivta: Visuo-tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

work page arXiv 2026
[10]

Calandra, A

R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine. More than a feeling: Learning to grasp and regrasp using vision and touch.IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018

2018
[11]

S. Dong, D. K. Jha, D. Romeres, S. Kim, D. Nikovski, and A. Rodriguez. Tactile-rl for inser- tion: Generalization to objects of unknown geometry. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6437–6443. IEEE, 2021

2021
[12]

H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik. General in-hand object rotation with vision and touch. InConference on Robot Learning, pages 2549–2564. PMLR, 2023

2023
[13]

Sunil, S

N. Sunil, S. Wang, Y . She, E. Adelson, and A. R. Garcia. Visuotactile affordances for cloth manipulation with local control. InConference on Robot Learning, pages 1596–1606. PMLR, 2023

2023
[14]

Schoettler, A

G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine. Deep re- inforcement learning for industrial insertion tasks with visual inputs and natural rewards. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5548–5555. IEEE, 2020

2020
[15]

W. Liu, J. Wang, Y . Wang, W. Wang, and C. Lu. Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1105–1112. IEEE, 2025. 9

2025
[16]

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation.arXiv preprint arXiv:2503.02881, 2025

work page arXiv 2025
[17]

Huang, Y

B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing.arXiv preprint arXiv:2410.24091, 2024

work page arXiv 2024
[18]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[22]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Zhang, W

Y . Zhang, W. Yuan, Y . Zhang, X. Zhang, and J. Wan. Focusvla: Focused visual utilization for vision-language-action models.arXiv preprint arXiv:2603.28740, 2026

work page arXiv 2026
[25]

P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025

work page arXiv 2025
[26]

J. Bi, K. Y . Ma, C. Hao, M. S. Zheng, and H. Soh. Vla-touch: Enhancing vision-language- action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

2026
[27]

J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.Ad- vances in Neural Information Processing Systems, 38:93409–93439, 2026

2026
[28]

Jones, O

J. Jones, O. Mees, C. Sferrazza, K. Stachowicz, P. Abbeel, and S. Levine. Beyond sight: Finetuning generalist robot policies with heterogeneous sensors via language grounding. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5961–5968. IEEE, 2025

2025
[29]

W. Song, Z. Zhou, H. Zhao, J. Chen, P. Ding, H. Yan, Y . Huang, F. Tang, D. Wang, and H. Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549– 18557, 2026

2026
[30]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. Advances in Neural Information Processing Systems, 38:24195–24228, 2026

2026
[31]

F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. Zeng, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025. 10

work page arXiv 2025
[32]

Zhong, J

Z. Zhong, J. Li, J. He, H. Yan, X. Gong, G. Zhao, Y . Cai, J. Gao, X. Yan, B. Liu, et al. Dualcot- vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action mod- els.arXiv preprint arXiv:2603.22280, 2026

work page arXiv 2026
[33]

Routray, H

S. Routray, H. Pan, U. Jain, S. Bahl, and D. Pathak. Vipra: Video prediction for robot actions. arXiv preprint arXiv:2511.07732, 2025

work page arXiv 2025
[34]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

World Models

D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end- to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

B. Ai, S. Tian, H. Shi, Y . Wang, C. Tan, Y . Li, and J. Wu. Robopack: Learning tactile-informed dynamics models for dense packing.arXiv preprint arXiv:2407.01418, 2024

work page arXiv 2024
[40]

H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv:2506.01953, 2025

work page arXiv 2025
[41]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026
[42]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

work page arXiv 2026
[43]

Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. Forcevla2: Unleashing hybrid force-position control with force awareness for contact-rich ma- nipulation.arXiv preprint arXiv:2603.15169, 2026

work page arXiv 2026
[44]

J. Lee, J. Shin, H. Choi, and J. Lee. Latent diffusion models with masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17422–17431, October 2025. 11 A Dataset and Task Details A.1 Hardware Details All demonstrations are collected using a teleoperation system based on an ALOHA-style master- slave co...

2025

[1] [1]

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu. See, hear, and feel: Smart sensory fusion for robotic manipulation.arXiv preprint arXiv:2212.03858, 2022

work page arXiv 2022

[2] [2]

Huang, S

J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-vla: unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

work page arXiv 2025

[3] [3]

Zhang, H

K. Zhang, H. Zhang, Z. Xu, Z. Zhang, M. R. I. Prince, X. Li, X. Han, Y . Zhou, A. Ajoudani, and Y . She. Tacvla: Contact-aware tactile fusion for robust vision-language-action manipulation. arXiv preprint arXiv:2603.12665, 2026

work page arXiv 2026

[4] [4]

Zhang, P

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. Vtla: Vision-tactile-language- action model with preference learning for insertion manipulation.Biomimetic Intelligence and Robotics, page 100333, 2026

2026

[5] [5]

R. Feng, D. Hu, W. Ma, and X. Li. Play to the score: Stage-guided dynamic multi-sensory fusion for robotic manipulation.arXiv preprint arXiv:2408.01366, 2024

work page arXiv 2024

[6] [6]

Nazari, W

K. Nazari, W. Mandill, M. Hanheide, and A. G. Esfahani. Tactile dynamic behaviour prediction based on robot action. InAnnual Conference Towards Autonomous Robotic Systems, pages 284–293. Springer, 2021

2021

[7] [7]

Mandil, K

W. Mandil, K. Nazari, et al. Action conditioned tactile prediction: case study on slip prediction. arXiv preprint arXiv:2205.09430, 2022

work page arXiv 2022

[8] [8]

G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.arXiv preprint arXiv:2512.23864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Zheng, S

Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, C. Hao, C. Gao, S. Liu, et al. Omnivta: Visuo-tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

work page arXiv 2026

[10] [10]

Calandra, A

R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine. More than a feeling: Learning to grasp and regrasp using vision and touch.IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018

2018

[11] [11]

S. Dong, D. K. Jha, D. Romeres, S. Kim, D. Nikovski, and A. Rodriguez. Tactile-rl for inser- tion: Generalization to objects of unknown geometry. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6437–6443. IEEE, 2021

2021

[12] [12]

H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik. General in-hand object rotation with vision and touch. InConference on Robot Learning, pages 2549–2564. PMLR, 2023

2023

[13] [13]

Sunil, S

N. Sunil, S. Wang, Y . She, E. Adelson, and A. R. Garcia. Visuotactile affordances for cloth manipulation with local control. InConference on Robot Learning, pages 1596–1606. PMLR, 2023

2023

[14] [14]

Schoettler, A

G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine. Deep re- inforcement learning for industrial insertion tasks with visual inputs and natural rewards. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5548–5555. IEEE, 2020

2020

[15] [15]

W. Liu, J. Wang, Y . Wang, W. Wang, and C. Lu. Forcemimic: Force-centric imitation learning with force-motion capture system for contact-rich manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1105–1112. IEEE, 2025. 9

2025

[16] [16]

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation.arXiv preprint arXiv:2503.02881, 2025

work page arXiv 2025

[17] [17]

Huang, Y

B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing.arXiv preprint arXiv:2410.24091, 2024

work page arXiv 2024

[18] [18]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[22] [22]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Zhang, W

Y . Zhang, W. Yuan, Y . Zhang, X. Zhang, and J. Wan. Focusvla: Focused visual utilization for vision-language-action models.arXiv preprint arXiv:2603.28740, 2026

work page arXiv 2026

[25] [25]

P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025

work page arXiv 2025

[26] [26]

J. Bi, K. Y . Ma, C. Hao, M. S. Zheng, and H. Soh. Vla-touch: Enhancing vision-language- action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

2026

[27] [27]

J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.Ad- vances in Neural Information Processing Systems, 38:93409–93439, 2026

2026

[28] [28]

Jones, O

J. Jones, O. Mees, C. Sferrazza, K. Stachowicz, P. Abbeel, and S. Levine. Beyond sight: Finetuning generalist robot policies with heterogeneous sensors via language grounding. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5961–5968. IEEE, 2025

2025

[29] [29]

W. Song, Z. Zhou, H. Zhao, J. Chen, P. Ding, H. Yan, Y . Huang, F. Tang, D. Wang, and H. Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549– 18557, 2026

2026

[30] [30]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. Advances in Neural Information Processing Systems, 38:24195–24228, 2026

2026

[31] [31]

F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. Zeng, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025. 10

work page arXiv 2025

[32] [32]

Zhong, J

Z. Zhong, J. Li, J. He, H. Yan, X. Gong, G. Zhao, Y . Cai, J. Gao, X. Yan, B. Liu, et al. Dualcot- vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action mod- els.arXiv preprint arXiv:2603.22280, 2026

work page arXiv 2026

[33] [33]

Routray, H

S. Routray, H. Pan, U. Jain, S. Bahl, and D. Pathak. Vipra: Video prediction for robot actions. arXiv preprint arXiv:2511.07732, 2025

work page arXiv 2025

[34] [34]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

World Models

D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end- to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

B. Ai, S. Tian, H. Shi, Y . Wang, C. Tan, Y . Li, and J. Wu. Robopack: Learning tactile-informed dynamics models for dense packing.arXiv preprint arXiv:2407.01418, 2024

work page arXiv 2024

[40] [40]

H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv:2506.01953, 2025

work page arXiv 2025

[41] [41]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

work page arXiv 2026

[42] [42]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

work page arXiv 2026

[43] [43]

Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. Forcevla2: Unleashing hybrid force-position control with force awareness for contact-rich ma- nipulation.arXiv preprint arXiv:2603.15169, 2026

work page arXiv 2026

[44] [44]

J. Lee, J. Shin, H. Choi, and J. Lee. Latent diffusion models with masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17422–17431, October 2025. 11 A Dataset and Task Details A.1 Hardware Details All demonstrations are collected using a teleoperation system based on an ALOHA-style master- slave co...

2025