pith. sign in

arxiv: 2512.23864 · v3 · submitted 2025-12-29 · 💻 cs.RO · cs.CV

Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation

Pith reviewed 2026-05-16 18:36 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords tactile sensingvision-language-actioncontact-rich manipulationworld modelrobotic manipulationhierarchical alignmentdigital twin
0
0 comments X

The pith

DreamTacVLA lets robots anticipate physical contact by predicting future tactile signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models map language to robot actions but stay blind to force and slip because they lack tactile input. DreamTacVLA adds high-resolution tactile images as a third visual stream alongside wrist and third-person cameras. It first aligns all three streams with a Hierarchical Spatial Alignment loss so tactile tokens register correctly with their visual counterparts. The model is then finetuned with a tactile world model that forecasts the next tactile image given the current state and planned action. Training occurs on a large hybrid set of digital-twin simulations and real trials to avoid exhausting physical sensors. The result is a policy that chooses actions based on both observed and imagined contact physics.

Core claim

DreamTacVLA grounds VLA models in contact physics by learning to feel the future. High-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. A Hierarchical Spatial Alignment loss first reconciles the multi-scale streams by aligning tactile tokens with their spatial counterparts. Finetuning with a tactile world model that predicts future tactile signals lets the policy condition actions on anticipated contact dynamics. The hybrid digital-twin plus real-world dataset overcomes tactile data scarcity and sensor wear, producing up to 95 percent success on contact-rich manipulation tasks.

What carries the argument

Tactile world model that predicts upcoming tactile signals, integrated via Hierarchical Spatial Alignment (HSA) loss to unify multi-scale tactile and visual inputs.

If this is right

  • Actions become conditioned on both current observations and predicted future contact states.
  • High-resolution tactile feedback supports explicit reasoning about force, texture, and slip.
  • Hybrid simulation-real training bypasses the practical limits of physical tactile sensor durability.
  • The hierarchical scheme aligns fine tactile detail with coarser visual context for coherent multi-scale perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future-prediction world models of this form could be attached to additional modalities such as audio or force-torque to enrich overall scene understanding.
  • Accurate contact anticipation may allow reliable deployment in unstructured settings where surface properties vary from training examples.
  • Extending the same prediction loop to longer horizons could support planning sequences of delicate manipulations without repeated real-world trials.

Load-bearing premise

The tactile world model trained on the hybrid dataset will generalize to unseen real-world contact dynamics without large distribution shifts or sensor degradation.

What would settle it

Test the trained model on contact-rich tasks using objects or surface materials absent from both the digital-twin and real training sets and measure whether success rate falls substantially below 95 percent.

Figures

Figures reproduced from arXiv: 2512.23864 by Guo Ye, Han Liu, Haoran Lu, Shang Wu, Shihan Lu, Xu Zhao, Zexi Zhang.

Figure 1
Figure 1. Figure 1: Hybrid tactile dataset and the Tactile-DreamVLA inference mechanism. (Top) We collect a large-scale tactile dataset covering 4 manipulation tasks and 9 objects, totaling 2M tactile frames. (Bottom) Our Think–Dream–Act loop executes each step of the policy in two passes. In the Think stage, the policy proposes a draft action using the current state and a null tactile prediction. In the Dream stage, a frozen… view at source ↗
Figure 2
Figure 2. Figure 2: The proposed framework operates in two stages. Stage 1 (Left): A multimodal encoder Eψ processes diverse inputs. This stage employs Hierarchical Spatial Alignment (HSA) to effectively fuse the features from different modalities, guided by the LHSA and LW losses. A policy πθ is trained to output an initial draft action a (t) draft. Stage 2 (Right): A world model Wϕ is trained to predict future tactile image… view at source ↗
Figure 3
Figure 3. Figure 3: The three-scale visual hierarchy of our model. Our framework fuses information from three distinct visual modalities. Our Hierarchical Spatial Alignment (HSA) loss is designed to explicitly ground the micro-vision (what the robot feels) within the local and macro visual contexts (what the robot sees). First, using the robot’s forward kinematics and cali￾brated camera parameters (extrinsics Etp, Ew and intr… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the world model’s predicted future-state embedding Hdream across training. Initially, the embedding is noisy and unstructured, indicating weak predictive ability. As training advances, the embedding becomes increasingly concentrated and stable, revealing that the world model is learning a coherent repre￾sentation of future tactile–visual dynamics. A key component of our architecture is a p… view at source ↗
Figure 5
Figure 5. Figure 5: Task suite used to evaluate DreamTacVLA. From left to right: Peg-in-Hole, USB Insert, Gear Assembly, and Tool Sta￾bilization. Each task demands precise, contact-rich manipulation, including aligning tight tolerances, detecting slip, or maintaining stable tool contact. It provides a comprehensive benchmark for assessing tactile-aware policies. and finetuned on our dataset. This CLIP model is also responsibl… view at source ↗
Figure 6
Figure 6. Figure 6: The dataset consists of 80% simulated demonstrations and 20% real-world demonstrations, each containing four task categories: Peg-in-Hole, USB Insert, Gear Assembly, and Tool Stabilization. Blue segments represent simulated data, while or￾ange segments denote real-world data. Baselines. We compare DreamTacVLA against strong state-of-the-art policies and controlled ablations of our own method. External base… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of our model’s tactile prediction. For both the Peginhole and Tool Stabilization tasks, we visualize the sequence (left to right) comparing our model’s Prediction (bottom row) to the Ground Truth tactile data (fourth row). The corresponding tactile images are provide as well. the most critical component. However, training the model to predict all future modalities (Tactile+Vision) yi… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation studies on model and data scaling. Tactile Dataset Size. We further investigated the influence of the tactile dataset size on our model’s performance. To do this, we trained separate instances of our model using pro￾gressively larger subsets of our collected data, ranging from 20% to 100% of the total available samples [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Keyframes of the gear assembly and tool stabilization task workflow [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Keyframes of one example failure case. D.3 Pretraining Data and Temporal Sampling We pretrain the tactile world model on unlabeled tactile sequences extracted from both simulation and real demonstrations. We sample short clips of length T with stride s, and randomly choose a prediction horizon N ∈ {1, . . . , Nmax}. This yields pairs (I (t) τ , I (t+N) τ ) to encourage multi-step predictive structure. E E… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. DreamTacVLA is a hierarchical Vision-Language-Action framework that integrates high-resolution tactile images as micro-vision inputs with wrist and third-person camera views. It first trains a unified policy using a Hierarchical Spatial Alignment (HSA) loss to align tactile tokens spatially, then finetunes with a tactile world model that predicts future tactile states on a hybrid dataset of high-fidelity digital-twin and real-world data. The model conditions actions on both observed and imagined tactile consequences, reporting up to 95% success on contact-rich manipulation tasks while outperforming state-of-the-art VLA baselines.

Significance. If the empirical claims hold after proper validation, the work would be significant for robotics by addressing the blindness of current VLAs to physical contact dynamics. It introduces a concrete mechanism (tactile world model + HSA alignment) to ground policies in contact physics, which could improve robustness in force-sensitive tasks where vision alone fails.

major comments (3)
  1. [Experiments] Experimental evaluation: The abstract and results claim quantitative gains up to 95% success and outperformance of VLA baselines, yet provide no details on task definitions, baseline implementations, number of trials per condition, or statistical tests. This absence makes it impossible to verify whether the performance edge is statistically reliable or reproducible.
  2. [Tactile World Model] Tactile world model and hybrid dataset: The finetuning stage relies on a world model trained on hybrid digital-twin plus real data to acquire a 'rich model of contact physics.' However, no evidence is presented that the digital-twin tactile images replicate real sensor noise, wear, material deformation, or force resolution at the pixel level; without this, the generalization assumption to unseen real-world contact dynamics (new objects, forces, or sensor aging) remains untested and load-bearing for the headline claim.
  3. [Hierarchical Perception Scheme] Hierarchical Spatial Alignment (HSA) loss: The loss is presented as the mechanism to reconcile multi-scale sensory streams, yet the manuscript does not report ablations isolating its contribution versus a standard multi-view fusion baseline. If the alignment does not measurably improve policy performance, the hierarchical perception scheme's necessity is undermined.
minor comments (2)
  1. [Abstract] The abstract states 'up to 95% success' without specifying the exact tasks, conditions, or variance; this should be clarified with precise metrics and task names.
  2. [Method] Notation for the tactile prediction horizon and HSA weighting coefficient is introduced without explicit equations or hyperparameter sensitivity analysis in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major point below and will revise the manuscript to incorporate clarifications and additional analyses as needed.

read point-by-point responses
  1. Referee: [Experiments] Experimental evaluation: The abstract and results claim quantitative gains up to 95% success and outperformance of VLA baselines, yet provide no details on task definitions, baseline implementations, number of trials per condition, or statistical tests. This absence makes it impossible to verify whether the performance edge is statistically reliable or reproducible.

    Authors: We agree that the manuscript would benefit from more explicit details to ensure reproducibility. In the revised version, we will expand the experimental setup section to include precise definitions of each contact-rich task (e.g., object types, success criteria), descriptions of how the VLA baselines were implemented (including any adaptations from their original papers), the number of trials conducted per condition (typically 20-50 depending on the task), and results of statistical significance tests such as paired t-tests with p-values. This will allow readers to better assess the reliability of the reported gains. revision: yes

  2. Referee: [Tactile World Model] Tactile world model and hybrid dataset: The finetuning stage relies on a world model trained on hybrid digital-twin plus real data to acquire a 'rich model of contact physics.' However, no evidence is presented that the digital-twin tactile images replicate real sensor noise, wear, material deformation, or force resolution at the pixel level; without this, the generalization assumption to unseen real-world contact dynamics (new objects, forces, or sensor aging) remains untested and load-bearing for the headline claim.

    Authors: The hybrid dataset construction is detailed in the manuscript, but we acknowledge the lack of direct fidelity validation between digital-twin and real tactile images. To strengthen this, we will add in the revision a dedicated analysis (e.g., in Section 4.2 or an appendix) providing side-by-side comparisons of simulated and real tactile images under similar contact conditions, including metrics for noise similarity and deformation patterns. We will also discuss the limitations regarding generalization to sensor aging and new dynamics as future work. This addresses the core concern while noting that the current results demonstrate practical improvements on real hardware. revision: partial

  3. Referee: [Hierarchical Perception Scheme] Hierarchical Spatial Alignment (HSA) loss: The loss is presented as the mechanism to reconcile multi-scale sensory streams, yet the manuscript does not report ablations isolating its contribution versus a standard multi-view fusion baseline. If the alignment does not measurably improve policy performance, the hierarchical perception scheme's necessity is undermined.

    Authors: We recognize the value of ablations to isolate the HSA loss's impact. To address this, we will include an ablation study in the revised manuscript comparing the policy trained with the HSA loss to one using a standard multi-view fusion approach without the alignment objective. We will report the performance differences to demonstrate the contribution of the hierarchical alignment. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and evaluation chain is self-contained

full rationale

The paper describes a standard hierarchical VLA training pipeline: HSA loss for multi-view alignment followed by finetuning a tactile world model on a hybrid dataset, with final claims resting on measured success rates (up to 95%) against external baselines. No equations, uniqueness theorems, or self-citations are invoked to force results by construction; the 'prediction' of future tactile states is an ordinary supervised objective whose outputs are evaluated on held-out real-world rollouts rather than being definitionally identical to the training inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions about representation learning and the fidelity of the digital-twin simulation; no explicit free parameters or invented physical entities are named in the abstract.

free parameters (2)
  • HSA loss weighting coefficient
    Hyperparameter balancing tactile-visual alignment against task loss; value not reported in abstract.
  • Tactile prediction horizon
    Number of future steps the world model is trained to predict; chosen during finetuning.
axioms (2)
  • domain assumption High-resolution tactile images can be processed as spatial tokens equivalent to visual patches
    Invoked when treating tactile data as micro-vision inputs in the hierarchical scheme.
  • domain assumption The digital twin faithfully reproduces real contact physics for the chosen tasks
    Required for the hybrid dataset to serve as effective training data.

pith-pipeline@v0.9.0 · 5592 in / 1380 out tokens · 54197 ms · 2026-05-16T18:36:51.026338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning Versatile Humanoid Manipulation with Touch Dreaming

    cs.RO 2026-04 conditional novelty 5.0

    HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-r...

  2. Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

    cs.RO 2026-05 unverdicted novelty 4.0

    A survey proposing a hierarchical taxonomy for multimodal tactile fusion datasets and methods across perception, generation, and interaction in embodied intelligence.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 2 Pith papers · 20 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Casta˜neda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818,

  6. [6]

    WorldVLA: Towards Autoregressive Action World Model

    Cen, J., Yu, C., Yuan, H., Jiang, Y ., Huang, S., Guo, J., Li, X., Song, Y ., Luo, H., Wang, F., et al. Worldvla: To- wards autoregressive action world model.arXiv preprint arXiv:2506.21539,

  7. [7]

    Omnivtla: Vision- tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025

    Cheng, Z., Zhang, Y ., Zhang, W., Li, H., Wang, K., Song, L., and Zhang, H. Omnivtla: Vision-tactile-language- action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706,

  8. [8]

    A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232, 2024

    Fu, L., Datta, G., Huang, H., Panitch, W. C.-H., Drake, J., Ortiz, J., Mukadam, M., Lambeta, M., Calandra, R., and Goldberg, K. A touch, vision, and language dataset for multimodal alignment.arXiv preprint arXiv:2402.13232,

  9. [9]

    World Models

    Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2(3),

  10. [10]

    Mastering Diverse Domains through World Models

    Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

  11. [11]

    ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

    Heng, L., Geng, H., Zhang, K., Abbeel, P., and Malik, J. Vitacformer: Learning cross-modal representation for visuo-tactile dexterous manipulation.arXiv preprint arXiv:2506.15953,

  12. [12]

    Sparsh: Self-supervised touch representations for vision- based tactile sensing.arXiv preprint arXiv:2410.24090, 2024

    Higuera, C., Sharma, A., Bodduluri, C. K., Fan, T., Lan- caster, P., Kalakrishnan, M., Kaess, M., Boots, B., Lam- beta, M., Wu, T., et al. Sparsh: Self-supervised touch representations for vision-based tactile sensing.arXiv preprint arXiv:2410.24090,

  13. [13]

    Tactile- VLA: Unlocking vision-language-action model’s physical knowledge for tactile generalization,

    Huang, J., Wang, S., Lin, F., Hu, Y ., Wen, C., and Gao, Y . Tactile-vla: Unlocking vision-language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160,

  14. [14]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dha- balia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al. π0.5: A vision-language-action model with open- world generalization.arXiv preprint arXiv:2504.16054,

  15. [15]

    Beyond sight: Finetuning generalist robot policies with heterogeneous sensors via language grounding.arXiv preprint arXiv:2501.04693,

    Jones, J., Mees, O., Sferrazza, C., Stachowicz, K., Abbeel, P., and Levine, S. Beyond sight: Finetuning generalist robot policies with heterogeneous sensors via language grounding.arXiv preprint arXiv:2501.04693,

  16. [16]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

  17. [17]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,

  18. [18]

    Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511, 2025a

    Li, C., Wen, J., Peng, Y ., Peng, Y ., Feng, F., and Zhu, Y . Pointvla: Injecting the 3d world into vision-language- action models.arXiv preprint arXiv:2503.07511,

  19. [19]

    See, hear, and feel: Smart sensory fusion for robotic manipulation,

    9 Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation Li, H., Zhang, Y ., Zhu, J., Wang, S., Lee, M. A., Xu, H., Adelson, E., Fei-Fei, L., Gao, R., and Wu, J. See, hear, and feel: Smart sensory fusion for robotic manipulation. arXiv preprint arXiv:2212.03858,

  20. [20]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Li, Q., Liang, Y ., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y ., Xu, S., Zhang, Y ., et al. Cogact: A foundational vision-language-action model for synergiz- ing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,

  21. [21]

    Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

    Lin, T., Li, G., Zhong, Y ., Zou, Y ., Du, Y ., Liu, J., Gu, E., and Zhao, B. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

  22. [22]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion founda- tion model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

  23. [23]

    Mla: A multisen- sory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

    Liu, Z., Liu, J., Xu, J., Han, N., Gu, C., Chen, H., Zhou, K., Zhang, R., Hsieh, K. C., Wu, K., et al. Mla: A mul- tisensory language-action model for multimodal under- standing and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642,

  24. [24]

    R3M: A Universal Visual Representation for Robot Manipulation

    Nair, S., Rajeswaran, A., Kumar, V ., Finn, C., and Gupta, A. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,

  25. [25]

    H., Schneider, T., Duret, G., Kshirsagar, A., Belousov, B., and Peters, J

    Nguyen, D. H., Schneider, T., Duret, G., Kshirsagar, A., Belousov, B., and Peters, J. Tacex: Gelsight tactile simu- lation in isaac sim–combining soft-body and visuotactile simulators.arXiv preprint arXiv:2411.04776,

  26. [26]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, D., Song, H., Chen, Q., Yao, Y ., Ye, X., Ding, Y ., Wang, Z., Gu, J., Zhao, B., Wang, D., et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830,

  27. [27]

    Octo: An Open-Source Generalist Robot Policy

    Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

  28. [28]

    Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,

    Xue, H., Ren, J., Chen, W., Zhang, G., Fang, Y ., Gu, G., Xu, H., and Lu, C. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipula- tion.arXiv preprint arXiv:2503.02881,

  29. [29]

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    Zhang, W., Liu, H., Qi, Z., Wang, Y ., Yu, X., Zhang, J., Dong, R., He, J., Lu, F., Wang, H., et al. Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge.arXiv preprint arXiv:2507.04447,

  30. [30]

    Zhao, J., Ma, Y ., Wang, L., and Adelson, E. H. Transferable tactile transformers for representation learning across di- verse sensors and tasks.arXiv preprint arXiv:2406.13640,

  31. [31]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V ., Levine, S., and Finn, C. Learn- ing fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

  32. [32]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Zheng, R., Liang, Y ., Huang, S., Gao, J., Daum ´e III, H., Kolobov, A., Huang, F., and Yang, J. Tracevla: Vi- sual trace prompting enhances spatial-temporal aware- ness for generalist robotic policies.arXiv preprint arXiv:2412.10345,

  33. [33]

    demonstrate that scaling data and model capacity enables strong cross-task and cross-embodiment generalization. Subsequent work extends this paradigm across embodiments, datasets, and action spaces, including Open X-Embodiment (O’Neill et al., 2024), OpenVLA (Kim et al., 2024), Octo (Team et al., 2024), RDT-1B (Liu et al., 2024), π0 (Black et al., 2024), ...

  34. [34]

    Modular designs such as CogACT (Li et al.,

    and language-conditioned visuomotor learning for compositional generalization (Zhao et al., 2023). Modular designs such as CogACT (Li et al.,

  35. [35]

    Complementary approaches model spatial structure implicitly: TraceVLA (Zheng et al.,

    and point cloud grounding in PointVLA (Li et al., 2025), which improve generalization in geometry-sensitive manipulation. Complementary approaches model spatial structure implicitly: TraceVLA (Zheng et al.,

  36. [36]

    MLA (Liu et al., 2025), Tactile-VLA (Huang et al., 2025), OmniVTLA (Cheng et al., 2025), and RDP (Xue et al.,

    studies structured multimodal fusion, while more recent VLA-style approaches incorporate tactile inputs directly. MLA (Liu et al., 2025), Tactile-VLA (Huang et al., 2025), OmniVTLA (Cheng et al., 2025), and RDP (Xue et al.,

  37. [37]

    demonstrate improved robustness via vision–tactile fusion and reactive tactile feedback, but typically operate on temporally sparse or spatially compressed tactile representations. In parallel, tactile representation learning focuses on transferable visuotactile embeddings decoupled from control, including Binding Touch (Yang et al., 2024), TVL (Fu et al....

  38. [38]

    provide dense micro-vision measurements of surface deformation, encoding texture, geometry, and shear-induced slip, which are essential for modeling fine-grained contact dynamics (She et al., 2023). However, existing approaches largely treat tactile sensing as an auxiliary input or a standalone representation, without tightly integrating tactile predictio...

  39. [39]

    Several VLA architectures integrate predictive modeling directly into policy learning by conditioning actions on latent roll- outs, including DreamVLA (Zhang et al.,

    17 Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation and V-JEPA-style approaches (Assran et al., 2025), learn structured predictive representations that transfer effectively to downstream robotic tasks. Several VLA architectures integrate predictive modeling directly into policy learning by conditioning actions on latent roll- outs, i...

  40. [40]

    In the tactile domain, ViTacFormer (Heng et al.,

    and WorldVLA (Cen et al., 2025). In the tactile domain, ViTacFormer (Heng et al.,