Recognition: unknown
Learning Versatile Humanoid Manipulation with Touch Dreaming
Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3
The pith
Touch dreaming in a multimodal Transformer policy raises humanoid manipulation success rates by 90 percent over baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder-decoder Transformer that treats touch as a primary modality with multi-view vision and proprioception. Trained to predict action chunks, future hand-joint forces, and future tactile latents using an EMA target encoder, HTD achieves a 90.9 percent relative improvement in average success rate across five real-world contact-rich tasks compared to a stronger baseline, with latent tactile prediction outperforming raw prediction by 30 percent relative gain.
What carries the argument
The touch dreaming component, which augments behavioral cloning by having the policy predict future tactile latents from an exponential moving average target encoder to learn contact-aware representations.
Load-bearing premise
That the contact dynamics and stability from VR-collected demonstrations transfer to the real robot without substantial distribution shift.
What would settle it
A controlled experiment showing equivalent or lower success rates for the HTD policy versus the baseline when evaluated on the same five tasks with varied surface conditions or speeds.
Figures
read the original abstract
Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, end-effector dexterity, and contact-aware interaction under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based lower-body controller that serves as the stability backbone for whole-body execution during complex manipulation. Built on this controller, we develop a VR-based whole-body humanoid data collection system that integrates dexterous hands and tactile sensing for contact-rich manipulation. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder--decoder Transformer that models touch as a core modality alongside multi-view vision and proprioception. HTD is trained in a single stage with behavioral cloning augmented by touch dreaming: in addition to predicting action chunks, the policy predicts future hand-joint forces and future tactile latents, with tactile-latent targets provided by an exponential moving average target encoder without requiring a separate tactile pretraining stage. This encourages the policy to learn contact-aware representations for dexterous manipulation. Across five real-world contact-rich tasks, HTD achieves a 90.9% relative improvement in average success rate over the stronger baseline. Ablation results further show that latent-space tactile prediction is more effective than raw tactile prediction, yielding a 30% relative gain in success rate. These results demonstrate that our touch-dreaming-enhanced learning system enables versatile, high-dexterity humanoid manipulation in the real world. More information and open-source materials are available at: humanoid-touch-dream.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder-decoder Transformer policy for humanoid loco-manipulation. It combines behavioral cloning on VR-collected whole-body demonstrations with auxiliary losses for predicting future hand-joint forces and future tactile latents, where the tactile latent targets are supplied by an exponential moving average (EMA) target encoder in a single training stage without separate tactile pretraining. The central empirical claim is a 90.9% relative improvement in average success rate over the stronger baseline across five real-world contact-rich tasks, plus a 30% relative gain from latent-space over raw tactile prediction in ablations.
Significance. If the results hold under rigorous evaluation, the work offers a practical advance for contact-aware humanoid policies by demonstrating that single-stage touch dreaming can yield contact-rich representations that improve real-robot dexterity and stability. The integration of an RL lower-body controller with VR data collection and the open-source release are positive contributions to reproducible humanoid research.
major comments (3)
- [Section 5] Section 5 (Experiments and Results): The abstract and main results report a 90.9% relative success-rate improvement and 30% ablation gain, yet provide no information on the number of trials per task, per-seed variance, statistical significance tests, or failure-mode analysis. Without these, the load-bearing empirical claim cannot be properly evaluated for robustness.
- [Section 4.2] Section 4.2 (Touch Dreaming formulation): The method relies on the EMA target encoder supplying stable, non-collapsing tactile latent targets to drive the auxiliary loss and the reported gains. The text contains no analysis (e.g., latent variance trajectories, cosine similarity to a constant target, or a collapse ablation) confirming that the EMA remains informative throughout training. This directly affects attribution of the 30% latent-vs-raw gain to touch dreaming rather than other training factors.
- [Section 3.1 and 5.1] Section 3.1 and 5.1 (VR data collection and transfer): The weakest assumption—that VR demonstrations transfer contact dynamics and stability without significant distribution shift—is stated but not quantified (e.g., no sim-to-real gap metrics or real-world force/tactile distribution comparisons). This is load-bearing for claiming the policy's real-world performance stems from the learned representations.
minor comments (3)
- [Section 4.2] Notation for the EMA target encoder (Eq. in §4.2) should explicitly define the momentum coefficient and update schedule to allow reproduction.
- [Figure 3] Figure 3 (qualitative results) would benefit from clearer labeling of success/failure cases and corresponding tactile predictions.
- [Section 5.2] The baseline implementations in §5.2 lack sufficient detail on architecture and hyperparameter matching to HTD, hindering fair comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper accordingly to strengthen the empirical claims and clarify methodological details.
read point-by-point responses
-
Referee: [Section 5] Section 5 (Experiments and Results): The abstract and main results report a 90.9% relative success-rate improvement and 30% ablation gain, yet provide no information on the number of trials per task, per-seed variance, statistical significance tests, or failure-mode analysis. Without these, the load-bearing empirical claim cannot be properly evaluated for robustness.
Authors: We agree that additional details on experimental rigor are essential for evaluating the robustness of the reported improvements. In the revised manuscript, we have expanded Section 5 to specify that each task was evaluated over 10 trials per method, with results averaged across three independent random seeds including standard deviations. We have also added paired t-test results confirming statistical significance (p < 0.05) of the 90.9% relative improvement and included a failure-mode analysis categorizing common issues such as grasp slippage and balance loss. These changes directly address the concern and allow proper assessment of the claims. revision: yes
-
Referee: [Section 4.2] Section 4.2 (Touch Dreaming formulation): The method relies on the EMA target encoder supplying stable, non-collapsing tactile latent targets to drive the auxiliary loss and the reported gains. The text contains no analysis (e.g., latent variance trajectories, cosine similarity to a constant target, or a collapse ablation) confirming that the EMA remains informative throughout training. This directly affects attribution of the 30% latent-vs-raw gain to touch dreaming rather than other training factors.
Authors: We concur that verifying the stability of the EMA target encoder is important for attributing the ablation gains. We have revised Section 4.2 to include new analysis: plots of tactile latent variance trajectories over training epochs and average cosine similarity between the online encoder and EMA target, which remain high and non-constant. We further added a collapse ablation comparing the EMA to a fixed target encoder, showing degraded performance and confirming that the dynamic targets contribute to the observed 30% relative gain in success rate. revision: yes
-
Referee: [Section 3.1 and 5.1] Section 3.1 and 5.1 (VR data collection and transfer): The weakest assumption—that VR demonstrations transfer contact dynamics and stability without significant distribution shift—is stated but not quantified (e.g., no sim-to-real gap metrics or real-world force/tactile distribution comparisons). This is load-bearing for claiming the policy's real-world performance stems from the learned representations.
Authors: We acknowledge that explicit quantification of the distribution shift would provide stronger support. In the revised Sections 3.1 and 5.1, we have expanded the description of the VR data collection system, including details on sensor calibration and whole-body tracking to minimize shift, along with qualitative comparisons of observed contact patterns. However, we do not have paired quantitative force/tactile distribution metrics between VR and real-world due to practical constraints in data collection. We have added an explicit discussion of this limitation and its implications for future work, while noting that the real-world task success rates provide the primary empirical validation of effective transfer. revision: partial
Circularity Check
No circularity: empirical results from standard BC + EMA-augmented auxiliary losses
full rationale
The paper's central claims are empirical success rates on real-world tasks (90.9% relative improvement) and an ablation (30% gain from latent vs. raw tactile prediction). The training procedure is described as single-stage behavioral cloning augmented by predicting future tactile latents whose targets are supplied by a standard EMA target encoder; this is a conventional self-supervised technique (online network predicts EMA target) that does not reduce any reported metric to a tautology by the paper's own equations. No self-definitional steps, no fitted parameters renamed as predictions, and no load-bearing self-citations appear in the provided derivation chain. The method is self-contained against external benchmarks (real-robot evaluation) and does not invoke uniqueness theorems or ansatzes that collapse back to the inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- loss weighting coefficients for action, force, and tactile prediction terms
axioms (2)
- domain assumption The lower-body RL controller provides sufficient stability for upper-body manipulation without requiring joint optimization of the full body.
- domain assumption VR demonstrations capture contact-rich dynamics that are sufficiently close to real-world execution for behavioral cloning to succeed.
Forward citations
Cited by 1 Pith paper
-
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
Reference graph
Works this paper leans on
-
[1]
Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning,
T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi, “Omnih2o: Universal and dexterous human- to-humanoid whole-body teleoperation and learning,”arXiv preprint arXiv:2406.08858, 2024
-
[2]
Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,”arXiv preprint arXiv:2508.08241, 2025
-
[3]
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazzaet al., “Perceptive humanoid parkour: Chaining dynamic human skills via motion matching,”arXiv preprint arXiv:2602.15827, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi, “Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,”arXiv preprint arXiv:2509.26633, 2025
-
[5]
Humanplus: Humanoid shadowing and imitation from humans,
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,”arXiv preprint arXiv:2406.10454, 2024
-
[6]
Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, and Hao Tang
Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Benet al., “Sonic: Supersizing motion tracking for natural humanoid whole-body control,”arXiv preprint arXiv:2511.07820, 2025
-
[7]
Twist2: Scalable, portable, and holistic humanoid data collection system,
Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu, “Twist2: Scalable, portable, and holistic humanoid data collection system,”arXiv preprint arXiv:2511.02832, 2025
-
[8]
Omniclone: Engineering a robust, all- rounder whole-body humanoid teleoperation system,
Y . Li, L. Ma, Y . Lin, Y . Du, M. Liu, K. Hu, J. Cui, Y . Zhu, W. Liang, B. Jiaet al., “Omniclone: Engineering a robust, all- rounder whole-body humanoid teleoperation system,”arXiv preprint arXiv:2603.14327, 2026
-
[9]
Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation,
T. Zhu, G. Cai, Y . Zhaohui, G. Ren, H. Xie, Z. Wang, J. Wu, J. Wang, X. Yang, Y . Muet al., “Clot: Closed-loop global motion tracking for whole-body humanoid teleoperation,”arXiv preprint arXiv:2602.15060, 2026
-
[10]
Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,
M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Learning multimodal representations for contact-rich tasks,”IEEE Transactions on Robotics, vol. 36, no. 3, pp. 582–596, 2020
2020
-
[11]
More than a feeling: Learning to grasp and regrasp using vision and touch,
R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3300–3307, 2018
2018
-
[12]
ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik, “Vitacformer: Learning cross-modal representation for visuo-tactile dexterous ma- nipulation,”arXiv preprint arXiv:2506.15953, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu, “Learning to feel the future: Dreamtacvla for contact-rich manipulation,”arXiv preprint arXiv:2512.23864, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Omnivta: Visuo-tactile world modeling for contact- rich robotic manipulation, 2026
Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, C. Hao, C. Gao, S. Liuet al., “Omnivta: Visuo-tactile world modeling for contact-rich robotic manipulation,”arXiv preprint arXiv:2603.19201, 2026
-
[15]
Transferable tactile transformers for representation learning across diverse sensors and tasks,
J. Zhao, Y . Ma, L. Wang, and E. H. Adelson, “Transferable tactile transformers for representation learning across diverse sensors and tasks,” 2024
2024
-
[16]
Vtam: Video-tactile-action models for complex physical interaction beyond vlas,
H. Yuan, W. Yi, Z. Zhang, W. Chen, Y . Mo, J. Yin, X. Li, X. Zeng, C. Wen, C. Luet al., “Vtam: Video-tactile-action models for complex physical interaction beyond vlas,”arXiv preprint arXiv:2603.23481, 2026
-
[17]
Implicitrdp: An end-to-end visual-force diffusion policy with structural slow-fast learning,
W. Chen, H. Xue, Y . Wang, F. Zhou, J. Lv, Y . Jin, S. Tang, C. Wen, and C. Lu, “Implicitrdp: An end-to-end visual-force diffusion policy with structural slow-fast learning,”arXiv preprint arXiv:2512.10946, 2025
-
[18]
Self-supervised learning from images with a joint-embedding predictive architecture,
M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 619–15 629
2023
-
[19]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholuset al., “V-jepa 2: Self- supervised video models enable understanding, prediction and plan- ning,”arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
Mobile-television: Predictive motion priors for humanoid whole-body control,
C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang, “Mobile-television: Predictive motion priors for humanoid whole-body control,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 5364–5371
2025
-
[21]
Amo: Adaptive motion optimization for hyper- dexterous humanoid whole-body control,
J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang, “Amo: Adaptive motion optimization for hyper-dexterous humanoid whole- body control,”arXiv preprint arXiv:2505.03738, 2025
-
[22]
E. Kwon, S. Oh, I.-C. Baek, Y . Park, G. Kim, J. Moon, Y . Choi, and K.-J. Kim, “A humanoid visual-tactile-action dataset for contact-rich manipulation,”arXiv preprint arXiv:2510.25725, 2025
-
[23]
Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,
R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, and Y . Gao, “Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,”
-
[24]
Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations,
[Online]. Available: https://arxiv.org/abs/2602.06643
-
[25]
Humdex: Humanoid dexterous manipulation made easy,
L. Heng, Y . Tang, J. Xu, H. Bao, D. Huang, and Y . Wang, “Humdex: Humanoid dexterous manipulation made easy,”arXiv preprint arXiv:2603.12260, 2026
-
[26]
Available: https://arxiv.org/abs/2403.04436
T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole-body teleoperation,” arXiv preprint arXiv:2403.04436, 2024
-
[27]
Open-television: Teleoperation with immersive active visual feedback,
X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang, “Open-television: Teleoperation with immersive active visual feedback,”arXiv preprint arXiv:2407.01512, 2024
-
[28]
Falcon: Learn- ing force-adaptive humanoid loco-manipulation,
Y . Zhang, Y . Yuan, P. Gurunath, I. Gupta, S. Omidshafiei, A.-a. Agha- mohammadi, M. Vazquez-Chanlatte, L. Pedersen, T. He, and G. Shi, “Falcon: Learning force-adaptive humanoid loco-manipulation,”arXiv preprint arXiv:2505.06776, 2025
-
[29]
Hmc: Learning heterogeneous meta-control for contact-rich loco- manipulation,
L. Wei, X. Peng, R.-Z. Qiu, T. Huang, X. Cheng, and X. Wang, “Hmc: Learning heterogeneous meta-control for contact-rich loco- manipulation,”arXiv preprint arXiv:2511.14756, 2025
-
[30]
Chip: Adaptive compliance for humanoid control through hindsight perturbation,
S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhuet al., “Chip: Adaptive compliance for humanoid control through hindsight perturbation,”arXiv preprint arXiv:2512.14689, 2025
-
[31]
Expressive whole-body control for humanoid robots,
X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang, “Ex- pressive whole-body control for humanoid robots,”arXiv preprint arXiv:2402.16796, 2024
-
[32]
Homie: Humanoid loco- manipulation with isomorphic exoskeleton cockpit,
Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang, “Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit,” arXiv preprint arXiv:2502.13013, 2025
-
[33]
Ulc: A unified and fine-grained controller for humanoid loco-manipulation,
W. Sun, L. Feng, B. Cao, Y . Liu, Y . Jin, and Z. Xie, “Ulc: A unified and fine-grained controller for humanoid loco-manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2507.06905
-
[34]
TWIST: Teleoperated whole-body imitation system
Y . Ze, Z. Chen, J. P. Ara ´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu, “Twist: Teleoperated whole-body imitation system,”arXiv preprint arXiv:2505.02833, 2025
-
[35]
Clone: Closed-loop whole-body humanoid teleoperation for long- horizon tasks,
Y . Li, Y . Lin, J. Cui, T. Liu, W. Liang, Y . Zhu, and S. Huang, “Clone: Closed-loop whole-body humanoid teleoperation for long- horizon tasks,” in9th Annual Conference on Robot Learning, 2025
2025
-
[36]
Coordinated humanoid manipulation with choice policies,
H. Qi, Y .-J. Wang, T. Lin, B. Yi, Y . Ma, K. Sreenath, and J. Malik, “Coordinated humanoid manipulation with choice policies,”arXiv preprint arXiv:2512.25072, 2025
-
[37]
Generalizable humanoid manipulation with 3d diffusion policies,
Y . Ze, Z. Chen, W. Wang, T. Chen, X. He, Y . Yuan, X. B. Peng, and J. Wu, “Generalizable humanoid manipulation with 3d diffusion policies,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2873–2880
2025
-
[38]
Okami: Teaching humanoid robots manipulation skills through single video imitation,
J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu, “Okami: Teaching humanoid robots manipulation skills through single video imitation,”arXiv preprint arXiv:2410.11792, 2024
-
[39]
Humanoid policy˜ human policy,
R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsenet al., “Humanoid policy˜ human policy,” arXiv preprint arXiv:2503.13441, 2025
-
[40]
Sparsh: Self-supervised touch representations for vision-based tactile sensing,
C. Higuera, A. Sharma, C. K. Bodduluri, T. Fan, P. Lancaster, M. Kalakrishnan, M. Kaess, B. Boots, M. Lambeta, T. Wu, and M. Mukadam, “Sparsh: Self-supervised touch representations for vision-based tactile sensing,” 2024. [Online]. Available: https: //openreview.net/forum?id=xYJn2e1uu8
2024
-
[41]
Tactile-conditioned diffusion policy for force-aware robotic manipulation, 2025
E. Helmut, N. Funk, T. Schneider, C. de Farias, and J. Peters, “Tactile- conditioned diffusion policy for force-aware robotic manipulation,” arXiv preprint arXiv:2510.13324, 2025
-
[42]
Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,
H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu, “Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,” inProceedings of Robotics: Science and Systems (RSS), 2025
2025
-
[43]
3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing,
B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li, “3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing,”arXiv preprint arXiv:2410.24091, 2024
-
[44]
Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper,
X. Zhu, B. Huang, and Y . Li, “Touch in the wild: Learning fine-grained manipulation with a portable visuo-tactile gripper,”arXiv preprint arXiv:2507.15062, 2025
-
[45]
Multi-Modal Manipulation via Multi-Modal Policy Consensus
H. Chen, J. Xu, H. Chen, K. Hong, B. Huang, C. Liu, J. Mao, Y . Li, Y . Du, and K. Driggs-Campbell, “Multi-modal manipulation via multi- modal policy consensus,”arXiv preprint arXiv:2509.23468, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Learning visuotactile skills with two multifingered hands,
T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik, “Learning visuotactile skills with two multifingered hands,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 5637–5643
2025
-
[47]
Dextac: Learning contact-aware visuotactile policies via hand-by- hand teaching,
X. Zhang, C. Zhang, B. Zhang, Z. Peng, S. Cui, and S. Wang, “Dextac: Learning contact-aware visuotactile policies via hand-by- hand teaching,”arXiv preprint arXiv:2601.21474, 2026
-
[48]
Tactile-vla: Unlocking vision-language-action model’s physical knowledge for tactile generalization,
J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao, “Tactile- vla: unlocking vision-language-action model’s physical knowledge for tactile generalization,”arXiv preprint arXiv:2507.09160, 2025
-
[49]
J. Bi, K. Y . Ma, C. Hao, M. Z. Shou, and H. Soh, “Vla-touch: Enhanc- ing vision-language-action models with dual-level tactile feedback,” arXiv preprint arXiv:2507.17294, 2025
-
[50]
Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,
C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang, “Vtla: Vision- tactile-language-action model with preference learning for insertion manipulation,”arXiv preprint arXiv:2505.09577, 2025
-
[51]
Visuo-tactile world models.arXiv preprint arXiv:2602.06001,
C. Higuera, S. Arnaud, B. Boots, M. Mukadam, F. R. Hogan, and F. Meier, “Visuo-tactile world models,”arXiv preprint arXiv:2602.06001, 2026
-
[52]
A-SLIP: Acoustic Sensing for Continuous In-hand Slip Estimation
U. Yoo, Y . Mao, J. Oh, and J. Ichnowski, “A-slip: Acoustic sensing for continuous in-hand slip estimation,” 2026. [Online]. Available: https://arxiv.org/abs/2604.08528
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano- Mu˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudinet al., “Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025
work page internal anchor Pith review arXiv 2025
-
[54]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
A reduction of imitation learning and structured prediction to no-regret online learning,
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 627–635
2011
-
[56]
Amass: Archive of motion capture as surface shapes,
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “Amass: Archive of motion capture as surface shapes,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5442–5451
2019
-
[57]
Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system,
A. Handa, K. Van Wyk, W. Yang, J. Liang, Y .-W. Chao, Q. Wan, S. Birchfield, N. Ratliff, and D. Fox, “Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 9164–9170
2020
-
[58]
Human2locoman: Learning versatile quadrupedal manipulation with human pretraining,
Y . Niu, Y . Zhang, M. Yu, C. Lin, C. Li, Y . Wang, Y . Yang, W. Yu, T. Zhang, Z. Li, J. Francis, B. Chen, J. Tan, and D. Zhao, “Human2locoman: Learning versatile quadrupedal manipulation with human pretraining,” inRobotics: Science and Systems (RSS), 2025
2025
-
[59]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,
L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,”Advances in neural information processing systems, vol. 37, pp. 124 420–124 450, 2024
2024
-
[60]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
2016
-
[61]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705 APPENDIX A. Lower-Body Controller Details We provide additional details on the command ranges and domain randomization parameters used in training the lower- body controller. a) Co...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.