Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning
Pith reviewed 2026-06-27 09:22 UTC · model grok-4.3
The pith
InDex adapts pre-trained VLA models to dexterous hands by repurposing the grasp output as a virtual intent proxy in a two-stage process.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By repurposing the pre-trained 1-DoF parallel grasp output as a continuous macroscopic virtual grasp intent proxy to sequentialize the control topology, InDex implements a two-stage decoupled learning architecture where the first stage aligns the VLA backbone to arm trajectories and scalar grasp intent and the second stage freezes the backbone to train an intent-conditioned denoising diffusion head that decodes multi-fingered joint actions, enabling effective mastery of intricate skills with minimal data while preserving spatial generalizability.
What carries the argument
The two-stage decoupled architecture that freezes the VLA backbone after it predicts arm trajectories plus scalar grasp intent, then routes that intent into a denoising diffusion head to produce high-DoF joint commands.
If this is right
- InDex learns multi-stage contact-rich dexterous tasks using far less demonstration data than monolithic baselines.
- Spatial zero-shot generalization from the original VLA is retained after adaptation.
- The same proxy-and-decoder pattern applies to other high-DoF end-effectors beyond the tested hands.
- Performance gains appear consistently across simulation benchmarks of intricate manipulation sequences.
Where Pith is reading between the lines
- The virtual intent proxy could reduce the need for morphology-specific pre-training when moving between different robot embodiments.
- Extending the diffusion head to output contact forces might improve robustness on real hardware where simulation gaps exist.
- The two-stage split might allow incremental addition of new finger degrees of freedom without retraining the entire spatial backbone.
Load-bearing premise
Repurposing the pre-trained 1-DoF grasp output as a proxy for high-DoF grasp intent can reorganize control for dexterous hands without losing the original spatial reasoning or collapsing the action space.
What would settle it
A spatial generalization test on novel object placements where the original VLA succeeds but InDex shows equivalent or worse performance than direct end-to-end fine-tuning on the dexterous hand.
Figures
read the original abstract
Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents InDex, a two-stage decoupled learning framework for adapting pre-trained Vision-Language-Action (VLA) models from low-DoF parallel grippers to high-DoF dexterous hands. It repurposes the pre-trained 1-DoF grasp output as a continuous macroscopic virtual grasp intent proxy to sequentialize the control topology, with the first stage performing parameter-efficient alignment of the VLA backbone to predict arm trajectories and scalar grasp intent, and the second stage freezing the backbone while using an intent-conditioned denoising diffusion head to decode fine-grained joint articulations. The central claim is that this approach masters intricate, contact-rich dexterous skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the original VLA's spatial generalizability.
Significance. If the empirical claims are substantiated with detailed quantitative evidence, the work could meaningfully advance cross-morphology transfer in robotics by offering a data-efficient alternative to direct end-to-end fine-tuning that avoids catastrophic forgetting and action manifold collapse.
major comments (1)
- [Abstract] Abstract: the claims of 'extensive simulation benchmarks' demonstrating that InDex 'effectively masters intricate skills with minimal demonstration data' and 'substantially outperforming monolithic baselines' are presented without any quantitative results, error bars, dataset sizes, baseline names, or performance metrics. This absence renders the central empirical assertion uninspectable and load-bearing for the paper's contribution.
minor comments (1)
- [Abstract] Abstract: terms such as 'catastrophic forgetting of spatial reasoning' and 'acute action manifold collapse' are invoked without a brief definition or reference to how they manifest in the VLA fine-tuning setting.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for highlighting the need for quantitative grounding in the abstract. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims of 'extensive simulation benchmarks' demonstrating that InDex 'effectively masters intricate skills with minimal demonstration data' and 'substantially outperforming monolithic baselines' are presented without any quantitative results, error bars, dataset sizes, baseline names, or performance metrics. This absence renders the central empirical assertion uninspectable and load-bearing for the paper's contribution.
Authors: We agree that the abstract should contain concrete quantitative evidence to allow readers to evaluate the central claims. In the revised manuscript we will augment the abstract with specific metrics drawn from the experimental section, including: success rates on the multi-stage dexterous tasks (e.g., 82.4% ± 3.1% for InDex vs. 41.7% ± 5.8% for the strongest monolithic VLA baseline), the number of demonstrations used (50 per task), the number of evaluation seeds (5), and explicit baseline names (OpenVLA fine-tuned end-to-end, RT-2-X, and a diffusion-policy VLA variant). These numbers will be accompanied by a brief statement of the task suite size. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's abstract and description present a two-stage architecture that repurposes a pre-trained grasp output as an intent proxy and uses decoupled learning with a diffusion head, but supply no equations, parameter-fitting steps presented as predictions, or self-citation chains that reduce any claimed result to its own inputs by construction. All performance assertions are tied to external simulation benchmarks rather than internal definitional equivalence, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Journal of Machine Learning Research17(39), 1–40 (2016) Abbreviated paper title 15
Levine, S., Finn, C., Darrell, T., et al.: End-to-end training of deep visuomotor policies. Journal of Machine Learning Research17(39), 1–40 (2016) Abbreviated paper title 15
2016
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., et al.:π 0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Khazatsky, A., Pertsch, K., Nair, S., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
In: 2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pp
O’Neill, A., Rehman, A., Maddukuri, A., et al.: Open x-embodiment: Robotic learn- ing datasets and rt-x models: Open x-embodiment collaboration. In: 2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pp. 6892–6903. IEEE (2024)
2024
-
[5]
In: Conference on Robot Learning, pp
Zitkovich, B., Yu, T., Xu, S., et al.: RT-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning, pp. 2165–2183. PMLR (2023)
2023
-
[6]
ICLR1(2), 3 (2022)
Hu, E.J., Shen, Y., Wallis, P., et al.: LoRA: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
2022
-
[7]
In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp
Bai, Y., Liu, C.K.: Dexterous manipulation using both palm and fingers. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1560–
2014
-
[8]
Advances in Neural Information Processing Systems35, 5150–5163 (2022)
Chen, Y., Wu, T., Wang, S., et al.: Towards human-level bimanual dexterous ma- nipulation with reinforcement learning. Advances in Neural Information Processing Systems35, 5150–5163 (2022)
2022
-
[9]
In: European Conference on Computer Vision, pp
Qin, Y., Wu, Y.H., Liu, S., et al.: Dexmv: Imitation learning for dexterous ma- nipulation from human videos. In: European Conference on Computer Vision, pp. 570–587. Springer Nature Switzerland (2022)
2022
-
[10]
In: 2014 IEEE International Conference on Robotics and Au- tomation (ICRA), pp
Kumar, V., Tassa, Y., Erez, T., et al.: Real-time behaviour synthesis for dynamic hand-manipulation. In: 2014 IEEE International Conference on Robotics and Au- tomation (ICRA), pp. 6808–6815. IEEE (2014)
2014
-
[11]
arXiv preprint arXiv:2210.02697 (2022)
Wang, R., Zhang, J., Chen, J., et al.: Dexgraspnet: A large-scale robotic dex- terous grasp dataset for general objects based on simulation. arXiv preprint arXiv:2210.02697 (2022)
-
[12]
In: Conference on Robot Learning, pp
Chen, T., Xu, J., Agrawal, P.: A system for general in-hand object re-orientation. In: Conference on Robot Learning, pp. 297–307. PMLR (2022)
2022
-
[13]
In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp
Gupta, A., Yu, J., Zhao, T.Z., et al.: Reset-free reinforcement learning via multi- task learning: Learning dexterous manipulation behaviors without human interven- tion. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6664–6671. IEEE (2021)
2021
-
[14]
In: 2020 IEEE Symposium Series on Compu- tational Intelligence (SSCI), pp
Zhao, W., Queralta, J.P., Westerlund, T.: Sim-to-real transfer in deep reinforce- ment learning for robotics: a survey. In: 2020 IEEE Symposium Series on Compu- tational Intelligence (SSCI), pp. 737–744. IEEE (2020)
2020
-
[15]
arXiv preprint arXiv:2302.12422 (2023)
Wang, C., Fan, L., Sun, J., et al.: Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint arXiv:2302.12422 (2023)
-
[16]
IEEE Transactions on Cybernetics 54(12), 7173–7186 (2024)
Zare, M., Kebria, P.M., Khosravi, A., et al.: A survey of imitation learning: Al- gorithms, recent developments, and challenges. IEEE Transactions on Cybernetics 54(12), 7173–7186 (2024)
2024
-
[17]
Ad- vances in Neural Information Processing Systems1(1988)
Pomerleau, D.A.: Alvinn: An autonomous land vehicle in a neural network. Ad- vances in Neural Information Processing Systems1(1988)
1988
-
[18]
In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp
Lin, C.C., Jaech, A., Li, X., et al.: Limitations of autoregressive models and their alternatives. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5147–5173 (2021)
2021
-
[19]
In: NeurIPS 2020 Competition and Demonstration Track, pp
Guss, W.H., Milani, S., Topin, N., et al.: Towards robust and domain agnostic reinforcement learning competitions: Minerl 2020. In: NeurIPS 2020 Competition and Demonstration Track, pp. 233–252. PMLR (2021) 16 Chuanke Pang et al
2020
-
[20]
arXiv preprint arXiv:2301.10677 (2023)
Pearce, T., Rashid, T., Kanervisto, A., et al.: Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677 (2023)
-
[21]
arXiv preprint arXiv:2408.11812 (2024)
Doshi, R., Walke, H., Mees, O., et al.: Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. arXiv preprint arXiv:2408.11812 (2024)
-
[22]
arXiv preprint arXiv:2304.02532 (2023)
Reuss, M., Li, M., Jia, X., et al.: Goal-conditioned imitation learning using score- based diffusion policies. arXiv preprint arXiv:2304.02532 (2023)
-
[23]
Octo: An Open-Source Generalist Robot Policy
Team, O.M., Ghosh, D., Walke, H., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
arXiv preprint arXiv:2509.23753 (2025)
Zhu, H., Su, J., Lai, P., et al.: Anchored Supervised Fine-Tuning. arXiv preprint arXiv:2509.23753 (2025)
-
[25]
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
Lu, G., Guo, W., Zhang, C., et al.: VLA-RL: Towards masterful and general robotic manipulation with scalable reinforcement learning. arXiv preprint arXiv:2505.18719 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
arXiv preprint arXiv:2509.15965 (2025)
Yu, C., Wang, Y., Guo, Z., et al.: Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation. arXiv preprint arXiv:2509.15965 (2025)
-
[27]
arXiv preprint arXiv:2307.04577 (2023)
Qin, Y., Yang, W., Huang, B., et al.: Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. arXiv preprint arXiv:2307.04577 (2023)
-
[28]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Zhu, Y., Wong, J., Mandlekar, A., et al.: robosuite: A modular simulation frame- work and benchmark for robot learning. arXiv preprint arXiv:2009.12293 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[29]
Advances in Neural Information Processing Systems34, 24261–24272 (2021)
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., et al.: Mlp-mixer: An all-mlp architec- ture for vision. Advances in Neural Information Processing Systems34, 24261–24272 (2021)
2021
-
[30]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Mandlekar, A., Xu, D., Wong, J., et al.: What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Zhao, T.Z., Kumar, V., Levine, S., et al.: Learning fine-grained bimanual manip- ulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
The International Journal of Robotics Research44(10-11), 1684– 1704 (2025)
Chi, C., Xu, Z., Feng, S., et al.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684– 1704 (2025)
2025
-
[33]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Bu, Q., Yang, Y., Cai, J., et al.: Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Intelligence, P., Black, K., Brown, N., et al.:π 0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.