Recognition: no theorem link
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
Pith reviewed 2026-05-12 04:03 UTC · model grok-4.3
The pith
Parameter differences between two finetuned VLA models produce transferable capability vectors that merge into pretrained weights to deliver auxiliary-level performance with standard SFT overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train one VLA model to convergence on a small-scale task set using standard supervised finetuning and a second model using auxiliary-objective finetuning; the element-wise difference between their parameters is interpreted as the capability vector supplied by the auxiliary losses. This vector is added to the weights of a pretrained VLA model to obtain a capability-enhanced meta-model. When standard supervised finetuning is performed with an additional lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary-objective baselines while incurring substantially lower computational overhead.
What carries the argument
The capability vector, defined as the parameter difference between a standard-SFT model and an auxiliary-objective SFT model trained on identical small-scale data, which isolates the general capability enhancement separate from task-specific action fitting.
If this is right
- Capability vectors transfer effectively across diverse VLA model architectures.
- The vectors generalize out of the box to novel environments and robot embodiments.
- Merged models reach auxiliary-finetuning performance levels with reduced computational cost.
- Orthogonal regularization during standard finetuning suffices to achieve the same gains as full auxiliary objectives.
Where Pith is reading between the lines
- The same subtraction technique could be applied to other multimodal models to extract reusable skill vectors without domain-specific auxiliary losses.
- Vectors from multiple auxiliary runs could be added or subtracted to combine or cancel specific capabilities in a single base model.
- If the vectors prove low-rank, they could be stored and transmitted at low cost for on-device adaptation of large VLA systems.
Load-bearing premise
The parameter difference between the two finetuned models isolates only the desired capability enhancement from auxiliary objectives and is not mixed with other training dynamics or task-specific fitting.
What would settle it
Apply the extracted capability vector to a new pretrained VLA model, run standard finetuning on a held-out task set, and measure whether success rate or convergence speed fails to exceed that of an identical run without the vector.
read the original abstract
This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CapVector, which extracts transferable 'capability vectors' as the parameter difference between two models finetuned on the same small-scale task set: one via standard SFT and one via auxiliary-objective SFT. These vectors are merged into pretrained VLA parameters to form an enhanced meta-model. With an optional lightweight orthogonal regularization during standard SFT, the merged model is claimed to match auxiliary-finetuned performance at lower cost. Internal and external experiments are said to show the vectors are effective across models and generalize out-of-the-box to novel environments and embodiments.
Significance. If the parameter-difference construction reliably isolates general capability enhancements orthogonal to task-specific fitting, the method could offer a low-overhead way to transfer auxiliary-objective benefits to standard SFT pipelines for VLA models, potentially improving adaptation efficiency and cross-embodiment transfer without repeated auxiliary training.
major comments (1)
- [Method (parameter-difference construction)] The central construction (parameter difference between standard-SFT and auxiliary-SFT models on identical small-scale tasks) treats the delta as a pure capability vector. This requires that the auxiliary losses affect only general capability parameters while task-specific action fitting remains identical; otherwise the delta absorbs convergence-speed, regularization, or distribution-fitting differences. The manuscript provides no evidence that the two models reach equivalent task performance before subtraction, nor quantitative checks (e.g., success rates or loss values on the small-scale set) that the delta is orthogonal to task gradients beyond the optional regularization. This directly threatens the versatility and out-of-the-box generalization claims.
minor comments (1)
- [Abstract] The abstract states that experiments support effectiveness and generalization yet supplies no quantitative metrics, baselines, error bars, or dataset details; the full paper should include these in the experimental section to allow verification.
Simulated Author's Rebuttal
Thank you for the detailed review of our paper. We appreciate the referee's focus on the methodological foundation of the capability vector construction and will clarify and strengthen the supporting evidence in the revision.
read point-by-point responses
-
Referee: The central construction (parameter difference between standard-SFT and auxiliary-SFT models on identical small-scale tasks) treats the delta as a pure capability vector. This requires that the auxiliary losses affect only general capability parameters while task-specific action fitting remains identical; otherwise the delta absorbs convergence-speed, regularization, or distribution-fitting differences. The manuscript provides no evidence that the two models reach equivalent task performance before subtraction, nor quantitative checks (e.g., success rates or loss values on the small-scale set) that the delta is orthogonal to task gradients beyond the optional regularization. This directly threatens the versatility and out-of-the-box generalization claims.
Authors: We agree that demonstrating equivalent task performance between the standard SFT and auxiliary-SFT models on the small-scale task set is crucial for validating the capability vector interpretation. Although the manuscript states that both models are trained to convergence on the same tasks, we did not include explicit quantitative comparisons such as success rates or final loss values. In the revised manuscript, we will add these metrics to show that the models achieve comparable task-specific performance, thereby supporting that the parameter difference primarily isolates the general capability enhancements provided by the auxiliary objectives. For the orthogonality, the lightweight orthogonal regularization is applied during standard SFT to encourage the capability vector to be orthogonal to task gradients, and we will include additional ablation or analysis to quantify this effect. These additions will bolster the claims regarding the vectors' versatility and out-of-the-box generalization to novel environments and embodiments. revision: partial
Circularity Check
No circularity: capability vector is an explicit definitional construction validated empirically
full rationale
The paper defines the capability vector directly as the parameter difference between a standard-SFT model and an auxiliary-objective model, both trained to convergence on the identical small-scale task set. This is an explicit modeling choice rather than a derived result that reduces to its own inputs by construction. No equations or claims in the provided text perform a 'prediction' that is statistically forced by a fitted subset, nor does any load-bearing premise rest on self-citation, imported uniqueness theorems, or smuggled ansatzes. Effectiveness claims rest on internal/external experiments rather than tautological reduction. The interpretation of the delta as isolating 'general capabilities' is an assumption subject to empirical test, not a self-referential loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The parameter difference between a standard-SFT model and an auxiliary-objective-SFT model isolates the capability enhancement attributable to the auxiliary objectives.
invented entities (1)
-
capability vectors
no independent evidence
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Towards a unified understanding of robot ma- nipulation: A comprehensive survey,
10 Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903,
-
[3]
HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation
Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, et al. Hex: Humanoid-aligned experts for cross-embodiment whole-body manipulation.arXiv preprint arXiv:2604.07993, 2026a. Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Chen...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. $\pi_{0}$: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Olora: Orthonormal low-rank adaptation of large language models
KerimBüyükakyüz. Olora: Orthonormallow-rankadaptationoflargelanguagemodels.arXiv preprint arXiv:2406.01775,
-
[7]
Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging. InForty-second International Conference on Machine Learning. Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810,
-
[9]
Editing Models with Task Arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,
work page internal anchor Pith review arXiv
-
[10]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
11 Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025a. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-langua...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, et al. Last_{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248,
-
[14]
arXiv preprint arXiv:2508.10333 (2025) 14
Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Zhijun Li, Donglin Wang, Lujia Wang, et al. Pd-vla: Accelerating vision-language-action model integrated with action chunking via parallel decoding. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13162–13169. IEEE, 2025a. Wenxu...
-
[15]
Orthogonal subspace learning for language model continual learning
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671,
work page 2023
-
[16]
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
12 Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, and Jingkuan Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation.arXiv preprint arXiv:2508.06426,
-
[18]
What matters for model merging at scale?arXiv preprint arXiv:2410.03617,
Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai. What matters for model merging at scale?arXiv preprint arXiv:2410.03617,
-
[19]
Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333,
-
[20]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, and Donglin Wang. Frappe: Infusing world modeling into generalist policies via multiple future representation alignment.arXiv preprint arXiv:2602.17259,
-
[22]
Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024
Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,
-
[23]
Flare: Robot learning with implicit world modeling, 2025
Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659,
-
[24]
Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, et al. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,
-
[25]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,
work page internal anchor Pith review arXiv 2009
-
[26]
to reinforce language grounding. On the LIBERO simulation benchmark (Liu et al., 2023), 15 OpenVLA-OFT achieves a 97.1% success rate while increasing action generation speed by 26 times. In real-world experiments with a bimanual ALOHA robot (Zhao et al., 2024), it surpasses strong finetuned VLAs such asπ0 (Black et al.,
work page 2023
-
[27]
and RDT-1B (Liu et al., 2024), as well as policies trained from scratch, achieving up to a 15% absolute improvement in average success rate on dexterous manipulation tasks. π0.5 (Black et al., 2025). π0.5 is a vision-language-action (VLA) model designed to achieve open-world generalization for real-world robotic manipulation. Building on π0, π0.5 adopts a...
work page 2024
-
[28]
and Cosmos (Kim et al., 2026)) with four representative action-decoding paradigms (e.g., autoregressive tokenization, flow-matching). This unified architecture is further strengthened by reusable, paradigm-agnostic training strategies for multimodal co- training and a standardized server-client interface for cross-benchmark evaluation. Across major benchm...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.