arxiv: 2605.10903 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.RO

Recognition: no theorem link

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Wenxuan Song , Han Zhao , Fuhao Li , Ziyang Zhou , Xi Wang , Jing Lyu , Pengxiang Ding , Yan Wang

show 2 more authors

Donglin Wang Haoang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:03 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords capability vectorsvision-language-action modelsparameter spacefinetuningauxiliary objectivesorthogonal regularizationtransfer learningmodel merging

0 comments

The pith

Parameter differences between two finetuned VLA models produce transferable capability vectors that merge into pretrained weights to deliver auxiliary-level performance with standard SFT overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that auxiliary training objectives and task-specific fitting can be separated directly in the weights of a vision-language-action model. By running standard supervised finetuning and auxiliary-objective finetuning on the same small task set, the resulting parameter difference isolates a vector that encodes the general capability gain. Merging this vector back into any pretrained VLA model creates an enhanced starting point for further adaptation. Adding a lightweight orthogonal regularization term during ordinary finetuning lets the merged model reach performance levels previously requiring full auxiliary losses, but at far lower compute cost. Internal and external tests show the vectors remain effective when applied to unseen models and transfer directly to new environments and robot bodies.

Core claim

We train one VLA model to convergence on a small-scale task set using standard supervised finetuning and a second model using auxiliary-objective finetuning; the element-wise difference between their parameters is interpreted as the capability vector supplied by the auxiliary losses. This vector is added to the weights of a pretrained VLA model to obtain a capability-enhanced meta-model. When standard supervised finetuning is performed with an additional lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary-objective baselines while incurring substantially lower computational overhead.

What carries the argument

The capability vector, defined as the parameter difference between a standard-SFT model and an auxiliary-objective SFT model trained on identical small-scale data, which isolates the general capability enhancement separate from task-specific action fitting.

If this is right

Capability vectors transfer effectively across diverse VLA model architectures.
The vectors generalize out of the box to novel environments and robot embodiments.
Merged models reach auxiliary-finetuning performance levels with reduced computational cost.
Orthogonal regularization during standard finetuning suffices to achieve the same gains as full auxiliary objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subtraction technique could be applied to other multimodal models to extract reusable skill vectors without domain-specific auxiliary losses.
Vectors from multiple auxiliary runs could be added or subtracted to combine or cancel specific capabilities in a single base model.
If the vectors prove low-rank, they could be stored and transmitted at low cost for on-device adaptation of large VLA systems.

Load-bearing premise

The parameter difference between the two finetuned models isolates only the desired capability enhancement from auxiliary objectives and is not mixed with other training dynamics or task-specific fitting.

What would settle it

Apply the extracted capability vector to a new pretrained VLA model, run standard finetuning on a held-out task set, and measure whether success rate or convergence speed fails to exceed that of an identical run without the vector.

read the original abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CapVector extracts capability vectors from the parameter difference between standard and auxiliary SFT runs on VLA models, but the isolation of pure transferable capability from task-fitting artifacts is not yet convincingly shown.

read the letter

The main thing to know is that this paper trains two versions of a VLA model on the same small task set—one with ordinary supervised fine-tuning and one with auxiliary objectives—then treats their parameter difference as a reusable capability vector that gets added back to the pretrained weights. They also test a lightweight orthogonal regularization on the standard path so the merged model can match auxiliary performance without the extra losses during training. Internal and external experiments are said to show the vectors work across models and transfer to new environments and robot bodies out of the box. This framing of the difference as a general capability boost is new relative to the SFT work cited in the abstract, and the low-overhead goal is relevant for real robotics deployment where extra compute during adaptation is costly. The construction itself is straightforward and avoids some of the overhead that comes with running auxiliary losses throughout fine-tuning. The soft spot is the core assumption that the parameter delta isolates only the desired capability enhancement. If the two models do not reach equivalent task performance or if the auxiliary run changes convergence speed, regularization effects, or task-distribution fitting, the subtracted vector will carry those differences along and the claimed generalization to novel settings will not hold cleanly. The abstract gives no numbers on task equivalence or orthogonality tests beyond the optional regularization, so the stress-test concern about embedding artifacts stands. This is for researchers adapting pretrained VLAs who want cheaper ways to improve them. A reader focused on parameter-space editing or efficient fine-tuning would get concrete value from the method and the reported cross-model results. It deserves a serious referee because the idea is distinct enough and the claims are specific enough to check with targeted ablations. I would send it to peer review and ask for explicit checks that the two fine-tuned models match on task metrics before subtraction plus ablations showing what the vectors actually encode.

Referee Report

1 major / 1 minor

Summary. The paper proposes CapVector, which extracts transferable 'capability vectors' as the parameter difference between two models finetuned on the same small-scale task set: one via standard SFT and one via auxiliary-objective SFT. These vectors are merged into pretrained VLA parameters to form an enhanced meta-model. With an optional lightweight orthogonal regularization during standard SFT, the merged model is claimed to match auxiliary-finetuned performance at lower cost. Internal and external experiments are said to show the vectors are effective across models and generalize out-of-the-box to novel environments and embodiments.

Significance. If the parameter-difference construction reliably isolates general capability enhancements orthogonal to task-specific fitting, the method could offer a low-overhead way to transfer auxiliary-objective benefits to standard SFT pipelines for VLA models, potentially improving adaptation efficiency and cross-embodiment transfer without repeated auxiliary training.

major comments (1)

[Method (parameter-difference construction)] The central construction (parameter difference between standard-SFT and auxiliary-SFT models on identical small-scale tasks) treats the delta as a pure capability vector. This requires that the auxiliary losses affect only general capability parameters while task-specific action fitting remains identical; otherwise the delta absorbs convergence-speed, regularization, or distribution-fitting differences. The manuscript provides no evidence that the two models reach equivalent task performance before subtraction, nor quantitative checks (e.g., success rates or loss values on the small-scale set) that the delta is orthogonal to task gradients beyond the optional regularization. This directly threatens the versatility and out-of-the-box generalization claims.

minor comments (1)

[Abstract] The abstract states that experiments support effectiveness and generalization yet supplies no quantitative metrics, baselines, error bars, or dataset details; the full paper should include these in the experimental section to allow verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed review of our paper. We appreciate the referee's focus on the methodological foundation of the capability vector construction and will clarify and strengthen the supporting evidence in the revision.

read point-by-point responses

Referee: The central construction (parameter difference between standard-SFT and auxiliary-SFT models on identical small-scale tasks) treats the delta as a pure capability vector. This requires that the auxiliary losses affect only general capability parameters while task-specific action fitting remains identical; otherwise the delta absorbs convergence-speed, regularization, or distribution-fitting differences. The manuscript provides no evidence that the two models reach equivalent task performance before subtraction, nor quantitative checks (e.g., success rates or loss values on the small-scale set) that the delta is orthogonal to task gradients beyond the optional regularization. This directly threatens the versatility and out-of-the-box generalization claims.

Authors: We agree that demonstrating equivalent task performance between the standard SFT and auxiliary-SFT models on the small-scale task set is crucial for validating the capability vector interpretation. Although the manuscript states that both models are trained to convergence on the same tasks, we did not include explicit quantitative comparisons such as success rates or final loss values. In the revised manuscript, we will add these metrics to show that the models achieve comparable task-specific performance, thereby supporting that the parameter difference primarily isolates the general capability enhancements provided by the auxiliary objectives. For the orthogonality, the lightweight orthogonal regularization is applied during standard SFT to encourage the capability vector to be orthogonal to task gradients, and we will include additional ablation or analysis to quantify this effect. These additions will bolster the claims regarding the vectors' versatility and out-of-the-box generalization to novel environments and embodiments. revision: partial

Circularity Check

0 steps flagged

No circularity: capability vector is an explicit definitional construction validated empirically

full rationale

The paper defines the capability vector directly as the parameter difference between a standard-SFT model and an auxiliary-objective model, both trained to convergence on the identical small-scale task set. This is an explicit modeling choice rather than a derived result that reduces to its own inputs by construction. No equations or claims in the provided text perform a 'prediction' that is statistically forced by a fitted subset, nor does any load-bearing premise rest on self-citation, imported uniqueness theorems, or smuggled ansatzes. Effectiveness claims rest on internal/external experiments rather than tautological reduction. The interpretation of the delta as isolating 'general capabilities' is an assumption subject to empirical test, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central construction rests on the assumption that parameter differences cleanly isolate auxiliary-objective effects; no explicit free parameters are named in the abstract, and the capability vector itself is an invented construct whose independence from the training choices is not demonstrated.

axioms (1)

domain assumption The parameter difference between a standard-SFT model and an auxiliary-objective-SFT model isolates the capability enhancement attributable to the auxiliary objectives.
Invoked when the authors state that the difference 'can then be interpreted as capability vectors provided by auxiliary objectives.'

invented entities (1)

capability vectors no independent evidence
purpose: To represent transferable enhancements extracted in parametric space from auxiliary training.
Newly postulated as the difference of two finetuned parameter sets; no independent falsifiable handle outside the paper is described.

pith-pipeline@v0.9.0 · 5559 in / 1406 out tokens · 32785 ms · 2026-05-12T04:03:57.323547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 12 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Towards a unified understanding of robot ma- nipulation: A comprehensive survey,

10 Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903,

work page arXiv
[3]

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, et al. Hex: Humanoid-aligned experts for cross-embodiment whole-body manipulation.arXiv preprint arXiv:2604.07993, 2026a. Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Chen...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. $\pi_{0}$: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Olora: Orthonormal low-rank adaptation of large language models

KerimBüyükakyüz. Olora: Orthonormallow-rankadaptationoflargelanguagemodels.arXiv preprint arXiv:2406.01775,

work page arXiv
[7]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He. Bring reason to vision: Understanding perception and reasoning through model merging. InForty-second International Conference on Machine Learning. Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810,

Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810,

work page arXiv
[9]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089,

work page internal anchor Pith review arXiv
[10]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

11 Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025a. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-langua...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Last _{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, et al. Last_{0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248,

work page arXiv
[14]

arXiv preprint arXiv:2508.10333 (2025) 14

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Zhijun Li, Donglin Wang, Lujia Wang, et al. Pd-vla: Accelerating vision-language-action model integrated with action chunking via parallel decoding. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13162–13169. IEEE, 2025a. Wenxu...

work page arXiv
[15]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671,

work page 2023
[16]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation.arXiv preprint arXiv:2508.06426,

12 Youguang Xing, Xu Luo, Junlin Xie, Lianli Gao, Hengtao Shen, and Jingkuan Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation.arXiv preprint arXiv:2508.06426,

work page arXiv
[18]

What matters for model merging at scale?arXiv preprint arXiv:2410.03617,

Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai. What matters for model merging at scale?arXiv preprint arXiv:2410.03617,

work page arXiv
[19]

Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333,

Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333,

work page arXiv
[20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Frappe: Infusing world modeling into generalist policies via multiple future representation alignment, 2026

Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, and Donglin Wang. Frappe: Infusing world modeling into generalist policies via multiple future representation alignment.arXiv preprint arXiv:2602.17259,

work page arXiv
[22]

Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024

Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,

work page arXiv
[23]

Flare: Robot learning with implicit world modeling, 2025

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659,

work page arXiv
[24]

Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, et al. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

work page arXiv
[25]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,

work page internal anchor Pith review arXiv 2009
[26]

On the LIBERO simulation benchmark (Liu et al., 2023), 15 OpenVLA-OFT achieves a 97.1% success rate while increasing action generation speed by 26 times

to reinforce language grounding. On the LIBERO simulation benchmark (Liu et al., 2023), 15 OpenVLA-OFT achieves a 97.1% success rate while increasing action generation speed by 26 times. In real-world experiments with a bimanual ALOHA robot (Zhao et al., 2024), it surpasses strong finetuned VLAs such asπ0 (Black et al.,

work page 2023
[27]

backbone-action-head

and RDT-1B (Liu et al., 2024), as well as policies trained from scratch, achieving up to a 15% absolute improvement in average success rate on dexterous manipulation tasks. π0.5 (Black et al., 2025). π0.5 is a vision-language-action (VLA) model designed to achieve open-world generalization for real-world robotic manipulation. Building on π0, π0.5 adopts a...

work page 2024
[28]

and Cosmos (Kim et al., 2026)) with four representative action-decoding paradigms (e.g., autoregressive tokenization, flow-matching). This unified architecture is further strengthened by reusable, paradigm-agnostic training strategies for multimodal co- training and a standardized server-client interface for cross-benchmark evaluation. Across major benchm...

work page 2026