arxiv: 2605.06247 · v1 · submitted 2026-05-07 · 💻 cs.RO

Recognition: unknown

CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

Yuhua Jiang , Yijun Guo , Hongbing Yang , Guojun Lei , Nuo Chen , Yinuo Zhang , Shaoqiang Yan , Bo Lin

show 2 more authors

Feifei Gao Biqing Qi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:58 UTC · model grok-4.3

classification 💻 cs.RO

keywords world action modelscontext knowledge transferparameter-efficient adaptationembodied controlrobot manipulationzero-shot generalizationknowledge transfer

0 comments

The pith

CKT-WAM transfers knowledge between world action models by injecting a compact adapted context from the teacher into the student's text embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

World action models generate control actions for robots but transferring knowledge between them is hard when their internal states do not match. The paper proposes CKT-WAM to solve this by pulling hidden states from a teacher model, shrinking them with learnable-query cross attention, routing the result through lightweight adapters, and appending it as extra context to the student's text embeddings. Only about one percent of parameters need training, yet the student improves on complex tasks. Results show top scores on a standard benchmark and solid transfer to physical robots on long sequences of actions.

Core claim

The central claim is that knowledge moves from teacher to student world action model without output imitation or dense hidden-state matching. Intermediate teacher states are compressed by learnable-query cross attention, adapted by an always-on generalized adapter plus a router that triggers specialized adapters, and the resulting context is appended to the student's conditioning textual embeddings. This minimal addition lets the student generate improved actions on new tasks.

What carries the argument

The context injection mechanism that compresses teacher hidden states via learnable-query cross attention and adapts them with generalized and specialized adapters before appending to student text embeddings.

If this is right

Zero-shot generalization improves consistently across manipulation tasks.
The total success rate reaches 86.1 percent on LIBERO-Plus with only 1.17 percent trainable parameters.
Real-world performance reaches an 83.3 percent average success rate on four multi-step long-horizon tasks.
The method approaches full fine-tuning results while keeping adaptation cost low.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The text embedding space may act as a shared interface that lets similar transfers work across different model families.
Practitioners could maintain one heavy teacher and many lightweight students that draw on it for new tasks.
The same compression and adapter pattern might apply to other generative control models where direct state alignment is costly.

Load-bearing premise

That the compressed and adapted context from the teacher hidden states can be injected into the student's text embedding space and will improve action generation without any matched latent interfaces or dense alignment.

What would settle it

A controlled test on LIBERO-Plus where the student model with the context injection shows no improvement over the identical student run without any transferred context would falsify the transfer benefit.

Figures

Figures reproduced from arXiv: 2605.06247 by Biqing Qi, Bo Lin, Feifei Gao, Guojun Lei, Hongbing Yang, Nuo Chen, Shaoqiang Yan, Yijun Guo, Yinuo Zhang, Yuhua Jiang.

**Figure 1.** Figure 1: Overview of CKT-WAM. Hidden states from a selected intermediate layer of the teacher WAM are transformed by the CKT module. The compressors use learnable-query cross-attention (LQCA) for token compression. A routing module aggregates the relevant branches into a compact context CA, which the student WAM concatenates with its original textual embeddings Et. The input image embeddings are denoted by Ei , and… view at source ↗

**Figure 2.** Figure 2: Latency-success trade-off across different teacher intermediate layers view at source ↗

**Figure 3.** Figure 3: Real-world long-horizon tasks in our evaluation. From left to right: ( view at source ↗

**Figure 4.** Figure 4: Stage-wise selection probabilities of specialized adapters in the view at source ↗

**Figure 5.** Figure 5: Detailed real-world examples for Clothes Folding and Fruit Sorting. The left three columns show representative intermediate states of the clothes-folding task, while the right two columns show the fruit-sorting task view at source ↗

**Figure 6.** Figure 6: Detailed real-world examples for Cube Catching and Unmanned Retail. The left part shows representative snapshots of the cube task, and the right part shows the unmanned-retail task with target beverage selection from a cluttered shelf. Observation: the Q/K/V and output projections of the cross-attention block alone account for ∼ 16.8M parameters, i.e. roughly 82% of a single adapter’s budget. This is expec… view at source ↗

**Figure 7.** Figure 7: Training curves of the proposed method. Left: optimization trajectories of the total loss, the averaged student EDM loss, and the last-step student EDM loss. Right: trajectory of the load-balancing loss Lbal. The balancing term decreases steadily throughout training, while the overall objective remains stable. 3. It conflates roles. Self-attention is the student’s pretrained spatial-temporal mixer over vis… view at source ↗

read the original abstract

World action models (WAMs) provide a powerful generative framework for embodied control, yet transferring knowledge across heterogeneous WAMs remains challenging due to mismatched latent interfaces, high adaptation cost, and the rigidity of conventional distillation objectives. We propose \textbf{CKT-WAM}, a parameter-efficient \textbf{C}ontext \textbf{K}nowledge \textbf{T}ransfer framework that transfers teacher WAM's knowledge into a student WAM through a compact context in the text embedding space, rather than output imitation or dense hidden-state matching. Specifically, CKT-WAM extracts intermediate teacher hidden states, reduces the number of tokens via compressors' learnable-query cross attention (LQCA), and transforms them through an always-on generalized adapter, a lightweight router, and sparsely activated specialized adapters. The resulting context is then appended to the student's conditioning textual embeddings, thereby injecting the transferred knowledge into the student with minimal architectural modification. Experiments show that CKT-WAM consistently improves zero-shot generalization and achieves the best overall performance on LIBERO-Plus, reaching 86.1\% total success rate with only 1.17\% trainable parameters, while approaching full fine-tuning performance. Beyond simulation, CKT-WAM also demonstrates strong real-world long-horizon manipulation ability, achieving the best average success rate of 83.3\% across four multi-step and long-horizon tasks. Code is available at https://github.com/YuhuaJiang2002/CKT-WAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CKT-WAM gives a workable low-parameter route for moving context between mismatched WAMs via text-embedding injection and posts solid sim-plus-real numbers, but the gains may trace more to added adapter capacity than to the teacher context itself.

read the letter

The paper's main contribution is a pipeline that pulls intermediate states from a teacher WAM, compresses them with learnable-query cross-attention, routes the result through a generalized adapter plus sparsely activated specialized adapters, and simply appends the output to the student's text embeddings. This avoids any need for latent-space alignment and keeps trainable parameters at 1.17 percent while reporting 86.1 percent success on LIBERO-Plus and 83.3 percent average on four real long-horizon tasks. Those numbers are useful for anyone who wants to reuse embodied models without full retraining or heavy distillation losses. The choice to work entirely in the text conditioning space is a pragmatic move given the heterogeneity problem the authors highlight. Code release is also a plus for reproducibility. The soft spot is the missing isolation between transferred knowledge and plain adapter capacity. The architecture trains a router and specialized adapters on top of the compressed teacher states, so the observed lift could come from the extra trainable modules learning the task rather than from any specific teacher-derived signal. The abstract does not describe ablations that disable the teacher input or compare against identical adapters fed random or zero context, which leaves the central transfer claim under-supported. Baselines, variance across runs, and data-split details are also thin in the summary. This work is aimed at robotics researchers who adapt generative action models across different architectures or robot platforms. It has enough empirical grounding and addresses a practical bottleneck to deserve a serious referee, provided the authors add the controls that separate context transfer from capacity effects.

Referee Report

2 major / 2 minor

Summary. The paper introduces CKT-WAM, a parameter-efficient framework for transferring knowledge between heterogeneous world action models (WAMs). It extracts intermediate hidden states from a teacher WAM, compresses them via learnable-query cross-attention (LQCA), processes them through an always-on generalized adapter plus router and sparsely activated specialized adapters (totaling 1.17% trainable parameters), and appends the resulting context to the student WAM's text conditioning embeddings. Experiments report that this yields the best overall performance on LIBERO-Plus (86.1% total success rate), improves zero-shot generalization, approaches full fine-tuning, and achieves 83.3% average success on four real-world long-horizon manipulation tasks. Code is released.

Significance. If the results hold, the work is significant for enabling efficient cross-model knowledge transfer in embodied AI without requiring latent interface alignment or dense distillation. The low parameter count and strong real-robot results on long-horizon tasks could facilitate practical deployment of adapted WAMs. Explicit credit is due for releasing code, which supports reproducibility.

major comments (2)

[§4 Experiments] §4 Experiments (including Table 1 and ablation tables): The central claim that performance gains arise from teacher-specific context transfer (rather than adapter capacity) is load-bearing but under-supported. The architecture trains the generalized adapter, router, and specialized adapters jointly with the LQCA injection pathway; no control ablation is described that replaces teacher hidden states with noise, zeros, or student-only inputs while keeping the 1.17% trainable modules fixed. Without this isolation, the 86.1% LIBERO-Plus and 83.3% real-world results cannot be confidently attributed to transferred knowledge.
[§3.2 Method] §3.2 Method (LQCA and adapter equations): The claim that 'no latent interface matching is needed' and that context injection works directly in text embedding space is not accompanied by a cross-architecture test (e.g., teacher and student WAMs with substantially different hidden-state dimensionalities or conditioning mechanisms). The reported gains on LIBERO-Plus therefore rest on an untested assumption about the robustness of the text-embedding injection pathway.

minor comments (2)

[Abstract and §4] The abstract and §4 should explicitly state the number of random seeds, statistical tests, and exact data splits used for the 86.1% and 83.3% figures to allow direct comparison with baselines.
[Figure 3 and §3.2] Figure 3 (architecture diagram) caption and §3.2 notation: the distinction between 'always-on generalized adapter' and 'sparsely activated specialized adapters' is visually clear but the router gating equation is not written out; adding it would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below, acknowledge where the manuscript can be strengthened, and describe the revisions we will make.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments (including Table 1 and ablation tables): The central claim that performance gains arise from teacher-specific context transfer (rather than adapter capacity) is load-bearing but under-supported. The architecture trains the generalized adapter, router, and specialized adapters jointly with the LQCA injection pathway; no control ablation is described that replaces teacher hidden states with noise, zeros, or student-only inputs while keeping the 1.17% trainable modules fixed. Without this isolation, the 86.1% LIBERO-Plus and 83.3% real-world results cannot be confidently attributed to transferred knowledge.

Authors: We agree that the requested control ablation—replacing teacher hidden states with noise, zeros, or student-only inputs while freezing the 1.17% trainable modules—would provide direct evidence that gains stem from transferred context rather than adapter capacity alone. Our existing ablations isolate the LQCA compressor and the router/specialized-adapter pathway, but they do not perform this exact isolation. We will add the control experiment to the revised manuscript (new row in the ablation table) to strengthen attribution of the 86.1% and 83.3% results. revision: yes
Referee: [§3.2 Method] §3.2 Method (LQCA and adapter equations): The claim that 'no latent interface matching is needed' and that context injection works directly in text embedding space is not accompanied by a cross-architecture test (e.g., teacher and student WAMs with substantially different hidden-state dimensionalities or conditioning mechanisms). The reported gains on LIBERO-Plus therefore rest on an untested assumption about the robustness of the text-embedding injection pathway.

Authors: The design intentionally decouples the transfer from latent-space alignment by compressing teacher states via LQCA into a fixed-length context that is appended to the student's text-conditioning embeddings; this pathway is architecture-agnostic by construction. While the LIBERO-Plus experiments already involve heterogeneous WAMs with differing internal representations, we acknowledge that explicit tests with larger dimensionality gaps or dissimilar conditioning mechanisms would further demonstrate robustness. We will expand the method discussion with a note on this generality and add a small-scale cross-architecture experiment or analysis in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the proposed framework

full rationale

The paper presents an empirical architecture for context knowledge transfer between heterogeneous world action models, using LQCA compression and adapter modules to inject teacher hidden states into student text embeddings. Reported metrics (86.1% on LIBERO-Plus, 83.3% real-world) are measured outcomes on external benchmarks, not quantities defined in terms of the fitted parameters themselves. No mathematical derivation chain, uniqueness theorems, or predictions that reduce to inputs by construction appear in the abstract or described method. The framework is a standard adapter-based transfer approach evaluated independently.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that text embeddings form a sufficient shared interface and on several newly introduced components whose parameters are fitted during transfer; no independent evidence for these components is supplied in the abstract.

free parameters (2)

learnable queries in LQCA
Parameters that select and compress teacher hidden-state tokens
generalized adapter, router, and specialized adapter weights
The 1.17% trainable parameters that implement the context transformation

axioms (1)

domain assumption Heterogeneous WAMs can exchange useful knowledge through a common text-embedding interface
Invoked by the design choice to append context directly to student textual embeddings

invented entities (2)

LQCA compressor no independent evidence
purpose: Reduce teacher hidden states to a small number of tokens
New attention-based reduction step introduced for this transfer task
always-on generalized adapter plus router and specialized adapters no independent evidence
purpose: Transform and selectively apply the compressed context
Lightweight adapter architecture proposed as the transfer mechanism

pith-pipeline@v0.9.0 · 5597 in / 1486 out tokens · 79526 ms · 2026-05-08T08:58:32.072521+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 13 canonical work pages · 10 internal anchors

[1]

World Models

World Models , author=. arXiv preprint arXiv:1803.10122 , year=

work page internal anchor Pith review arXiv
[2]

Proceedings of the 36th International Conference on Machine Learning , series=

Learning Latent Dynamics for Planning from Pixels , author=. Proceedings of the 36th International Conference on Machine Learning , series=
[3]

International Conference on Learning Representations , year=

Dream to Control: Learning Behaviors by Latent Imagination , author=. International Conference on Learning Representations , year=
[4]

Mastering Diverse Domains through World Models

Mastering Diverse Domains through World Models , author=. arXiv preprint arXiv:2301.04104 , year=

work page internal anchor Pith review arXiv
[5]

Robotics: Science and Systems , year=

RT-1: Robotics Transformer for Real-World Control at Scale , author=. Robotics: Science and Systems , year=
[6]

Proceedings of The 7th Conference on Robot Learning , series=

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author=. Proceedings of The 7th Conference on Robot Learning , series=
[7]

Proceedings of The 8th Conference on Robot Learning , series=

OpenVLA: An Open-Source Vision-Language-Action Model , author=. Proceedings of The 8th Conference on Robot Learning , series=
[8]

Proceedings of the 41st International Conference on Machine Learning , series=

Genie: Generative Interactive Environments , author=. Proceedings of the 41st International Conference on Machine Learning , series=
[9]

The Twelfth International Conference on Learning Representations , year=

Learning Interactive Real-World Simulators , author=. The Twelfth International Conference on Learning Representations , year=
[10]

Distilling the Knowledge in a Neural Network

Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review arXiv
[11]

International Conference on Learning Representations , year=

FitNets: Hints for Thin Deep Nets , author=. International Conference on Learning Representations , year=
[12]

International Conference on Learning Representations , year=

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , author=. International Conference on Learning Representations , year=
[13]

International Conference on Learning Representations , year=

Policy Distillation , author=. International Conference on Learning Representations , year=
[14]

2020 , publisher=

Jiao, Xiaoqi and Yin, Yichun and Shang, Lifeng and Jiang, Xin and Chen, Xiao and Li, Linlin and Wang, Fang and Liu, Qun , booktitle=. 2020 , publisher=

2020
[15]

Proceedings of the 38th International Conference on Machine Learning , series=

Training data-efficient image transformers & distillation through attention , author=. Proceedings of the 38th International Conference on Machine Learning , series=
[16]

Wang, Wenhui and Wei, Furu and Dong, Li and Cheng, Hao and Liu, Xiaodong and Song, Xia and Gao, Jianfeng , booktitle=
[17]

World Action Models are Zero-shot Policies

World action models are zero-shot policies , author=. arXiv preprint arXiv:2602.15922 , year=

work page internal anchor Pith review arXiv
[18]

Causal World Modeling for Robot Control

Causal world modeling for robot control , author=. arXiv preprint arXiv:2601.21998 , year=

work page internal anchor Pith review arXiv
[19]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Vidar: Embodied video diffusion model for generalist manipulation , author=. arXiv preprint arXiv:2507.12898 , year=

work page arXiv
[20]

2025 , note =

Hongzhe Bi and Hengkai Tan and Shenghao Xie and Zeyuan Wang and Shuhe Huang and Haitian Liu and Ruowen Zhao and Yao Feng and Chendong Xiang and Yinze Rong and others , title =. 2025 , note =

2025
[21]

arXiv preprint arXiv:2504.0279 , year=

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets , author=. arXiv preprint arXiv:2504.0279 , year=

work page arXiv
[22]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Video prediction policy: A generalist robot policy with predictive visual representations , author=. arXiv preprint arXiv:2412.14803 , year=

work page internal anchor Pith review arXiv
[23]

Unified Video Action Model

Unified video action model , author=. arXiv preprint arXiv:2503.00200 , year=

work page internal anchor Pith review arXiv
[24]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Fast-WAM: Do World Action Models Need Test-time Future Imagination? , author=. arXiv preprint arXiv:2603.16666 , year=

work page internal anchor Pith review arXiv
[25]

Advances in Neural Information Processing Systems , volume =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =
[26]

Proceedings of the 38th International Conference on Machine Learning , series =

Perceiver: General Perception with Iterative Attention , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =

2021
[27]

Computer Vision -- ECCV 2020 , series =

End-to-End Object Detection with Transformers , author =. Computer Vision -- ECCV 2020 , series =. 2020 , publisher =

2020
[28]

Journal of Machine Learning Research , volume =

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. Journal of Machine Learning Research , volume =. 2022 , url =

2022
[29]

Proceedings of the 41st International Conference on Machine Learning , pages =

Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[30]

2024 , editor =

Lu, Xudong and Zhou, Aojun and Xu, Yuhui and Zhang, Renrui and Gao, Peng and Li, Hongsheng , booktitle =. 2024 , editor =

2024
[31]

Proceedings of the 41st International Conference on Machine Learning , pages =

Nikdan, Mahdi and Tabesh, Soroush and Crn. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2025 , doi =

2025
[33]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2024 , doi =

2024
[34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Time-, Memory- and Parameter-Efficient Visual Adaptation , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
[35]

Advances in Neural Information Processing Systems , year =

Sparse High Rank Adapters , author =. Advances in Neural Information Processing Systems , year =
[36]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models , author=. arXiv preprint arXiv:2510.13626 , year=

work page internal anchor Pith review arXiv
[37]

Advances in Neural Information Processing Systems , volume=

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning , author=. Advances in Neural Information Processing Systems , volume=
[38]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
[39]

Advances in Neural Information Processing Systems , year=

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models , author=. Advances in Neural Information Processing Systems , year=
[40]

International Conference on Learning Representations , year=

KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
[41]

Conference on Language Modeling , year=

AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts , author=. Conference on Language Modeling , year=
[42]

Advances in Neural Information Processing Systems , year=

HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning , author=. Advances in Neural Information Processing Systems , year=
[43]

Advances in Neural Information Processing Systems , year=

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition , author=. Advances in Neural Information Processing Systems , year=
[44]

International Conference on Learning Representations , year=

Vision Transformer Adapter for Dense Predictions , author=. International Conference on Learning Representations , year=
[45]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

5\ author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[46]

2025 , address=

Wang, Hanqing and Li, Yixia and Wang, Shuo and Chen, Guanhua and Chen, Yun , booktitle=. 2025 , address=

2025
[47]

2024 , volume=

Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung , booktitle=. 2024 , volume=

2024
[48]

Fan, Chenghao and Lu, Zhenyi and Liu, Sichen and Gu, Chengfeng and Qu, Xiaoye and Wei, Wei and Cheng, Yu , booktitle=. Make. 2025 , volume=

2025
[49]

2025 , eprint=

WorldVLA: Towards Autoregressive Action World Model , author=. 2025 , eprint=

2025
[50]

2025 , eprint=

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks , author=. 2025 , eprint=

2025
[51]

2025 , eprint=

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions , author=. 2025 , eprint=

2025
[52]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

_. 2025 , eprint =. doi:10.48550/arXiv.2504.16054 , url =

work page Pith review doi:10.48550/arxiv.2504.16054 2025
[53]

2024 , eprint=

_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. 2024 , eprint=

2024
[54]

2025 , eprint=

FAST: Efficient Action Tokenization for Vision-Language-Action Models , author=. 2025 , eprint=

2025
[55]

2025 , eprint=

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success , author=. 2025 , eprint=

2025
[56]

2025 , eprint=

Interactive Post-Training for Vision-Language-Action Models , author=. 2025 , eprint=

2025
[57]

2025 , eprint=

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model , author=. 2025 , eprint=

2025
[58]

2026 , eprint=

HoloBrain-0 Technical Report , author=. 2026 , eprint=

2026
[59]

2026 , eprint=

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model , author=. 2026 , eprint=

2026
[60]

2026 , eprint=

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning , author=. 2026 , eprint=

2026
[61]

2025 , eprint=

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation , author=. 2025 , eprint=

2025
[62]

2026 , eprint=

VLANeXt: Recipes for Building Strong VLA Models , author=. 2026 , eprint=

2026
[63]

2026 , eprint=

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. 2026 , eprint=

2026
[64]

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Yuhua Jiang and Shuang Cheng and Yan Ding and Feifei Gao and Biqing Qi , year=. 2511.14148 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv