Recognition: unknown
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
Pith reviewed 2026-05-08 08:58 UTC · model grok-4.3
The pith
CKT-WAM transfers knowledge between world action models by injecting a compact adapted context from the teacher into the student's text embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that knowledge moves from teacher to student world action model without output imitation or dense hidden-state matching. Intermediate teacher states are compressed by learnable-query cross attention, adapted by an always-on generalized adapter plus a router that triggers specialized adapters, and the resulting context is appended to the student's conditioning textual embeddings. This minimal addition lets the student generate improved actions on new tasks.
What carries the argument
The context injection mechanism that compresses teacher hidden states via learnable-query cross attention and adapts them with generalized and specialized adapters before appending to student text embeddings.
If this is right
- Zero-shot generalization improves consistently across manipulation tasks.
- The total success rate reaches 86.1 percent on LIBERO-Plus with only 1.17 percent trainable parameters.
- Real-world performance reaches an 83.3 percent average success rate on four multi-step long-horizon tasks.
- The method approaches full fine-tuning results while keeping adaptation cost low.
Where Pith is reading between the lines
- The text embedding space may act as a shared interface that lets similar transfers work across different model families.
- Practitioners could maintain one heavy teacher and many lightweight students that draw on it for new tasks.
- The same compression and adapter pattern might apply to other generative control models where direct state alignment is costly.
Load-bearing premise
That the compressed and adapted context from the teacher hidden states can be injected into the student's text embedding space and will improve action generation without any matched latent interfaces or dense alignment.
What would settle it
A controlled test on LIBERO-Plus where the student model with the context injection shows no improvement over the identical student run without any transferred context would falsify the transfer benefit.
Figures
read the original abstract
World action models (WAMs) provide a powerful generative framework for embodied control, yet transferring knowledge across heterogeneous WAMs remains challenging due to mismatched latent interfaces, high adaptation cost, and the rigidity of conventional distillation objectives. We propose \textbf{CKT-WAM}, a parameter-efficient \textbf{C}ontext \textbf{K}nowledge \textbf{T}ransfer framework that transfers teacher WAM's knowledge into a student WAM through a compact context in the text embedding space, rather than output imitation or dense hidden-state matching. Specifically, CKT-WAM extracts intermediate teacher hidden states, reduces the number of tokens via compressors' learnable-query cross attention (LQCA), and transforms them through an always-on generalized adapter, a lightweight router, and sparsely activated specialized adapters. The resulting context is then appended to the student's conditioning textual embeddings, thereby injecting the transferred knowledge into the student with minimal architectural modification. Experiments show that CKT-WAM consistently improves zero-shot generalization and achieves the best overall performance on LIBERO-Plus, reaching 86.1\% total success rate with only 1.17\% trainable parameters, while approaching full fine-tuning performance. Beyond simulation, CKT-WAM also demonstrates strong real-world long-horizon manipulation ability, achieving the best average success rate of 83.3\% across four multi-step and long-horizon tasks. Code is available at https://github.com/YuhuaJiang2002/CKT-WAM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CKT-WAM, a parameter-efficient framework for transferring knowledge between heterogeneous world action models (WAMs). It extracts intermediate hidden states from a teacher WAM, compresses them via learnable-query cross-attention (LQCA), processes them through an always-on generalized adapter plus router and sparsely activated specialized adapters (totaling 1.17% trainable parameters), and appends the resulting context to the student WAM's text conditioning embeddings. Experiments report that this yields the best overall performance on LIBERO-Plus (86.1% total success rate), improves zero-shot generalization, approaches full fine-tuning, and achieves 83.3% average success on four real-world long-horizon manipulation tasks. Code is released.
Significance. If the results hold, the work is significant for enabling efficient cross-model knowledge transfer in embodied AI without requiring latent interface alignment or dense distillation. The low parameter count and strong real-robot results on long-horizon tasks could facilitate practical deployment of adapted WAMs. Explicit credit is due for releasing code, which supports reproducibility.
major comments (2)
- [§4 Experiments] §4 Experiments (including Table 1 and ablation tables): The central claim that performance gains arise from teacher-specific context transfer (rather than adapter capacity) is load-bearing but under-supported. The architecture trains the generalized adapter, router, and specialized adapters jointly with the LQCA injection pathway; no control ablation is described that replaces teacher hidden states with noise, zeros, or student-only inputs while keeping the 1.17% trainable modules fixed. Without this isolation, the 86.1% LIBERO-Plus and 83.3% real-world results cannot be confidently attributed to transferred knowledge.
- [§3.2 Method] §3.2 Method (LQCA and adapter equations): The claim that 'no latent interface matching is needed' and that context injection works directly in text embedding space is not accompanied by a cross-architecture test (e.g., teacher and student WAMs with substantially different hidden-state dimensionalities or conditioning mechanisms). The reported gains on LIBERO-Plus therefore rest on an untested assumption about the robustness of the text-embedding injection pathway.
minor comments (2)
- [Abstract and §4] The abstract and §4 should explicitly state the number of random seeds, statistical tests, and exact data splits used for the 86.1% and 83.3% figures to allow direct comparison with baselines.
- [Figure 3 and §3.2] Figure 3 (architecture diagram) caption and §3.2 notation: the distinction between 'always-on generalized adapter' and 'sparsely activated specialized adapters' is visually clear but the router gating equation is not written out; adding it would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below, acknowledge where the manuscript can be strengthened, and describe the revisions we will make.
read point-by-point responses
-
Referee: [§4 Experiments] §4 Experiments (including Table 1 and ablation tables): The central claim that performance gains arise from teacher-specific context transfer (rather than adapter capacity) is load-bearing but under-supported. The architecture trains the generalized adapter, router, and specialized adapters jointly with the LQCA injection pathway; no control ablation is described that replaces teacher hidden states with noise, zeros, or student-only inputs while keeping the 1.17% trainable modules fixed. Without this isolation, the 86.1% LIBERO-Plus and 83.3% real-world results cannot be confidently attributed to transferred knowledge.
Authors: We agree that the requested control ablation—replacing teacher hidden states with noise, zeros, or student-only inputs while freezing the 1.17% trainable modules—would provide direct evidence that gains stem from transferred context rather than adapter capacity alone. Our existing ablations isolate the LQCA compressor and the router/specialized-adapter pathway, but they do not perform this exact isolation. We will add the control experiment to the revised manuscript (new row in the ablation table) to strengthen attribution of the 86.1% and 83.3% results. revision: yes
-
Referee: [§3.2 Method] §3.2 Method (LQCA and adapter equations): The claim that 'no latent interface matching is needed' and that context injection works directly in text embedding space is not accompanied by a cross-architecture test (e.g., teacher and student WAMs with substantially different hidden-state dimensionalities or conditioning mechanisms). The reported gains on LIBERO-Plus therefore rest on an untested assumption about the robustness of the text-embedding injection pathway.
Authors: The design intentionally decouples the transfer from latent-space alignment by compressing teacher states via LQCA into a fixed-length context that is appended to the student's text-conditioning embeddings; this pathway is architecture-agnostic by construction. While the LIBERO-Plus experiments already involve heterogeneous WAMs with differing internal representations, we acknowledge that explicit tests with larger dimensionality gaps or dissimilar conditioning mechanisms would further demonstrate robustness. We will expand the method discussion with a note on this generality and add a small-scale cross-architecture experiment or analysis in the revision. revision: partial
Circularity Check
No significant circularity in the proposed framework
full rationale
The paper presents an empirical architecture for context knowledge transfer between heterogeneous world action models, using LQCA compression and adapter modules to inject teacher hidden states into student text embeddings. Reported metrics (86.1% on LIBERO-Plus, 83.3% real-world) are measured outcomes on external benchmarks, not quantities defined in terms of the fitted parameters themselves. No mathematical derivation chain, uniqueness theorems, or predictions that reduce to inputs by construction appear in the abstract or described method. The framework is a standard adapter-based transfer approach evaluated independently.
Axiom & Free-Parameter Ledger
free parameters (2)
- learnable queries in LQCA
- generalized adapter, router, and specialized adapter weights
axioms (1)
- domain assumption Heterogeneous WAMs can exchange useful knowledge through a common text-embedding interface
invented entities (2)
-
LQCA compressor
no independent evidence
-
always-on generalized adapter plus router and specialized adapters
no independent evidence
Reference graph
Works this paper leans on
-
[1]
World Models , author=. arXiv preprint arXiv:1803.10122 , year=
work page internal anchor Pith review arXiv
-
[2]
Proceedings of the 36th International Conference on Machine Learning , series=
Learning Latent Dynamics for Planning from Pixels , author=. Proceedings of the 36th International Conference on Machine Learning , series=
-
[3]
International Conference on Learning Representations , year=
Dream to Control: Learning Behaviors by Latent Imagination , author=. International Conference on Learning Representations , year=
-
[4]
Mastering Diverse Domains through World Models
Mastering Diverse Domains through World Models , author=. arXiv preprint arXiv:2301.04104 , year=
work page internal anchor Pith review arXiv
-
[5]
Robotics: Science and Systems , year=
RT-1: Robotics Transformer for Real-World Control at Scale , author=. Robotics: Science and Systems , year=
-
[6]
Proceedings of The 7th Conference on Robot Learning , series=
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author=. Proceedings of The 7th Conference on Robot Learning , series=
-
[7]
Proceedings of The 8th Conference on Robot Learning , series=
OpenVLA: An Open-Source Vision-Language-Action Model , author=. Proceedings of The 8th Conference on Robot Learning , series=
-
[8]
Proceedings of the 41st International Conference on Machine Learning , series=
Genie: Generative Interactive Environments , author=. Proceedings of the 41st International Conference on Machine Learning , series=
-
[9]
The Twelfth International Conference on Learning Representations , year=
Learning Interactive Real-World Simulators , author=. The Twelfth International Conference on Learning Representations , year=
-
[10]
Distilling the Knowledge in a Neural Network
Distilling the Knowledge in a Neural Network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review arXiv
-
[11]
International Conference on Learning Representations , year=
FitNets: Hints for Thin Deep Nets , author=. International Conference on Learning Representations , year=
-
[12]
International Conference on Learning Representations , year=
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , author=. International Conference on Learning Representations , year=
-
[13]
International Conference on Learning Representations , year=
Policy Distillation , author=. International Conference on Learning Representations , year=
-
[14]
2020 , publisher=
Jiao, Xiaoqi and Yin, Yichun and Shang, Lifeng and Jiang, Xin and Chen, Xiao and Li, Linlin and Wang, Fang and Liu, Qun , booktitle=. 2020 , publisher=
2020
-
[15]
Proceedings of the 38th International Conference on Machine Learning , series=
Training data-efficient image transformers & distillation through attention , author=. Proceedings of the 38th International Conference on Machine Learning , series=
-
[16]
Wang, Wenhui and Wei, Furu and Dong, Li and Cheng, Hao and Liu, Xiaodong and Song, Xia and Gao, Jianfeng , booktitle=
-
[17]
World Action Models are Zero-shot Policies
World action models are zero-shot policies , author=. arXiv preprint arXiv:2602.15922 , year=
work page internal anchor Pith review arXiv
-
[18]
Causal World Modeling for Robot Control
Causal world modeling for robot control , author=. arXiv preprint arXiv:2601.21998 , year=
work page internal anchor Pith review arXiv
-
[19]
Vidar: Embodied video diffusion model for generalist manipulation , author=. arXiv preprint arXiv:2507.12898 , year=
-
[20]
2025 , note =
Hongzhe Bi and Hengkai Tan and Shenghao Xie and Zeyuan Wang and Shuhe Huang and Haitian Liu and Ruowen Zhao and Yao Feng and Chendong Xiang and Yinze Rong and others , title =. 2025 , note =
2025
-
[21]
arXiv preprint arXiv:2504.0279 , year=
Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets , author=. arXiv preprint arXiv:2504.0279 , year=
-
[22]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video prediction policy: A generalist robot policy with predictive visual representations , author=. arXiv preprint arXiv:2412.14803 , year=
work page internal anchor Pith review arXiv
-
[23]
Unified video action model , author=. arXiv preprint arXiv:2503.00200 , year=
work page internal anchor Pith review arXiv
-
[24]
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM: Do World Action Models Need Test-time Future Imagination? , author=. arXiv preprint arXiv:2603.16666 , year=
work page internal anchor Pith review arXiv
-
[25]
Advances in Neural Information Processing Systems , volume =
Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =
-
[26]
Proceedings of the 38th International Conference on Machine Learning , series =
Perceiver: General Perception with Iterative Attention , author =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =
2021
-
[27]
Computer Vision -- ECCV 2020 , series =
End-to-End Object Detection with Transformers , author =. Computer Vision -- ECCV 2020 , series =. 2020 , publisher =
2020
-
[28]
Journal of Machine Learning Research , volume =
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. Journal of Machine Learning Research , volume =. 2022 , url =
2022
-
[29]
Proceedings of the 41st International Conference on Machine Learning , pages =
Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
2024
-
[30]
2024 , editor =
Lu, Xudong and Zhou, Aojun and Xu, Yuhui and Zhang, Renrui and Gao, Peng and Li, Hongsheng , booktitle =. 2024 , editor =
2024
-
[31]
Proceedings of the 41st International Conference on Machine Learning , pages =
Nikdan, Mahdi and Tabesh, Soroush and Crn. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
2024
-
[32]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2025 , doi =
2025
-
[33]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2024 , doi =
2024
-
[34]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
Time-, Memory- and Parameter-Efficient Visual Adaptation , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
-
[35]
Advances in Neural Information Processing Systems , year =
Sparse High Rank Adapters , author =. Advances in Neural Information Processing Systems , year =
-
[36]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models , author=. arXiv preprint arXiv:2510.13626 , year=
work page internal anchor Pith review arXiv
-
[37]
Advances in Neural Information Processing Systems , volume=
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning , author=. Advances in Neural Information Processing Systems , volume=
-
[38]
International Conference on Learning Representations , year=
LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
-
[39]
Advances in Neural Information Processing Systems , year=
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models , author=. Advances in Neural Information Processing Systems , year=
-
[40]
International Conference on Learning Representations , year=
KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
-
[41]
Conference on Language Modeling , year=
AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts , author=. Conference on Language Modeling , year=
-
[42]
Advances in Neural Information Processing Systems , year=
HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning , author=. Advances in Neural Information Processing Systems , year=
-
[43]
Advances in Neural Information Processing Systems , year=
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition , author=. Advances in Neural Information Processing Systems , year=
-
[44]
International Conference on Learning Representations , year=
Vision Transformer Adapter for Dense Predictions , author=. International Conference on Learning Representations , year=
-
[45]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
5\ author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[46]
2025 , address=
Wang, Hanqing and Li, Yixia and Wang, Shuo and Chen, Guanhua and Chen, Yun , booktitle=. 2025 , address=
2025
-
[47]
2024 , volume=
Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung , booktitle=. 2024 , volume=
2024
-
[48]
Fan, Chenghao and Lu, Zhenyi and Liu, Sichen and Gu, Chengfeng and Qu, Xiaoye and Wei, Wei and Cheng, Yu , booktitle=. Make. 2025 , volume=
2025
-
[49]
2025 , eprint=
WorldVLA: Towards Autoregressive Action World Model , author=. 2025 , eprint=
2025
-
[50]
2025 , eprint=
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks , author=. 2025 , eprint=
2025
-
[51]
2025 , eprint=
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions , author=. 2025 , eprint=
2025
-
[52]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
_. 2025 , eprint =. doi:10.48550/arXiv.2504.16054 , url =
-
[53]
2024 , eprint=
_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. 2024 , eprint=
2024
-
[54]
2025 , eprint=
FAST: Efficient Action Tokenization for Vision-Language-Action Models , author=. 2025 , eprint=
2025
-
[55]
2025 , eprint=
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success , author=. 2025 , eprint=
2025
-
[56]
2025 , eprint=
Interactive Post-Training for Vision-Language-Action Models , author=. 2025 , eprint=
2025
-
[57]
2025 , eprint=
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model , author=. 2025 , eprint=
2025
-
[58]
2026 , eprint=
HoloBrain-0 Technical Report , author=. 2026 , eprint=
2026
-
[59]
2026 , eprint=
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model , author=. 2026 , eprint=
2026
-
[60]
2026 , eprint=
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning , author=. 2026 , eprint=
2026
-
[61]
2025 , eprint=
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation , author=. 2025 , eprint=
2025
-
[62]
2026 , eprint=
VLANeXt: Recipes for Building Strong VLA Models , author=. 2026 , eprint=
2026
-
[63]
2026 , eprint=
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning , author=. 2026 , eprint=
2026
-
[64]
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
Yuhua Jiang and Shuang Cheng and Yan Ding and Feifei Gao and Biqing Qi , year=. 2511.14148 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.