pith. sign in

arxiv: 2511.18085 · v4 · submitted 2025-11-22 · 💻 cs.RO · cs.AI

Continually Evolving Skill Knowledge in Vision Language Action Model

Pith reviewed 2026-05-17 05:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords continual learningvision-language-actionimitation learningexpert routingknowledge evolutionLIBERO benchmarkdual-arm manipulation
0
0 comments X

The pith

Stellar VLA lets vision-language-action models acquire new skills by evolving a shared knowledge space and routing tasks to experts without adding parameters or forgetting old ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that vision-language-action models can adapt continually to sequences of tasks by building and updating a learned knowledge space while using a routing system to direct each task to the right parts of the model. It does this through two model variants that jointly optimize task representations and the knowledge space, then apply a knowledge-guided expert routing step based on how tasks relate and on the top semantic matches. A sympathetic reader would care because large pretrained robot models are costly to retrain from scratch for every new skill, and most existing continual-learning fixes either grow the network or require heavy replay of past data. If the approach holds, it would let robots pick up manipulation skills over time in changing scenes while keeping model size fixed and retaining earlier capabilities.

Core claim

Stellar VLA is a knowledge-driven continual imitation learning framework that enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. It introduces a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings that supports task specialization without increasing model size. On the LIBERO benchmark the resulting models achieve strong performance among VLA and CIL baselines while using only 1% data replay, and real-world dual-arm experiments confirm effective knowledge transfer across distinct embodiments and scenes.

What carries the argument

The knowledge-guided expert routing mechanism that conditions routing decisions on knowledge relations and Top-K semantic embeddings to assign tasks to specialized experts while preserving prior knowledge.

If this is right

  • Stellar VLAs match or exceed both VLA and CIL baselines on the LIBERO benchmark while replaying only 1% of prior data.
  • Knowledge transfer remains effective when the same models are deployed on real dual-arm hardware with new embodiments and scenes.
  • The hierarchical TS-Stellar variant shows particular strength on tasks that require composing skills in stages.
  • Visualizations of the learned knowledge space indicate both retention of earlier tasks and discovery of structure among new ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing idea could be tested on longer task sequences to see how far the fixed-size model can scale before interference appears.
  • Because the knowledge space is learned jointly with tasks, it might transfer to other continual-learning settings such as language-only or vision-only models.
  • Pairing this routing with stronger initial VLA pretraining could further lower the amount of replay needed for acceptable performance.
  • Real-world tests on additional robot platforms would show whether the dual-arm results generalize to single-arm or mobile-manipulator settings.

Load-bearing premise

The assumption that routing decisions based on knowledge relations and top semantic embeddings can reliably produce task specialization without any parameter growth or loss of earlier skills.

What would settle it

If sequential tasks on the LIBERO benchmark produce performance well below the reported VLA and CIL baselines when replay is limited to 1%, or if the real dual-arm experiments show clear forgetting of prior tasks across embodiment changes, the central claims would not hold.

Figures

Figures reproduced from arXiv: 2511.18085 by Brian Sheil, Guangming Wang, Hesheng Wang, Maoqing Yao, Yuxuan Wu, Zhiheng Yang.

Figure 1
Figure 1. Figure 1: We present Stellar VLA, a continual learning VLA framework driven by a self-evolving knowledge space. Its variants T-Stellar and TS-Stellar demonstrate superior final average success rates over UniVLA [4] and MoDE [31]. Abstract Developing general robot intelligence in open environ￾ments requires continual skill learning. Recent Vision￾Language-Action (VLA) models leverage massive pretrain￾ing data to supp… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of Stellar VLA. CLIP [29] and FiLM [28]-conditioned ResNet encode language and visual inputs respec￾tively. The task-centric representation z and knowledge space are jointly learned through knowledge update and latents aggregation, as detailed in Sec. 3.3. The learned knowledge prior finally guides the MoE action head for motion prediction, as detailed in Sec. 3.4. ing language informa… view at source ↗
Figure 3
Figure 3. Figure 3: Knowledge-prior-routed MoE action head. Two knowledge embeddings, relation and top-K semantic, are com￾puted for expert routing, alongside language, noise, observation and noise action tokens fed into the denoising transformer. models, the DP-based knowledge space is simultaneously updated based on these representations, ensuring timely dis￾covery of new knowledge, as illustrated in [PITH_FULL_IMAGE:figur… view at source ↗
Figure 4
Figure 4. Figure 4: T-SNE visualization of Stellar VLA Latent representations after 1, 4, 6, 8, and 10 tasks on LIBERO-long are shown. Task names are abbreviated for clarity. T-Stellar models discrete task distributions, and TS-Stellar learn relevant skill across tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Behavior visualization on “Pick up Bag” after training on “Handover Toy”. TS-Stellar achieves the most synchronized dual-arm motion; T-Stellar hesitates slightly; MoDE∗ and w/o KS show strong desynchronization and ultimately fail. results validates the effectiveness of our DPMM-based task￾centric and HDP-based task–skill knowledge spaces. Knowledge Guidance for Expert Routing Effectiveness. We evaluate T-S… view at source ↗
read the original abstract

Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters. Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure. Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery. Project Website: https://stellarvla.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Stellar VLA, a knowledge-driven continual imitation learning (CIL) framework for Vision-Language-Action (VLA) models that achieves self-evolving skill knowledge without increasing network parameters. It introduces two variants—T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structures—along with a knowledge-guided expert routing mechanism conditioned on knowledge relations and Top-K semantic embeddings. On the LIBERO benchmark, Stellar VLAs report strong performance relative to VLA and CIL baselines using only 1% data replay; real-world dual-arm experiments validate knowledge transfer, with TS-Stellar excelling in hierarchical manipulation and visualizations showing knowledge retention and task discovery.

Significance. If the routing mechanism demonstrably enables parameter-free specialization and forgetting mitigation, the work offers a scalable path for continual adaptation in large VLAs, addressing a key limitation of existing CIL methods that rely on added parameters or external modules. The combination of benchmark results with real-robot transfer and hierarchical modeling would represent a meaningful advance for efficient lifelong robotic learning.

major comments (2)
  1. [§5] §5 (Experiments): the claim that the knowledge-guided expert routing produces task specialization and prevents catastrophic forgetting without parameter growth is load-bearing for the LIBERO performance results, yet no ablation isolates the routing (conditioned on knowledge relation and Top-K embeddings) from standard continual baselines or from the 1% replay buffer; without this, the contribution of the proposed mechanism to the reported gains cannot be verified.
  2. [§4] §4 (Method, routing subsection): the assertion that the routing reuses existing parameters without adding new ones for the knowledge space or conditioning is central to the 'no increase in model size' claim, but the manuscript provides no explicit parameter-count comparison table or derivation showing how the Top-K semantic embeddings and knowledge relations are implemented inside the fixed-size network.
minor comments (2)
  1. [Abstract] Abstract and §5: quantitative metrics (e.g., success rates, exact baselines compared) and error bars or statistical significance are not reported in the summary of LIBERO results; adding these would improve verifiability of the 'strong performance' statement.
  2. [§6] §6 (Real-world evaluation): the dual-arm platform description lacks detail on embodiment differences and scene configurations relative to simulation; a brief table comparing sim-to-real gaps would clarify the transfer claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that additional clarifications and experiments will strengthen the manuscript. We will incorporate the suggested changes in the revised version.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments): the claim that the knowledge-guided expert routing produces task specialization and prevents catastrophic forgetting without parameter growth is load-bearing for the LIBERO performance results, yet no ablation isolates the routing (conditioned on knowledge relation and Top-K embeddings) from standard continual baselines or from the 1% replay buffer; without this, the contribution of the proposed mechanism to the reported gains cannot be verified.

    Authors: We agree that an explicit ablation isolating the routing mechanism would strengthen the evidence. Our LIBERO results already compare Stellar VLA variants against both standard VLA models and existing CIL baselines (which lack the knowledge-guided routing), with performance gains observed under the 1% replay setting. To directly address the concern, the revised manuscript will include a new ablation study that disables the knowledge-guided expert routing and Top-K semantic embeddings while retaining the 1% replay buffer, reporting the resulting drop in task success rates and increased forgetting on LIBERO to quantify the routing's isolated contribution to specialization and continual performance. revision: yes

  2. Referee: [§4] §4 (Method, routing subsection): the assertion that the routing reuses existing parameters without adding new ones for the knowledge space or conditioning is central to the 'no increase in model size' claim, but the manuscript provides no explicit parameter-count comparison table or derivation showing how the Top-K semantic embeddings and knowledge relations are implemented inside the fixed-size network.

    Authors: We appreciate this observation. The routing mechanism reuses the fixed VLA backbone by computing knowledge relations and Top-K semantic embeddings from the jointly learned knowledge space, which is integrated via existing conditioning pathways without expanding parameter count. In the revised manuscript, we will add an explicit parameter-count comparison table (base VLA vs. T-Stellar vs. TS-Stellar) and a short derivation in §4 showing that the embeddings and relations are generated using the model's existing layers and do not introduce new parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes Stellar VLA as jointly optimizing task representations and a learned knowledge space, with a knowledge-guided expert routing mechanism conditioned on knowledge relations and Top-K embeddings. These are presented as trainable components whose outputs are validated empirically on LIBERO (1% replay) and real-robot transfer, rather than defined to tautologically produce the reported specialization or forgetting resistance. No equations reduce the performance gains to fitted inputs by construction, and no load-bearing self-citation chain is invoked to force uniqueness. The central claims rest on experimental comparisons against VLA and CIL baselines, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework assumes that a jointly optimized knowledge space can be maintained without parameter growth and that semantic embeddings plus relation conditioning suffice for expert selection; these rest on standard neural-network assumptions plus domain-specific claims about task-skill separability.

free parameters (1)
  • Top-K value for semantic embeddings
    Chosen to select experts; value not specified in abstract but directly affects routing behavior.
axioms (1)
  • domain assumption Task representations and knowledge space can be jointly optimized without interference or forgetting
    Invoked when describing self-evolving knowledge learning.

pith-pipeline@v0.9.0 · 5508 in / 1379 out tokens · 26342 ms · 2026-05-17T05:58:10.839445+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...

  2. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.

  3. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  4. Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

    cs.RO 2026-02 unverdicted novelty 6.0

    LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 3 Pith papers · 16 internal anchors

  1. [1]

    Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974

    Charles E Antoniak. Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974. 2

  2. [2]

    RT-H: Action Hierarchies Using Language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

  3. [3]

    Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022

    Zhenshan Bing, David Lerch, Kai Huang, and Alois Knoll. Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022. 3

  4. [4]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 1, 2, 6

  5. [5]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024. 2

  6. [6]

    Don't forget, there is more than forgetting: new metrics for Continual Learning

    Natalia D ´ıaz-Rodr´ıguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than for- getting: new metrics for continual learning.arXiv preprint arXiv:1810.13166, 2018. 6

  7. [7]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 2

  8. [8]

    Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025

    Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2

  9. [9]

    Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,

    Shengran Hu and Jeff Clune. Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,

  10. [10]

    Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,

    Michael C Hughes and Erik Sudderth. Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,

  11. [11]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024. 2

  12. [12]

    Efficient planning in a compact latent action space

    Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rockt ¨aschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. arXiv preprint arXiv:2208.10291, 2022. 2

  13. [13]

    H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,

    Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rockt¨aschel, and Yuandong Tian. H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,

  14. [14]

    Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025. 1

  15. [15]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1

  16. [16]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 1

  17. [17]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 4

  18. [18]

    Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024

    Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024. 2, 3, 6

  19. [19]

    Behavior generation with latent actions

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Be- havior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024. 2, 3

  20. [20]

    Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025

    Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, and Marios Savvides. Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025. 1

  21. [21]

    Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting

    Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting. In International conference on machine learning, pages 3925–

  22. [22]

    MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 2

  23. [23]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 5, 6

  24. [24]

    Tail: Task-specific adapters for imitation learning with large pretrained models,

    Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, and Rasool Fakoor. Tail: Task-specific adapters for imitation learning with large pretrained models. arXiv preprint arXiv:2310.05905, 2023. 2

  25. [25]

    Packnet: Adding mul- tiple tasks to a single network by iterative pruning

    Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2

  26. [26]

    Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025

    Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025. 2, 3

  27. [27]

    Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024

    Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024. 2, 3

  28. [28]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  30. [30]

    Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019

    Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pas- canu, Yee Whye Teh, and Raia Hadsell. Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019. 2

  31. [31]

    Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024

    Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Li- outikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024. 1, 2, 5, 6, 7

  32. [32]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz- van Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 2

  33. [33]

    A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994

    Jayaram Sethuraman. A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994. 3

  34. [34]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 2

  35. [35]

    Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

    Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025. 2

  36. [36]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

  37. [37]

    Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,

    Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,

  38. [38]

    Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,

    Sebastian Thrun and Tom M Mitchell. Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,

  39. [39]

    Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery

    Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 537–544. IEEE, 2024. 2, 3, 6

  40. [40]

    Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning

    Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xi- ang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning.arXiv preprint arXiv:2407.01531, 2024. 2

  41. [41]

    Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016,

    Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016, 2025. 2, 3

  42. [42]

    Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024

    Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Shawn Ma, and Yitao Liang. Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024. 2

  43. [43]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. 2

  44. [44]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 2

  45. [45]

    Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,

    Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, and Tingnan Zhang. Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,

  46. [46]

    DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025. 2

  47. [47]

    Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025

    Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025

  48. [48]

    arXiv preprint arXiv:2503.20384 (2025)

    Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 2

  49. [49]

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2

  50. [50]

    More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025

    Han Zhao, Wenxuan Song, Donglin Wang, Xinyang Tong, Pengxiang Ding, Xuelian Cheng, and Zongyuan Ge. More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025. 2

  51. [51]

    Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,

    Ruijie Zheng, Ching-An Cheng, Hal Daum ´e III, Furong Huang, and Andrey Kolobov. Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,