Continually Evolving Skill Knowledge in Vision Language Action Model

Brian Sheil; Guangming Wang; Hesheng Wang; Maoqing Yao; Yuxuan Wu; Zhiheng Yang

arxiv: 2511.18085 · v4 · submitted 2025-11-22 · 💻 cs.RO · cs.AI

Continually Evolving Skill Knowledge in Vision Language Action Model

Yuxuan Wu , Guangming Wang , Zhiheng Yang , Maoqing Yao , Brian Sheil , Hesheng Wang This is my paper

Pith reviewed 2026-05-17 05:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords continual learningvision-language-actionimitation learningexpert routingknowledge evolutionLIBERO benchmarkdual-arm manipulation

0 comments

The pith

Stellar VLA lets vision-language-action models acquire new skills by evolving a shared knowledge space and routing tasks to experts without adding parameters or forgetting old ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that vision-language-action models can adapt continually to sequences of tasks by building and updating a learned knowledge space while using a routing system to direct each task to the right parts of the model. It does this through two model variants that jointly optimize task representations and the knowledge space, then apply a knowledge-guided expert routing step based on how tasks relate and on the top semantic matches. A sympathetic reader would care because large pretrained robot models are costly to retrain from scratch for every new skill, and most existing continual-learning fixes either grow the network or require heavy replay of past data. If the approach holds, it would let robots pick up manipulation skills over time in changing scenes while keeping model size fixed and retaining earlier capabilities.

Core claim

Stellar VLA is a knowledge-driven continual imitation learning framework that enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. It introduces a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings that supports task specialization without increasing model size. On the LIBERO benchmark the resulting models achieve strong performance among VLA and CIL baselines while using only 1% data replay, and real-world dual-arm experiments confirm effective knowledge transfer across distinct embodiments and scenes.

What carries the argument

The knowledge-guided expert routing mechanism that conditions routing decisions on knowledge relations and Top-K semantic embeddings to assign tasks to specialized experts while preserving prior knowledge.

If this is right

Stellar VLAs match or exceed both VLA and CIL baselines on the LIBERO benchmark while replaying only 1% of prior data.
Knowledge transfer remains effective when the same models are deployed on real dual-arm hardware with new embodiments and scenes.
The hierarchical TS-Stellar variant shows particular strength on tasks that require composing skills in stages.
Visualizations of the learned knowledge space indicate both retention of earlier tasks and discovery of structure among new ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing idea could be tested on longer task sequences to see how far the fixed-size model can scale before interference appears.
Because the knowledge space is learned jointly with tasks, it might transfer to other continual-learning settings such as language-only or vision-only models.
Pairing this routing with stronger initial VLA pretraining could further lower the amount of replay needed for acceptable performance.
Real-world tests on additional robot platforms would show whether the dual-arm results generalize to single-arm or mobile-manipulator settings.

Load-bearing premise

The assumption that routing decisions based on knowledge relations and top semantic embeddings can reliably produce task specialization without any parameter growth or loss of earlier skills.

What would settle it

If sequential tasks on the LIBERO benchmark produce performance well below the reported VLA and CIL baselines when replay is limited to 1%, or if the real dual-arm experiments show clear forgetting of prior tasks across embodiment changes, the central claims would not hold.

Figures

Figures reproduced from arXiv: 2511.18085 by Brian Sheil, Guangming Wang, Hesheng Wang, Maoqing Yao, Yuxuan Wu, Zhiheng Yang.

**Figure 1.** Figure 1: We present Stellar VLA, a continual learning VLA framework driven by a self-evolving knowledge space. Its variants T-Stellar and TS-Stellar demonstrate superior final average success rates over UniVLA [4] and MoDE [31]. Abstract Developing general robot intelligence in open environments requires continual skill learning. Recent VisionLanguage-Action (VLA) models leverage massive pretraining data to supp… view at source ↗

**Figure 2.** Figure 2: Overall architecture of Stellar VLA. CLIP [29] and FiLM [28]-conditioned ResNet encode language and visual inputs respectively. The task-centric representation z and knowledge space are jointly learned through knowledge update and latents aggregation, as detailed in Sec. 3.3. The learned knowledge prior finally guides the MoE action head for motion prediction, as detailed in Sec. 3.4. ing language informa… view at source ↗

**Figure 3.** Figure 3: Knowledge-prior-routed MoE action head. Two knowledge embeddings, relation and top-K semantic, are computed for expert routing, alongside language, noise, observation and noise action tokens fed into the denoising transformer. models, the DP-based knowledge space is simultaneously updated based on these representations, ensuring timely discovery of new knowledge, as illustrated in [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 4.** Figure 4: T-SNE visualization of Stellar VLA Latent representations after 1, 4, 6, 8, and 10 tasks on LIBERO-long are shown. Task names are abbreviated for clarity. T-Stellar models discrete task distributions, and TS-Stellar learn relevant skill across tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Behavior visualization on “Pick up Bag” after training on “Handover Toy”. TS-Stellar achieves the most synchronized dual-arm motion; T-Stellar hesitates slightly; MoDE∗ and w/o KS show strong desynchronization and ultimately fail. results validates the effectiveness of our DPMM-based taskcentric and HDP-based task–skill knowledge spaces. Knowledge Guidance for Expert Routing Effectiveness. We evaluate T-S… view at source ↗

read the original abstract

Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters. Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure. Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery. Project Website: https://stellarvla.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Stellar VLA, a knowledge-driven continual imitation learning (CIL) framework for Vision-Language-Action (VLA) models that achieves self-evolving skill knowledge without increasing network parameters. It introduces two variants—T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structures—along with a knowledge-guided expert routing mechanism conditioned on knowledge relations and Top-K semantic embeddings. On the LIBERO benchmark, Stellar VLAs report strong performance relative to VLA and CIL baselines using only 1% data replay; real-world dual-arm experiments validate knowledge transfer, with TS-Stellar excelling in hierarchical manipulation and visualizations showing knowledge retention and task discovery.

Significance. If the routing mechanism demonstrably enables parameter-free specialization and forgetting mitigation, the work offers a scalable path for continual adaptation in large VLAs, addressing a key limitation of existing CIL methods that rely on added parameters or external modules. The combination of benchmark results with real-robot transfer and hierarchical modeling would represent a meaningful advance for efficient lifelong robotic learning.

major comments (2)

[§5] §5 (Experiments): the claim that the knowledge-guided expert routing produces task specialization and prevents catastrophic forgetting without parameter growth is load-bearing for the LIBERO performance results, yet no ablation isolates the routing (conditioned on knowledge relation and Top-K embeddings) from standard continual baselines or from the 1% replay buffer; without this, the contribution of the proposed mechanism to the reported gains cannot be verified.
[§4] §4 (Method, routing subsection): the assertion that the routing reuses existing parameters without adding new ones for the knowledge space or conditioning is central to the 'no increase in model size' claim, but the manuscript provides no explicit parameter-count comparison table or derivation showing how the Top-K semantic embeddings and knowledge relations are implemented inside the fixed-size network.

minor comments (2)

[Abstract] Abstract and §5: quantitative metrics (e.g., success rates, exact baselines compared) and error bars or statistical significance are not reported in the summary of LIBERO results; adding these would improve verifiability of the 'strong performance' statement.
[§6] §6 (Real-world evaluation): the dual-arm platform description lacks detail on embodiment differences and scene configurations relative to simulation; a brief table comparing sim-to-real gaps would clarify the transfer claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that additional clarifications and experiments will strengthen the manuscript. We will incorporate the suggested changes in the revised version.

read point-by-point responses

Referee: [§5] §5 (Experiments): the claim that the knowledge-guided expert routing produces task specialization and prevents catastrophic forgetting without parameter growth is load-bearing for the LIBERO performance results, yet no ablation isolates the routing (conditioned on knowledge relation and Top-K embeddings) from standard continual baselines or from the 1% replay buffer; without this, the contribution of the proposed mechanism to the reported gains cannot be verified.

Authors: We agree that an explicit ablation isolating the routing mechanism would strengthen the evidence. Our LIBERO results already compare Stellar VLA variants against both standard VLA models and existing CIL baselines (which lack the knowledge-guided routing), with performance gains observed under the 1% replay setting. To directly address the concern, the revised manuscript will include a new ablation study that disables the knowledge-guided expert routing and Top-K semantic embeddings while retaining the 1% replay buffer, reporting the resulting drop in task success rates and increased forgetting on LIBERO to quantify the routing's isolated contribution to specialization and continual performance. revision: yes
Referee: [§4] §4 (Method, routing subsection): the assertion that the routing reuses existing parameters without adding new ones for the knowledge space or conditioning is central to the 'no increase in model size' claim, but the manuscript provides no explicit parameter-count comparison table or derivation showing how the Top-K semantic embeddings and knowledge relations are implemented inside the fixed-size network.

Authors: We appreciate this observation. The routing mechanism reuses the fixed VLA backbone by computing knowledge relations and Top-K semantic embeddings from the jointly learned knowledge space, which is integrated via existing conditioning pathways without expanding parameter count. In the revised manuscript, we will add an explicit parameter-count comparison table (base VLA vs. T-Stellar vs. TS-Stellar) and a short derivation in §4 showing that the embeddings and relations are generated using the model's existing layers and do not introduce new parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes Stellar VLA as jointly optimizing task representations and a learned knowledge space, with a knowledge-guided expert routing mechanism conditioned on knowledge relations and Top-K embeddings. These are presented as trainable components whose outputs are validated empirically on LIBERO (1% replay) and real-robot transfer, rather than defined to tautologically produce the reported specialization or forgetting resistance. No equations reduce the performance gains to fitted inputs by construction, and no load-bearing self-citation chain is invoked to force uniqueness. The central claims rest on experimental comparisons against VLA and CIL baselines, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework assumes that a jointly optimized knowledge space can be maintained without parameter growth and that semantic embeddings plus relation conditioning suffice for expert selection; these rest on standard neural-network assumptions plus domain-specific claims about task-skill separability.

free parameters (1)

Top-K value for semantic embeddings
Chosen to select experts; value not specified in abstract but directly affects routing behavior.

axioms (1)

domain assumption Task representations and knowledge space can be jointly optimized without interference or forgetting
Invoked when describing self-evolving knowledge learning.

pith-pipeline@v0.9.0 · 5508 in / 1379 out tokens · 26342 ms · 2026-05-17T05:58:10.839445+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
cs.RO 2026-05 unverdicted novelty 6.0

Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
cs.RO 2026-05 unverdicted novelty 6.0

A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
cs.RO 2026-02 unverdicted novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 3 Pith papers · 16 internal anchors

[1]

Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974

Charles E Antoniak. Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974. 2

work page 1974
[2]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

work page internal anchor Pith review arXiv 2024
[3]

Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022

Zhenshan Bing, David Lerch, Kai Huang, and Alois Knoll. Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022. 3

work page 2022
[4]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Don't forget, there is more than forgetting: new metrics for Continual Learning

Natalia D ´ıaz-Rodr´ıguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than for- getting: new metrics for continual learning.arXiv preprint arXiv:1810.13166, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 2

work page 2022
[8]

Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2

work page arXiv 2025
[9]

Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,

Shengran Hu and Jeff Clune. Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,

work page
[10]

Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,

Michael C Hughes and Erik Sudderth. Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,

work page
[11]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Efficient planning in a compact latent action space

Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rockt ¨aschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. arXiv preprint arXiv:2208.10291, 2022. 2

work page arXiv 2022
[13]

H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,

Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rockt¨aschel, and Yuandong Tian. H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,

work page arXiv
[14]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025. 1

work page 2025
[15]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 4

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024

Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024. 2, 3, 6

work page 2024
[19]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Be- havior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024. 2, 3

work page arXiv 2024
[20]

Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, and Marios Savvides. Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025. 1

work page arXiv 2025
[21]

Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting

Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting. In International conference on machine learning, pages 3925–

work page
[22]

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 5, 6

work page 2023
[24]

Tail: Task-specific adapters for imitation learning with large pretrained models,

Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, and Rasool Fakoor. Tail: Task-specific adapters for imitation learning with large pretrained models. arXiv preprint arXiv:2310.05905, 2023. 2

work page arXiv 2023
[25]

Packnet: Adding mul- tiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2

work page 2018
[26]

Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025

Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025. 2, 3

work page 2025
[27]

Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024

Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024. 2, 3

work page 2024
[28]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3

work page 2018
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[30]

Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019

Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pas- canu, Yee Whye Teh, and Raia Hadsell. Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019. 2

work page 2019
[31]

Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Li- outikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024. 1, 2, 5, 6, 7

work page arXiv 2024
[32]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz- van Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994

Jayaram Sethuraman. A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994. 3

work page 1994
[34]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025. 2

work page arXiv 2025
[36]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,

Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,

work page
[38]

Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,

Sebastian Thrun and Tom M Mitchell. Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,

work page
[39]

Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery

Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 537–544. IEEE, 2024. 2, 3, 6

work page 2024
[40]

Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning

Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xi- ang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning.arXiv preprint arXiv:2407.01531, 2024. 2

work page arXiv 2024
[41]

Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016,

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016, 2025. 2, 3

work page arXiv 2025
[42]

Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024

Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Shawn Ma, and Yitao Liang. Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024. 2

work page 2024
[43]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,

Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, and Tingnan Zhang. Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,

work page arXiv
[46]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025. 2

work page internal anchor Pith review arXiv 2025
[47]

Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025

Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025

work page arXiv 2025
[48]

arXiv preprint arXiv:2503.20384 (2025)

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 2

work page arXiv 2025
[49]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025

Han Zhao, Wenxuan Song, Donglin Wang, Xinyang Tong, Pengxiang Ding, Xuelian Cheng, and Zongyuan Ge. More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025. 2

work page arXiv 2025
[51]

Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,

Ruijie Zheng, Ching-An Cheng, Hal Daum ´e III, Furong Huang, and Andrey Kolobov. Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,

work page

[1] [1]

Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974

Charles E Antoniak. Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974. 2

work page 1974

[2] [2]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

work page internal anchor Pith review arXiv 2024

[3] [3]

Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022

Zhenshan Bing, David Lerch, Kai Huang, and Alois Knoll. Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022. 3

work page 2022

[4] [4]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Don't forget, there is more than forgetting: new metrics for Continual Learning

Natalia D ´ıaz-Rodr´ıguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than for- getting: new metrics for continual learning.arXiv preprint arXiv:1810.13166, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 2

work page 2022

[8] [8]

Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025

Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2

work page arXiv 2025

[9] [9]

Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,

Shengran Hu and Jeff Clune. Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,

work page

[10] [10]

Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,

Michael C Hughes and Erik Sudderth. Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,

work page

[11] [11]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Efficient planning in a compact latent action space

Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rockt ¨aschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. arXiv preprint arXiv:2208.10291, 2022. 2

work page arXiv 2022

[13] [13]

H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,

Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rockt¨aschel, and Yuandong Tian. H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,

work page arXiv

[14] [14]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025. 1

work page 2025

[15] [15]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 4

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024

Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024. 2, 3, 6

work page 2024

[19] [19]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Be- havior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024. 2, 3

work page arXiv 2024

[20] [20]

Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, and Marios Savvides. Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025. 1

work page arXiv 2025

[21] [21]

Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting

Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting. In International conference on machine learning, pages 3925–

work page

[22] [22]

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 5, 6

work page 2023

[24] [24]

Tail: Task-specific adapters for imitation learning with large pretrained models,

Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, and Rasool Fakoor. Tail: Task-specific adapters for imitation learning with large pretrained models. arXiv preprint arXiv:2310.05905, 2023. 2

work page arXiv 2023

[25] [25]

Packnet: Adding mul- tiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2

work page 2018

[26] [26]

Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025

Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025. 2, 3

work page 2025

[27] [27]

Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024

Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024. 2, 3

work page 2024

[28] [28]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3

work page 2018

[29] [29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021

[30] [30]

Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019

Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pas- canu, Yee Whye Teh, and Raia Hadsell. Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019. 2

work page 2019

[31] [31]

Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Li- outikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024. 1, 2, 5, 6, 7

work page arXiv 2024

[32] [32]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz- van Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994

Jayaram Sethuraman. A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994. 3

work page 1994

[34] [34]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025. 2

work page arXiv 2025

[36] [36]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,

Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,

work page

[38] [38]

Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,

Sebastian Thrun and Tom M Mitchell. Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,

work page

[39] [39]

Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery

Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 537–544. IEEE, 2024. 2, 3, 6

work page 2024

[40] [40]

Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning

Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xi- ang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning.arXiv preprint arXiv:2407.01531, 2024. 2

work page arXiv 2024

[41] [41]

Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016,

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016, 2025. 2, 3

work page arXiv 2025

[42] [42]

Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024

Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Shawn Ma, and Yitao Liang. Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024. 2

work page 2024

[43] [43]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,

Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, and Tingnan Zhang. Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,

work page arXiv

[46] [46]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025. 2

work page internal anchor Pith review arXiv 2025

[47] [47]

Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025

Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025

work page arXiv 2025

[48] [48]

arXiv preprint arXiv:2503.20384 (2025)

Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 2

work page arXiv 2025

[49] [49]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025

Han Zhao, Wenxuan Song, Donglin Wang, Xinyang Tong, Pengxiang Ding, Xuelian Cheng, and Zongyuan Ge. More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025. 2

work page arXiv 2025

[51] [51]

Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,

Ruijie Zheng, Ching-An Cheng, Hal Daum ´e III, Furong Huang, and Andrey Kolobov. Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,

work page