Continually Evolving Skill Knowledge in Vision Language Action Model
Pith reviewed 2026-05-17 05:58 UTC · model grok-4.3
The pith
Stellar VLA lets vision-language-action models acquire new skills by evolving a shared knowledge space and routing tasks to experts without adding parameters or forgetting old ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stellar VLA is a knowledge-driven continual imitation learning framework that enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. It introduces a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings that supports task specialization without increasing model size. On the LIBERO benchmark the resulting models achieve strong performance among VLA and CIL baselines while using only 1% data replay, and real-world dual-arm experiments confirm effective knowledge transfer across distinct embodiments and scenes.
What carries the argument
The knowledge-guided expert routing mechanism that conditions routing decisions on knowledge relations and Top-K semantic embeddings to assign tasks to specialized experts while preserving prior knowledge.
If this is right
- Stellar VLAs match or exceed both VLA and CIL baselines on the LIBERO benchmark while replaying only 1% of prior data.
- Knowledge transfer remains effective when the same models are deployed on real dual-arm hardware with new embodiments and scenes.
- The hierarchical TS-Stellar variant shows particular strength on tasks that require composing skills in stages.
- Visualizations of the learned knowledge space indicate both retention of earlier tasks and discovery of structure among new ones.
Where Pith is reading between the lines
- The same routing idea could be tested on longer task sequences to see how far the fixed-size model can scale before interference appears.
- Because the knowledge space is learned jointly with tasks, it might transfer to other continual-learning settings such as language-only or vision-only models.
- Pairing this routing with stronger initial VLA pretraining could further lower the amount of replay needed for acceptable performance.
- Real-world tests on additional robot platforms would show whether the dual-arm results generalize to single-arm or mobile-manipulator settings.
Load-bearing premise
The assumption that routing decisions based on knowledge relations and top semantic embeddings can reliably produce task specialization without any parameter growth or loss of earlier skills.
What would settle it
If sequential tasks on the LIBERO benchmark produce performance well below the reported VLA and CIL baselines when replay is limited to 1%, or if the real dual-arm experiments show clear forgetting of prior tasks across embodiment changes, the central claims would not hold.
Figures
read the original abstract
Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters. Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure. Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery. Project Website: https://stellarvla.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Stellar VLA, a knowledge-driven continual imitation learning (CIL) framework for Vision-Language-Action (VLA) models that achieves self-evolving skill knowledge without increasing network parameters. It introduces two variants—T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structures—along with a knowledge-guided expert routing mechanism conditioned on knowledge relations and Top-K semantic embeddings. On the LIBERO benchmark, Stellar VLAs report strong performance relative to VLA and CIL baselines using only 1% data replay; real-world dual-arm experiments validate knowledge transfer, with TS-Stellar excelling in hierarchical manipulation and visualizations showing knowledge retention and task discovery.
Significance. If the routing mechanism demonstrably enables parameter-free specialization and forgetting mitigation, the work offers a scalable path for continual adaptation in large VLAs, addressing a key limitation of existing CIL methods that rely on added parameters or external modules. The combination of benchmark results with real-robot transfer and hierarchical modeling would represent a meaningful advance for efficient lifelong robotic learning.
major comments (2)
- [§5] §5 (Experiments): the claim that the knowledge-guided expert routing produces task specialization and prevents catastrophic forgetting without parameter growth is load-bearing for the LIBERO performance results, yet no ablation isolates the routing (conditioned on knowledge relation and Top-K embeddings) from standard continual baselines or from the 1% replay buffer; without this, the contribution of the proposed mechanism to the reported gains cannot be verified.
- [§4] §4 (Method, routing subsection): the assertion that the routing reuses existing parameters without adding new ones for the knowledge space or conditioning is central to the 'no increase in model size' claim, but the manuscript provides no explicit parameter-count comparison table or derivation showing how the Top-K semantic embeddings and knowledge relations are implemented inside the fixed-size network.
minor comments (2)
- [Abstract] Abstract and §5: quantitative metrics (e.g., success rates, exact baselines compared) and error bars or statistical significance are not reported in the summary of LIBERO results; adding these would improve verifiability of the 'strong performance' statement.
- [§6] §6 (Real-world evaluation): the dual-arm platform description lacks detail on embodiment differences and scene configurations relative to simulation; a brief table comparing sim-to-real gaps would clarify the transfer claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing that additional clarifications and experiments will strengthen the manuscript. We will incorporate the suggested changes in the revised version.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): the claim that the knowledge-guided expert routing produces task specialization and prevents catastrophic forgetting without parameter growth is load-bearing for the LIBERO performance results, yet no ablation isolates the routing (conditioned on knowledge relation and Top-K embeddings) from standard continual baselines or from the 1% replay buffer; without this, the contribution of the proposed mechanism to the reported gains cannot be verified.
Authors: We agree that an explicit ablation isolating the routing mechanism would strengthen the evidence. Our LIBERO results already compare Stellar VLA variants against both standard VLA models and existing CIL baselines (which lack the knowledge-guided routing), with performance gains observed under the 1% replay setting. To directly address the concern, the revised manuscript will include a new ablation study that disables the knowledge-guided expert routing and Top-K semantic embeddings while retaining the 1% replay buffer, reporting the resulting drop in task success rates and increased forgetting on LIBERO to quantify the routing's isolated contribution to specialization and continual performance. revision: yes
-
Referee: [§4] §4 (Method, routing subsection): the assertion that the routing reuses existing parameters without adding new ones for the knowledge space or conditioning is central to the 'no increase in model size' claim, but the manuscript provides no explicit parameter-count comparison table or derivation showing how the Top-K semantic embeddings and knowledge relations are implemented inside the fixed-size network.
Authors: We appreciate this observation. The routing mechanism reuses the fixed VLA backbone by computing knowledge relations and Top-K semantic embeddings from the jointly learned knowledge space, which is integrated via existing conditioning pathways without expanding parameter count. In the revised manuscript, we will add an explicit parameter-count comparison table (base VLA vs. T-Stellar vs. TS-Stellar) and a short derivation in §4 showing that the embeddings and relations are generated using the model's existing layers and do not introduce new parameters. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes Stellar VLA as jointly optimizing task representations and a learned knowledge space, with a knowledge-guided expert routing mechanism conditioned on knowledge relations and Top-K embeddings. These are presented as trainable components whose outputs are validated empirically on LIBERO (1% replay) and real-robot transfer, rather than defined to tautologically produce the reported specialization or forgetting resistance. No equations reduce the performance gains to fitted inputs by construction, and no load-bearing self-citation chain is invoked to force uniqueness. The central claims rest on experimental comparisons against VLA and CIL baselines, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Top-K value for semantic embeddings
axioms (1)
- domain assumption Task representations and knowledge space can be jointly optimized without interference or forgetting
Forward citations
Cited by 4 Pith papers
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...
Reference graph
Works this paper leans on
-
[1]
Charles E Antoniak. Mixtures of dirichlet processes with ap- plications to bayesian nonparametric problems.The annals of statistics, pages 1152–1174, 1974. 2
work page 1974
-
[2]
RT-H: Action Hierarchies Using Language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[3]
Zhenshan Bing, David Lerch, Kai Huang, and Alois Knoll. Meta-reinforcement learning in non-stationary and dynamic environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3476–3491, 2022. 3
work page 2022
-
[4]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Don't forget, there is more than forgetting: new metrics for Continual Learning
Natalia D ´ıaz-Rodr´ıguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than for- getting: new metrics for continual learning.arXiv preprint arXiv:1810.13166, 2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 2
work page 2022
-
[8]
Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, Zeyu Jiang, et al. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models.arXiv preprint arXiv:2506.17561, 2025. 2
-
[9]
Shengran Hu and Jeff Clune. Thought cloning: Learning to think while acting by imitating human thinking.Advances in Neural Information Processing Systems, 36:44451–44469,
-
[10]
Michael C Hughes and Erik Sudderth. Memoized on- line variational inference for dirichlet process mixture mod- els.Advances in neural information processing systems, 26,
-
[11]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Efficient planning in a compact latent action space
Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rockt ¨aschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. arXiv preprint arXiv:2208.10291, 2022. 2
-
[13]
H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,
Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rockt¨aschel, and Yuandong Tian. H-gap: Humanoid control with a generalist planner.arXiv preprint arXiv:2312.02682,
-
[14]
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Pos- ner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 2025. 1
work page 2025
-
[15]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 4
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[18]
Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, and Honguk Woo. Incremental learning of retrievable skills for efficient continual task adaptation.Advances in Neural Information Processing Systems, 37:17286–17312, 2024. 2, 3, 6
work page 2024
-
[19]
Behavior generation with latent actions
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Be- havior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024. 2, 3
-
[20]
Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, and Marios Savvides. Metavla: Unified meta co-training for efficient embodied adaption.arXiv preprint arXiv:2510.05580, 2025. 1
-
[21]
Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting
Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learn- ing framework for overcoming catastrophic forgetting. In International conference on machine learning, pages 3925–
-
[22]
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 5, 6
work page 2023
-
[24]
Tail: Task-specific adapters for imitation learning with large pretrained models,
Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, and Rasool Fakoor. Tail: Task-specific adapters for imitation learning with large pretrained models. arXiv preprint arXiv:2310.05905, 2023. 2
-
[25]
Packnet: Adding mul- tiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. InPro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 2
work page 2018
-
[26]
Yuan Meng, Zhenshan Bing, Xiangtong Yao, Kejia Chen, Kai Huang, Yang Gao, Fuchun Sun, and Alois Knoll. Pre- serving and combining knowledge in robotic lifelong rein- forcement learning.Nature Machine Intelligence, pages 1– 14, 2025. 2, 3
work page 2025
-
[27]
Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, and Animesh Garg. Quest: Self-supervised skill abstractions 9 for learning continuous control.Advances in Neural Infor- mation Processing Systems, 37:4062–4089, 2024. 2, 3
work page 2024
-
[28]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3
work page 2018
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
work page 2021
-
[30]
Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pas- canu, Yee Whye Teh, and Raia Hadsell. Continual unsuper- vised representation learning.Advances in neural informa- tion processing systems, 32, 2019. 2
work page 2019
-
[31]
Moritz Reuss, Jyothish Pari, Pulkit Agrawal, and Rudolf Li- outikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning.arXiv preprint arXiv:2412.12953, 2024. 1, 2, 5, 6, 7
-
[32]
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz- van Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. 2
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994
Jayaram Sethuraman. A constructive definition of dirichlet priors.Statistica sinica, pages 639–650, 1994. 3
work page 1994
-
[34]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025. 2
-
[36]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes.Journal of the american statistical association, 101(476):1566–1581,
-
[38]
Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,
Sebastian Thrun and Tom M Mitchell. Lifelong robot learn- ing.Robotics and autonomous systems, 15(1-2):25–46,
-
[39]
Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery
Weikang Wan, Yifeng Zhu, Rutav Shah, and Yuke Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 537–544. IEEE, 2024. 2, 3, 6
work page 2024
-
[40]
Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning
Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xi- ang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, et al. Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning.arXiv preprint arXiv:2407.01531, 2024. 2
-
[41]
Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao- Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers.arXiv preprint arXiv:2507.01016, 2025. 2, 3
-
[42]
Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Shawn Ma, and Yitao Liang. Omnijarvis: Unified vision-language- action tokenization enables open-world instruction following agents.Advances in Neural Information Processing Systems, 37:73278–73308, 2024. 2
work page 2024
-
[43]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Jonathan Yang, Chuyuan Kelly Fu, Dhruv Shah, Dorsa Sadigh, Fei Xia, and Tingnan Zhang. Bridging perception and action: Spatially-grounded mid-level representations for robot generalization.arXiv preprint arXiv:2506.06196,
-
[46]
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.arXiv preprint arXiv:2505.16278, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[47]
Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation.arXiv preprint arXiv:2505.22159, 2025
-
[48]
arXiv preprint arXiv:2503.20384 (2025)
Rongyu Zhang, Menghang Dong, Yuan Zhang, Liang Heng, Xiaowei Chi, Gaole Dai, Li Du, Yuan Du, and Shanghang Zhang. Mole-vla: Dynamic layer-skipping vision language action model via mixture-of-layers for efficient robot manip- ulation.arXiv preprint arXiv:2503.20384, 2025. 2
-
[49]
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Han Zhao, Wenxuan Song, Donglin Wang, Xinyang Tong, Pengxiang Ding, Xuelian Cheng, and Zongyuan Ge. More: Unlocking scalability in reinforcement learning for quadruped vision-language-action models.arXiv preprint arXiv:2503.08007, 2025. 2
-
[51]
Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,
Ruijie Zheng, Ching-An Cheng, Hal Daum ´e III, Furong Huang, and Andrey Kolobov. Prise: Learning temporal ac- tion abstractions as a sequence compression problem.CoRR,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.