OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

Huaizhi Tang; Ting Cao; Weijun Wang; Xiangyu Li; Xin Ding; Yunxin Liu

arxiv: 2603.14371 · v2 · pith:RDLM25SLnew · submitted 2026-03-15 · 💻 cs.RO · cs.AI

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

Xiangyu Li , Huaizhi Tang , Xin Ding , Weijun Wang , Ting Cao , Yunxin Liu This is my paper

Pith reviewed 2026-05-21 11:41 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords KV cache managementVLA inferencemulti-task parallelismMixture-of-Transformerscontinuous batchingon-device roboticsembodied AIaction frequency

0 comments

The pith

Unified KV cache management lets VLA models run multiple tasks in parallel by sharing observation prefill and batching language decoding with action cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current inference systems waste work when a single VLA model must produce language, actions, and memory outputs at different rates from the same camera input. The authors treat the KV cache as one shared pool rather than separate per-task stores. This change removes repeated prefill of common observations and lets language tokens be generated continuously while actions stay locked to fixed control cycles. If the design holds, on-device robots can sustain high language throughput and real-time control at once without extra hardware or quality loss.

Core claim

Isolated KV cache management is the root cause of redundant computation and resource contention in multi-task VLA inference. Treating the KV cache as a first-class shared resource across tasks and over time enables cross-task KV sharing, which eliminates redundant prefill of shared observations, and cross-frame continuous batching, which decouples variable-length language decoding from fixed-rate action generation. When implemented for the π_{0.5} MoT VLA, the approach produces up to 3.7× speedup over isolated execution while delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously on both RTX 4090 and Jetson AGX Thor without degrading action quality.

What carries the argument

unified KV cache management, an abstraction that treats the KV cache as a shared resource across tasks and time to support cross-task KV sharing for redundant prefill elimination and cross-frame continuous batching for rate decoupling

If this is right

Language and action outputs run at high rates together from one model instance.
Redundant computation on repeated observations drops sharply across tasks.
The same design works on both workstation GPUs and edge platforms like Jetson AGX Thor.
Real-robot validation confirms the throughput and frequency numbers hold in hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing idea could apply to other models that emit outputs at mismatched rates, such as multi-modal generation systems.
Lower memory footprint from shared KV entries might extend battery life on mobile robots.
Dynamic adjustment of batch windows could further optimize for sudden changes in task priority.
The approach may combine with model compression techniques to reach even higher frequencies on smaller chips.

Load-bearing premise

Cross-task KV sharing and cross-frame batching can be applied to the MoT architecture without creating state inconsistencies or output errors when tasks have different time constraints.

What would settle it

Execute the system on a mix of manipulation, conversation, and memory tasks from the same observation stream and check whether action success rates or language coherence fall below the isolated baseline.

Figures

Figures reproduced from arXiv: 2603.14371 by Huaizhi Tang, Ting Cao, Weijun Wang, Xiangyu Li, Xin Ding, Yunxin Liu.

**Figure 1.** Figure 1: Left: An example of deploying a Mixture-of-Transformers (MoT) VisionLanguage-Action (VLA) model for parallel multi-task inference: based on per-frame input observations, the VLA generates robot actions within each frame, while continuously generating language-based memories during multiple frames [38]. Right: Comparison between two paradigms of MoT VLA inference: existing systems manages KV cache in iso… view at source ↗

**Figure 2.** Figure 2: KV-centric dataflow at frame t with unified KV cache manager. KV[t] represents KV cache prefilled at frame t (i.e., Kt defined in Eq. (1)); ∆Language[t] represents incremental language tokens in yt, generated with Kt. Resumable generation state. Given the shared Kt from Eq. (1), cross-task KV sharing is straightforward: all experts at frame t consume the same prefill cache Kt. However, interrupting and re… view at source ↗

**Figure 3.** Figure 3: Timeline comparison of OxyGen vs. isolated execution (baseline), with an example workload of N = 12 total tokens per request. After the initial warmup, OxyGen steadily advances B = 3 parallel requests to produce k = 4 tokens per request per frame, significantly reducing the end-to-end inference latency per frame, and increasing both action frequency and language throughput, all by a factor of 1 + ∆T /T. … view at source ↗

**Figure 4.** Figure 4: Comparison of action frequency and language throughput under different configurations. OxyGen consistently outperforms baselines by up to 3.7×, achieving up to 200 tokens/s language throughput and 70 Hz action frequency simultaneously. steps S = 10 during evaluation, unless otherwise specified. Language throughput (tokens/s): number of language tokens generated per second, reflecting the speed of languag… view at source ↗

**Figure 5.** Figure 5: Speedup ratio for action frequency of OxyGen vs. baseline under different configurations. Action denoising steps have modest impact on the speedup ratio. and achieves significant speedup. (3) Naive MPS parallelization provides modest improvement over the sequential baseline, demonstrating that simply running tasks in parallel without eliminating redundant computation is insufficient. 4.3 Ablation Study We … view at source ↗

**Figure 6.** Figure 6: Ablation of action frequency on LIBERO across language decoding steps. Crosstask KV sharing provides the initial speedup for short decoding steps, while cross-frame continuous batching maintains high action frequency as decoding steps increase. 1/4 1/2 1 2 4 Arrivals per Frame 0 30 60 90 120 150 Action Frequency (Hz) Uniform 0.5 1.0 1.5 2.0 Mean Arrivals per Frame (λ) 0 20 40 60 80 100 Poisson 1/9 3/7 5/5… view at source ↗

**Figure 7.** Figure 7: Action frequency (per request) and average batch size of OxyGen under different request arrival patterns. 4.4 Workload Generalization Besides the common workload of one new observation per frame, we demonstrate OxyGen’s capability of generalizing to more workloads, including uniform arrivals with varying requests per frame, Poisson arrivals with varying intensity λ (mean arrivals per frame), and random-l… view at source ↗

read the original abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $\pi_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OxyGen shows a workable unified KV cache for parallel tasks in MoT VLAs that cuts redundant prefill and supports mixed rates, with hardware numbers that look usable for on-device robotics.

read the letter

The paper's main contribution is treating the KV cache as a shared resource across tasks and frames instead of isolating each one. This lets cross-task sharing reuse prefilled observations from the same inputs and cross-frame batching run language decoding at its own pace while actions stay at fixed control rates. They apply it to π0.5 and report up to 3.7× faster execution, over 200 tokens/s language throughput, and 70 Hz actions on both a 4090 and Jetson AGX Thor, plus a real humanoid test, all without measurable drop in action quality. That combination of optimizations and concrete platform results is the part worth paying attention to for anyone doing multi-task VLA deployment on limited hardware. The engineering is direct and the claims are testable rather than hand-wavy. The stress-test worry about state inconsistencies or attention leakage across heterogeneous tasks is reasonable on the surface, but the paper's results claim no quality loss, so the implementation must include some form of per-task masking or segmentation that keeps contexts separate while still sharing the cache. If those details are spelled out with clear code or diagrams, the concern does not land as a load-bearing flaw. The abstract is light on baselines and variance, but the full runs on two platforms and the robot test give it more weight than a pure simulation paper. This is aimed at people building or optimizing on-device embodied systems who already work with MoT VLAs and need better throughput under parallel loads. It is not a theoretical advance but a practical systems tweak that readers in robotics inference will actually try. The work is coherent enough and grounded enough in real hardware to deserve a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes OxyGen, a unified KV cache management design for efficient multi-task inference in Mixture-of-Transformers (MoT) Vision-Language-Action (VLA) models such as π_{0.5}. It treats the KV cache as a shared resource to enable cross-task KV sharing (eliminating redundant prefill of shared observations) and cross-frame continuous batching (decoupling variable-length language decoding from fixed-rate action generation). Implemented and evaluated on NVIDIA RTX 4090 and Jetson AGX Thor, OxyGen reports up to 3.7× speedup over isolated execution, >200 tokens/s language throughput, and 70 Hz action frequency simultaneously, without degrading action quality, with additional validation on a real humanoid robot.

Significance. If the central claims hold, the work would be significant for on-device deployment of multi-task VLAs, as it directly tackles redundant computation and resource contention in parallel execution of heterogeneous tasks (manipulation, conversation, memory) from shared observations. The concrete speedups, throughput numbers, and real-robot validation on representative hardware platforms constitute practical contributions. The engineering focus on KV cache as a first-class abstraction is a clear strength, though the absence of detailed experimental protocols in the provided abstract limits immediate assessment of reproducibility.

major comments (2)

[Section 3 (cross-task KV sharing)] The description of cross-task KV sharing (around the implementation for π_{0.5}'s MoT architecture) does not specify the concrete mechanisms—such as per-task attention masks, position ID offsets, or explicit cache segmentation—that prevent attention leakage or hidden-state inconsistencies between tasks with mismatched generation rates and output heads. This is load-bearing for the claim that action quality remains undegraded while achieving simultaneous 200 tokens/s and 70 Hz performance.
[Evaluation section / results table] Table or figure reporting the 3.7× speedup and quality metrics lacks explicit baselines, error bars, number of runs, or data exclusion criteria. Without these, it is difficult to verify that the gains are not artifacts of particular task pairings or hardware configurations, directly affecting confidence in the multi-task parallelism results.

minor comments (2)

[Section 3] Notation for the two optimizations (cross-task sharing vs. cross-frame batching) could be introduced with a small diagram or pseudocode to clarify how they interact over control cycles.
[Abstract and §4] The abstract and evaluation should explicitly state the isolated-execution baseline (e.g., separate processes or sequential scheduling) used for the 3.7× comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive evaluation of the significance of our work on unified KV cache management for multi-task VLA inference. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Section 3 (cross-task KV sharing)] The description of cross-task KV sharing (around the implementation for π_{0.5}'s MoT architecture) does not specify the concrete mechanisms—such as per-task attention masks, position ID offsets, or explicit cache segmentation—that prevent attention leakage or hidden-state inconsistencies between tasks with mismatched generation rates and output heads. This is load-bearing for the claim that action quality remains undegraded while achieving simultaneous 200 tokens/s and 70 Hz performance.

Authors: We agree that a more detailed exposition of the mechanisms is essential for clarity and to substantiate the claims regarding action quality preservation. In the revised version of the manuscript, we have augmented Section 3 with explicit descriptions of the implementation details for cross-task KV sharing in the context of π_{0.5}'s MoT architecture. This includes the use of per-task attention masks to isolate computations, adjustments to position ID offsets to handle mismatched generation rates, and explicit cache segmentation to prevent hidden-state inconsistencies. These additions ensure that the shared KV cache does not lead to attention leakage across tasks, thereby supporting the reported performance without degradation in action quality. We believe this revision addresses the concern directly. revision: yes
Referee: [Evaluation section / results table] Table or figure reporting the 3.7× speedup and quality metrics lacks explicit baselines, error bars, number of runs, or data exclusion criteria. Without these, it is difficult to verify that the gains are not artifacts of particular task pairings or hardware configurations, directly affecting confidence in the multi-task parallelism results.

Authors: We acknowledge that the original presentation of results could benefit from additional statistical rigor and transparency. Accordingly, we have revised the evaluation section to include a more comprehensive table that specifies the baselines used (such as isolated task execution and alternative batching strategies), reports error bars based on multiple runs (specifically, mean and standard deviation over 10 independent trials per configuration), details the number of runs, and outlines the data exclusion criteria (e.g., runs affected by transient hardware issues were excluded and noted). These changes aim to provide greater confidence that the observed speedups and throughput metrics are robust and not dependent on specific task pairings or hardware setups. We have also added a discussion on reproducibility protocols. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering results independent of inputs

full rationale

The paper presents a systems-level inference optimization for unified KV cache management in MoT-based VLAs, with performance claims (3.7× speedup, 200 tokens/s, 70 Hz) arising from direct implementation and hardware measurement on RTX 4090 and Jetson AGX Thor rather than any first-principles derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are shown that reduce to the paper's own inputs by construction; cross-task sharing and continuous batching are described as explicit design changes whose correctness is validated externally via action-quality preservation on a real robot. The work is self-contained against external benchmarks with no load-bearing self-citations or ansatz smuggling evident in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on domain assumptions about the MoT VLA supporting heterogeneous parallel outputs from shared inputs; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption MoT VLAs architecturally support heterogeneous outputs for multiple tasks from shared observations under distinct time constraints.
This premise underpins the need for unified cache management and is stated in the problem setup.

pith-pipeline@v0.9.0 · 5786 in / 1194 out tokens · 56550 ms · 2026-05-21T11:41:38.838395+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies
cs.RO 2026-05 unverdicted novelty 6.0

DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen ref...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

1X Technologies: NEO Home Robot: The World’s First Consumer-Ready Hu- manoid Robot.https://www.1x.tech/neo(2026), accessed: 2026-02-14

work page 2026
[2]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Anwar, A., Welsh, J., Biswas, J., Pouya, S., Chang, Y.: Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 2838–

work page 2025
[3]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Real-Time Execution of Action Chunking Flow Policies

Black, K., Galliker, M.Y., Levine, S.: Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

work page arXiv 2025
[9]

arXiv preprint arXiv:2509.09090 (2025)

Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: Sqap-vla: A synergistic quantization- aware pruning framework for high-performance vision-language-action models. arXiv preprint arXiv:2509.09090 (2025)

work page arXiv 2025
[10]

Figure AI: Introducing Figure 03: Third Generation Humanoid Robot.https: //www.figure.ai/news/introducing-figure-03(Oct 2025), accessed: 2026-02- 14

work page 2025
[11]

arXiv preprint arXiv:2510.03215 (2025)

Fu, T., Min, Z., Zhang, H., Yan, J., Dai, G., Ouyang, W., Wang, Y.: Cache- to-cache: Direct semantic communication between large language models. arXiv preprint arXiv:2510.03215 (2025)

work page arXiv 2025
[12]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

work page 2024
[13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

arXiv preprint arXiv:2509.00576 (2025)

Jiang, T., Yuan, T., Liu, Y., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., Zhao, H.: Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576 (2025)

work page arXiv 2025
[15]

arXiv preprint arXiv:2509.12594 (2025) 16 X

Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025) 16 X. Li et al

work page arXiv 2025
[16]

arXiv preprint arXiv:2602.18397 (2026)

Jiang, W., Clemons, J., Sankaralingam, K., Kozyrakis, C.: How fast can i run my vla? demystifying vla inference performance with vla-perf. arXiv preprint arXiv:2602.18397 (2026)

work page arXiv 2026
[17]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

In: Conference on Robot Learning

Kim, N., Kwon, O., Yoo, H., Choi, Y., Park, J., Oh, S.: Topological semantic graph memory for image-goal navigation. In: Conference on Robot Learning. pp. 393–402. PMLR (2023)

work page 2023
[20]

In: Proceedings of the 29th symposium on operating systems prin- ciples

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023)

work page 2023
[21]

arXiv preprint arXiv:2602.04157 (2026)

Lee, D.W., Gillet, S., Morency, L.P., Breazeal, C., Park, H.W.: A modern system recipe for situated embodied human-robot conversation with real-time multimodal llms and tool-calling. arXiv preprint arXiv:2602.04157 (2026)

work page arXiv 2026
[22]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang,Y.,etal.:Cogact:Afoundationalvision-language-actionmodelforsynergiz- ing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

arXiv preprint arXiv:2506.12723 (2025)

Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025)

work page arXiv 2025
[24]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Liang, W., Yu, L., Luo, L., Iyer, S., Dong, N., Zhou, C., Ghosh, G., Lewis, M., Yih, W.t., Zettlemoyer, L., et al.: Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

work page 2023
[26]

arXiv preprint arXiv:2411.02820 (2024)

Liu, Y., Huang, Y., Yao, J., Feng, S., Gu, Z., Du, K., Li, H., Cheng, Y., Jiang, J., Lu, S., et al.: Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving. arXiv preprint arXiv:2411.02820 (2024)

work page arXiv 2024
[27]

arXiv preprint arXiv:2510.26742 (2025)

Ma, Y., Zhou, Y., Yang, Y., Wang, T., Fan, H.: Running vlas at real-time speed. arXiv preprint arXiv:2510.26742 (2025)

work page arXiv 2025
[28]

OpenGalaxea: Galaxeavla: Galaxea’s open-source vla repository (2025),https: //github.com/OpenGalaxea/GalaxeaVLA, g0 PLUS Community License (Non- Commercial)

work page 2025
[29]

arXiv preprint arXiv:2412.01034 (2024)

Park, S., Kim, H., Jeon, W., Yang, J., Jeon, B., Oh, Y., Choi, J.: Quantization- aware imitation-learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 (2024)

work page arXiv 2024
[30]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17

Physical Intelligence: openpi (2025),https : / / github . com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17

work page 2025
[32]

In: Proceedings of the International Conference on Automated Planning and Scheduling

Rajvanshi, A., Sikka, K., Lin, X., Lee, B., Chiu, H.P., Velasquez, A.: Saynav: Grounding large language models for dynamic planning to navigation in new envi- ronments. In: Proceedings of the International Conference on Automated Planning and Scheduling. vol. 34, pp. 464–474 (2024)

work page 2024
[33]

E., Otto, F., and Lioutikov, R

Reuss, M., Zhou, H., Rühle, M., Yağmurlu, Ö.E., Otto, F., Lioutikov, R.: Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. arXiv preprint arXiv:2509.04996 (2025)

work page arXiv 2025
[34]

arXiv preprint (2026)

Robotics, X.: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint (2026)

work page 2026
[35]

Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al.: Hi robot: Open-ended instruction following with hierarchicalvision-language-actionmodels.arXivpreprintarXiv:2502.19417(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

arXiv preprint arXiv:2505.21200 (2025)

Tan, X., Yang, Y., Ye, P., Zheng, J., Bai, B., Wang, X., Hao, J., Chen, T.: Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models. arXiv preprint arXiv:2505.21200 (2025)

work page arXiv 2025
[38]

pdf, technical report, Physical Intelligence

Torne,M.,Pertsch,K.,Walke,H.,Vedder,K.,Nair,S.,Ichter,B.,Ren,A.Z.,Wang, H., Tang, J., Stachowicz, K., Dhabalia, K., Equi, M., Vuong, Q., Springenberg, J.T., Levine, S., Finn, C., Driess, D.: Mem: Multi-scale embodied memory for vision language action models (2025),https://www.pi.website/download/Mem. pdf, technical report, Physical Intelligence

work page 2025
[39]

Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

Wang, H., Xiong, C., Wang, R., Chen, X.: Bitvla: 1-bit vision-language-action models for robotics manipulation. arXiv preprint arXiv:2506.07530 (2025)

work page arXiv 2025
[40]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., Feng, F.: Dexvla: Vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

X-Square-Robot: wall-x: Building general-purpose robots based on embodied foun- dation model (2025),https://github.com/X-Square-Robot/wall-x

work page 2025
[42]

Xiaomi Robotics: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time inference (2026),https://github.com/XiaomiRobotics/ Xiaomi-Robotics-0, apache-2.0 license

work page 2026
[43]

arXiv preprint arXiv:2502.02175 (2025)

Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C.: Vla-cache: Efficient vision-language-action manipulation via adaptive token caching. arXiv preprint arXiv:2502.02175 (2025)

work page arXiv 2025
[44]

arXiv preprint arXiv:2509.21354 (2025)

Xu, W., Zhuang, L., Shan, L.: Kv-efficient vla: A method to speed up vision lan- guage models with rnn-gated chunked kv cache. arXiv preprint arXiv:2509.21354 (2025)

work page arXiv 2025
[45]

arXiv preprint arXiv:2503.16525 (2025)

Yang, H., Zhang, R., Huang, M., Wang, W., Tang, Y., Li, Y., Liu, Y., Zhang, D.: Kvshare: An llm service system with efficient and effective multi-tenant kv cache reuse. arXiv preprint arXiv:2503.16525 (2025)

work page arXiv 2025
[46]

Efficientvla: Training-free acceleration and compression for vision-language- action models.arXiv preprint arXiv:2506.10100,

Yang, Y., Wang, Y., Wen, Z., Zhongwei, L., Zou, C., Zhang, Z., Wen, C., Zhang, L.: Efficientvla: Training-free acceleration and compression for vision-language-action models. arXiv preprint arXiv:2506.10100 (2025)

work page arXiv 2025
[47]

In: Proceedings of the twentieth European conference on computer systems

Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., Jiang, J.: Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In: Proceedings of the twentieth European conference on computer systems. pp. 94–109 (2025) 18 X. Li et al

work page 2025
[48]

Advances in Neural Information Processing Systems37, 56619–56643 (2024)

Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G.: Deer-vla:Dynamicinferenceofmultimodallargelanguagemodelsforefficientrobot execution. Advances in Neural Information Processing Systems37, 56619–56643 (2024)

work page 2024
[49]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., et al.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)

work page arXiv 2025
[50]

arXiv preprint arXiv:2503.20384 (2025)

Zhang, R., Dong, M., Zhang, Y., Heng, L., Chi, X., Dai, G., Du, L., Du, Y., Zhang, S.: Mole-vla: Dynamic layer-skipping vision language action model via mixture-of- layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 (2025)

work page arXiv 2025
[51]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

arXiv preprint arXiv:2512.24673 (2025)

Zhao, Y., Zhao, L., Cheng, B., Yao, G., Wen, X., Gao, H.: Vla-rail: A real- time asynchronous inference linker for vla models and robots. arXiv preprint arXiv:2512.24673 (2025)

work page arXiv 2025
[53]

Advances in neural information processing systems37, 62557– 62583 (2024)

Zheng, L., Yin, L., Xie, Z., Sun, C.L., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37, 62557– 62583 (2024)

work page 2024
[54]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

work page 2023

[1] [1]

1X Technologies: NEO Home Robot: The World’s First Consumer-Ready Hu- manoid Robot.https://www.1x.tech/neo(2026), accessed: 2026-02-14

work page 2026

[2] [2]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Anwar, A., Welsh, J., Biswas, J., Pouya, S., Chang, Y.: Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 2838–

work page 2025

[3] [3]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Real-Time Execution of Action Chunking Flow Policies

Black, K., Galliker, M.Y., Levine, S.: Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

work page arXiv 2025

[9] [9]

arXiv preprint arXiv:2509.09090 (2025)

Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: Sqap-vla: A synergistic quantization- aware pruning framework for high-performance vision-language-action models. arXiv preprint arXiv:2509.09090 (2025)

work page arXiv 2025

[10] [10]

Figure AI: Introducing Figure 03: Third Generation Humanoid Robot.https: //www.figure.ai/news/introducing-figure-03(Oct 2025), accessed: 2026-02- 14

work page 2025

[11] [11]

arXiv preprint arXiv:2510.03215 (2025)

Fu, T., Min, Z., Zhang, H., Yan, J., Dai, G., Ouyang, W., Wang, Y.: Cache- to-cache: Direct semantic communication between large language models. arXiv preprint arXiv:2510.03215 (2025)

work page arXiv 2025

[12] [12]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

work page 2024

[13] [13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

arXiv preprint arXiv:2509.00576 (2025)

Jiang, T., Yuan, T., Liu, Y., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., Zhao, H.: Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576 (2025)

work page arXiv 2025

[15] [15]

arXiv preprint arXiv:2509.12594 (2025) 16 X

Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025) 16 X. Li et al

work page arXiv 2025

[16] [16]

arXiv preprint arXiv:2602.18397 (2026)

Jiang, W., Clemons, J., Sankaralingam, K., Kozyrakis, C.: How fast can i run my vla? demystifying vla inference performance with vla-perf. arXiv preprint arXiv:2602.18397 (2026)

work page arXiv 2026

[17] [17]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

In: Conference on Robot Learning

Kim, N., Kwon, O., Yoo, H., Choi, Y., Park, J., Oh, S.: Topological semantic graph memory for image-goal navigation. In: Conference on Robot Learning. pp. 393–402. PMLR (2023)

work page 2023

[20] [20]

In: Proceedings of the 29th symposium on operating systems prin- ciples

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023)

work page 2023

[21] [21]

arXiv preprint arXiv:2602.04157 (2026)

Lee, D.W., Gillet, S., Morency, L.P., Breazeal, C., Park, H.W.: A modern system recipe for situated embodied human-robot conversation with real-time multimodal llms and tool-calling. arXiv preprint arXiv:2602.04157 (2026)

work page arXiv 2026

[22] [22]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang,Y.,etal.:Cogact:Afoundationalvision-language-actionmodelforsynergiz- ing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

arXiv preprint arXiv:2506.12723 (2025)

Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025)

work page arXiv 2025

[24] [24]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Liang, W., Yu, L., Luo, L., Iyer, S., Dong, N., Zhou, C., Ghosh, G., Lewis, M., Yih, W.t., Zettlemoyer, L., et al.: Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

work page 2023

[26] [26]

arXiv preprint arXiv:2411.02820 (2024)

Liu, Y., Huang, Y., Yao, J., Feng, S., Gu, Z., Du, K., Li, H., Cheng, Y., Jiang, J., Lu, S., et al.: Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving. arXiv preprint arXiv:2411.02820 (2024)

work page arXiv 2024

[27] [27]

arXiv preprint arXiv:2510.26742 (2025)

Ma, Y., Zhou, Y., Yang, Y., Wang, T., Fan, H.: Running vlas at real-time speed. arXiv preprint arXiv:2510.26742 (2025)

work page arXiv 2025

[28] [28]

OpenGalaxea: Galaxeavla: Galaxea’s open-source vla repository (2025),https: //github.com/OpenGalaxea/GalaxeaVLA, g0 PLUS Community License (Non- Commercial)

work page 2025

[29] [29]

arXiv preprint arXiv:2412.01034 (2024)

Park, S., Kim, H., Jeon, W., Yang, J., Jeon, B., Oh, Y., Choi, J.: Quantization- aware imitation-learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 (2024)

work page arXiv 2024

[30] [30]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17

Physical Intelligence: openpi (2025),https : / / github . com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17

work page 2025

[32] [32]

In: Proceedings of the International Conference on Automated Planning and Scheduling

Rajvanshi, A., Sikka, K., Lin, X., Lee, B., Chiu, H.P., Velasquez, A.: Saynav: Grounding large language models for dynamic planning to navigation in new envi- ronments. In: Proceedings of the International Conference on Automated Planning and Scheduling. vol. 34, pp. 464–474 (2024)

work page 2024

[33] [33]

E., Otto, F., and Lioutikov, R

Reuss, M., Zhou, H., Rühle, M., Yağmurlu, Ö.E., Otto, F., Lioutikov, R.: Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. arXiv preprint arXiv:2509.04996 (2025)

work page arXiv 2025

[34] [34]

arXiv preprint (2026)

Robotics, X.: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint (2026)

work page 2026

[35] [35]

Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al.: Hi robot: Open-ended instruction following with hierarchicalvision-language-actionmodels.arXivpreprintarXiv:2502.19417(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

arXiv preprint arXiv:2505.21200 (2025)

Tan, X., Yang, Y., Ye, P., Zheng, J., Bai, B., Wang, X., Hao, J., Chen, T.: Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models. arXiv preprint arXiv:2505.21200 (2025)

work page arXiv 2025

[38] [38]

pdf, technical report, Physical Intelligence

Torne,M.,Pertsch,K.,Walke,H.,Vedder,K.,Nair,S.,Ichter,B.,Ren,A.Z.,Wang, H., Tang, J., Stachowicz, K., Dhabalia, K., Equi, M., Vuong, Q., Springenberg, J.T., Levine, S., Finn, C., Driess, D.: Mem: Multi-scale embodied memory for vision language action models (2025),https://www.pi.website/download/Mem. pdf, technical report, Physical Intelligence

work page 2025

[39] [39]

Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

Wang, H., Xiong, C., Wang, R., Chen, X.: Bitvla: 1-bit vision-language-action models for robotics manipulation. arXiv preprint arXiv:2506.07530 (2025)

work page arXiv 2025

[40] [40]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., Feng, F.: Dexvla: Vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

X-Square-Robot: wall-x: Building general-purpose robots based on embodied foun- dation model (2025),https://github.com/X-Square-Robot/wall-x

work page 2025

[42] [42]

Xiaomi Robotics: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time inference (2026),https://github.com/XiaomiRobotics/ Xiaomi-Robotics-0, apache-2.0 license

work page 2026

[43] [43]

arXiv preprint arXiv:2502.02175 (2025)

Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C.: Vla-cache: Efficient vision-language-action manipulation via adaptive token caching. arXiv preprint arXiv:2502.02175 (2025)

work page arXiv 2025

[44] [44]

arXiv preprint arXiv:2509.21354 (2025)

Xu, W., Zhuang, L., Shan, L.: Kv-efficient vla: A method to speed up vision lan- guage models with rnn-gated chunked kv cache. arXiv preprint arXiv:2509.21354 (2025)

work page arXiv 2025

[45] [45]

arXiv preprint arXiv:2503.16525 (2025)

Yang, H., Zhang, R., Huang, M., Wang, W., Tang, Y., Li, Y., Liu, Y., Zhang, D.: Kvshare: An llm service system with efficient and effective multi-tenant kv cache reuse. arXiv preprint arXiv:2503.16525 (2025)

work page arXiv 2025

[46] [46]

Efficientvla: Training-free acceleration and compression for vision-language- action models.arXiv preprint arXiv:2506.10100,

Yang, Y., Wang, Y., Wen, Z., Zhongwei, L., Zou, C., Zhang, Z., Wen, C., Zhang, L.: Efficientvla: Training-free acceleration and compression for vision-language-action models. arXiv preprint arXiv:2506.10100 (2025)

work page arXiv 2025

[47] [47]

In: Proceedings of the twentieth European conference on computer systems

Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., Jiang, J.: Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In: Proceedings of the twentieth European conference on computer systems. pp. 94–109 (2025) 18 X. Li et al

work page 2025

[48] [48]

Advances in Neural Information Processing Systems37, 56619–56643 (2024)

Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G.: Deer-vla:Dynamicinferenceofmultimodallargelanguagemodelsforefficientrobot execution. Advances in Neural Information Processing Systems37, 56619–56643 (2024)

work page 2024

[49] [49]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., et al.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)

work page arXiv 2025

[50] [50]

arXiv preprint arXiv:2503.20384 (2025)

Zhang, R., Dong, M., Zhang, Y., Heng, L., Chi, X., Dai, G., Du, L., Du, Y., Zhang, S.: Mole-vla: Dynamic layer-skipping vision language action model via mixture-of- layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 (2025)

work page arXiv 2025

[51] [51]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

arXiv preprint arXiv:2512.24673 (2025)

Zhao, Y., Zhao, L., Cheng, B., Yao, G., Wen, X., Gao, H.: Vla-rail: A real- time asynchronous inference linker for vla models and robots. arXiv preprint arXiv:2512.24673 (2025)

work page arXiv 2025

[53] [53]

Advances in neural information processing systems37, 62557– 62583 (2024)

Zheng, L., Yin, L., Xie, Z., Sun, C.L., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37, 62557– 62583 (2024)

work page 2024

[54] [54]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

work page 2023