pith. sign in

arxiv: 2603.14371 · v2 · pith:RDLM25SLnew · submitted 2026-03-15 · 💻 cs.RO · cs.AI

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

Pith reviewed 2026-05-21 11:41 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords KV cache managementVLA inferencemulti-task parallelismMixture-of-Transformerscontinuous batchingon-device roboticsembodied AIaction frequency
0
0 comments X

The pith

Unified KV cache management lets VLA models run multiple tasks in parallel by sharing observation prefill and batching language decoding with action cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current inference systems waste work when a single VLA model must produce language, actions, and memory outputs at different rates from the same camera input. The authors treat the KV cache as one shared pool rather than separate per-task stores. This change removes repeated prefill of common observations and lets language tokens be generated continuously while actions stay locked to fixed control cycles. If the design holds, on-device robots can sustain high language throughput and real-time control at once without extra hardware or quality loss.

Core claim

Isolated KV cache management is the root cause of redundant computation and resource contention in multi-task VLA inference. Treating the KV cache as a first-class shared resource across tasks and over time enables cross-task KV sharing, which eliminates redundant prefill of shared observations, and cross-frame continuous batching, which decouples variable-length language decoding from fixed-rate action generation. When implemented for the π_{0.5} MoT VLA, the approach produces up to 3.7× speedup over isolated execution while delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously on both RTX 4090 and Jetson AGX Thor without degrading action quality.

What carries the argument

unified KV cache management, an abstraction that treats the KV cache as a shared resource across tasks and time to support cross-task KV sharing for redundant prefill elimination and cross-frame continuous batching for rate decoupling

If this is right

  • Language and action outputs run at high rates together from one model instance.
  • Redundant computation on repeated observations drops sharply across tasks.
  • The same design works on both workstation GPUs and edge platforms like Jetson AGX Thor.
  • Real-robot validation confirms the throughput and frequency numbers hold in hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sharing idea could apply to other models that emit outputs at mismatched rates, such as multi-modal generation systems.
  • Lower memory footprint from shared KV entries might extend battery life on mobile robots.
  • Dynamic adjustment of batch windows could further optimize for sudden changes in task priority.
  • The approach may combine with model compression techniques to reach even higher frequencies on smaller chips.

Load-bearing premise

Cross-task KV sharing and cross-frame batching can be applied to the MoT architecture without creating state inconsistencies or output errors when tasks have different time constraints.

What would settle it

Execute the system on a mix of manipulation, conversation, and memory tasks from the same observation stream and check whether action success rates or language coherence fall below the isolated baseline.

Figures

Figures reproduced from arXiv: 2603.14371 by Huaizhi Tang, Ting Cao, Weijun Wang, Xiangyu Li, Xin Ding, Yunxin Liu.

Figure 1
Figure 1. Figure 1: Left: An example of deploying a Mixture-of-Transformers (MoT) Vision￾Language-Action (VLA) model for parallel multi-task inference: based on per-frame input observations, the VLA generates robot actions within each frame, while contin￾uously generating language-based memories during multiple frames [38]. Right: Com￾parison between two paradigms of MoT VLA inference: existing systems manages KV cache in iso… view at source ↗
Figure 2
Figure 2. Figure 2: KV-centric dataflow at frame t with unified KV cache manager. KV[t] repre￾sents KV cache prefilled at frame t (i.e., Kt defined in Eq. (1)); ∆Language[t] represents incremental language tokens in yt, generated with Kt. Resumable generation state. Given the shared Kt from Eq. (1), cross-task KV sharing is straightforward: all experts at frame t consume the same prefill cache Kt. However, interrupting and re… view at source ↗
Figure 3
Figure 3. Figure 3: Timeline comparison of OxyGen vs. isolated execution (baseline), with an ex￾ample workload of N = 12 total tokens per request. After the initial warmup, OxyGen steadily advances B = 3 parallel requests to produce k = 4 tokens per request per frame, significantly reducing the end-to-end inference latency per frame, and increas￾ing both action frequency and language throughput, all by a factor of 1 + ∆T /T. … view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of action frequency and language throughput under different con￾figurations. OxyGen consistently outperforms baselines by up to 3.7×, achieving up to 200 tokens/s language throughput and 70 Hz action frequency simultaneously. steps S = 10 during evaluation, unless otherwise specified. Language through￾put (tokens/s): number of language tokens generated per second, reflecting the speed of languag… view at source ↗
Figure 5
Figure 5. Figure 5: Speedup ratio for action frequency of OxyGen vs. baseline under different configurations. Action denoising steps have modest impact on the speedup ratio. and achieves significant speedup. (3) Naive MPS parallelization provides modest improvement over the sequential baseline, demonstrating that simply running tasks in parallel without eliminating redundant computation is insufficient. 4.3 Ablation Study We … view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of action frequency on LIBERO across language decoding steps. Cross￾task KV sharing provides the initial speedup for short decoding steps, while cross-frame continuous batching maintains high action frequency as decoding steps increase. 1/4 1/2 1 2 4 Arrivals per Frame 0 30 60 90 120 150 Action Frequency (Hz) Uniform 0.5 1.0 1.5 2.0 Mean Arrivals per Frame (λ) 0 20 40 60 80 100 Poisson 1/9 3/7 5/5… view at source ↗
Figure 7
Figure 7. Figure 7: Action frequency (per request) and average batch size of OxyGen under differ￾ent request arrival patterns. 4.4 Workload Generalization Besides the common workload of one new observation per frame, we demon￾strate OxyGen’s capability of generalizing to more workloads, including uniform arrivals with varying requests per frame, Poisson arrivals with varying intensity λ (mean arrivals per frame), and random-l… view at source ↗
read the original abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $\pi_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes OxyGen, a unified KV cache management design for efficient multi-task inference in Mixture-of-Transformers (MoT) Vision-Language-Action (VLA) models such as π_{0.5}. It treats the KV cache as a shared resource to enable cross-task KV sharing (eliminating redundant prefill of shared observations) and cross-frame continuous batching (decoupling variable-length language decoding from fixed-rate action generation). Implemented and evaluated on NVIDIA RTX 4090 and Jetson AGX Thor, OxyGen reports up to 3.7× speedup over isolated execution, >200 tokens/s language throughput, and 70 Hz action frequency simultaneously, without degrading action quality, with additional validation on a real humanoid robot.

Significance. If the central claims hold, the work would be significant for on-device deployment of multi-task VLAs, as it directly tackles redundant computation and resource contention in parallel execution of heterogeneous tasks (manipulation, conversation, memory) from shared observations. The concrete speedups, throughput numbers, and real-robot validation on representative hardware platforms constitute practical contributions. The engineering focus on KV cache as a first-class abstraction is a clear strength, though the absence of detailed experimental protocols in the provided abstract limits immediate assessment of reproducibility.

major comments (2)
  1. [Section 3 (cross-task KV sharing)] The description of cross-task KV sharing (around the implementation for π_{0.5}'s MoT architecture) does not specify the concrete mechanisms—such as per-task attention masks, position ID offsets, or explicit cache segmentation—that prevent attention leakage or hidden-state inconsistencies between tasks with mismatched generation rates and output heads. This is load-bearing for the claim that action quality remains undegraded while achieving simultaneous 200 tokens/s and 70 Hz performance.
  2. [Evaluation section / results table] Table or figure reporting the 3.7× speedup and quality metrics lacks explicit baselines, error bars, number of runs, or data exclusion criteria. Without these, it is difficult to verify that the gains are not artifacts of particular task pairings or hardware configurations, directly affecting confidence in the multi-task parallelism results.
minor comments (2)
  1. [Section 3] Notation for the two optimizations (cross-task sharing vs. cross-frame batching) could be introduced with a small diagram or pseudocode to clarify how they interact over control cycles.
  2. [Abstract and §4] The abstract and evaluation should explicitly state the isolated-execution baseline (e.g., separate processes or sequential scheduling) used for the 3.7× comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive evaluation of the significance of our work on unified KV cache management for multi-task VLA inference. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Section 3 (cross-task KV sharing)] The description of cross-task KV sharing (around the implementation for π_{0.5}'s MoT architecture) does not specify the concrete mechanisms—such as per-task attention masks, position ID offsets, or explicit cache segmentation—that prevent attention leakage or hidden-state inconsistencies between tasks with mismatched generation rates and output heads. This is load-bearing for the claim that action quality remains undegraded while achieving simultaneous 200 tokens/s and 70 Hz performance.

    Authors: We agree that a more detailed exposition of the mechanisms is essential for clarity and to substantiate the claims regarding action quality preservation. In the revised version of the manuscript, we have augmented Section 3 with explicit descriptions of the implementation details for cross-task KV sharing in the context of π_{0.5}'s MoT architecture. This includes the use of per-task attention masks to isolate computations, adjustments to position ID offsets to handle mismatched generation rates, and explicit cache segmentation to prevent hidden-state inconsistencies. These additions ensure that the shared KV cache does not lead to attention leakage across tasks, thereby supporting the reported performance without degradation in action quality. We believe this revision addresses the concern directly. revision: yes

  2. Referee: [Evaluation section / results table] Table or figure reporting the 3.7× speedup and quality metrics lacks explicit baselines, error bars, number of runs, or data exclusion criteria. Without these, it is difficult to verify that the gains are not artifacts of particular task pairings or hardware configurations, directly affecting confidence in the multi-task parallelism results.

    Authors: We acknowledge that the original presentation of results could benefit from additional statistical rigor and transparency. Accordingly, we have revised the evaluation section to include a more comprehensive table that specifies the baselines used (such as isolated task execution and alternative batching strategies), reports error bars based on multiple runs (specifically, mean and standard deviation over 10 independent trials per configuration), details the number of runs, and outlines the data exclusion criteria (e.g., runs affected by transient hardware issues were excluded and noted). These changes aim to provide greater confidence that the observed speedups and throughput metrics are robust and not dependent on specific task pairings or hardware setups. We have also added a discussion on reproducibility protocols. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering results independent of inputs

full rationale

The paper presents a systems-level inference optimization for unified KV cache management in MoT-based VLAs, with performance claims (3.7× speedup, 200 tokens/s, 70 Hz) arising from direct implementation and hardware measurement on RTX 4090 and Jetson AGX Thor rather than any first-principles derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are shown that reduce to the paper's own inputs by construction; cross-task sharing and continuous batching are described as explicit design changes whose correctness is validated externally via action-quality preservation on a real robot. The work is self-contained against external benchmarks with no load-bearing self-citations or ansatz smuggling evident in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on domain assumptions about the MoT VLA supporting heterogeneous parallel outputs from shared inputs; no new free parameters or invented entities are introduced.

axioms (1)
  • domain assumption MoT VLAs architecturally support heterogeneous outputs for multiple tasks from shared observations under distinct time constraints.
    This premise underpins the need for unified cache management and is stated in the problem setup.

pith-pipeline@v0.9.0 · 5786 in / 1194 out tokens · 56550 ms · 2026-05-21T11:41:38.838395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen ref...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    1X Technologies: NEO Home Robot: The World’s First Consumer-Ready Hu- manoid Robot.https://www.1x.tech/neo(2026), accessed: 2026-02-14

  2. [2]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Anwar, A., Welsh, J., Biswas, J., Pouya, S., Chang, Y.: Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 2838–

  3. [3]

    Motus: A Unified Latent Action World Model

    Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

  4. [4]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  6. [6]

    Real-Time Execution of Action Chunking Flow Policies

    Black, K., Galliker, M.Y., Levine, S.: Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339 (2025)

  7. [7]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)

  8. [8]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

    Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

  9. [9]

    arXiv preprint arXiv:2509.09090 (2025)

    Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: Sqap-vla: A synergistic quantization- aware pruning framework for high-performance vision-language-action models. arXiv preprint arXiv:2509.09090 (2025)

  10. [10]

    Figure AI: Introducing Figure 03: Third Generation Humanoid Robot.https: //www.figure.ai/news/introducing-figure-03(Oct 2025), accessed: 2026-02- 14

  11. [11]

    arXiv preprint arXiv:2510.03215 (2025)

    Fu, T., Min, Z., Zhang, H., Yan, J., Dai, G., Ouyang, W., Wang, Y.: Cache- to-cache: Direct semantic communication between large language models. arXiv preprint arXiv:2510.03215 (2025)

  12. [12]

    In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

    Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

  13. [13]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  14. [14]

    arXiv preprint arXiv:2509.00576 (2025)

    Jiang, T., Yuan, T., Liu, Y., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., Zhao, H.: Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576 (2025)

  15. [15]

    arXiv preprint arXiv:2509.12594 (2025) 16 X

    Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025) 16 X. Li et al

  16. [16]

    arXiv preprint arXiv:2602.18397 (2026)

    Jiang, W., Clemons, J., Sankaralingam, K., Kozyrakis, C.: How fast can i run my vla? demystifying vla inference performance with vla-perf. arXiv preprint arXiv:2602.18397 (2026)

  17. [17]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)

  18. [18]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  19. [19]

    In: Conference on Robot Learning

    Kim, N., Kwon, O., Yoo, H., Choi, Y., Park, J., Oh, S.: Topological semantic graph memory for image-goal navigation. In: Conference on Robot Learning. pp. 393–402. PMLR (2023)

  20. [20]

    In: Proceedings of the 29th symposium on operating systems prin- ciples

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023)

  21. [21]

    arXiv preprint arXiv:2602.04157 (2026)

    Lee, D.W., Gillet, S., Morency, L.P., Breazeal, C., Park, H.W.: A modern system recipe for situated embodied human-robot conversation with real-time multimodal llms and tool-calling. arXiv preprint arXiv:2602.04157 (2026)

  22. [22]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang,Y.,etal.:Cogact:Afoundationalvision-language-actionmodelforsynergiz- ing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 (2024)

  23. [23]

    arXiv preprint arXiv:2506.12723 (2025)

    Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025)

  24. [24]

    Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    Liang, W., Yu, L., Luo, L., Iyer, S., Dong, N., Zhou, C., Ghosh, G., Lewis, M., Yih, W.t., Zettlemoyer, L., et al.: Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996 (2024)

  25. [25]

    Advances in Neural Information Processing Systems36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

  26. [26]

    arXiv preprint arXiv:2411.02820 (2024)

    Liu, Y., Huang, Y., Yao, J., Feng, S., Gu, Z., Du, K., Li, H., Cheng, Y., Jiang, J., Lu, S., et al.: Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving. arXiv preprint arXiv:2411.02820 (2024)

  27. [27]

    arXiv preprint arXiv:2510.26742 (2025)

    Ma, Y., Zhou, Y., Yang, Y., Wang, T., Fan, H.: Running vlas at real-time speed. arXiv preprint arXiv:2510.26742 (2025)

  28. [28]

    OpenGalaxea: Galaxeavla: Galaxea’s open-source vla repository (2025),https: //github.com/OpenGalaxea/GalaxeaVLA, g0 PLUS Community License (Non- Commercial)

  29. [29]

    arXiv preprint arXiv:2412.01034 (2024)

    Park, S., Kim, H., Jeon, W., Yang, J., Jeon, B., Oh, Y., Choi, J.: Quantization- aware imitation-learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 (2024)

  30. [30]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)

  31. [31]

    com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17

    Physical Intelligence: openpi (2025),https : / / github . com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17

  32. [32]

    In: Proceedings of the International Conference on Automated Planning and Scheduling

    Rajvanshi, A., Sikka, K., Lin, X., Lee, B., Chiu, H.P., Velasquez, A.: Saynav: Grounding large language models for dynamic planning to navigation in new envi- ronments. In: Proceedings of the International Conference on Automated Planning and Scheduling. vol. 34, pp. 464–474 (2024)

  33. [33]

    E., Otto, F., and Lioutikov, R

    Reuss, M., Zhou, H., Rühle, M., Yağmurlu, Ö.E., Otto, F., Lioutikov, R.: Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. arXiv preprint arXiv:2509.04996 (2025)

  34. [34]

    arXiv preprint (2026)

    Robotics, X.: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint (2026)

  35. [35]

    Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al.: Hi robot: Open-ended instruction following with hierarchicalvision-language-actionmodels.arXivpreprintarXiv:2502.19417(2025)

  36. [36]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025)

  37. [37]

    arXiv preprint arXiv:2505.21200 (2025)

    Tan, X., Yang, Y., Ye, P., Zheng, J., Bai, B., Wang, X., Hao, J., Chen, T.: Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models. arXiv preprint arXiv:2505.21200 (2025)

  38. [38]

    pdf, technical report, Physical Intelligence

    Torne,M.,Pertsch,K.,Walke,H.,Vedder,K.,Nair,S.,Ichter,B.,Ren,A.Z.,Wang, H., Tang, J., Stachowicz, K., Dhabalia, K., Equi, M., Vuong, Q., Springenberg, J.T., Levine, S., Finn, C., Driess, D.: Mem: Multi-scale embodied memory for vision language action models (2025),https://www.pi.website/download/Mem. pdf, technical report, Physical Intelligence

  39. [39]

    Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

    Wang, H., Xiong, C., Wang, R., Chen, X.: Bitvla: 1-bit vision-language-action models for robotics manipulation. arXiv preprint arXiv:2506.07530 (2025)

  40. [40]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., Feng, F.: Dexvla: Vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855 (2025)

  41. [41]

    X-Square-Robot: wall-x: Building general-purpose robots based on embodied foun- dation model (2025),https://github.com/X-Square-Robot/wall-x

  42. [42]

    Xiaomi Robotics: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time inference (2026),https://github.com/XiaomiRobotics/ Xiaomi-Robotics-0, apache-2.0 license

  43. [43]

    arXiv preprint arXiv:2502.02175 (2025)

    Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C.: Vla-cache: Efficient vision-language-action manipulation via adaptive token caching. arXiv preprint arXiv:2502.02175 (2025)

  44. [44]

    arXiv preprint arXiv:2509.21354 (2025)

    Xu, W., Zhuang, L., Shan, L.: Kv-efficient vla: A method to speed up vision lan- guage models with rnn-gated chunked kv cache. arXiv preprint arXiv:2509.21354 (2025)

  45. [45]

    arXiv preprint arXiv:2503.16525 (2025)

    Yang, H., Zhang, R., Huang, M., Wang, W., Tang, Y., Li, Y., Liu, Y., Zhang, D.: Kvshare: An llm service system with efficient and effective multi-tenant kv cache reuse. arXiv preprint arXiv:2503.16525 (2025)

  46. [46]

    Efficientvla: Training-free acceleration and compression for vision-language- action models.arXiv preprint arXiv:2506.10100,

    Yang, Y., Wang, Y., Wen, Z., Zhongwei, L., Zou, C., Zhang, Z., Wen, C., Zhang, L.: Efficientvla: Training-free acceleration and compression for vision-language-action models. arXiv preprint arXiv:2506.10100 (2025)

  47. [47]

    In: Proceedings of the twentieth European conference on computer systems

    Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., Jiang, J.: Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In: Proceedings of the twentieth European conference on computer systems. pp. 94–109 (2025) 18 X. Li et al

  48. [48]

    Advances in Neural Information Processing Systems37, 56619–56643 (2024)

    Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G.: Deer-vla:Dynamicinferenceofmultimodallargelanguagemodelsforefficientrobot execution. Advances in Neural Information Processing Systems37, 56619–56643 (2024)

  49. [49]

    Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

    Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., et al.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)

  50. [50]

    arXiv preprint arXiv:2503.20384 (2025)

    Zhang, R., Dong, M., Zhang, Y., Heng, L., Chi, X., Dai, G., Du, L., Du, Y., Zhang, S.: Mole-vla: Dynamic layer-skipping vision language action model via mixture-of- layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 (2025)

  51. [51]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

  52. [52]

    arXiv preprint arXiv:2512.24673 (2025)

    Zhao, Y., Zhao, L., Cheng, B., Yao, G., Wen, X., Gao, H.: Vla-rail: A real- time asynchronous inference linker for vla models and robots. arXiv preprint arXiv:2512.24673 (2025)

  53. [53]

    Advances in neural information processing systems37, 62557– 62583 (2024)

    Zheng, L., Yin, L., Xie, Z., Sun, C.L., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37, 62557– 62583 (2024)

  54. [54]

    In: Conference on Robot Learning

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)