OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism
Pith reviewed 2026-05-21 11:41 UTC · model grok-4.3
The pith
Unified KV cache management lets VLA models run multiple tasks in parallel by sharing observation prefill and batching language decoding with action cycles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Isolated KV cache management is the root cause of redundant computation and resource contention in multi-task VLA inference. Treating the KV cache as a first-class shared resource across tasks and over time enables cross-task KV sharing, which eliminates redundant prefill of shared observations, and cross-frame continuous batching, which decouples variable-length language decoding from fixed-rate action generation. When implemented for the π_{0.5} MoT VLA, the approach produces up to 3.7× speedup over isolated execution while delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously on both RTX 4090 and Jetson AGX Thor without degrading action quality.
What carries the argument
unified KV cache management, an abstraction that treats the KV cache as a shared resource across tasks and time to support cross-task KV sharing for redundant prefill elimination and cross-frame continuous batching for rate decoupling
If this is right
- Language and action outputs run at high rates together from one model instance.
- Redundant computation on repeated observations drops sharply across tasks.
- The same design works on both workstation GPUs and edge platforms like Jetson AGX Thor.
- Real-robot validation confirms the throughput and frequency numbers hold in hardware.
Where Pith is reading between the lines
- The same sharing idea could apply to other models that emit outputs at mismatched rates, such as multi-modal generation systems.
- Lower memory footprint from shared KV entries might extend battery life on mobile robots.
- Dynamic adjustment of batch windows could further optimize for sudden changes in task priority.
- The approach may combine with model compression techniques to reach even higher frequencies on smaller chips.
Load-bearing premise
Cross-task KV sharing and cross-frame batching can be applied to the MoT architecture without creating state inconsistencies or output errors when tasks have different time constraints.
What would settle it
Execute the system on a mix of manipulation, conversation, and memory tasks from the same observation stream and check whether action success rates or language coherence fall below the isolated baseline.
Figures
read the original abstract
Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $\pi_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes OxyGen, a unified KV cache management design for efficient multi-task inference in Mixture-of-Transformers (MoT) Vision-Language-Action (VLA) models such as π_{0.5}. It treats the KV cache as a shared resource to enable cross-task KV sharing (eliminating redundant prefill of shared observations) and cross-frame continuous batching (decoupling variable-length language decoding from fixed-rate action generation). Implemented and evaluated on NVIDIA RTX 4090 and Jetson AGX Thor, OxyGen reports up to 3.7× speedup over isolated execution, >200 tokens/s language throughput, and 70 Hz action frequency simultaneously, without degrading action quality, with additional validation on a real humanoid robot.
Significance. If the central claims hold, the work would be significant for on-device deployment of multi-task VLAs, as it directly tackles redundant computation and resource contention in parallel execution of heterogeneous tasks (manipulation, conversation, memory) from shared observations. The concrete speedups, throughput numbers, and real-robot validation on representative hardware platforms constitute practical contributions. The engineering focus on KV cache as a first-class abstraction is a clear strength, though the absence of detailed experimental protocols in the provided abstract limits immediate assessment of reproducibility.
major comments (2)
- [Section 3 (cross-task KV sharing)] The description of cross-task KV sharing (around the implementation for π_{0.5}'s MoT architecture) does not specify the concrete mechanisms—such as per-task attention masks, position ID offsets, or explicit cache segmentation—that prevent attention leakage or hidden-state inconsistencies between tasks with mismatched generation rates and output heads. This is load-bearing for the claim that action quality remains undegraded while achieving simultaneous 200 tokens/s and 70 Hz performance.
- [Evaluation section / results table] Table or figure reporting the 3.7× speedup and quality metrics lacks explicit baselines, error bars, number of runs, or data exclusion criteria. Without these, it is difficult to verify that the gains are not artifacts of particular task pairings or hardware configurations, directly affecting confidence in the multi-task parallelism results.
minor comments (2)
- [Section 3] Notation for the two optimizations (cross-task sharing vs. cross-frame batching) could be introduced with a small diagram or pseudocode to clarify how they interact over control cycles.
- [Abstract and §4] The abstract and evaluation should explicitly state the isolated-execution baseline (e.g., separate processes or sequential scheduling) used for the 3.7× comparison.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and positive evaluation of the significance of our work on unified KV cache management for multi-task VLA inference. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Section 3 (cross-task KV sharing)] The description of cross-task KV sharing (around the implementation for π_{0.5}'s MoT architecture) does not specify the concrete mechanisms—such as per-task attention masks, position ID offsets, or explicit cache segmentation—that prevent attention leakage or hidden-state inconsistencies between tasks with mismatched generation rates and output heads. This is load-bearing for the claim that action quality remains undegraded while achieving simultaneous 200 tokens/s and 70 Hz performance.
Authors: We agree that a more detailed exposition of the mechanisms is essential for clarity and to substantiate the claims regarding action quality preservation. In the revised version of the manuscript, we have augmented Section 3 with explicit descriptions of the implementation details for cross-task KV sharing in the context of π_{0.5}'s MoT architecture. This includes the use of per-task attention masks to isolate computations, adjustments to position ID offsets to handle mismatched generation rates, and explicit cache segmentation to prevent hidden-state inconsistencies. These additions ensure that the shared KV cache does not lead to attention leakage across tasks, thereby supporting the reported performance without degradation in action quality. We believe this revision addresses the concern directly. revision: yes
-
Referee: [Evaluation section / results table] Table or figure reporting the 3.7× speedup and quality metrics lacks explicit baselines, error bars, number of runs, or data exclusion criteria. Without these, it is difficult to verify that the gains are not artifacts of particular task pairings or hardware configurations, directly affecting confidence in the multi-task parallelism results.
Authors: We acknowledge that the original presentation of results could benefit from additional statistical rigor and transparency. Accordingly, we have revised the evaluation section to include a more comprehensive table that specifies the baselines used (such as isolated task execution and alternative batching strategies), reports error bars based on multiple runs (specifically, mean and standard deviation over 10 independent trials per configuration), details the number of runs, and outlines the data exclusion criteria (e.g., runs affected by transient hardware issues were excluded and noted). These changes aim to provide greater confidence that the observed speedups and throughput metrics are robust and not dependent on specific task pairings or hardware setups. We have also added a discussion on reproducibility protocols. revision: yes
Circularity Check
No circularity: empirical engineering results independent of inputs
full rationale
The paper presents a systems-level inference optimization for unified KV cache management in MoT-based VLAs, with performance claims (3.7× speedup, 200 tokens/s, 70 Hz) arising from direct implementation and hardware measurement on RTX 4090 and Jetson AGX Thor rather than any first-principles derivation, fitted parameter, or self-referential definition. No equations, predictions, or uniqueness theorems are shown that reduce to the paper's own inputs by construction; cross-task sharing and continuous batching are described as explicit design changes whose correctness is validated externally via action-quality preservation on a real robot. The work is self-contained against external benchmarks with no load-bearing self-citations or ansatz smuggling evident in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MoT VLAs architecturally support heterogeneous outputs for multiple tasks from shared observations under distinct time constraints.
Forward citations
Cited by 1 Pith paper
-
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies
DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen ref...
Reference graph
Works this paper leans on
-
[1]
1X Technologies: NEO Home Robot: The World’s First Consumer-Ready Hu- manoid Robot.https://www.1x.tech/neo(2026), accessed: 2026-02-14
work page 2026
-
[2]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Anwar, A., Welsh, J., Biswas, J., Pouya, S., Chang, Y.: Remembr: Building and reasoning over long-horizon spatio-temporal memory for robot navigation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 2838–
work page 2025
-
[3]
Motus: A Unified Latent Action World Model
Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Real-Time Execution of Action Chunking Flow Policies
Black, K., Galliker, M.Y., Levine, S.: Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Uni- vla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a
Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)
-
[9]
arXiv preprint arXiv:2509.09090 (2025)
Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: Sqap-vla: A synergistic quantization- aware pruning framework for high-performance vision-language-action models. arXiv preprint arXiv:2509.09090 (2025)
-
[10]
Figure AI: Introducing Figure 03: Third Generation Humanoid Robot.https: //www.figure.ai/news/introducing-figure-03(Oct 2025), accessed: 2026-02- 14
work page 2025
-
[11]
arXiv preprint arXiv:2510.03215 (2025)
Fu, T., Min, Z., Zhang, H., Yan, J., Dai, G., Ouyang, W., Wang, Y.: Cache- to-cache: Direct semantic communication between large language models. arXiv preprint arXiv:2510.03215 (2025)
-
[12]
In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)
Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)
work page 2024
-
[13]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
arXiv preprint arXiv:2509.00576 (2025)
Jiang, T., Yuan, T., Liu, Y., Lu, C., Cui, J., Liu, X., Cheng, S., Gao, J., Xu, H., Zhao, H.: Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576 (2025)
-
[15]
arXiv preprint arXiv:2509.12594 (2025) 16 X
Jiang, T., Jiang, X., Ma, Y., Wen, X., Li, B., Zhan, K., Jia, P., Liu, Y., Sun, S., Lang, X.: The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning. arXiv preprint arXiv:2509.12594 (2025) 16 X. Li et al
-
[16]
arXiv preprint arXiv:2602.18397 (2026)
Jiang, W., Clemons, J., Sankaralingam, K., Kozyrakis, C.: How fast can i run my vla? demystifying vla inference performance with vla-perf. arXiv preprint arXiv:2602.18397 (2026)
-
[17]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., et al.: Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
In: Conference on Robot Learning
Kim, N., Kwon, O., Yoo, H., Choi, Y., Park, J., Oh, S.: Topological semantic graph memory for image-goal navigation. In: Conference on Robot Learning. pp. 393–402. PMLR (2023)
work page 2023
-
[20]
In: Proceedings of the 29th symposium on operating systems prin- ciples
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023)
work page 2023
-
[21]
arXiv preprint arXiv:2602.04157 (2026)
Lee, D.W., Gillet, S., Morency, L.P., Breazeal, C., Park, H.W.: A modern system recipe for situated embodied human-robot conversation with real-time multimodal llms and tool-calling. arXiv preprint arXiv:2602.04157 (2026)
-
[22]
Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang,Y.,etal.:Cogact:Afoundationalvision-language-actionmodelforsynergiz- ing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
arXiv preprint arXiv:2506.12723 (2025)
Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025)
-
[24]
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Liang, W., Yu, L., Luo, L., Iyer, S., Dong, N., Zhou, C., Ghosh, G., Lewis, M., Yih, W.t., Zettlemoyer, L., et al.: Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Advances in Neural Information Processing Systems36, 44776–44791 (2023)
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)
work page 2023
-
[26]
arXiv preprint arXiv:2411.02820 (2024)
Liu, Y., Huang, Y., Yao, J., Feng, S., Gu, Z., Du, K., Li, H., Cheng, Y., Jiang, J., Lu, S., et al.: Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving. arXiv preprint arXiv:2411.02820 (2024)
-
[27]
arXiv preprint arXiv:2510.26742 (2025)
Ma, Y., Zhou, Y., Yang, Y., Wang, T., Fan, H.: Running vlas at real-time speed. arXiv preprint arXiv:2510.26742 (2025)
-
[28]
OpenGalaxea: Galaxeavla: Galaxea’s open-source vla repository (2025),https: //github.com/OpenGalaxea/GalaxeaVLA, g0 PLUS Community License (Non- Commercial)
work page 2025
-
[29]
arXiv preprint arXiv:2412.01034 (2024)
Park, S., Kim, H., Jeon, W., Yang, J., Jeon, B., Oh, Y., Choi, J.: Quantization- aware imitation-learning for resource-efficient robotic control. arXiv preprint arXiv:2412.01034 (2024)
-
[30]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S.: Fast: Efficient action tokenization for vision-language-action models. arXiv preprint arXiv:2501.09747 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Physical Intelligence: openpi (2025),https : / / github . com / Physical - Intelligence/openpi, apache-2.0 license OxyGen: Unified KV Cache Management for Multi-Task VLA 17
work page 2025
-
[32]
In: Proceedings of the International Conference on Automated Planning and Scheduling
Rajvanshi, A., Sikka, K., Lin, X., Lee, B., Chiu, H.P., Velasquez, A.: Saynav: Grounding large language models for dynamic planning to navigation in new envi- ronments. In: Proceedings of the International Conference on Automated Planning and Scheduling. vol. 34, pp. 464–474 (2024)
work page 2024
-
[33]
E., Otto, F., and Lioutikov, R
Reuss, M., Zhou, H., Rühle, M., Yağmurlu, Ö.E., Otto, F., Lioutikov, R.: Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies. arXiv preprint arXiv:2509.04996 (2025)
-
[34]
Robotics, X.: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint (2026)
work page 2026
-
[35]
Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al.: Hi robot: Open-ended instruction following with hierarchicalvision-language-actionmodels.arXivpreprintarXiv:2502.19417(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zoui- tine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., et al.: Smolvla: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
arXiv preprint arXiv:2505.21200 (2025)
Tan, X., Yang, Y., Ye, P., Zheng, J., Bai, B., Wang, X., Hao, J., Chen, T.: Think twice, act once: Token-aware compression and action reuse for efficient inference in vision-language-action models. arXiv preprint arXiv:2505.21200 (2025)
-
[38]
pdf, technical report, Physical Intelligence
Torne,M.,Pertsch,K.,Walke,H.,Vedder,K.,Nair,S.,Ichter,B.,Ren,A.Z.,Wang, H., Tang, J., Stachowicz, K., Dhabalia, K., Equi, M., Vuong, Q., Springenberg, J.T., Levine, S., Finn, C., Driess, D.: Mem: Multi-scale embodied memory for vision language action models (2025),https://www.pi.website/download/Mem. pdf, technical report, Physical Intelligence
work page 2025
-
[39]
Wang, H., Xiong, C., Wang, R., Chen, X.: Bitvla: 1-bit vision-language-action models for robotics manipulation. arXiv preprint arXiv:2506.07530 (2025)
-
[40]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., Feng, F.: Dexvla: Vision-language model with plug-in diffusion expert for general robot control. arXiv preprint arXiv:2502.05855 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
X-Square-Robot: wall-x: Building general-purpose robots based on embodied foun- dation model (2025),https://github.com/X-Square-Robot/wall-x
work page 2025
-
[42]
Xiaomi Robotics: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time inference (2026),https://github.com/XiaomiRobotics/ Xiaomi-Robotics-0, apache-2.0 license
work page 2026
-
[43]
arXiv preprint arXiv:2502.02175 (2025)
Xu, S., Wang, Y., Xia, C., Zhu, D., Huang, T., Xu, C.: Vla-cache: Efficient vision-language-action manipulation via adaptive token caching. arXiv preprint arXiv:2502.02175 (2025)
-
[44]
arXiv preprint arXiv:2509.21354 (2025)
Xu, W., Zhuang, L., Shan, L.: Kv-efficient vla: A method to speed up vision lan- guage models with rnn-gated chunked kv cache. arXiv preprint arXiv:2509.21354 (2025)
-
[45]
arXiv preprint arXiv:2503.16525 (2025)
Yang, H., Zhang, R., Huang, M., Wang, W., Tang, Y., Li, Y., Liu, Y., Zhang, D.: Kvshare: An llm service system with efficient and effective multi-tenant kv cache reuse. arXiv preprint arXiv:2503.16525 (2025)
-
[46]
Yang, Y., Wang, Y., Wen, Z., Zhongwei, L., Zou, C., Zhang, Z., Wen, C., Zhang, L.: Efficientvla: Training-free acceleration and compression for vision-language-action models. arXiv preprint arXiv:2506.10100 (2025)
-
[47]
In: Proceedings of the twentieth European conference on computer systems
Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., Jiang, J.: Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In: Proceedings of the twentieth European conference on computer systems. pp. 94–109 (2025) 18 X. Li et al
work page 2025
-
[48]
Advances in Neural Information Processing Systems37, 56619–56643 (2024)
Yue, Y., Wang, Y., Kang, B., Han, Y., Wang, S., Song, S., Feng, J., Huang, G.: Deer-vla:Dynamicinferenceofmultimodallargelanguagemodelsforefficientrobot execution. Advances in Neural Information Processing Systems37, 56619–56643 (2024)
work page 2024
-
[49]
Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., et al.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)
-
[50]
arXiv preprint arXiv:2503.20384 (2025)
Zhang, R., Dong, M., Zhang, Y., Heng, L., Chi, X., Dai, G., Du, L., Du, Y., Zhang, S.: Mole-vla: Dynamic layer-skipping vision language action model via mixture-of- layers for efficient robot manipulation. arXiv preprint arXiv:2503.20384 (2025)
-
[51]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
arXiv preprint arXiv:2512.24673 (2025)
Zhao, Y., Zhao, L., Cheng, B., Yao, G., Wen, X., Gao, H.: Vla-rail: A real- time asynchronous inference linker for vla models and robots. arXiv preprint arXiv:2512.24673 (2025)
-
[53]
Advances in neural information processing systems37, 62557– 62583 (2024)
Zheng, L., Yin, L., Xie, Z., Sun, C.L., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37, 62557– 62583 (2024)
work page 2024
-
[54]
In: Conference on Robot Learning
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.