arxiv: 2605.13276 · v2 · submitted 2026-05-13 · 💻 cs.AI · cs.RO

Recognition: 2 theorem links

· Lean Theorem

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

Yucheng Guo , Yongjian Guo , Zhong Guan , Wen Huang , Haoran Sun , Haodong Yue , Xiaolong Xiang , Shuai Di

show 4 more authors

Zhen Sun Luqiao Wang Junwu Xiong Yicheng Gong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:00 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords distributed reinforcement learningvision-language-action modelsplane decouplingswimlane pipelineembodied AIasynchronous RLthroughput optimizationlinear speedup

0 comments

The pith

D-VLA decouples simulation and optimization planes to deliver linear speedup for trillion-parameter vision-language-action models in distributed RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies resource conflicts between high-fidelity physical simulation and the heavy VRAM and bandwidth needs of deep learning as the main limiter on throughput when applying reinforcement learning to large VLA models. It proposes D-VLA, which introduces Plane Decoupling to physically isolate high-frequency training data from low-frequency weight control, plus a four-thread Swimlane pipeline that overlaps sampling, inference, gradient computation, and parameter distribution. Dual-pool VRAM management and topology-aware replication further reduce fragmentation and communication overhead. Experiments on LIBERO benchmarks report higher throughput and sampling efficiency than mainstream frameworks for billion-parameter models, with linear speedup and stability holding in trillion-parameter tests. A sympathetic reader would care because this setup could make training of embodied agents that integrate vision, language, and action practical at scale.

Core claim

D-VLA is a high-concurrency distributed asynchronous RL framework for large-scale embodied foundation models that uses Plane Decoupling to physically isolate high-frequency training data from low-frequency weight control, eliminating interference between simulation and optimization. A four-thread Swimlane pipeline enables full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Dual-pool VRAM management and topology-aware replication address memory fragmentation and communication efficiency. On LIBERO benchmarks the framework outperforms mainstream RL systems in throughput and sampling efficiency for billion-parameter VLA models, while trillion-paramet

What carries the argument

Plane Decoupling, which physically isolates high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization.

Load-bearing premise

Physically isolating high-frequency training data from low-frequency weight control via Plane Decoupling and the Swimlane pipeline will eliminate interference and produce linear speedup without creating new bottlenecks on real distributed hardware.

What would settle it

Running D-VLA on a real distributed cluster with a billion-parameter VLA model and measuring whether throughput gains remain sub-linear or new communication bottlenecks appear would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.13276 by Haodong Yue, Haoran Sun, Junwu Xiong, Luqiao Wang, Shuai Di, Wen Huang, Xiaolong Xiang, Yicheng Gong, Yongjian Guo, Yucheng Guo, Zhen Sun, Zhong Guan.

**Figure 1.** Figure 1: Placement Strategies across Different Training Frameworks Embodied AI, regarded as a pivotal pathway toward Artificial General Intelligence (AGI), is undergoing a profound paradigm shift driven by the emergence of Vision-Language-Action (VLA) models [1, 2, 3, 4, 5] such as OpenVLA [4], π0 [2], and GR00T [6]. These models achieve a significant transition from manually designed explicit models to data-dr… view at source ↗

**Figure 2.** Figure 2: The D-VLA Framework: Overview of the asynchronous embodied RL training architecture D-VLA . The GPU pool is partitioned into rollout workers and actor workers. Rollout GPUs co-locate PhysX-accelerated parallel environments with a frozen inference policy copy, eliminating interprocess observation transfer and model offload overhead. Upon completing a fixed-horizon rollout epoch, trajectory data is dispatch… view at source ↗

**Figure 3.** Figure 3: Schematic of Multi-Node Communication in D-VLA D-VLA implements a fully non-blocking end-toend data flow throughout its execution cycle. Environmental features collected by Rollout components are pushed to Actor components in realtime. To mitigate the speed mismatch between data production and training, a resource buffer queue based on host memory is constructed on the Actor side, ensuring continuous … view at source ↗

**Figure 4.** Figure 4: Performance Benchmarking of π0.5 under Different Distributed Strategies. (Left) System throughput measured in steps per second; (Middle) Average inference latency per step in milliseconds; (Right) Percentage breakdown of execution time between Rollout and Actor components. Ratios (3:1 and 1:1) represent the resource partitioning between rollout/environment and actor modules [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 5.** Figure 5: Performance Evaluation of OpenVLA-OFT across Various Scaling Configurations. (Left) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Training Success Rate on ManiSkill with π0.5. To further investigate the capacity and performance evolution of D-VLA under large-scale parallel workloads, we conduct a systematic evaluation using the π0.5 model as a benchmark with a 3:1 resource placement strategy. We scale the environment count from 384 to 3,072 and monitor the dynamic changes in system throughput and sub-component latencies. The results… view at source ↗

**Figure 7.** Figure 7: Performance scaling of D-VLA on π0.5 across varying environment counts. The plots illustrate (Left) throughput scaling trends with a peak at 768 environments, (Middle) the linear growth of decoupled time components, and (Right) the stacked breakdown of Rollout and Actor latencies. experimental data confirms that through precise pipeline alignment, our framework ensures a high degree of overlap between larg… view at source ↗

read the original abstract

The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D-VLA introduces workable system-level fixes for VLA RL scaling but the trillion-parameter linear speedup still needs real-hardware confirmation on communication costs.

read the letter

The core of this paper is a set of engineering choices for running reinforcement learning on large vision-language-action models in distributed settings. Plane Decoupling keeps high-frequency simulation data separate from low-frequency weight updates, the Swimlane pipeline overlaps sampling, inference, gradient steps, and parameter sync across four threads, and they add dual VRAM pools plus topology-aware replication to cut fragmentation and comms overhead. These are the concrete pieces that are new relative to standard distributed RL setups like those in the cited baselines. The approach makes sense for the stated problem: simulation and training often contend for the same resources, and isolating the paths plus pipelining the stages can raise throughput without changing the underlying RL algorithm. The LIBERO results for billion-parameter models are the part that looks most grounded, since they compare against mainstream frameworks on throughput and sampling efficiency. That kind of incremental system win can be useful for groups already running these models. The weaker part is the trillion-parameter scaling section. Linear speedup and stability are claimed, yet the abstract gives no numbers on measured bandwidth utilization, node count, or how much data still crosses the network during parameter distribution. At that scale the replication step can easily become the limiter even if the other planes are isolated, so the results only hold if the experiments actually ran under realistic interconnect constraints rather than idealized conditions. The paper is aimed at researchers who build and tune large distributed RL systems for embodied agents. It is worth sending for peer review because the techniques are specific enough that referees can check the implementation details and ask for the missing hardware measurements and ablations. If those hold up, the framework could serve as a practical reference; if not, the claims can be dialed back without losing the engineering contribution.

Referee Report

2 major / 1 minor

Summary. The paper proposes D-VLA, a high-concurrency distributed asynchronous RL framework for large-scale Vision-Language-Action models. It introduces Plane Decoupling to physically isolate high-frequency training data from low-frequency weight control, a four-thread Swimlane pipeline to overlap sampling/inference/gradient computation/parameter distribution, dual-pool VRAM management, and topology-aware replication. Experiments claim that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency on LIBERO for billion-parameter VLAs, while maintaining exceptional stability and linear speedup in trillion-parameter scalability tests.

Significance. If the experimental claims hold under realistic hardware constraints, the framework could meaningfully advance scalable RL training for embodied foundation models by addressing VRAM/bandwidth conflicts that currently limit throughput in distributed settings.

major comments (2)

[Abstract] Abstract: the headline claim of linear speedup and exceptional stability in trillion-parameter tests rests on the premise that Plane Decoupling plus the Swimlane pipeline fully overlaps all stages without new contention; however, the parameter-distribution leg must still transfer model deltas across the network, and no quantitative evidence (e.g., measured bandwidth saturation, communication volume, or topology-aware replication overhead) is provided to confirm that this leg does not become the new bottleneck at trillion-parameter scale.
[Experiments] Experiments section (implied by LIBERO and scalability results): the statements that D-VLA 'significantly outperforms' mainstream frameworks and achieves 'linear speedup' are presented without reported metrics, error bars, ablation studies, baseline implementations, or hardware configuration details, rendering the central performance claims unverifiable from the available text.

minor comments (1)

[Abstract] Abstract: the term 'physically isolating' for Plane Decoupling is used without a diagram or pseudocode clarifying the memory layout or thread affinity mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our performance claims. We address each major point below and have revised the manuscript to improve verifiability and add supporting quantitative details where the original text was insufficient.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of linear speedup and exceptional stability in trillion-parameter tests rests on the premise that Plane Decoupling plus the Swimlane pipeline fully overlaps all stages without new contention; however, the parameter-distribution leg must still transfer model deltas across the network, and no quantitative evidence (e.g., measured bandwidth saturation, communication volume, or topology-aware replication overhead) is provided to confirm that this leg does not become the new bottleneck at trillion-parameter scale.

Authors: We agree that the abstract's claims would be strengthened by explicit quantitative backing for the parameter-distribution stage. The full manuscript reports linear speedup and stability in trillion-parameter tests, but we acknowledge the absence of direct measurements such as bandwidth saturation curves or per-stage communication volumes in the summary text. In revision we will add these metrics (including measured network utilization during delta transfers and ablation of the topology-aware replication) to the Experiments section and update the abstract to reference them, confirming that the Swimlane pipeline keeps this leg from becoming the bottleneck. revision: yes
Referee: [Experiments] Experiments section (implied by LIBERO and scalability results): the statements that D-VLA 'significantly outperforms' mainstream frameworks and achieves 'linear speedup' are presented without reported metrics, error bars, ablation studies, baseline implementations, or hardware configuration details, rendering the central performance claims unverifiable from the available text.

Authors: The referee correctly identifies that the current manuscript text does not include explicit numerical throughput values, error bars, ablation tables, or hardware specifications in the main narrative. While the full paper contains supporting figures for LIBERO throughput and scalability, these elements are not sufficiently detailed or referenced in the prose. We will revise the Experiments section to report concrete metrics (e.g., samples/sec, speedup factors with standard deviations), include error bars on all plots, add ablation studies isolating Plane Decoupling and Swimlane contributions, specify baseline framework versions and implementations, and list exact hardware configurations (GPU count, interconnect topology, etc.). revision: yes

Circularity Check

0 steps flagged

No circularity detected; performance claims rest on independent experiments

full rationale

The paper introduces an engineering framework (Plane Decoupling, Swimlane pipeline, dual-pool VRAM) to address distributed RL bottlenecks for VLA models. All load-bearing claims of throughput gains and linear speedup are presented as outcomes of empirical benchmarks on LIBERO and trillion-parameter scalability tests, not as reductions from equations or self-referential definitions. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The design is justified by stated assumptions about interference elimination, with results serving as external validation rather than circular confirmation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameters; framework relies on standard distributed-systems assumptions about network behavior and hardware homogeneity.

axioms (1)

domain assumption Standard assumptions in distributed computing hold, including predictable network latency and no unexpected hardware contention.
Required for claims of linear speedup and stability in trillion-parameter tests.

pith-pipeline@v0.9.0 · 5565 in / 1133 out tokens · 39503 ms · 2026-05-15T06:00:28.408357+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

D-VLA introduces 'Plane Decoupling,' physically isolating high-frequency training data from low-frequency weight control... four-thread asynchronous 'Swimlane' pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-pool VRAM management model and topology-aware replication resolve memory fragmentation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 11 internal anchors

[1]

Sanketi, Grecia Salazar, Michael S

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...

work page 2023
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

arXiv preprint arXiv:2510.03342 (2025)

Gemini Robotics Team. Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer.arXiv e-prints, page arXiv:2510.03342, October 2025

work page arXiv 2025
[4]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Smolvla: A vision-language-action model for affordable and efficient robotics, 2025

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025

work page 2025
[6]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925,

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language- action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

work page arXiv 2025
[9]

arXiv preprint arXiv:2509.19012 (2025)

Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012, 2025

work page arXiv 2025
[10]

Lerobot: An open-source library for end-to-end robot learning.arXiv preprint arXiv:2602.22818, 2026

Remi Cadene, Simon Aliberts, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, et al. Lerobot: An open-source library for end-to-end robot learning.arXiv preprint arXiv:2602.22818, 2026

work page arXiv 2026
[11]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[12]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Minxuan Lv, Wenping Hu, Fuzheng Zhang, Kun Gai, et al. Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629, 2025

work page arXiv 2025
[15]

Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025. 10

work page arXiv 2025
[16]

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

Zhong Guan, Haoran Sun, Yongjian Guo, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Chen Zhou, Yucheng Guo, et al. Rl-vla3: Reinforcement learning vla accelerating via full asynchronism.arXiv preprint arXiv:2602.05765, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review arXiv 2025
[18]

Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511, 2025

Bin Xie, Erjin Zhou, Fan Jia, Hao Shi, Haoqiang Fan, Haowei Zhang, Hebei Li, Jianjian Sun, Jie Bin, Junwen Huang, et al. Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511, 2025

work page arXiv 2025
[19]

Vlab: Your labora- tory for pretraining vlas.https://github.com/huggingface/vlab, 2025

Mustafa Shukor Dana Aubakirova, Jade Cholgari, and Leandro von Werra. Vlab: Your labora- tory for pretraining vlas.https://github.com/huggingface/vlab, 2025

work page 2025
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Thousand-gpu large-scale training and optimization recipe for ai-native cloud embodied intelligence infrastructure.arXiv preprint arXiv:2603.11101, 2026

Chen Zhou, Haoran Sun, Hedan Yang, Jing Long, Junwu Xiong, Luqiao Wang, Mingxi Luo, Qiming Yang, Shuai Di, Song Wang, et al. Thousand-gpu large-scale training and optimization recipe for ai-native cloud embodied intelligence infrastructure.arXiv preprint arXiv:2603.11101, 2026

work page arXiv 2026
[22]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[23]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi 0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

T Mu, Z Ling, F Xiang, D Yang, X Li, S Tao, Z Huang, Z Jia, and H Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[25]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

work page 2023
[26]

Rollart: Scaling agentic rl training via disaggregated infrastructure.arXiv preprint arXiv:2512.22560, 2025

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, et al. Rollart: Scaling agentic rl training via disaggregated infrastructure.arXiv preprint arXiv:2512.22560, 2025

work page arXiv 2025
[27]

Reward models in deep reinforcement learning: A survey.arXiv preprint arXiv:2506.15421, 2025

Rui Yu, Shenghua Wan, Yucen Wang, Chen-Xiao Gao, Le Gan, Zongzhang Zhang, and De- Chuan Zhan. Reward models in deep reinforcement learning: A survey.arXiv preprint arXiv:2506.15421, 2025

work page arXiv 2025
[28]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024