Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

· 2026 · cs.RO · arXiv 2605.00416

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

representative citing papers

FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation

cs.RO · 2026-06-21 · unverdicted · novelty 6.0

FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

UniIntervene uses future-conditioned action-value estimation and a temporal value-risk critic to trigger memory-based recovery interventions, reporting 8.6% higher success rates and 57% fewer human interventions than prior HiL-RL methods on real manipulation tasks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 33 · internal anchor
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning cs.RO · 2026-06-10 · unverdicted · none · ref 40 · internal anchor
UniIntervene uses future-conditioned action-value estimation and a temporal value-risk critic to trigger memory-based recovery interventions, reporting 8.6% higher success rates and 57% fewer human interventions than prior HiL-RL methods on real manipulation tasks.

Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

fields

years

verdicts

representative citing papers

citing papers explorer