Recognition: 2 theorem links
· Lean Theoremπ^{*}_{0.6}: a VLA That Learns From Experience
Pith reviewed 2026-05-12 10:29 UTC · model grok-4.3
The pith
Advantage-conditioned policies let a pre-trained VLA improve on real household tasks by training on its own deployments and corrections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RECAP uses advantage conditioning to turn heterogeneous real-world data into stable policy updates for VLAs. After offline pre-training of π*_{0.6}, the method collects on-robot rollouts, records advantages for each action, and trains the policy to favor higher-advantage actions while incorporating teleoperated corrections when the robot fails. This produces measurable gains: more than doubled task throughput and roughly halved failure rates on tasks such as laundry folding and espresso preparation.
What carries the argument
Advantage-conditioned policies in RECAP, which estimate the advantage of each action from mixed data sources and condition the VLA output on those values to blend demonstrations, self-generated data, and interventions without separate weighting.
If this is right
- A single generalist VLA can be specialized to new tasks through modest on-robot data collection rather than full retraining.
- Task throughput more than doubles and failure rates roughly halve on the hardest real-world activities when the full RECAP pipeline is applied.
- Teleoperated interventions during autonomous runs can be folded back into training to correct failures without discarding the entire rollout.
- The same pre-trained base model supports both broad capabilities and high performance on specific physical tasks after experience-based refinement.
Where Pith is reading between the lines
- If the method scales, fleets of robots could pool their deployment data to accelerate collective improvement across homes and factories.
- The approach reduces reliance on exhaustive expert demonstrations by turning ordinary failures and fixes into useful training signal.
- Similar conditioning could be tested on longer-horizon tasks or multi-robot coordination where data sources are even more varied.
Load-bearing premise
Advantage conditioning on mixed demonstrations, on-policy data, and interventions will produce stable improvement in the real world without triggering large distribution shifts or unsafe autonomous behavior.
What would settle it
Run the RECAP-trained π*_{0.6} and the offline-pretrained version side-by-side on the same set of new household tasks for 100 trials each; if the RECAP version shows no reduction in failure rate or throughput, the central claim is falsified.
read the original abstract
We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $\pi^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $\pi^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RECAP (RL with Experience and Corrections via Advantage-conditioned Policies), a method for improving vision-language-action (VLA) models via reinforcement learning that incorporates heterogeneous data sources including demonstrations, on-policy rollouts, and teleoperated interventions during autonomous execution. It describes pre-training a generalist VLA called π*_{0.6} with offline RL, followed by specialization on real-robot tasks, and claims that the resulting model can fold laundry in homes, assemble boxes, and operate a professional espresso machine, with RECAP more than doubling task throughput and halving failure rates on the hardest tasks.
Significance. If the performance claims are supported by rigorous experiments, the work would be significant for real-world robotics by showing a scalable path for VLA self-improvement in unstructured settings using mixed data without requiring fully autonomous safe exploration. It directly targets the gap between offline pre-training and deployment-time adaptation. However, the absence of any experimental protocol, baselines, or analysis in the manuscript prevents determining whether these gains represent genuine policy improvement or artifacts of human intervention.
major comments (2)
- [Abstract] Abstract: The central empirical claims (more than doubling task throughput and roughly halving failure rates on hardest tasks) are stated without any accompanying experimental protocol, task definitions, trial counts, baselines, error bars, statistical tests, or ablation studies. This directly undermines evaluation of the reported numbers and is load-bearing for the paper's primary contribution.
- [Abstract] Abstract: No description is given of how advantages are estimated from the heterogeneous data mixture (demonstrations, on-policy collection, teleoperated interventions), including whether a learned critic, Monte-Carlo returns, or GAE is used, and whether importance sampling, clipping, or safety filters are applied. This leaves the skeptic's concern about high-variance advantage estimates and distribution shift unaddressed and is central to the stability of the claimed RL improvement.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address the major concerns point-by-point below and will make revisions to improve clarity and completeness, particularly regarding the experimental details and methodological specifics.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claims (more than doubling task throughput and roughly halving failure rates on hardest tasks) are stated without any accompanying experimental protocol, task definitions, trial counts, baselines, error bars, statistical tests, or ablation studies. This directly undermines evaluation of the reported numbers and is load-bearing for the paper's primary contribution.
Authors: We recognize that the abstract presents the results at a high level without the supporting experimental details. To address this, we will revise the abstract to incorporate a brief description of the experimental protocol, including task definitions, number of trials, and mention of baselines and statistical analysis. Additionally, we will ensure the Experiments section is expanded if needed to include all requested elements such as error bars and ablations. This will allow readers to properly evaluate the claims. revision: yes
-
Referee: [Abstract] Abstract: No description is given of how advantages are estimated from the heterogeneous data mixture (demonstrations, on-policy collection, teleoperated interventions), including whether a learned critic, Monte-Carlo returns, or GAE is used, and whether importance sampling, clipping, or safety filters are applied. This leaves the skeptic's concern about high-variance advantage estimates and distribution shift unaddressed and is central to the stability of the claimed RL improvement.
Authors: We agree that the description of advantage estimation from the heterogeneous data is insufficient in the current manuscript. We will add a comprehensive subsection detailing the advantage computation: specifying the use of a learned critic, the choice of Monte-Carlo returns for demonstrations and GAE for on-policy data, the application of importance sampling and clipping for mixed data, and safety filters for interventions. This will include discussion of variance and distribution shift mitigation to strengthen the methodological foundation. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivation chain
full rationale
The paper's central claims consist of empirical performance results on real-world robotics tasks (laundry folding, box assembly, espresso making) after applying the RECAP method. No equations, derivations, fitted parameters, or mathematical predictions are presented in the abstract or described structure. The method is introduced as a general-purpose RL approach incorporating heterogeneous data, but the reported gains (doubled throughput, halved failure rates) are direct experimental outcomes rather than quantities derived from prior fitted values or self-referential definitions. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the provided text. The derivation chain is absent, rendering the paper self-contained as an empirical report.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Advantage-conditioned policies can stably incorporate demonstrations, on-policy data, and expert interventions for real-world VLA improvement.
Forward citations
Cited by 39 Pith papers
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predica...
-
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibrat...
-
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
-
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction
HandITL blends human intent with policy execution to eliminate gesture jumps in dexterous VLA interventions, cutting jitter by 99.8%, grasp failures by 87.5%, and yielding 19% better refined policies.
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
-
How to utilize failure demo data?: Effective data selection for imitation learning using distribution differences in attention mechanism
The method uses attention discrepancy metrics on latent success-failure representations to select beneficial failure data for imitation learning, raising task success rates in simulations.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
-
FASTER: Value-Guided Sampling for Fast RL
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
-
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
-
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
-
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
-
Cooptimizing Safety and Performance Using Safety Value-Constrained Model Predictive Control
Augments MPC with a safety value function terminal constraint to achieve recursive feasibility and persistent safety while co-optimizing performance.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018. 1
work page 2018
-
[2]
Sascha Lange, Thomas Gabel, and Martin A. Ried- miller. Batch reinforcement learning. In Marco A. Wiering and Martijn van Otterlo, editors,Reinforce- ment Learning, volume 12 ofAdaptation, Learning, and Optimization, pages 45–73. Springer, 2012. doi: 10.1007/978-3-642-27645-3\ 2. 2, 4
-
[3]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[4]
Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint, arXiv:2505.23458,
-
[5]
In9th Annual Conference on Robot Learning, 2025
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025. 2, 3, 5, 7, 8
work page 2025
-
[6]
Physical Intelligence Team.π 0.6 model card. 2025. 2, 5, 6, 8
work page 2025
-
[7]
A reduction of imitation learning and structured prediction to no-regret online learning
St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InAISTATS, pages 627–635,
-
[8]
Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 462–469,
work page 2016
-
[9]
doi: 10.1109/ICRA.2016.7487175. 2
-
[10]
Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dra- gan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. InProceedings of the 34th Interna- tional Conference on Machine Learning (ICML), vol- ume 70 ofProceedings of Machine Learning Research, pages 1989–1998. PMLR, 2017
work page 1989
-
[11]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning, pages 991–1002. PMLR, 2022. 2
work page 2022
-
[12]
Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint, arXiv:2509.07953, 2025. 2
-
[13]
Hg-dagger: Inter- active imitation learning with human experts
Michael Kelly, Chelsea Sidrane, Katherine Driggs- Campbell, and Mykel J Kochenderfer. Hg-dagger: Inter- active imitation learning with human experts. InICRA,
-
[14]
End-to-end training of deep visuomotor policies
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334– 1373, 2016. 2
work page 2016
-
[15]
Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018
-
[16]
Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei- Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data.ICRA, 2020
work page 2020
-
[17]
Ahmed Ahmed Rehaan Ahmad, and Chelsea Finn
Archit Sharma, M. Ahmed Ahmed Rehaan Ahmad, and Chelsea Finn. Self-improving robots: End-to-end autonomous visuomotor reinforcement learning. In Proceedings of the 7th Conference on Robot Learning (CoRL), volume 229, pages 3292–3308. PMLR, 2023
work page 2023
-
[18]
2023 IEEE International Conference on Robotics and Automation (ICRA) , volume =
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Alan: Autonomously exploring robotic agents in the real world. InProceedings of the 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 3044–3050, 2023. doi: 10.1109/ICRA48891.2023. 10013321
-
[19]
Continuously improving mobile manipulation with autonomous real- world rl
Russell Mendonca, Emmanuel Panov, Bernadette Bucher, Jiuguang Wang, and Deepak Pathak. Continuously improving mobile manipulation with autonomous real- world rl. InProceedings of the 8th Conference on Robot Learning (CoRL), pages 5204–5219, 2024
work page 2024
-
[20]
Serl: A software suite for sample-efficient robotic reinforcement learning, 2024
Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning, 2024
work page 2024
-
[21]
Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, and Anusha Nagabandi. Residual off- policy rl for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025
-
[22]
Thomas Lampe, Abbas Abdolmaleki, Sarah Bechtle, Sandy H. Huang, Jost Tobias Springenberg, Michael Bloesch, Oliver Groth, Roland Hafner, Tim Hertweck, Michael Neunert, Markus Wulfmeier, Jingwei Zhang, Francesco Nori, Nicolas Heess, and Martin Riedmiller. Mastering stacking of diverse shapes with large-scale iterative reinforcement learning on real robots. ...
-
[23]
Perry Dong, Suvir Mirchandani, Dorsa Sadigh, and Chelsea Finn. What matters for batch online re- inforcement learning in robotics?arXiv preprint, arXiv:2505.08078, 2025. 2
-
[24]
Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion Policy Policy Optimization. InProceedings of the 2025 International Conference on Learning Rep- resentations (ICLR), 2025. 9, 17
work page 2025
-
[25]
Rl-100: Performant robotic ma- nipulation with real-world reinforcement learning,
Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint, arXiv:2510.14830, 2025. 2
-
[26]
Mt-opt: Continuous multi- task robotic reinforcement learning at scale.arXiv, 2021
Dmitry Kalashnkov, Jake Varley, Yevgen Chebotar, Ben Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi- task robotic reinforcement learning at scale.arXiv, 2021. 2
work page 2021
-
[27]
Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey Levine
Abhishek Gupta, Justin Yu, Tony Z. Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey Levine. Reset-free reinforcement learning via multi- task learning: Learning dexterous manipulation behaviors without human intervention. InProceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6664–6671, 2021. 2
work page 2021
-
[28]
RoboCat : A self-improving foundation agent for robotic manipulation
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving foundation agent for robotic manipula- tion.arXiv preprint arXiv:2306.11706, 2023. 2
-
[29]
Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline reinforcement learning enables learning new tasks from a handful of trials. InProceedings of Robotics: Science and Systems (RSS), 2023. doi: 10.15607/RSS.2023.XIX.019
-
[30]
Jingyun Yang, Max Sobol Mark, Brandon Vu, Archit Sharma, Jeannette Bohg, and Chelsea Finn. Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning. In Proceedings of the 2024 IEEE International Confer- ence on Robotics and Automation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610421. 2
-
[31]
Interactive post-training for vision-language- action models, 2025
Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint, arXiv:2505.17016, 2025. 2
-
[32]
Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning,
Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint, arXiv:2505.18719, 2025
-
[33]
What can rl bring to vla generalization? an empirical study.arXiv preprint, arXiv:2505.19789, 2025
Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint, arXiv:2505.19789, 2025
- [34]
-
[35]
arXiv preprint arXiv:2509.09674 , year=
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. Simplevla-rl: Scaling vla training via rein- forcement learning.arXiv preprint, arXiv:2509.09674,
-
[36]
Improving vision- language-action model with online reinforcement learning
Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint, arXiv:2501.16664, 2025. 2
-
[37]
Self- improving vision-language-action models with data gen- eration via residual rl, 2025
Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi ”Jim” Fan, Guanya Shi, and Yuke Zhu. Self- improving vision-language-action models with data gen- eration via residual rl, 2025. 2
work page 2025
-
[38]
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025. 2
-
[39]
arXiv preprint arXiv:2412.06685 , year=
Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy-agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint, arXiv:2412.06685, 2024. 2
-
[40]
Steering your generalists: Improving robotic foundation models via value guidance
Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InCon- ference on Robot Learning, pages 4996–5013. PMLR, 2025
work page 2025
-
[41]
Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, and Xuelong Li. Align-then-steer: Adapting the vision-language action models through unified latent guidance.arXiv preprint arXiv:2509.02055, 2025. 2
-
[42]
Steering your diffusion policy with latent space reinforcement learning
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning. InProceedings of the 9th Conference on Robot Learning (CoRL), 2025. 2
work page 2025
-
[43]
Charles Xu, Qiyang Li, Jianlan Luo, and Sergey Levine. Rldg: Robotic generalist policy distillation via reinforce- ment learning.arXiv preprint arXiv:2412.09858, 2024. 2
-
[44]
Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint, arXiv:2508.02219, 2025. 3
- [45]
-
[46]
Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint, arXiv:2509.15937, 2025. 3
-
[47]
Self- improving embodied foundation models.arXiv preprint, arXiv:2509.15155, 2025
Seyed Kamyar Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, and Igor Mordatch. Self- improving embodied foundation models.arXiv preprint, arXiv:2509.15155, 2025. 3
-
[48]
Reinforcement learning upside down: Don’t predict rewards — just map them to actions
J ¨urgen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards — just map them to actions. arXiv preprint, arXiv:1912.02875, 2019. 3, 4
-
[49]
Reward-conditioned policies.CoRR, abs/1912.13465,
Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies.CoRR, abs/1912.13465,
-
[50]
Decision transformer: Rein- forcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling. InAdvances in Neural Information Processing Systems (NeurIPS) 34, 2021
work page 2021
-
[51]
David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche, and Joan Bruna. When does return- conditioned supervised learning work for offline rein- forcement learning? InAdvances in Neural Information Processing Systems (NeurIPS) 35, 2022. 4
work page 2022
-
[52]
Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. Rvs: What is essential for offline rl via supervised learning? InProceedings of the 10th International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[53]
Generalized decision transformer for offline hindsight information matching
Hiroki Furuta, Yusuke Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. InProceedings of the 10th International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[54]
Taku Yamagata, Ahmed Khalil, and Ra ´ul Santos- Rodr´ıguez. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence mod- elling in offline rl. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 38989–39007. PMLR, 2023
work page 2023
-
[55]
Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. InProceedings of the 39th Interna- tional Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 27042–27059. PMLR, 2022
work page 2022
-
[56]
Advantage-conditioned diffusion: Offline rl via general- ization
Jakub Grudzien Kuba, Pieter Abbeel, and Sergey Levine. Advantage-conditioned diffusion: Offline rl via general- ization. 2023
work page 2023
-
[57]
Yueh-Hua Wu, Xiaolong Wang, and Masashi Hamaya. Elastic decision transformer. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023. doi: 10.5555/3666122.3666936. 3
-
[58]
Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions
Lin Shao, Toki Migimatsu, Qiang Zhang, Kaiyuan Yang, and Jeannette Bohg. Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions. InProceedings of Robotics: Science & Systems (RSS), 2020. doi: 10.15607/RSS.2020.XVI.082. 3
-
[59]
Annie S. Chen, Suraj Nair, and Chelsea Finn. Learn- ing generalizable robotic reward functions from “in-the- wild” human videos. InProceedings of Robotics: Science & Systems (RSS) 2021, 2021
work page 2021
-
[60]
Learning language- conditioned robot behavior from offline data and crowd- sourced annotation
Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Silvio Savarese, and Chelsea Finn. Learning language- conditioned robot behavior from offline data and crowd- sourced annotation. InProceedings of the 5th Conference on Robot Learning (CoRL), volume 164 ofProceed- ings of Machine Learning Research, pages 1303–1315. PMLR, 2022
work page 2022
-
[61]
Sontakke, Jesse Zhang, S ´ebastien M.R
Sumedh A. Sontakke, Jesse Zhang, S ´ebastien M.R. Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demon- stration is enough to learn robot policies. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[62]
Language to rewards for robotic skill synthesis
Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kir- mani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao- Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, and Fei Xia. Language to rewards for robotic skill synthesis. InProceedings of the 7...
-
[63]
Lim, Jesse Thomason, Erdem Bıyık, and Jesse Zhang
Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Bıyık, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations. In Proceedings of the 9th Conference on Robot Learning (CoRL), 2025
work page 2025
-
[64]
Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, and Kai Yuan. Video-language critic: Transferable reward functions for language-conditioned robotics.Transac- tions on Machine Learning Research, 2025:1–22, 2025. 3
work page 2025
-
[65]
Liv: Language-image representations and rewards for robotic control
Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayara- man. Liv: Language-image representations and rewards for robotic control. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML), 2023. 3
work page 2023
-
[66]
Vision language models are in-context value learners
Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Ja- yaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, and Fei Xia. Vision language models are in-context value learners. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025. 3
work page 2025
-
[67]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 3, 4, 17
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[68]
Maximum a posteriori policy optimisation
Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations,
-
[69]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 4, 9
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[70]
Peter Dayan and Geoffrey E. Hinton. Using expectation- maximization for reinforcement learning.Neural Com- putation, 9(2):271–278, 1997. doi: 10.1162/neco.1997.9. 2.271
-
[71]
Rel- ative entropy policy search
Jan Peters, Katharina M ¨ulling, and Yasemin Alt ¨un. Rel- ative entropy policy search. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelli- gence, AAAI’10, page 1607–1612. AAAI Press, 2010. 3
work page 2010
-
[72]
Exponentially weighted imitation learning for batched historical data
Qing Wang, Jiechao Xiong, Lei Han, peng sun, Han Liu, and Tong Zhang. Exponentially weighted imitation learning for batched historical data. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31, 2018. 3
work page 2018
-
[73]
A distributional perspective on reinforcement learning
Marc G Bellemare, Will Dabney, and R ´emi Munos. A distributional perspective on reinforcement learning. InInternational conference on machine learning, pages 449–458. PMLR, 2017. 4
work page 2017
-
[74]
Knowledge insulating vision-language-action models: Train fast, run fast, generalize better
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2025. 4, 6
work page 2025
-
[75]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.ICML, 2018. 4
work page 2018
-
[76]
Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic regularized regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 7...
work page 2020
-
[77]
Offline reinforcement learning with implicit q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInter- national Conference on Learning Representations, 2022. 4
work page 2022
-
[78]
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tok- enization for vision-language-action models.Robotics: Science and Systems, 2025. 6
work page 2025
-
[79]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga¨el Liu, Francesco Visin, Kathleen Kenealy, Luc...
work page 2025
-
[80]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 6, 16
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.