arxiv: 2511.14759 · v2 · submitted 2025-11-18 · 💻 cs.LG · cs.RO

Recognition: 2 theorem links

· Lean Theorem

π^{*}_{0.6}: a VLA That Learns From Experience

Physical Intelligence , Ali Amin , Raichelle Aniceto , Ashwin Balakrishna , Kevin Black , Ken Conley , Grace Connors , James Darpinian

show 48 more authors

Karan Dhabalia Jared DiCarlo Danny Driess Michael Equi Adnan Esmail Yunhao Fang Chelsea Finn Catherine Glossop Thomas Godden Ivan Goryachev Lachy Groom Hunter Hancock Karol Hausman Gashon Hussein Brian Ichter Szymon Jakubczak Rowan Jen Tim Jones Ben Katz Liyiming Ke Chandra Kuchi Marinda Lamb Devin LeBlanc Sergey Levine Adrian Li-Bell Yao Lu Vishnu Mano Mohith Mothukuri Suraj Nair Karl Pertsch Allen Z. Ren Charvi Sharma Lucy Xiaoyang Shi Laura Smith Jost Tobias Springenberg Kyle Stachowicz Will Stoeckle Alex Swerdlow James Tanner Marcel Torne Quan Vuong Anna Walling Haohuan Wang Blake Williams Sukwon Yoo Lili Yu Ury Zhilinsky Zhiyuan Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 10:29 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords vision-language-action modelsreinforcement learningrobot learningadvantage conditioningreal-world deploymentheterogeneous dataself-improvementpolicy refinement

0 comments

The pith

Advantage-conditioned policies let a pre-trained VLA improve on real household tasks by training on its own deployments and corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RECAP, a method that trains vision-language-action models through reinforcement learning on data collected during actual robot use. It begins with offline RL pre-training of a generalist model called π*_{0.6}, then refines it using mixed sources: expert demonstrations, the model's own on-policy attempts, and human interventions that correct mistakes mid-execution. The resulting policy is shown to fold laundry in homes, assemble boxes reliably, and operate a professional espresso machine, with large gains in speed and success on the most difficult cases. A sympathetic reader would care because this outlines a concrete route for physical robots to accumulate skill from deployment experience rather than remaining frozen after initial training.

Core claim

RECAP uses advantage conditioning to turn heterogeneous real-world data into stable policy updates for VLAs. After offline pre-training of π*_{0.6}, the method collects on-robot rollouts, records advantages for each action, and trains the policy to favor higher-advantage actions while incorporating teleoperated corrections when the robot fails. This produces measurable gains: more than doubled task throughput and roughly halved failure rates on tasks such as laundry folding and espresso preparation.

What carries the argument

Advantage-conditioned policies in RECAP, which estimate the advantage of each action from mixed data sources and condition the VLA output on those values to blend demonstrations, self-generated data, and interventions without separate weighting.

If this is right

A single generalist VLA can be specialized to new tasks through modest on-robot data collection rather than full retraining.
Task throughput more than doubles and failure rates roughly halve on the hardest real-world activities when the full RECAP pipeline is applied.
Teleoperated interventions during autonomous runs can be folded back into training to correct failures without discarding the entire rollout.
The same pre-trained base model supports both broad capabilities and high performance on specific physical tasks after experience-based refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method scales, fleets of robots could pool their deployment data to accelerate collective improvement across homes and factories.
The approach reduces reliance on exhaustive expert demonstrations by turning ordinary failures and fixes into useful training signal.
Similar conditioning could be tested on longer-horizon tasks or multi-robot coordination where data sources are even more varied.

Load-bearing premise

Advantage conditioning on mixed demonstrations, on-policy data, and interventions will produce stable improvement in the real world without triggering large distribution shifts or unsafe autonomous behavior.

What would settle it

Run the RECAP-trained π*_{0.6} and the offline-pretrained version side-by-side on the same set of new household tasks for 100 trials each; if the RECAP version shows no reduction in failure rate or throughput, the central claim is falsified.

read the original abstract

We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $\pi^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $\pi^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RECAP shows a workable path for VLAs to keep learning from mixed real-world data on tasks like laundry and espresso, but the abstract leaves the RL mechanics and evidence too thin to judge the gains yet.

read the letter

The main point is that this paper gives a concrete method called RECAP for training VLAs with advantage conditioning on a blend of demonstrations, on-policy data, and human interventions during deployment. They start with offline RL to make a generalist model π*0.6, then use that base to collect more data and improve on specific skills. The result is a system that can fold laundry in actual homes, assemble boxes reliably, and operate a professional espresso machine, with reported gains of more than double the throughput and half the failures on the harder cases.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces RECAP (RL with Experience and Corrections via Advantage-conditioned Policies), a method for improving vision-language-action (VLA) models via reinforcement learning that incorporates heterogeneous data sources including demonstrations, on-policy rollouts, and teleoperated interventions during autonomous execution. It describes pre-training a generalist VLA called π*_{0.6} with offline RL, followed by specialization on real-robot tasks, and claims that the resulting model can fold laundry in homes, assemble boxes, and operate a professional espresso machine, with RECAP more than doubling task throughput and halving failure rates on the hardest tasks.

Significance. If the performance claims are supported by rigorous experiments, the work would be significant for real-world robotics by showing a scalable path for VLA self-improvement in unstructured settings using mixed data without requiring fully autonomous safe exploration. It directly targets the gap between offline pre-training and deployment-time adaptation. However, the absence of any experimental protocol, baselines, or analysis in the manuscript prevents determining whether these gains represent genuine policy improvement or artifacts of human intervention.

major comments (2)

[Abstract] Abstract: The central empirical claims (more than doubling task throughput and roughly halving failure rates on hardest tasks) are stated without any accompanying experimental protocol, task definitions, trial counts, baselines, error bars, statistical tests, or ablation studies. This directly undermines evaluation of the reported numbers and is load-bearing for the paper's primary contribution.
[Abstract] Abstract: No description is given of how advantages are estimated from the heterogeneous data mixture (demonstrations, on-policy collection, teleoperated interventions), including whether a learned critic, Monte-Carlo returns, or GAE is used, and whether importance sampling, clipping, or safety filters are applied. This leaves the skeptic's concern about high-variance advantage estimates and distribution shift unaddressed and is central to the stability of the claimed RL improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major concerns point-by-point below and will make revisions to improve clarity and completeness, particularly regarding the experimental details and methodological specifics.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims (more than doubling task throughput and roughly halving failure rates on hardest tasks) are stated without any accompanying experimental protocol, task definitions, trial counts, baselines, error bars, statistical tests, or ablation studies. This directly undermines evaluation of the reported numbers and is load-bearing for the paper's primary contribution.

Authors: We recognize that the abstract presents the results at a high level without the supporting experimental details. To address this, we will revise the abstract to incorporate a brief description of the experimental protocol, including task definitions, number of trials, and mention of baselines and statistical analysis. Additionally, we will ensure the Experiments section is expanded if needed to include all requested elements such as error bars and ablations. This will allow readers to properly evaluate the claims. revision: yes
Referee: [Abstract] Abstract: No description is given of how advantages are estimated from the heterogeneous data mixture (demonstrations, on-policy collection, teleoperated interventions), including whether a learned critic, Monte-Carlo returns, or GAE is used, and whether importance sampling, clipping, or safety filters are applied. This leaves the skeptic's concern about high-variance advantage estimates and distribution shift unaddressed and is central to the stability of the claimed RL improvement.

Authors: We agree that the description of advantage estimation from the heterogeneous data is insufficient in the current manuscript. We will add a comprehensive subsection detailing the advantage computation: specifying the use of a learned critic, the choice of Monte-Carlo returns for demonstrations and GAE for on-policy data, the application of importance sampling and clipping for mixed data, and safety filters for interventions. This will include discussion of variance and distribution shift mitigation to strengthen the methodological foundation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivation chain

full rationale

The paper's central claims consist of empirical performance results on real-world robotics tasks (laundry folding, box assembly, espresso making) after applying the RECAP method. No equations, derivations, fitted parameters, or mathematical predictions are presented in the abstract or described structure. The method is introduced as a general-purpose RL approach incorporating heterogeneous data, but the reported gains (doubled throughput, halved failure rates) are direct experimental outcomes rather than quantities derived from prior fitted values or self-referential definitions. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the provided text. The derivation chain is absent, rendering the paper self-contained as an empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven effectiveness of advantage conditioning for integrating heterogeneous real-world data into VLA policies; no free parameters or new entities are named in the abstract.

axioms (1)

domain assumption Advantage-conditioned policies can stably incorporate demonstrations, on-policy data, and expert interventions for real-world VLA improvement.
This premise is required for the RECAP method to function as described.

pith-pipeline@v0.9.0 · 5696 in / 1328 out tokens · 92348 ms · 2026-05-12T10:29:53.168398+00:00 · methodology

discussion (0)

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
cs.LG 2026-05 unverdicted novelty 7.0

Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predica...
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
cs.LG 2026-05 unverdicted novelty 7.0

Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibrat...
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction
cs.RO 2026-05 unverdicted novelty 6.0

HandITL blends human intent with policy execution to eliminate gesture jumps in dexterous VLA interventions, cutting jitter by 99.8%, grasp failures by 87.5%, and yielding 19% better refined policies.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
cs.RO 2026-05 unverdicted novelty 6.0

TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
How to utilize failure demo data?: Effective data selection for imitation learning using distribution differences in attention mechanism
cs.RO 2026-05 unverdicted novelty 6.0

The method uses attention discrepancy metrics on latent success-failure representations to select beneficial failure data for imitation learning, raising task success rates in simulations.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
cs.RO 2026-05 unverdicted novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
cs.LG 2026-04 unverdicted novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
FASTER: Value-Guided Sampling for Fast RL
cs.LG 2026-04 unverdicted novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
cs.RO 2026-04 unverdicted novelty 6.0

SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
Cooptimizing Safety and Performance Using Safety Value-Constrained Model Predictive Control
cs.RO 2026-04 unverdicted novelty 5.0

Augments MPC with a safety value function terminal constraint to achieve recursive feasibility and persistent safety while co-optimizing performance.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
cs.RO 2026-04 unverdicted novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 35 Pith papers · 5 internal anchors

[1]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018. 1

work page 2018
[2]

Riedmiller

Sascha Lange, Thomas Gabel, and Martin A. Ried- miller. Batch reinforcement learning. In Marco A. Wiering and Martijn van Otterlo, editors,Reinforce- ment Learning, volume 12 ofAdaptation, Learning, and Optimization, pages 45–73. Springer, 2012. doi: 10.1007/978-3-642-27645-3\ 2. 2, 4

work page doi:10.1007/978-3-642-27645-3 2012
[3]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458, 2025

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint, arXiv:2505.23458,

work page arXiv
[5]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025. 2, 3, 5, 7, 8

work page 2025
[6]

Physical Intelligence Team.π 0.6 model card. 2025. 2, 5, 6, 8

work page 2025
[7]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InAISTATS, pages 627–635,

work page
[8]

Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces

Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 462–469,

work page 2016
[9]

doi: 10.1109/ICRA.2016.7487175. 2

work page doi:10.1109/icra.2016.7487175 2016
[10]

Dra- gan, and Ken Goldberg

Michael Laskey, Jonathan Lee, Roy Fox, Anca D. Dra- gan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. InProceedings of the 34th Interna- tional Conference on Machine Learning (ICML), vol- ume 70 ofProceedings of Machine Learning Research, pages 1989–1998. PMLR, 2017

work page 1989
[11]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning, pages 991–1002. PMLR, 2022. 2

work page 2022
[12]

Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint arXiv:2509.07953, 2025

Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction.arXiv preprint, arXiv:2509.07953, 2025. 2

work page arXiv 2025
[13]

Hg-dagger: Inter- active imitation learning with human experts

Michael Kelly, Chelsea Sidrane, Katherine Driggs- Campbell, and Mykel J Kochenderfer. Hg-dagger: Inter- active imitation learning with human experts. InICRA,

work page
[14]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334– 1373, 2016. 2

work page 2016
[15]

Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018

work page arXiv 2018
[16]

Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data.ICRA, 2020

Ajay Mandlekar, Fabio Ramos, Byron Boots, Li Fei- Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data.ICRA, 2020

work page 2020
[17]

Ahmed Ahmed Rehaan Ahmad, and Chelsea Finn

Archit Sharma, M. Ahmed Ahmed Rehaan Ahmad, and Chelsea Finn. Self-improving robots: End-to-end autonomous visuomotor reinforcement learning. In Proceedings of the 7th Conference on Robot Learning (CoRL), volume 229, pages 3292–3308. PMLR, 2023

work page 2023
[18]

2023 IEEE International Conference on Robotics and Automation (ICRA) , volume =

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Alan: Autonomously exploring robotic agents in the real world. InProceedings of the 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 3044–3050, 2023. doi: 10.1109/ICRA48891.2023. 10013321

work page doi:10.1109/icra48891.2023 2023
[19]

Continuously improving mobile manipulation with autonomous real- world rl

Russell Mendonca, Emmanuel Panov, Bernadette Bucher, Jiuguang Wang, and Deepak Pathak. Continuously improving mobile manipulation with autonomous real- world rl. InProceedings of the 8th Conference on Robot Learning (CoRL), pages 5204–5219, 2024

work page 2024
[20]

Serl: A software suite for sample-efficient robotic reinforcement learning, 2024

Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning, 2024

work page 2024
[21]

Residual off- policy rl for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025

Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, and Anusha Nagabandi. Residual off- policy rl for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025

work page arXiv 2025
[22]

Thomas Lampe, Abbas Abdolmaleki, Sarah Bechtle, Sandy H. Huang, Jost Tobias Springenberg, Michael Bloesch, Oliver Groth, Roland Hafner, Tim Hertweck, Michael Neunert, Markus Wulfmeier, Jingwei Zhang, Francesco Nori, Nicolas Heess, and Martin Riedmiller. Mastering stacking of diverse shapes with large-scale iterative reinforcement learning on real robots. ...

work page arXiv 2024
[23]

What matters for batch online re- inforcement learning in robotics?arXiv preprint, arXiv:2505.08078, 2025

Perry Dong, Suvir Mirchandani, Dorsa Sadigh, and Chelsea Finn. What matters for batch online re- inforcement learning in robotics?arXiv preprint, arXiv:2505.08078, 2025. 2

work page arXiv 2025
[24]

Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz

Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion Policy Policy Optimization. InProceedings of the 2025 International Conference on Learning Rep- resentations (ICLR), 2025. 9, 17

work page 2025
[25]

Rl-100: Performant robotic ma- nipulation with real-world reinforcement learning,

Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint, arXiv:2510.14830, 2025. 2

work page arXiv 2025
[26]

Mt-opt: Continuous multi- task robotic reinforcement learning at scale.arXiv, 2021

Dmitry Kalashnkov, Jake Varley, Yevgen Chebotar, Ben Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi- task robotic reinforcement learning at scale.arXiv, 2021. 2

work page 2021
[27]

Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey Levine

Abhishek Gupta, Justin Yu, Tony Z. Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey Levine. Reset-free reinforcement learning via multi- task learning: Learning dexterous manipulation behaviors without human intervention. InProceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6664–6671, 2021. 2

work page 2021
[28]

RoboCat : A self-improving foundation agent for robotic manipulation

Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving foundation agent for robotic manipula- tion.arXiv preprint arXiv:2306.11706, 2023. 2

work page arXiv 2023
[29]

Pre-training for robots: Offline reinforcement learning enables learning new tasks from a handful of trials

Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline reinforcement learning enables learning new tasks from a handful of trials. InProceedings of Robotics: Science and Systems (RSS), 2023. doi: 10.15607/RSS.2023.XIX.019

work page doi:10.15607/rss.2023.xix.019 2023
[30]

In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

Jingyun Yang, Max Sobol Mark, Brandon Vu, Archit Sharma, Jeannette Bohg, and Chelsea Finn. Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning. In Proceedings of the 2024 IEEE International Confer- ence on Robotics and Automation (ICRA), 2024. doi: 10.1109/ICRA57147.2024.10610421. 2

work page doi:10.1109/icra57147.2024.10610421 2024
[31]

Interactive post-training for vision-language- action models, 2025

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr ¨ahenb¨uhl. Interactive post-training for vision-language-action models.arXiv preprint, arXiv:2505.17016, 2025. 2

work page arXiv 2025
[32]

Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning,

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint, arXiv:2505.18719, 2025

work page arXiv 2025
[33]

What can rl bring to vla generalization? an empirical study.arXiv preprint, arXiv:2505.19789, 2025

Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint, arXiv:2505.19789, 2025

work page arXiv 2025
[34]

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, and Chao Yu.π rl: Online rl fine-tuning for flow- based vision-language-action models.arXiv preprint, arXiv:2510.25889, 2025

work page arXiv 2025
[35]

arXiv preprint arXiv:2509.09674 , year=

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhao- hui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. Simplevla-rl: Scaling vla training via rein- forcement learning.arXiv preprint, arXiv:2509.09674,

work page arXiv
[36]

Improving vision- language-action model with online reinforcement learning

Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint, arXiv:2501.16664, 2025. 2

work page arXiv 2025
[37]

Self- improving vision-language-action models with data gen- eration via residual rl, 2025

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi ”Jim” Fan, Guanya Shi, and Yuke Zhu. Self- improving vision-language-action models with data gen- eration via residual rl, 2025. 2

work page 2025
[38]

Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450, 2025

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025. 2

work page arXiv 2025
[39]

arXiv preprint arXiv:2412.06685 , year=

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy-agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint, arXiv:2412.06685, 2024. 2

work page arXiv 2024
[40]

Steering your generalists: Improving robotic foundation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InCon- ference on Robot Learning, pages 4996–5013. PMLR, 2025

work page 2025
[41]

Align-then-steer: Adapting the vision-language action models through unified latent guidance.arXiv preprint arXiv:2509.02055, 2025

Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, and Xuelong Li. Align-then-steer: Adapting the vision-language action models through unified latent guidance.arXiv preprint arXiv:2509.02055, 2025. 2

work page arXiv 2025
[42]

Steering your diffusion policy with latent space reinforcement learning

Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning. InProceedings of the 9th Conference on Robot Learning (CoRL), 2025. 2

work page 2025
[43]

Rldg: Robotic generalist policy distillation via reinforce- ment learning.arXiv preprint arXiv:2412.09858, 2024

Charles Xu, Qiyang Li, Jianlan Luo, and Sergey Levine. Rldg: Robotic generalist policy distillation via reinforce- ment learning.arXiv preprint arXiv:2412.09858, 2024. 2

work page arXiv 2024
[44]

CO-RFT: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine- tuning of vision-language-action models through chun- ked offline reinforcement learning.arXiv preprint, arXiv:2508.02219, 2025. 3

work page arXiv 2025
[45]

Zhang, K

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. Grape: Generalizing robot policy via preference alignment.arXiv preprint, arXiv:2411.19309, 2024. 3

work page arXiv 2024
[46]

A vision-language-action-critic model for robotic real-world reinforcement learning.arXiv preprint arXiv:2509.15937, 2025

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language- action-critic model for robotic real-world reinforcement learning.arXiv preprint, arXiv:2509.15937, 2025. 3

work page arXiv 2025
[47]

Self- improving embodied foundation models.arXiv preprint, arXiv:2509.15155, 2025

Seyed Kamyar Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, and Igor Mordatch. Self- improving embodied foundation models.arXiv preprint, arXiv:2509.15155, 2025. 3

work page arXiv 2025
[48]

Reinforcement learning upside down: Don’t predict rewards — just map them to actions

J ¨urgen Schmidhuber. Reinforcement learning upside down: Don’t predict rewards — just map them to actions. arXiv preprint, arXiv:1912.02875, 2019. 3, 4

work page arXiv 1912
[49]

Reward-conditioned policies.CoRR, abs/1912.13465,

Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies.CoRR, abs/1912.13465,

work page arXiv 1912
[50]

Decision transformer: Rein- forcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling. InAdvances in Neural Information Processing Systems (NeurIPS) 34, 2021

work page 2021
[51]

When does return- conditioned supervised learning work for offline rein- forcement learning? InAdvances in Neural Information Processing Systems (NeurIPS) 35, 2022

David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche, and Joan Bruna. When does return- conditioned supervised learning work for offline rein- forcement learning? InAdvances in Neural Information Processing Systems (NeurIPS) 35, 2022. 4

work page 2022
[52]

Rvs: What is essential for offline rl via supervised learning? InProceedings of the 10th International Conference on Learning Representations (ICLR), 2022

Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. Rvs: What is essential for offline rl via supervised learning? InProceedings of the 10th International Conference on Learning Representations (ICLR), 2022

work page 2022
[53]

Generalized decision transformer for offline hindsight information matching

Hiroki Furuta, Yusuke Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. InProceedings of the 10th International Conference on Learning Representations (ICLR), 2022

work page 2022
[54]

Q-learning decision transformer: Leveraging dynamic programming for conditional sequence mod- elling in offline rl

Taku Yamagata, Ahmed Khalil, and Ra ´ul Santos- Rodr´ıguez. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence mod- elling in offline rl. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pages 38989–39007. PMLR, 2023

work page 2023
[55]

Online decision transformer

Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. InProceedings of the 39th Interna- tional Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 27042–27059. PMLR, 2022

work page 2022
[56]

Advantage-conditioned diffusion: Offline rl via general- ization

Jakub Grudzien Kuba, Pieter Abbeel, and Sergey Levine. Advantage-conditioned diffusion: Offline rl via general- ization. 2023

work page 2023
[57]

Elastic decision transformer

Yueh-Hua Wu, Xiaolong Wang, and Masashi Hamaya. Elastic decision transformer. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023. doi: 10.5555/3666122.3666936. 3

work page doi:10.5555/3666122.3666936 2023
[58]

Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions

Lin Shao, Toki Migimatsu, Qiang Zhang, Kaiyuan Yang, and Jeannette Bohg. Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions. InProceedings of Robotics: Science & Systems (RSS), 2020. doi: 10.15607/RSS.2020.XVI.082. 3

work page doi:10.15607/rss.2020.xvi.082 2020
[59]

in-the- wild

Annie S. Chen, Suraj Nair, and Chelsea Finn. Learn- ing generalizable robotic reward functions from “in-the- wild” human videos. InProceedings of Robotics: Science & Systems (RSS) 2021, 2021

work page 2021
[60]

Learning language- conditioned robot behavior from offline data and crowd- sourced annotation

Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Silvio Savarese, and Chelsea Finn. Learning language- conditioned robot behavior from offline data and crowd- sourced annotation. InProceedings of the 5th Conference on Robot Learning (CoRL), volume 164 ofProceed- ings of Machine Learning Research, pages 1303–1315. PMLR, 2022

work page 2022
[61]

Sontakke, Jesse Zhang, S ´ebastien M.R

Sumedh A. Sontakke, Jesse Zhang, S ´ebastien M.R. Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demon- stration is enough to learn robot policies. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[62]

Language to rewards for robotic skill synthesis

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kir- mani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao- Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, and Fei Xia. Language to rewards for robotic skill synthesis. InProceedings of the 7...

work page
[63]

Lim, Jesse Thomason, Erdem Bıyık, and Jesse Zhang

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Bıyık, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations. In Proceedings of the 9th Conference on Robot Learning (CoRL), 2025

work page 2025
[64]

Video-language critic: Transferable reward functions for language-conditioned robotics.Transac- tions on Machine Learning Research, 2025:1–22, 2025

Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, and Kai Yuan. Video-language critic: Transferable reward functions for language-conditioned robotics.Transac- tions on Machine Learning Research, 2025:1–22, 2025. 3

work page 2025
[65]

Liv: Language-image representations and rewards for robotic control

Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayara- man. Liv: Language-image representations and rewards for robotic control. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML), 2023. 3

work page 2023
[66]

Vision language models are in-context value learners

Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Ja- yaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, and Fei Xia. Vision language models are in-context value learners. InProceedings of the 13th International Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[67]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 3, 4, 17

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations,

work page
[69]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 1910
[70]

Peter Dayan and Geoffrey E. Hinton. Using expectation- maximization for reinforcement learning.Neural Com- putation, 9(2):271–278, 1997. doi: 10.1162/neco.1997.9. 2.271

work page doi:10.1162/neco.1997.9 1997
[71]

Rel- ative entropy policy search

Jan Peters, Katharina M ¨ulling, and Yasemin Alt ¨un. Rel- ative entropy policy search. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelli- gence, AAAI’10, page 1607–1612. AAAI Press, 2010. 3

work page 2010
[72]

Exponentially weighted imitation learning for batched historical data

Qing Wang, Jiechao Xiong, Lei Han, peng sun, Han Liu, and Tong Zhang. Exponentially weighted imitation learning for batched historical data. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31, 2018. 3

work page 2018
[73]

A distributional perspective on reinforcement learning

Marc G Bellemare, Will Dabney, and R ´emi Munos. A distributional perspective on reinforcement learning. InInternational conference on machine learning, pages 449–458. PMLR, 2017. 4

work page 2017
[74]

Knowledge insulating vision-language-action models: Train fast, run fast, generalize better

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2025. 4, 6

work page 2025
[75]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.ICML, 2018

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.ICML, 2018. 4

work page 2018
[76]

Critic regularized regression

Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic regularized regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 7...

work page 2020
[77]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInter- national Conference on Learning Representations, 2022. 4

work page 2022
[78]

FAST: Efficient action tok- enization for vision-language-action models.Robotics: Science and Systems, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tok- enization for vision-language-action models.Robotics: Science and Systems, 2025. 6

work page 2025
[79]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Per- rin, Tatiana Matejovicova, Alexandre Ram ´e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga¨el Liu, Francesco Visin, Kathleen Kenealy, Luc...

work page 2025
[80]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2022

Showing first 80 references.