EXPO: Stable Reinforcement Learning with Expressive Policies

Chelsea Finn; Dorsa Sadigh; Perry Dong; Qiyang Li

arxiv: 2507.07986 · v3 · submitted 2025-07-10 · 💻 cs.LG · cs.AI

EXPO: Stable Reinforcement Learning with Expressive Policies

Perry Dong , Qiyang Li , Dorsa Sadigh , Chelsea Finn This is my paper

Pith reviewed 2026-05-19 05:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningexpressive policiesdiffusion policiesoffline to onlinepolicy optimizationsample efficiencyfine-tuning

0 comments

The pith

EXPO stabilizes value maximization for expressive policies in online RL by pairing a base imitation policy with a lightweight Gaussian edit policy that selects higher-value actions on the fly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that expressive policies such as diffusion and flow-matching models can be trained and fine-tuned stably with online reinforcement learning when given an offline dataset. Direct gradient optimization through their long denoising chains tends to destabilize value maximization, so the method instead keeps the expressive policy trained only via imitation learning and adds a small Gaussian edit policy that adjusts sampled actions toward higher Q-values. The on-the-fly policy then picks the better of the original or edited action both for data collection and for temporal-difference backups. A sympathetic reader would care because this approach promises to bring the representational power of modern generative models into online RL without the usual instability or sample-inefficiency penalties.

Core claim

EXPO constructs an on-the-fly policy that maximizes Q-value by sampling actions from a large expressive base policy trained with a stable imitation objective, editing those actions with a lightweight Gaussian policy toward higher value, and then selecting the higher-Q action from the pair for both sampling and TD backup; this yields up to 2-3x better sample efficiency than prior methods when fine-tuning a pretrained policy or training online from offline data.

What carries the argument

The on-the-fly policy that selects between base-policy actions and actions edited by the lightweight Gaussian edit policy to maximize Q-value for both sampling and backups.

Load-bearing premise

The lightweight Gaussian edit policy can reliably produce higher-value actions from the base policy samples without introducing instability or bias into the TD backups or the on-the-fly selection process.

What would settle it

Train the same base and edit policies but replace the value-based selection step with random choice between base and edited actions; if sample efficiency then falls back to the level of prior direct-optimization methods, the selection mechanism is necessary for the reported gains.

Figures

Figures reproduced from arXiv: 2507.07986 by Chelsea Finn, Dorsa Sadigh, Perry Dong, Qiyang Li.

**Figure 1.** Figure 1: Left: Expressive Policy Optimization (EXPO) is a stable, sample efficient method for training expressive policies with reinforcement learning by avoiding direct optimization over the value function with the expressive policy. Right: Average performance over tasks of EXPO and prior methods. intermediate denoising steps to guide the denoising process towards high-value actions (Psenka et al., 2023; Fang et a… view at source ↗

**Figure 2.** Figure 2: The edit policy transforms actions of the base policy into actions that further maximize Q-value while encouraging exploration. The blue contour represents the Q-values of actions of a single state and the orange contours represent the Gaussian distributions of actions the edit policy changes the base actions into. rely on implicit Q-learning objective where the Q-target is computed without policy samples.… view at source ↗

**Figure 3.** Figure 3: Visualizations of 12 sparse-reward environments we evaluate on. Note that Antmaze medium and Antmaze large both have two dataset variants. We note that because the on-the-fly policy is parameterized to maximize the Q-function and the action a˜ ∗ t+1 is the action sample with the highest Q-value, this procedure can be viewed as equivalent to a standard Q-learning update with the implicit policy. 4.3 PRACTIC… view at source ↗

**Figure 4.** Figure 4: Online RL results on 12 challenging sparse-reward tasks. Across almost every task, EXPO consistently exceeds or matches the performance of the best baseline—even without any pretraining. 5.2 BASELINES We evaluate our method in both the online setting (no pre-traning) as well as the offline-to-online setting (offline pre-training followed by online fine-tuning). We compare our method against prior state-of-… view at source ↗

**Figure 5.** Figure 5: Offline-to-online RL results on 12 challenging sparse-reward tasks. EXPO consistently exceeds or matches the performance of the best baseline. The relative benefit of EXPO over baselines is especially large on the manipulation tasks, where prior methods often struggle to improve in performance. Importantly, EXPO does not drop in performance going from pre-training to fine-tuning. 5.5 WHAT COMPONENTS OF EXP… view at source ↗

**Figure 6.** Figure 6: Ablation over on-the-fly policy extraction in the TD backup. We find that using value-maximizing actions in TD backup is vital for performance [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Varying the offline dataset. We find that better offline data, as measured by the performance of an imitation learning policy trained on the data, correlates strongly with performance of EXPO. The plot is averaged over 3 seeds. How does the offline dataset affect performance? Because EXPO trains the base expressive policy with imitation learning, a natural question to ask is how does the offline dataset i… view at source ↗

**Figure 9.** Figure 9: Ablation on not keeping the offline dataset for fine-tuning. We find that EXPO can learn effectively even without retaining the offline dataset after pre-training. Can EXPO effectively fine-tune a pre-trained policy without the offline dataset? To better understand the role of the offline dataset as a prior in EXPO, we study EXPO in the setting of finetuning a pre-trained policy without the offline datase… view at source ↗

read the original abstract

We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EXPO stabilizes expressive policy RL online by using an imitation base plus lightweight Gaussian editor for on-the-fly max-Q actions, but the TD target selection risks uncorrected optimistic bias.

read the letter

The punchline is that EXPO gives a way to do online RL with diffusion or flow policies without the usual gradient problems by not optimizing the expressive part directly against Q. They train the big base policy with imitation learning on the offline data, then learn a small Gaussian edit policy that shifts base samples to better actions according to the current critic. For both environment steps and the TD target, they sample from the base, edit it, and pick whichever has higher Q.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Expressive Policy Optimization (EXPO), an online RL algorithm for training and fine-tuning expressive policies (diffusion/flow-matching) given offline data. It trains a large base policy via stable imitation learning and a lightweight Gaussian edit policy; an on-the-fly policy then selects the higher-Q action between base and edited samples for both environment steps and TD backups, claiming up to 2-3x average sample-efficiency gains over prior methods in both fine-tuning and online-from-offline settings.

Significance. If the empirical claims hold under rigorous controls, the work would be significant for enabling stable online RL with complex policy classes that are otherwise difficult to optimize directly against value functions. The hybrid use of imitation-trained expressive policies plus lightweight edits offers a practical route to leverage offline data without full offline RL overhead.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the on-the-fly max-Q selection for both sampling and TD targets uses the same Q-network without double-Q or target-network decoupling. This risks uncorrected optimistic bias when the edit policy is imperfect, directly threatening the stability claim and the reported 2-3x sample-efficiency gains. A concrete test (e.g., ablation with double-Q or target-network selection) is needed to show the bias does not inflate returns.
[§4] §4 (experiments): the reported gains lack error bars, full ablation tables on edit-policy scale/training frequency, and controls for the selection bias identified above. Without these, it is impossible to verify whether the central sample-efficiency claim is robust or driven by the optimistic selection mechanism.

minor comments (2)

[§3] Notation for the edit policy variance and the exact form of the on-the-fly policy (Eq. numbers in §3) should be clarified to avoid ambiguity with standard Gaussian policies.
[§4] Figure captions and axis labels in the experimental plots should explicitly state the number of seeds and whether shaded regions are standard error or min/max.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concerns regarding potential optimistic bias in the on-the-fly selection mechanism and the need for more rigorous experimental reporting are well-taken. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the on-the-fly max-Q selection for both sampling and TD targets uses the same Q-network without double-Q or target-network decoupling. This risks uncorrected optimistic bias when the edit policy is imperfect, directly threatening the stability claim and the reported 2-3x sample-efficiency gains. A concrete test (e.g., ablation with double-Q or target-network selection) is needed to show the bias does not inflate returns.

Authors: We appreciate the referee's identification of this potential source of bias. Our design intentionally avoids direct value optimization of the expressive base policy to promote stability, relying instead on imitation learning for the base and small Gaussian edits. The on-the-fly max-Q choice is applied only between the base action and the lightly edited action, which limits the scope for over-optimism compared to unconstrained maximization. Nevertheless, to directly test the concern, we have run an additional ablation that decouples the selection Q-network via a target network (updated with polyak averaging) and will include these results in the revised §4. The performance gains remain consistent, supporting that the reported improvements are not primarily driven by uncorrected bias. revision: yes
Referee: [§4] §4 (experiments): the reported gains lack error bars, full ablation tables on edit-policy scale/training frequency, and controls for the selection bias identified above. Without these, it is impossible to verify whether the central sample-efficiency claim is robust or driven by the optimistic selection mechanism.

Authors: We agree that error bars and expanded ablations are necessary to substantiate the empirical claims. In the revision we will add standard error bars computed over at least five random seeds for all main results. We will also include a fuller set of ablation tables varying edit-policy network size, update frequency, and perturbation scale. To control for selection bias we will add a comparison against a variant that uses random selection between base and edited actions (instead of max-Q) and report the resulting sample-efficiency curves; these controls will be presented alongside the original results in the updated §4. revision: yes

Circularity Check

0 steps flagged

No significant circularity: algorithmic construction remains independent of its inputs.

full rationale

The paper presents EXPO as a new algorithmic procedure that combines a base expressive policy trained by imitation learning, a lightweight Gaussian edit policy, and on-the-fly selection of the higher-Q action for both sampling and TD targets. This construction does not reduce any claimed performance gain or stability property to a fitted parameter or self-citation by definition; the 2-3x sample-efficiency result is framed as an empirical outcome measured against external baselines rather than a quantity forced by the method's own equations. No load-bearing uniqueness theorems, ansatzes smuggled via prior self-work, or renaming of known patterns appear in the derivation chain. The approach is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard RL assumptions plus the modeling choice that a small Gaussian edit policy can improve upon samples from a larger expressive policy without destabilizing learning.

free parameters (1)

edit policy scale and training frequency
Hyperparameters controlling how much the Gaussian editor can change actions and how often it is updated; these are chosen to balance stability and improvement.

axioms (1)

domain assumption The value function provides a reliable signal for selecting between base and edited actions during both data collection and TD updates.
Invoked when the on-the-fly policy chooses the value-maximizing action for sampling and backup.

pith-pipeline@v0.9.0 · 5768 in / 1284 out tokens · 26037 ms · 2026-05-19T05:06:15.768388+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations
cs.RO 2026-05 unverdicted novelty 7.0

CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
cs.RO 2026-05 unverdicted novelty 6.0

TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
cs.LG 2026-05 unverdicted novelty 6.0

ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
FASTER: Value-Guided Sampling for Fast RL
cs.LG 2026-04 unverdicted novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 5 Pith papers · 11 internal anchors

[1]

Ankile, L., Simeonov, A., Shenfeld, I., and Agrawal, P

Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement–residual rl for precise assembly. arXiv preprint arXiv:2407.16677,

work page arXiv
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Offline reinforcement learning via high-fidelity generative behavior modeling

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548,

work page arXiv
[4]

arXiv preprint arXiv:2101.05982 , year=

Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized ensembled double q-learning: Learning fast without a model. arXiv preprint arXiv:2101.05982,

work page arXiv
[5]

Diffusion-based reinforcement learning via q-weighted variational policy optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. arXiv preprint arXiv:2405.16173,

work page arXiv
[6]

Perry Dong, Alec M Lessing, Annie S Chen, and Chelsea Finn

URL https://openreview.net/forum?id=v8jdwkUNXb. Perry Dong, Alec M Lessing, Annie S Chen, and Chelsea Finn. Reinforcement learning via implicit imitation guidance. arXiv preprint arXiv:2506.07505,

work page arXiv
[7]

Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, and Bing-Yi Jing

URL https://proceedings.mlr.press/v80/espeholt18a.html. Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, and Bing-Yi Jing. Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning. arXiv preprint arXiv:2405.20555,

work page arXiv
[8]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

URL https://arxiv.org/abs/2004.07219. Yuwei Fu, Di Wu, and Benoit Boulet. A closer look at offline rl agents. Advances in Neural Information Processing Systems, 35:8591–8604,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[9]

MADE: Masked Autoencoder for Distribution Estimation

URL https://arxiv.org/abs/1502.03509. Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pp. 3682–3691. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine

URL https://arxiv.org/abs/2212.05698. Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573,

work page arXiv
[12]

Aligniql: Policy alignment in implicit q-learning through constrained optimization

Longxiang He, Li Shen, Junbo Tan, and Xueqian Wang. Aligniql: Policy alignment in implicit q-learning through constrained optimization. arXiv preprint arXiv:2405.18187,

work page arXiv
[13]

Deep Q-learning from Demonstrations

URL https://arxiv. org/abs/1704.03732. Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Offline Reinforcement Learning with Implicit Q-Learning

URL https://arxiv.org/abs/2110.06169. 12 Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine

URL https: //arxiv.org/abs/2107.00591. Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine. Accelerating exploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36:67434–67458,

work page arXiv
[16]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

URL https://arxiv.org/abs/2310.17596. Max Sobol Mark, Archit Sharma, Fahim Tajwar, Rafael Rafailov, Sergey Levine, and Chelsea Finn. Offline retraining for online rl: Decoupled policy learning to mitigate exploration bias,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar

URL https://arxiv.org/abs/2310.08558. Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685,

work page arXiv
[18]

Overcoming Exploration in Reinforcement Learning with Demonstrations

URL https://arxiv. org/abs/1709.10089. Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

URL https://arxiv.org/abs/2006. 09359. Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36:62244–62269,

work page 2006
[20]

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar

URL https://arxiv.org/abs/2303.05479. Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl? arXiv preprint arXiv:2406.09329,

work page arXiv
[21]

Flow Q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. arXiv preprint arXiv:2502.02538,

work page arXiv
[22]

Learning a

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. arXiv preprint arXiv:2312.11752,

work page arXiv
[23]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,

URL https: //arxiv.org/abs/2210.06718. Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,

work page arXiv
[26]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

URL https: //arxiv.org/abs/1707.08817. Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decorator: Model-agnostic online refinement for large policy model. arXiv preprint arXiv:2412.13630,

work page arXiv
[28]

Policy expansion for bridging offline-to- online reinforcement learning,

URL https://arxiv.org/abs/2302.00935. Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning. arXiv preprint arXiv:2503.04975,

work page arXiv
[29]

For offline-to-online training, we present the number of pretraining steps for each suite

Each training run presented is with three seeds and error bars indicating max and min. For offline-to-online training, we present the number of pretraining steps for each suite. We do not pretrain in the online setting. We use the same residual block structure for the base policy as IDQL (Hansen-Estruch et al., 2023). Hyperparameter Robomimic Adroit Antma...

work page 2023
[30]

The Adroit and Antmaze environments use the default D4RL provided datasets

We subsample 10 trajectories for Lift and use the MH dataset for Can to make the tasks harder. The Adroit and Antmaze environments use the default D4RL provided datasets. Hyperparameter Num Data Composition MimicGen Stack 200 10 human and 190 generated by MimicGen MimicGen Threading 50 10 human and 40 generated by MimicGen Robomimic Lift 10 PH Robomimic S...

work page 2023

[1] [1]

Ankile, L., Simeonov, A., Shenfeld, I., and Agrawal, P

Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, and Pulkit Agrawal. From imitation to refinement–residual rl for precise assembly. arXiv preprint arXiv:2407.16677,

work page arXiv

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Offline reinforcement learning via high-fidelity generative behavior modeling

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548,

work page arXiv

[4] [4]

arXiv preprint arXiv:2101.05982 , year=

Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized ensembled double q-learning: Learning fast without a model. arXiv preprint arXiv:2101.05982,

work page arXiv

[5] [5]

Diffusion-based reinforcement learning via q-weighted variational policy optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. arXiv preprint arXiv:2405.16173,

work page arXiv

[6] [6]

Perry Dong, Alec M Lessing, Annie S Chen, and Chelsea Finn

URL https://openreview.net/forum?id=v8jdwkUNXb. Perry Dong, Alec M Lessing, Annie S Chen, and Chelsea Finn. Reinforcement learning via implicit imitation guidance. arXiv preprint arXiv:2506.07505,

work page arXiv

[7] [7]

Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, and Bing-Yi Jing

URL https://proceedings.mlr.press/v80/espeholt18a.html. Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, and Bing-Yi Jing. Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning. arXiv preprint arXiv:2405.20555,

work page arXiv

[8] [8]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

URL https://arxiv.org/abs/2004.07219. Yuwei Fu, Di Wu, and Benoit Boulet. A closer look at offline rl agents. Advances in Neural Information Processing Systems, 35:8591–8604,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[9] [9]

MADE: Masked Autoencoder for Distribution Estimation

URL https://arxiv.org/abs/1502.03509. Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pp. 3682–3691. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine

URL https://arxiv.org/abs/2212.05698. Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573,

work page arXiv

[12] [12]

Aligniql: Policy alignment in implicit q-learning through constrained optimization

Longxiang He, Li Shen, Junbo Tan, and Xueqian Wang. Aligniql: Policy alignment in implicit q-learning through constrained optimization. arXiv preprint arXiv:2405.18187,

work page arXiv

[13] [13]

Deep Q-learning from Demonstrations

URL https://arxiv. org/abs/1704.03732. Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Offline Reinforcement Learning with Implicit Q-Learning

URL https://arxiv.org/abs/2110.06169. 12 Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine

URL https: //arxiv.org/abs/2107.00591. Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, and Sergey Levine. Accelerating exploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36:67434–67458,

work page arXiv

[16] [16]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

URL https://arxiv.org/abs/2310.17596. Max Sobol Mark, Archit Sharma, Fahim Tajwar, Rafael Rafailov, Sergey Levine, and Chelsea Finn. Offline retraining for online rl: Decoupled policy learning to mitigate exploration bias,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar

URL https://arxiv.org/abs/2310.08558. Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685,

work page arXiv

[18] [18]

Overcoming Exploration in Reinforcement Learning with Demonstrations

URL https://arxiv. org/abs/1709.10089. Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

URL https://arxiv.org/abs/2006. 09359. Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36:62244–62269,

work page 2006

[20] [20]

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar

URL https://arxiv.org/abs/2303.05479. Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl? arXiv preprint arXiv:2406.09329,

work page arXiv

[21] [21]

Flow Q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. arXiv preprint arXiv:2502.02538,

work page arXiv

[22] [22]

Learning a

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. arXiv preprint arXiv:2312.11752,

work page arXiv

[23] [23]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Hybrid rl: Using both offline and online data can make rl efficient.arXiv preprint arXiv:2210.06718,

URL https: //arxiv.org/abs/2210.06718. Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,

work page arXiv

[26] [26]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

URL https: //arxiv.org/abs/1707.08817. Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Policy decorator: Model- agnostic online refinement for large policy model.arXiv preprint arXiv:2412.13630, 2024

Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, and Hao Su. Policy decorator: Model-agnostic online refinement for large policy model. arXiv preprint arXiv:2412.13630,

work page arXiv

[28] [28]

Policy expansion for bridging offline-to- online reinforcement learning,

URL https://arxiv.org/abs/2302.00935. Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning. arXiv preprint arXiv:2503.04975,

work page arXiv

[29] [29]

For offline-to-online training, we present the number of pretraining steps for each suite

Each training run presented is with three seeds and error bars indicating max and min. For offline-to-online training, we present the number of pretraining steps for each suite. We do not pretrain in the online setting. We use the same residual block structure for the base policy as IDQL (Hansen-Estruch et al., 2023). Hyperparameter Robomimic Adroit Antma...

work page 2023

[30] [30]

The Adroit and Antmaze environments use the default D4RL provided datasets

We subsample 10 trajectories for Lift and use the MH dataset for Can to make the tasks harder. The Adroit and Antmaze environments use the default D4RL provided datasets. Hyperparameter Num Data Composition MimicGen Stack 200 10 human and 190 generated by MimicGen MimicGen Threading 50 10 human and 40 generated by MimicGen Robomimic Lift 10 PH Robomimic S...

work page 2023