EXPO: Stable Reinforcement Learning with Expressive Policies

· 2025 · cs.LG · arXiv 2507.07986

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

citing papers explorer

Showing 5 of 5 citing papers.

Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations cs.RO · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning cs.RO · 2026-05-12 · unverdicted · none · ref 31 · internal anchor
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
Adaptive Action Chunking via Multi-Chunk Q Value Estimation cs.LG · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 141 · internal anchor
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
FASTER: Value-Guided Sampling for Fast RL cs.LG · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

EXPO: Stable Reinforcement Learning with Expressive Policies

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer