Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
hub Canonical reference
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and performance on D4RL benchmarks.
Reformulates constrained black-box optimization as posterior inference in latent space of flow-based models amortized by outsourced diffusion models, claiming superior performance on synthetic and real tasks.
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
FLAG augments state space with flow latent variable to optimize a proxy MaxEnt-RL objective, enabling expressive policies with limited importance samples in high-dimensional control.
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
The paper introduces discipline stability, a trace-based evaluation paradigm for checking if RL agents maintain behavioral discipline like rule-based competitors in hidden-state competitive settings such as hotel pricing and bidding.
ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.
DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.
SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
citing papers explorer
-
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
-
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.
-
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
-
From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning
DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and performance on D4RL benchmarks.
-
Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization
Reformulates constrained black-box optimization as posterior inference in latent space of flow-based models amortized by outsourced diffusion models, claiming superior performance on synthetic and real tasks.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning
BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.
-
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
-
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
-
FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance
FLAG augments state space with flow latent variable to optimize a proxy MaxEnt-RL objective, enabling expressive policies with limited importance samples in high-dimensional control.
-
Goal-Conditioned Agents that Learn Everything All at Once
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
-
When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State
The paper introduces discipline stability, a trace-based evaluation paradigm for checking if RL agents maintain behavioral discipline like rule-based competitors in hidden-state competitive settings such as hotel pricing and bidding.
-
ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.
-
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
-
ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network
ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.
-
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
-
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
-
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
-
Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning
An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.
-
AdamO: A Collapse-Suppressed Optimizer for Offline RL
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning
For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero success on AntMaze.
-
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
-
Beyond Importance Sampling: Rejection-Gated Policy Optimization
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
-
When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse
KICL completes execution decisions in KOL financial discourse using offline RL, achieving top returns and Sharpe ratios with no unsupported trades or direction changes on YouTube and X data from 2022-2025.
-
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
-
Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.
-
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.
-
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
-
Pseudo-Expert Regularized Offline RL for End-to-End Autonomous Driving in Photorealistic Closed-Loop Environments
Pseudo-expert regularized offline RL reduces collisions and improves route completion for camera-based driving models trained on fixed simulator datasets from nuScenes.
-
Reinforcement Learning with Action Chunking
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
-
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
-
COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning
COLSON applies diffusion models to reinforcement learning for social robot navigation and adds controllability mechanisms that enable zero-shot adaptation to unseen static obstacles and altered objectives.
-
Diffusion Policy Policy Optimization
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
-
Training Diffusion Models with Reinforcement Learning
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
-
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
-
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
-
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
-
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-to-data ratios.
-
LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning
LLM-TALE steers RL exploration using LLM-generated plans at task and affordance levels with online suboptimality correction, improving sample efficiency and success rates on pick-and-place tasks without human supervision.
-
Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
Offline RL post-training boosts code generation performance in LLMs, with larger gains for small models and hard problems, using pre-collected datasets.
-
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
- Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
- Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies