arxiv: 2006.09359 · v6 · submitted 2020-06-16 · 💻 cs.LG · cs.RO· stat.ML

Recognition: 2 theorem links

· Lean Theorem

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair , Abhishek Gupta , Murtaza Dalal , Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:28 UTC · model grok-4.3

classification 💻 cs.LG cs.ROstat.ML

keywords reinforcement learningoffline dataonline fine-tuningactor-criticrobotic manipulationpolicy optimizationadvantage weighting

0 comments

The pith

AWAC combines offline data with online reinforcement learning to accelerate policy improvement for robotic control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reinforcement learning demands extensive active exploration to learn control policies, which limits its use in real settings like robotics where data collection is costly. The paper claims that previously collected datasets, whether expert demonstrations or merely useful sub-optimal trajectories, can serve as a bootstrap that reduces the online samples required. AWAC achieves this by estimating advantages through dynamic programming on the mixed data and then performing maximum-likelihood policy updates weighted by those advantages. This framework supports a smooth shift from offline initialization to online fine-tuning, enabling the agent to exceed the quality of the original data. If the approach holds, it makes a range of manipulation skills trainable within practical time limits on both simulated and physical robots.

Core claim

The paper proposes advantage weighted actor critic (AWAC), an algorithm that first uses sample-efficient dynamic programming to compute action advantages from a combination of offline data and online experience, then updates the policy via maximum-likelihood estimation weighted by those advantages. This simple combination lets the method leverage large prior datasets to mitigate exploration difficulties while still permitting further improvement during online training. Experiments show the resulting policies reach high performance on dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and valve rotation, reducing the online interaction time needed to a

What carries the argument

Advantage weighted actor critic (AWAC), which estimates advantages via dynamic programming on offline and online data then applies them as weights in maximum-likelihood policy updates.

If this is right

Prior data, expert or sub-optimal, provides an initial policy that reduces the exploration burden in subsequent online training.
The agent can continue to improve beyond the performance level present in the offline dataset.
The same method works across both simulated environments and physical robot hardware for manipulation skills.
Transition difficulties between offline pre-training and online fine-tuning are avoided by the advantage-weighted updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid offline-online pipelines become viable for other domains where batch data is cheap but live interaction is expensive.
Large public or previously recorded datasets could serve as starting points for new robot tasks before targeted online refinement.
The weighting mechanism might be portable to other actor-critic or policy-gradient methods to stabilize data mixing.

Load-bearing premise

Offline data supplies transitions whose advantages remain accurate and useful guides for policy updates even after online experience begins to dominate the dataset.

What would settle it

If AWAC requires the same number of online steps as a standard online RL baseline to reach equivalent performance on the real multi-fingered hand or valve tasks, the claim of accelerated learning from offline data would not hold.

read the original abstract

Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AWAC gives a workable way to mix offline data into online RL without the usual breakdown, and the real-robot results are the part that stands out.

read the letter

The paper's main contribution is a clear analysis of why offline-to-online RL usually stalls, followed by AWAC, which learns a critic via dynamic programming and then does advantage-weighted maximum-likelihood policy updates. This keeps the policy from drifting too far from the data while still allowing improvement. The construction is straightforward and avoids the overestimation and distribution-shift traps that plague standard actor-critic methods in this setting. They test it on both simulated environments and real hardware, including multi-fingered dexterous manipulation, drawer opening, and valve rotation, with comparisons against relevant baselines. The real-robot speedups are the strongest evidence that the approach actually reduces the sample complexity barrier they describe. What the paper does well is keep the method simple enough to implement while still targeting the exact failure mode. The experiments are run on tasks that matter for robotics, and the results show consistent gains when prior data is available. Soft spots are limited. The method still depends on the critic being accurate enough to produce useful advantages, and the paper does not explore how performance degrades when the offline data is very noisy or low-coverage. A few more ablations on the weighting term or on data quality would have made the claims tighter, but these are incremental rather than central. The work is aimed at researchers who already run RL on robots and have access to some prior trajectories. It is the kind of paper that gives practitioners a concrete recipe rather than a purely theoretical fix. I would bring it to a reading group focused on practical RL. It deserves a serious referee because the empirical support is direct and the problem it solves is a real deployment obstacle.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes the difficulties of transitioning from offline RL (using expert or sub-optimal demonstration data) to continued online improvement, then proposes the Advantage-Weighted Actor-Critic (AWAC) algorithm. AWAC combines sample-efficient dynamic programming for the critic with advantage-weighted maximum-likelihood policy updates that directly mitigate distribution shift and value overestimation. The central claim is that this enables rapid online fine-tuning of policies when seeded with large amounts of prior data, demonstrated on simulated robotics tasks and real-world domains including dexterous manipulation with a multi-fingered hand, drawer opening, and valve rotation.

Significance. If the reported results hold, the work has clear practical significance for robotics RL: prior data (even sub-optimal) can be leveraged to reduce the exploration burden and sample complexity that currently limit real-world deployment. The construction is a strength—it targets the precise failure modes (distribution shift, overestimation) that typically prevent offline-to-online progress without introducing circularity or parameter-free claims. The inclusion of real-robot experiments with comparisons to relevant baselines provides direct, falsifiable support for the claim that offline data can accelerate online learning to practical timescales.

minor comments (3)

[Abstract] Abstract: the claim of results on 'simulated and real robotics domains' is not accompanied by any quantitative metrics, error bars, or baseline comparisons. While the full text supplies these, adding a concise summary of key performance numbers to the abstract would improve accessibility.
[Method] The notation for the advantage-weighted policy update (likely in §3 or §4) could be clarified by explicitly stating the objective as an expectation under the offline data distribution rather than leaving the weighting implicit in the text description.
[Experiments] Figure captions for the real-robot experiments should include the number of trials and whether error bars represent standard error or standard deviation to allow readers to assess statistical reliability without consulting the main text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work, recognition of its practical significance for robotics RL, and recommendation for minor revision. We appreciate the detailed summary of the manuscript's contributions and the identification of strengths in the algorithmic construction and experimental validation.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives AWAC by combining standard dynamic programming (for the critic) with advantage-weighted maximum-likelihood policy updates. This construction is presented as a direct response to analyzed offline-to-online transition issues, without any step reducing by definition to fitted parameters or prior self-citations. The central claim rests on explicit algorithmic definitions and empirical validation across simulated and real-robot tasks, remaining self-contained against external benchmarks. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or self-definitional reductions appear in the provided derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities are detailed. The method builds on standard RL concepts like actor-critic and dynamic programming.

pith-pipeline@v0.9.0 · 5607 in / 1108 out tokens · 118101 ms · 2026-05-13T03:28:40.431855+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning
cs.LG 2020-04 accept novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
cs.LG 2026-05 unverdicted novelty 7.0

Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.
Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
cs.LG 2026-05 unverdicted novelty 7.0

Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
cs.LG 2022-08 unverdicted novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization
cs.LG 2026-05 unverdicted novelty 6.0

ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy
cs.LG 2026-05 unverdicted novelty 6.0

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
cs.AI 2026-05 unverdicted novelty 6.0

RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.
ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network
cs.LG 2026-05 unverdicted novelty 6.0

ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
cs.LG 2026-05 unverdicted novelty 6.0

DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
cs.AI 2026-05 unverdicted novelty 6.0

In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic p...
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.
AdamO: A Collapse-Suppressed Optimizer for Offline RL
cs.LG 2026-05 unverdicted novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
cs.RO 2026-05 unverdicted novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
Beyond Importance Sampling: Rejection-Gated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse
cs.LG 2026-04 unverdicted novelty 6.0

KICL completes execution decisions in KOL financial discourse using offline RL, achieving top returns and Sharpe ratios with no unsupported trades or direction changes on YouTube and X data from 2022-2025.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks
cs.RO 2026-04 unverdicted novelty 6.0

MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.
PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC
cs.LG 2026-04 unverdicted novelty 6.0

PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
cs.LG 2023-04 conditional novelty 6.0

IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
cs.LG 2026-05 unverdicted novelty 5.0

Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
cs.LG 2026-05 unverdicted novelty 5.0

XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-t...
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 32 Pith papers

[1]

Apprenticeship learning via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning (ICML) , pp. 1, 2004

work page 2004
[2]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a Posteriori Policy Optimisation. In International Conference on Learning Representations (ICLR), pp. 1–19, 2018

work page 2018
[3]

An Optimistic Perspective on Ofﬂine Reinforce- ment Learning

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An Optimistic Perspective on Ofﬂine Reinforce- ment Learning. In International Conference on Machine Learning (ICML), 2019

work page 2019
[4]

ROBEL: Robotics Benchmarks for Learning with Low- Cost Robots

Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. ROBEL: Robotics Benchmarks for Learning with Low- Cost Robots. In Conference on Robot Learning (CoRL) . arXiv, 2019

work page 2019
[5]

Robot Learning From Demonstration

Christopher G Atkeson and Stefan Schaal. Robot Learning From Demonstration. In International Conference on Machine Learning (ICML) , 1997

work page 1997
[6]

Compatible value gradients for reinforcement learning of continuous deep policies

David Balduzzi and Muhammad Ghifary. Compatible value gradients for reinforcement learning of continuous deep policies. CoRR, abs/1509.03005, 2015

work page arXiv 2015
[7]

Bentivegna, Gordon Cheng, and Christopher G

Darrin C. Bentivegna, Gordon Cheng, and Christopher G. Atkeson. Learning from observation and from practice using behavioral primitives. In Paolo Dario and Raja Chatila (eds.), Robotics Research, The Eleventh Inter- national Symposium, ISRR, October 19-22, 2003, Siena, Italy, volume 15 of Springer Tracts in Advanced Robotics, pp. 551–560. Springer, 2003. ...

work page doi:10.1007/11008941 2003
[8]

Sutton, Mohammad Ghavamzadeh, and Mark Lee

Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Autom., 45(11):2471–2482, 2009. doi: 10.1016/j.automatica.2009.07.008

work page doi:10.1016/j.automatica.2009.07.008 2009
[9]

Thomas Degris, Martha White, and Richard S. Sutton. Off-Policy Actor-Critic. In International Conference on Machine Learning (ICML) , 2012

work page 2012
[10]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Association for Compuational Linguistics (ACL) , 2019

work page 2019
[11]

P3O: Policy-on Policy-off Policy Optimization

Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3O: Policy-on Policy-off Policy Optimization. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI) , 2019

work page 2019
[12]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. 2020

work page 2020
[13]

Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 2018

work page 2018
[14]

Off- Policy Deep Reinforcement Learning without Exploration

Scott Fujimoto, David Meger, and Doina Precup. Off- Policy Deep Reinforcement Learning without Exploration. In International Conference on Machine Learning (ICML), 2019

work page 2019
[15]

Reinforcement learning from imper- fect demonstrations

Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imper- fect demonstrations. CoRR, abs/1802.05313, 2018

work page arXiv 2018
[16]

Relay Policy Learning: Solv- ing Long-Horizon Tasks via Imitation and Reinforcement Learning

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay Policy Learning: Solv- ing Long-Horizon Tasks via Imitation and Reinforcement Learning. In Conference on Robot Learning (CoRL) , 2019

work page 2019
[17]

Reset-Free Reinforcement Learning via Multi- Task Learning: Learning Dexterous Manipulation Be- haviors without Human Intervention

Abhishek Gupta, Justin Yu, Tony Zhao, Vikash Kumar, Kelvin Xu, Thomas Devlin, Aaron Rovinsky, and Sergey Levine. Reset-Free Reinforcement Learning via Multi- Task Learning: Learning Dexterous Manipulation Be- haviors without Human Intervention. In International Conference on Robotics and Automation (ICRA) , 2021

work page 2021
[18]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International Conference on Machine Learning , 2018

work page 2018
[19]

Consistent On-Line Off-Policy Evaluation

Assaf Hallak and Shie Mannor. Consistent On-Line Off-Policy Evaluation. In International Conference on Machine Learning (ICML) , 2017

work page 2017
[20]

Off-policy Model-based Learning under Un- known Factored Dynamics

Assaf Hallak, Francois Schnitzler, Timothy Mann, and Shie Mannor. Off-policy Model-based Learning under Un- known Factored Dynamics. In International Conference on Machine Learning (ICML) , 2015

work page 2015
[21]

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Assaf Hallak, Aviv Tamar, Rémi Munos, and Shie Mannor. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis. In Association for the Advancement of Artiﬁcial Intelligence (AAAI) , 2016

work page 2016
[22]

Learning Attractor Landscapes for Learning Motor Prim- itives

Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Learning Attractor Landscapes for Learning Motor Prim- itives. In Advances in Neural Information Processing Systems (NIPS), pp. 1547–1554, 2002. ISBN 1049-5258

work page 2002
[23]

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. CoRR, abs/1907.00456, 2019

work page Pith review arXiv 1907
[24]

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

Nan Jiang and Lihong Li. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In International Conference on Machine Learning (ICML) , 2016

work page 2016
[25]

Learning from Limited Demonstrations

Beomjoon Kim, Amir-Massoud Farahmand, Joelle Pineau, and Doina Precup. Learning from Limited Demonstrations. In Advances in Neural Information Processing Systems (NIPS), 2013

work page 2013
[26]

Jens Kober and J. Peter. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), volume 97, 2008

work page 2008
[27]

Actor-Critic Algo- rithms

Vijay R Konda and John N Tsitsiklis. Actor-Critic Algo- rithms. In Advances in Neural Information Processing Systems (NeurIPS), 2000

work page 2000
[28]

Cald- well

Petar Kormushev, Sylvain Calinon, and Darwin G. Cald- well. Robot motor skill coordination with em-based reinforcement learning. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 18-22, 2010, Taipei, Taiwan, pp. 3232–3237. IEEE, 2010. doi: 10.1109/IROS.2010.5649089

work page doi:10.1109/iros.2010.5649089 2010
[29]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105, 2012

work page 2012
[30]

Stabilizing Off-Policy Q-Learning via Bootstrap- ping Error Reduction

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing Off-Policy Q-Learning via Bootstrap- ping Error Reduction. In Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[31]

Conservative Q-Learning for Ofﬂine Reinforce- ment Learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Ofﬂine Reinforce- ment Learning. In Advances in Neural Information Processing Systems (NeurIPS) , 2020

work page 2020
[32]

Riedmiller

Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Marco Wiering and Mar- tijn van Otterlo (eds.), Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization , pp. 45–73. Springer, 2012. doi: 10.1007/978-3-642-27645-3\_2

work page doi:10.1007/978-3-642-27645-3 2012
[33]

Ofﬂine Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. Technical report, 2020

work page 2020
[34]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR) , 2016. ISBN 0-7803- 3213-X. doi: 10.1613/jair.301

work page doi:10.1613/jair.301 2016
[35]

Asynchronous Methods for Deep Reinforcement Learning

V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In Interna- tional Conference on Machine Learning (ICML) , 2016

work page 2016
[36]

Learning to select and generalize striking movements in robot table tennis

Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. Int. J. Robotics Res. , 32(3):263–279, 2013. doi: 10.1177/0278364912472380

work page doi:10.1177/0278364912472380 2013
[37]

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections. In Advances in Neural Information Processing Systems (NeurIPS) , 2019

work page 2019
[38]

Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation

Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, Sergey Levine, Dian Chen, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation. In IEEE International Conference on Robotics and Au- tomation (ICRA) , 2017. ISBN 9781509046331...

work page doi:10.1109/icra.2017.7989247 2017
[39]

Overcoming Explo- ration in Reinforcement Learning with Demonstrations

Ashvin Nair, Bob Mcgrew, Marcin Andrychowicz, Woj- ciech Zaremba, and Pieter Abbeel. Overcoming Explo- ration in Reinforcement Learning with Demonstrations. In IEEE International Conference on Robotics and Automation (ICRA), 2018

work page 2018
[40]

Fitted Q-iteration by Advantage Weighted Regression

Gerhard Neumann and Jan Peters. Fitted Q-iteration by Advantage Weighted Regression. In Advances in Neural Information Processing Systems (NeurIPS) , 2008

work page 2008
[41]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. 2019

work page 2019
[42]

Reinforcement Learning by Reward-weighted Regression for Operational Space Con- trol

Jan Peters and Stefan Schaal. Reinforcement Learning by Reward-weighted Regression for Operational Space Con- trol. In International Conference on Machine Learning , 2007

work page 2007
[43]

Natural actor-critic

Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008. doi: 10.1016/ j.neucom.2007.11.026

work page 2008
[44]

Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008. ISSN 08936080. doi: 10.1016/j. neunet.2008.02.003

work page doi:10.1016/j 2008
[45]

Rel- ative Entropy Policy Search

Jan Peters, Katharina Mülling, and Yasemin Altün. Rel- ative Entropy Policy Search. In AAAI Conference on Artiﬁcial Intelligence, pp. 1607–1612, 2010

work page 2010
[46]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Robotics: Science and Systems , 2018

work page 2018
[47]

Lifelong Generative Modeling

Jason Ramapuram, Magda Gregorova, and Alexandros Kalousis. Lifelong Generative Modeling. Neurocomputing, 2017

work page 2017
[48]

Learning from demonstration

Stefan Schaal. Learning from demonstration. In Advances in Neural Information Processing Systems (NeurIPS) , number 9, pp. 1040–1046, 1997. ISBN 1558604863. doi: 10.1016/j.robot.2004.03.001

work page doi:10.1016/j.robot.2004.03.001 1997
[49]

Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller

Noah Y . Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning, 2020

work page 2020
[50]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . 1998

work page 1998
[51]

A Generalized Path Integral Control Approach to Re- inforcement Learning

Evangelos A Theodorou, Jonas Buchli, and Stefan Schaal. A Generalized Path Integral Control Approach to Re- inforcement Learning. Journal of Machine Learning Research (JMLR), 11:3137–3181, 2010

work page 2010
[52]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings, pp. 2139–2148...

work page 2016
[53]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pp. 5026–5033, 2012. ISBN 9781467317375. doi: 10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[54]

Leverag- ing Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Matej Ve ˇcerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leverag- ing Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. CoRR, abs/1707.0, 2017

work page 2017
[55]

Exponentially Weighted Imitation Learn- ing for Batched Historical Data

Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, and Tong Zhang. Exponentially Weighted Imitation Learn- ing for Batched Historical Data. In Neural Information Processing Systems (NeurIPS) , 2018

work page 2018
[56]

Critic Regularized Regression

Ziyu Wang, Alexander Novikov, Konrad Zołna, Jost To- bias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, and Nando De Freitas. Critic Regularized Regression. 2020

work page 2020
[57]

Taherkhani, A

Pawel Wawrzynski. Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Networks, 22(10):1484–1497, 2009. doi: 10.1016/j.neunet. 2009.05.011

work page doi:10.1016/j.neunet 2009
[58]

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Ronald J Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, pp. 229–256, 1992

work page 1992
[59]

Behavior Regularized Ofﬂine Reinforcement Learning

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior Regularized Ofﬂine Reinforcement Learning. 2020

work page 2020
[60]

Shaping rewards for reinforcement learning with imperfect demonstrations using generative models, 2020

Yuchen Wu, Melissa Moziﬁan, and Florian Shkurti. Shaping rewards for reinforcement learning with imperfect demonstrations using generative models, 2020

work page 2020
[61]

GenDICE: Generalized Ofﬂine Estimation of Stationary Values

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized Ofﬂine Estimation of Stationary Values. In International Conference on Learning Repre- sentations (ICLR), 2020

work page 2020
[62]

Generalized off-policy actor-critic

Shangtong Zhang, Wendelin Boehmer, and Shimon White- son. Generalized off-policy actor-critic. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 2001–2011. Curran Associates, Inc., 2019

work page 2001
[63]

Watch, try, learn: Meta-learning from demonstrations and reward

Allan Zhou, Eric Jang, Daniel Kappler, Alexander Her- zog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. Watch, try, learn: Meta-learning from demonstrations and reward. CoRR, abs/1906.03352, 2019

work page arXiv 1906
[64]

Dexterous Manipulation with Deep Reinforcement Learning: Efﬁcient, General, and Low-Cost

Henry Zhu, Abhishek Gupta, Aravind Rajeswaran, Sergey Levine, and Vikash Kumar. Dexterous Manipulation with Deep Reinforcement Learning: Efﬁcient, General, and Low-Cost. In Proceedings - IEEE International Con- ference on Robotics and Automation , volume 2019-May, pp. 3651–3657. Institute of Electrical and Electronics Engineers Inc., 2019

work page 2019
[65]

Maximum Entropy Inverse Reinforcement Learning

Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI Conference on Artiﬁcial Intelligence , pp. 1433–1438, 2008. ISBN 9781577353683 (ISBN). APPENDIX A. Algorithm Derivation Details The full optimization problem we solve, given the previous off-policy advantage estimate Aπk and buffer dist...

work page 2008
[66]

Dexterous Manipulation Environments: These environ- ments are modiﬁed from those proposed by Rajeswaran et al. [46]. a) pen-binary-v0.: The task is to spin a pen into a given orientation. The action dimension is 24 and the observation dimension is 45. Let the position and orientation of the pen be denoted byxp andxo respectively, and the desired position ...

work page
[67]

The task is to push a puck to a goal position in a 40cm x 20cm, and the reward function is the negative distance between the puck and goal position

Sawyer Manipulation Environment: a) SawyerPush-v0.: This environment is included in the Multiworld library. The task is to push a puck to a goal position in a 40cm x 20cm, and the reward function is the negative distance between the puck and goal position. When using this environment, we use hindsight experience replay for goal- conditioned reinforcement ...

work page
[68]

For Gym benchmarks we report average return, and expert data is collected by a trained SAC policy

Off-Policy Data Performance: The performances of the expert data, behavior cloning (BC) on the expert data (1), and BC on the combined expert+BC data (2) are included in Table III. For Gym benchmarks we report average return, and expert data is collected by a trained SAC policy. For dextrous manipulation tasks we report the success rate, and the expert da...

work page 2040