pith. machine review for the scientific record. sign in

arxiv: 2006.09359 · v6 · submitted 2020-06-16 · 💻 cs.LG · cs.RO· stat.ML

Recognition: 2 theorem links

· Lean Theorem

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:28 UTC · model grok-4.3

classification 💻 cs.LG cs.ROstat.ML
keywords reinforcement learningoffline dataonline fine-tuningactor-criticrobotic manipulationpolicy optimizationadvantage weighting
0
0 comments X

The pith

AWAC combines offline data with online reinforcement learning to accelerate policy improvement for robotic control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reinforcement learning demands extensive active exploration to learn control policies, which limits its use in real settings like robotics where data collection is costly. The paper claims that previously collected datasets, whether expert demonstrations or merely useful sub-optimal trajectories, can serve as a bootstrap that reduces the online samples required. AWAC achieves this by estimating advantages through dynamic programming on the mixed data and then performing maximum-likelihood policy updates weighted by those advantages. This framework supports a smooth shift from offline initialization to online fine-tuning, enabling the agent to exceed the quality of the original data. If the approach holds, it makes a range of manipulation skills trainable within practical time limits on both simulated and physical robots.

Core claim

The paper proposes advantage weighted actor critic (AWAC), an algorithm that first uses sample-efficient dynamic programming to compute action advantages from a combination of offline data and online experience, then updates the policy via maximum-likelihood estimation weighted by those advantages. This simple combination lets the method leverage large prior datasets to mitigate exploration difficulties while still permitting further improvement during online training. Experiments show the resulting policies reach high performance on dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and valve rotation, reducing the online interaction time needed to a

What carries the argument

Advantage weighted actor critic (AWAC), which estimates advantages via dynamic programming on offline and online data then applies them as weights in maximum-likelihood policy updates.

If this is right

  • Prior data, expert or sub-optimal, provides an initial policy that reduces the exploration burden in subsequent online training.
  • The agent can continue to improve beyond the performance level present in the offline dataset.
  • The same method works across both simulated environments and physical robot hardware for manipulation skills.
  • Transition difficulties between offline pre-training and online fine-tuning are avoided by the advantage-weighted updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid offline-online pipelines become viable for other domains where batch data is cheap but live interaction is expensive.
  • Large public or previously recorded datasets could serve as starting points for new robot tasks before targeted online refinement.
  • The weighting mechanism might be portable to other actor-critic or policy-gradient methods to stabilize data mixing.

Load-bearing premise

Offline data supplies transitions whose advantages remain accurate and useful guides for policy updates even after online experience begins to dominate the dataset.

What would settle it

If AWAC requires the same number of online steps as a standard online RL baseline to reach equivalent performance on the real multi-fingered hand or valve tasks, the claim of accelerated learning from offline data would not hold.

read the original abstract

Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript analyzes the difficulties of transitioning from offline RL (using expert or sub-optimal demonstration data) to continued online improvement, then proposes the Advantage-Weighted Actor-Critic (AWAC) algorithm. AWAC combines sample-efficient dynamic programming for the critic with advantage-weighted maximum-likelihood policy updates that directly mitigate distribution shift and value overestimation. The central claim is that this enables rapid online fine-tuning of policies when seeded with large amounts of prior data, demonstrated on simulated robotics tasks and real-world domains including dexterous manipulation with a multi-fingered hand, drawer opening, and valve rotation.

Significance. If the reported results hold, the work has clear practical significance for robotics RL: prior data (even sub-optimal) can be leveraged to reduce the exploration burden and sample complexity that currently limit real-world deployment. The construction is a strength—it targets the precise failure modes (distribution shift, overestimation) that typically prevent offline-to-online progress without introducing circularity or parameter-free claims. The inclusion of real-robot experiments with comparisons to relevant baselines provides direct, falsifiable support for the claim that offline data can accelerate online learning to practical timescales.

minor comments (3)
  1. [Abstract] Abstract: the claim of results on 'simulated and real robotics domains' is not accompanied by any quantitative metrics, error bars, or baseline comparisons. While the full text supplies these, adding a concise summary of key performance numbers to the abstract would improve accessibility.
  2. [Method] The notation for the advantage-weighted policy update (likely in §3 or §4) could be clarified by explicitly stating the objective as an expectation under the offline data distribution rather than leaving the weighting implicit in the text description.
  3. [Experiments] Figure captions for the real-robot experiments should include the number of trials and whether error bars represent standard error or standard deviation to allow readers to assess statistical reliability without consulting the main text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work, recognition of its practical significance for robotics RL, and recommendation for minor revision. We appreciate the detailed summary of the manuscript's contributions and the identification of strengths in the algorithmic construction and experimental validation.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives AWAC by combining standard dynamic programming (for the critic) with advantage-weighted maximum-likelihood policy updates. This construction is presented as a direct response to analyzed offline-to-online transition issues, without any step reducing by definition to fitted parameters or prior self-citations. The central claim rests on explicit algorithmic definitions and empirical validation across simulated and real-robot tasks, remaining self-contained against external benchmarks. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or self-definitional reductions appear in the provided derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities are detailed. The method builds on standard RL concepts like actor-critic and dynamic programming.

pith-pipeline@v0.9.0 · 5607 in / 1108 out tokens · 118101 ms · 2026-05-13T03:28:40.431855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    cs.LG 2020-04 accept novelty 8.0

    D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

  2. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  3. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  4. Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    Anchor-TS defines arm indices as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean to correct distribution-shift bias and safely accelerate online learning with offline data.

  5. Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

    cs.LG 2026-05 unverdicted novelty 7.0

    Anchor-TS corrects bias from distribution shift in offline-to-online bandits by taking the median of an online posterior sample, a hybrid posterior sample, and the online sample mean.

  6. SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

    cs.LG 2026-05 unverdicted novelty 7.0

    SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

  7. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  8. WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.

  9. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    cs.LG 2022-08 unverdicted novelty 7.0

    Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

  10. ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.

  11. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  12. Discrete Flow Matching for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

  13. RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

    cs.AI 2026-05 unverdicted novelty 6.0

    RankQ adds a self-supervised ranking loss to Q-learning to learn structured action orderings, yielding competitive or better performance than prior methods on D4RL benchmarks and large gains in vision-based robot fine-tuning.

  14. ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

    cs.LG 2026-05 unverdicted novelty 6.0

    ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.

  15. Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

    cs.LG 2026-05 unverdicted novelty 6.0

    DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.

  16. Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

    cs.AI 2026-05 unverdicted novelty 6.0

    In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic p...

  17. Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.

  18. Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    An adaptive UCB-based policy selection and fine-tuning strategy improves performance over standard O2O-RL baselines under interaction budgets.

  19. AdamO: A Collapse-Suppressed Optimizer for Offline RL

    cs.LG 2026-05 unverdicted novelty 6.0

    AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

  20. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  21. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  22. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  23. When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...

  24. Fisher Decorator: Refining Flow Policy via a Local Transport Map

    cs.LG 2026-04 unverdicted novelty 6.0

    Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

  25. Beyond Importance Sampling: Rejection-Gated Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

  26. When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse

    cs.LG 2026-04 unverdicted novelty 6.0

    KICL completes execution decisions in KOL financial discourse using offline RL, achieving top returns and Sharpe ratios with no unsupported trades or direction changes on YouTube and X data from 2022-2025.

  27. MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

    cs.RO 2026-04 unverdicted novelty 6.0

    MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.

  28. Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.

  29. PriPG-RL: Privileged Planner-Guided Reinforcement Learning for Partially Observable Systems with Anytime-Feasible MPC

    cs.LG 2026-04 unverdicted novelty 6.0

    PriPG-RL trains RL policies for POMDPs by distilling knowledge from a privileged anytime-feasible MPC planner into a P2P-SAC policy, improving sample efficiency and performance in partially observable robotic navigation.

  30. Training Diffusion Models with Reinforcement Learning

    cs.LG 2023-05 unverdicted novelty 6.0

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  31. IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    cs.LG 2023-04 conditional novelty 6.0

    IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.

  32. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.

  33. XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    XQCfD accelerates actor-critic RL by using prior data, pretrained policies, and stationary architectures to achieve state-of-the-art results on Adroit, Robomimic, and MimicGen manipulation benchmarks with low update-t...

  34. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    cs.LG 2020-05 unverdicted novelty 2.0

    Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 32 Pith papers

  1. [1]

    Apprenticeship learning via inverse reinforcement learning

    Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning (ICML) , pp. 1, 2004

  2. [2]

    Maximum a Posteriori Policy Optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a Posteriori Policy Optimisation. In International Conference on Learning Representations (ICLR), pp. 1–19, 2018

  3. [3]

    An Optimistic Perspective on Offline Reinforce- ment Learning

    Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An Optimistic Perspective on Offline Reinforce- ment Learning. In International Conference on Machine Learning (ICML), 2019

  4. [4]

    ROBEL: Robotics Benchmarks for Learning with Low- Cost Robots

    Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. ROBEL: Robotics Benchmarks for Learning with Low- Cost Robots. In Conference on Robot Learning (CoRL) . arXiv, 2019

  5. [5]

    Robot Learning From Demonstration

    Christopher G Atkeson and Stefan Schaal. Robot Learning From Demonstration. In International Conference on Machine Learning (ICML) , 1997

  6. [6]

    Compatible value gradients for reinforcement learning of continuous deep policies

    David Balduzzi and Muhammad Ghifary. Compatible value gradients for reinforcement learning of continuous deep policies. CoRR, abs/1509.03005, 2015

  7. [7]

    Bentivegna, Gordon Cheng, and Christopher G

    Darrin C. Bentivegna, Gordon Cheng, and Christopher G. Atkeson. Learning from observation and from practice using behavioral primitives. In Paolo Dario and Raja Chatila (eds.), Robotics Research, The Eleventh Inter- national Symposium, ISRR, October 19-22, 2003, Siena, Italy, volume 15 of Springer Tracts in Advanced Robotics, pp. 551–560. Springer, 2003. ...

  8. [8]

    Sutton, Mohammad Ghavamzadeh, and Mark Lee

    Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Autom., 45(11):2471–2482, 2009. doi: 10.1016/j.automatica.2009.07.008

  9. [9]

    Thomas Degris, Martha White, and Richard S. Sutton. Off-Policy Actor-Critic. In International Conference on Machine Learning (ICML) , 2012

  10. [10]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Association for Compuational Linguistics (ACL) , 2019

  11. [11]

    P3O: Policy-on Policy-off Policy Optimization

    Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3O: Policy-on Policy-off Policy Optimization. In Conference on Uncertainty in Artificial Intelligence (UAI) , 2019

  12. [12]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. 2020

  13. [13]

    Addressing Function Approximation Error in Actor-Critic Methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 2018

  14. [14]

    Off- Policy Deep Reinforcement Learning without Exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off- Policy Deep Reinforcement Learning without Exploration. In International Conference on Machine Learning (ICML), 2019

  15. [15]

    Reinforcement learning from imper- fect demonstrations

    Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imper- fect demonstrations. CoRR, abs/1802.05313, 2018

  16. [16]

    Relay Policy Learning: Solv- ing Long-Horizon Tasks via Imitation and Reinforcement Learning

    Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay Policy Learning: Solv- ing Long-Horizon Tasks via Imitation and Reinforcement Learning. In Conference on Robot Learning (CoRL) , 2019

  17. [17]

    Reset-Free Reinforcement Learning via Multi- Task Learning: Learning Dexterous Manipulation Be- haviors without Human Intervention

    Abhishek Gupta, Justin Yu, Tony Zhao, Vikash Kumar, Kelvin Xu, Thomas Devlin, Aaron Rovinsky, and Sergey Levine. Reset-Free Reinforcement Learning via Multi- Task Learning: Learning Dexterous Manipulation Be- haviors without Human Intervention. In International Conference on Robotics and Automation (ICRA) , 2021

  18. [18]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International Conference on Machine Learning , 2018

  19. [19]

    Consistent On-Line Off-Policy Evaluation

    Assaf Hallak and Shie Mannor. Consistent On-Line Off-Policy Evaluation. In International Conference on Machine Learning (ICML) , 2017

  20. [20]

    Off-policy Model-based Learning under Un- known Factored Dynamics

    Assaf Hallak, Francois Schnitzler, Timothy Mann, and Shie Mannor. Off-policy Model-based Learning under Un- known Factored Dynamics. In International Conference on Machine Learning (ICML) , 2015

  21. [21]

    Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

    Assaf Hallak, Aviv Tamar, Rémi Munos, and Shie Mannor. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis. In Association for the Advancement of Artificial Intelligence (AAAI) , 2016

  22. [22]

    Learning Attractor Landscapes for Learning Motor Prim- itives

    Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Learning Attractor Landscapes for Learning Motor Prim- itives. In Advances in Neural Information Processing Systems (NIPS), pp. 1547–1554, 2002. ISBN 1049-5258

  23. [23]

    Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. CoRR, abs/1907.00456, 2019

  24. [24]

    Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

    Nan Jiang and Lihong Li. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In International Conference on Machine Learning (ICML) , 2016

  25. [25]

    Learning from Limited Demonstrations

    Beomjoon Kim, Amir-Massoud Farahmand, Joelle Pineau, and Doina Precup. Learning from Limited Demonstrations. In Advances in Neural Information Processing Systems (NIPS), 2013

  26. [26]

    Jens Kober and J. Peter. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), volume 97, 2008

  27. [27]

    Actor-Critic Algo- rithms

    Vijay R Konda and John N Tsitsiklis. Actor-Critic Algo- rithms. In Advances in Neural Information Processing Systems (NeurIPS), 2000

  28. [28]

    Cald- well

    Petar Kormushev, Sylvain Calinon, and Darwin G. Cald- well. Robot motor skill coordination with em-based reinforcement learning. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 18-22, 2010, Taipei, Taiwan, pp. 3232–3237. IEEE, 2010. doi: 10.1109/IROS.2010.5649089

  29. [29]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105, 2012

  30. [30]

    Stabilizing Off-Policy Q-Learning via Bootstrap- ping Error Reduction

    Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing Off-Policy Q-Learning via Bootstrap- ping Error Reduction. In Neural Information Processing Systems (NeurIPS), 2019

  31. [31]

    Conservative Q-Learning for Offline Reinforce- ment Learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforce- ment Learning. In Advances in Neural Information Processing Systems (NeurIPS) , 2020

  32. [32]

    Riedmiller

    Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Marco Wiering and Mar- tijn van Otterlo (eds.), Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization , pp. 45–73. Springer, 2012. doi: 10.1007/978-3-642-27645-3\_2

  33. [33]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. Technical report, 2020

  34. [34]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR) , 2016. ISBN 0-7803- 3213-X. doi: 10.1613/jair.301

  35. [35]

    Asynchronous Methods for Deep Reinforcement Learning

    V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In Interna- tional Conference on Machine Learning (ICML) , 2016

  36. [36]

    Learning to select and generalize striking movements in robot table tennis

    Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. Int. J. Robotics Res. , 32(3):263–279, 2013. doi: 10.1177/0278364912472380

  37. [37]

    DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

    Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections. In Advances in Neural Information Processing Systems (NeurIPS) , 2019

  38. [38]

    Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation

    Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, Sergey Levine, Dian Chen, Phillip Isola, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation. In IEEE International Conference on Robotics and Au- tomation (ICRA) , 2017. ISBN 9781509046331...

  39. [39]

    Overcoming Explo- ration in Reinforcement Learning with Demonstrations

    Ashvin Nair, Bob Mcgrew, Marcin Andrychowicz, Woj- ciech Zaremba, and Pieter Abbeel. Overcoming Explo- ration in Reinforcement Learning with Demonstrations. In IEEE International Conference on Robotics and Automation (ICRA), 2018

  40. [40]

    Fitted Q-iteration by Advantage Weighted Regression

    Gerhard Neumann and Jan Peters. Fitted Q-iteration by Advantage Weighted Regression. In Advances in Neural Information Processing Systems (NeurIPS) , 2008

  41. [41]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. 2019

  42. [42]

    Reinforcement Learning by Reward-weighted Regression for Operational Space Con- trol

    Jan Peters and Stefan Schaal. Reinforcement Learning by Reward-weighted Regression for Operational Space Con- trol. In International Conference on Machine Learning , 2007

  43. [43]

    Natural actor-critic

    Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008. doi: 10.1016/ j.neucom.2007.11.026

  44. [44]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008. ISSN 08936080. doi: 10.1016/j. neunet.2008.02.003

  45. [45]

    Rel- ative Entropy Policy Search

    Jan Peters, Katharina Mülling, and Yasemin Altün. Rel- ative Entropy Policy Search. In AAAI Conference on Artificial Intelligence, pp. 1607–1612, 2010

  46. [46]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Robotics: Science and Systems , 2018

  47. [47]

    Lifelong Generative Modeling

    Jason Ramapuram, Magda Gregorova, and Alexandros Kalousis. Lifelong Generative Modeling. Neurocomputing, 2017

  48. [48]

    Learning from demonstration

    Stefan Schaal. Learning from demonstration. In Advances in Neural Information Processing Systems (NeurIPS) , number 9, pp. 1040–1046, 1997. ISBN 1558604863. doi: 10.1016/j.robot.2004.03.001

  49. [49]

    Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller

    Noah Y . Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning, 2020

  50. [50]

    Reinforcement Learning: An Introduction

    Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction . 1998

  51. [51]

    A Generalized Path Integral Control Approach to Re- inforcement Learning

    Evangelos A Theodorou, Jonas Buchli, and Stefan Schaal. A Generalized Path Integral Control Approach to Re- inforcement Learning. Journal of Machine Learning Research (JMLR), 11:3137–3181, 2010

  52. [52]

    Thomas and Emma Brunskill

    Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings, pp. 2139–2148...

  53. [53]

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pp. 5026–5033, 2012. ISBN 9781467317375. doi: 10.1109/IROS.2012.6386109

  54. [54]

    Leverag- ing Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

    Matej Ve ˇcerík, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leverag- ing Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. CoRR, abs/1707.0, 2017

  55. [55]

    Exponentially Weighted Imitation Learn- ing for Batched Historical Data

    Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, and Tong Zhang. Exponentially Weighted Imitation Learn- ing for Batched Historical Data. In Neural Information Processing Systems (NeurIPS) , 2018

  56. [56]

    Critic Regularized Regression

    Ziyu Wang, Alexander Novikov, Konrad Zołna, Jost To- bias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, and Nando De Freitas. Critic Regularized Regression. 2020

  57. [57]

    Taherkhani, A

    Pawel Wawrzynski. Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Networks, 22(10):1484–1497, 2009. doi: 10.1016/j.neunet. 2009.05.011

  58. [58]

    Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

    Ronald J Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, pp. 229–256, 1992

  59. [59]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning. 2020

  60. [60]

    Shaping rewards for reinforcement learning with imperfect demonstrations using generative models, 2020

    Yuchen Wu, Melissa Mozifian, and Florian Shkurti. Shaping rewards for reinforcement learning with imperfect demonstrations using generative models, 2020

  61. [61]

    GenDICE: Generalized Offline Estimation of Stationary Values

    Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized Offline Estimation of Stationary Values. In International Conference on Learning Repre- sentations (ICLR), 2020

  62. [62]

    Generalized off-policy actor-critic

    Shangtong Zhang, Wendelin Boehmer, and Shimon White- son. Generalized off-policy actor-critic. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 2001–2011. Curran Associates, Inc., 2019

  63. [63]

    Watch, try, learn: Meta-learning from demonstrations and reward

    Allan Zhou, Eric Jang, Daniel Kappler, Alexander Her- zog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, and Chelsea Finn. Watch, try, learn: Meta-learning from demonstrations and reward. CoRR, abs/1906.03352, 2019

  64. [64]

    Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost

    Henry Zhu, Abhishek Gupta, Aravind Rajeswaran, Sergey Levine, and Vikash Kumar. Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost. In Proceedings - IEEE International Con- ference on Robotics and Automation , volume 2019-May, pp. 3651–3657. Institute of Electrical and Electronics Engineers Inc., 2019

  65. [65]

    Maximum Entropy Inverse Reinforcement Learning

    Brian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI Conference on Artificial Intelligence , pp. 1433–1438, 2008. ISBN 9781577353683 (ISBN). APPENDIX A. Algorithm Derivation Details The full optimization problem we solve, given the previous off-policy advantage estimate Aπk and buffer dist...

  66. [66]

    Dexterous Manipulation Environments: These environ- ments are modified from those proposed by Rajeswaran et al. [46]. a) pen-binary-v0.: The task is to spin a pen into a given orientation. The action dimension is 24 and the observation dimension is 45. Let the position and orientation of the pen be denoted byxp andxo respectively, and the desired position ...

  67. [67]

    The task is to push a puck to a goal position in a 40cm x 20cm, and the reward function is the negative distance between the puck and goal position

    Sawyer Manipulation Environment: a) SawyerPush-v0.: This environment is included in the Multiworld library. The task is to push a puck to a goal position in a 40cm x 20cm, and the reward function is the negative distance between the puck and goal position. When using this environment, we use hindsight experience replay for goal- conditioned reinforcement ...

  68. [68]

    For Gym benchmarks we report average return, and expert data is collected by a trained SAC policy

    Off-Policy Data Performance: The performances of the expert data, behavior cloning (BC) on the expert data (1), and BC on the combined expert+BC data (2) are included in Table III. For Gym benchmarks we report average return, and expert data is collected by a trained SAC policy. For dextrous manipulation tasks we report the success rate, and the expert da...