pith. machine review for the scientific record. sign in

arxiv: 2605.10734 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningactor criticdemonstrationsrobotic manipulationsample efficiencypretrained policiesstationary networkssparse rewards
0
0 comments X

The pith

XQCfD uses stationary networks and augmented buffers to retain and improve upon pretrained policies in actor-critic learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing actor-critic methods tend to overwrite useful pretrained policies when they start learning from new experience mixed with demonstrations. XQCfD addresses this by extending the XQC algorithm with augmented replay buffers that combine prior and online data, and by employing stationary policy networks that keep higher entropy in their predictions for states outside the current training distribution. This design supports continued policy improvement without relying on ensembles or frequent updates. The result is state-of-the-art performance on several robotic manipulation benchmarks that feature sparse rewards. A sympathetic reader would see this as evidence that careful architectural choices can make better use of expensive demonstration data in real-world settings.

Core claim

The central discovery is that a stationary policy architecture combined with augmented replay buffers allows the XQC actor-critic to avoid rapidly unlearning strong initial policies from demonstrations. Instead, the higher entropy predictions enable effective policy improvement on out-of-distribution states, producing state-of-the-art results across complex manipulation tasks on the Adroit, Robomimic, and MimicGen benchmarks with a low update-to-data ratio and no ensemble networks.

What carries the argument

Stationary policy network architecture that generates higher-entropy predictions out of distribution to support ongoing improvement from pretrained policies.

If this is right

  • Pretrained policies can be retained and refined using demonstration data without special stabilization techniques beyond the stationary design.
  • Robotic agents achieve higher sample efficiency in sparse-reward settings by mixing prior data with new interactions.
  • Performance gains occur without increasing the update-to-data ratio or adding network ensembles.
  • The method applies directly to popular benchmarks for dexterous manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other fast actor-critic variants might benefit from similar stationary designs to preserve prior knowledge.
  • This approach could lower the barrier to using demonstration data in online reinforcement learning by reducing the risk of catastrophic forgetting.
  • Future work might test whether the higher entropy property holds across different task distributions beyond manipulation.

Load-bearing premise

The assumption that standard network architectures inherently lose high entropy predictions out of distribution, and that making them stationary will fix this without new instabilities.

What would settle it

Running the method on the Adroit benchmark and finding that the learned policy entropy matches that of non-stationary networks on out-of-distribution states, or that final performance does not exceed prior actor-critic baselines.

Figures

Figures reproduced from arXiv: 2605.10734 by Danica Kragic, Daniel Palenicek, Florian Vogt, Ingmar Posner, Jan Peters, Joe Watson.

Figure 1
Figure 1. Figure 1: Therefore, during RL finetuning in out-of-distribution states, Equation 1 may encourage the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of XQCfD and baselines on Adroit over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. BC pretraining performance is shown before the vertical dashed line. Notably, in two of the tasks, BC achieves expert performance. The benefit of improving upon BC is clear in this scenario, and baselines then require 100K–1M interactions to recover BC-level pe… view at source ↗
Figure 3
Figure 3. Figure 3: Performance results on Robomimic over 10 seeds showing the IQM and 10th and 90th [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance results of XQCfD and baselines on MimicGen over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. BC pretraining performance is shown before the vertical dashed line. OD refers to adding expert offline data. Compared to Robomimic, these tasks are slightly more complex due to a broader initial state distribution and longer task horizon. For this su… view at source ↗
Figure 5
Figure 5. Figure 5: An empirical analysis of the optimization landscape of the actor with and without the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation studies on XQCfD, replacing the HetStat MLP with a standard MLP and replacing KL regularization with standard entropy regularization for the Adroit (top), Robomimic (middle) and MimicGen (bottom) environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity study of XQCfD’s temperature α that controls KL regularization against the BC policy. Evaluated over one task for Adroit, Robomimic and MimicGen over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. Lower temperatures result in a larger performance drop on the transition from BC to RL due to unlearning, but lower temperatures also facilitate grea… view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity study on XQCfD, replacing the number of expert demonstrations for the Adroit environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) 0.0 0.5 1.0 Success Lift 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) PickPlaceCan 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) NutAssemblySquare N=200 N=100 N… view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity study on XQCfD, replacing the number of expert demonstrations for the Robomimic environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) 0.0 0.5 1.0 Success StackThree D0 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) Coffee D0 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) HammerCleanup D0 0.0 0… view at source ↗
Figure 10
Figure 10. Figure 10: Sensitivity study on XQCfD, replacing the number of expert demonstrations for the MimicGen environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. B The Categorical Critic Objective In C51 [3], the critic models the return distribution as a categorical distribution over N fixed atoms {zi} N i=1 with probabilities pϕ(s, a) = (pϕ,1(s, a), . . . , p… view at source ↗
read the original abstract

For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks -- notably with a low update-to-data ratio and no ensemble networks

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes XQCfD, an extension of the XQC actor-critic algorithm that incorporates expert demonstration data via augmented replay buffers, pretrained policies, and stationary policy network architectures. The central claim is that this design prevents rapid unlearning of strong initial policies (unlike prior actor-critic methods), enables better out-of-distribution policy improvement through higher-entropy predictions, and achieves state-of-the-art sample-efficient performance on sparse-reward robotic manipulation tasks from the Adroit, Robomimic, and MimicGen benchmarks, notably without ensembles and at low update-to-data ratios.

Significance. If the empirical results hold under rigorous verification, the work could meaningfully advance sample-efficient robotic RL by showing how to better leverage prior data and policies. The emphasis on stationary architectures for preserving entropy in OOD regions offers a practical design insight that may reduce reliance on ensembles or high update frequencies in demonstration-augmented settings.

major comments (2)
  1. [Abstract] Abstract: The central attribution of SOTA gains to the stationary architecture's higher-entropy OOD predictions is load-bearing for the contribution, yet the provided description contains no reference to specific ablation studies, entropy measurements, or controlled comparisons against non-stationary baselines that would isolate this mechanism from the effects of augmented buffers and pretrained policies.
  2. [Abstract] The claim that existing actor-critic designs inherently fail to retain and improve upon strong pretrained policies (Abstract) requires explicit evidence from head-to-head experiments; without reported metrics on policy retention (e.g., performance degradation curves or KL divergence to the initial policy) on the same Adroit/MimicGen tasks, it is difficult to assess whether the stationary design is necessary or merely sufficient.
minor comments (2)
  1. [Abstract] The abstract is written as a single unbroken paragraph with multiple run-on clauses, reducing readability; breaking it into 2-3 sentences would improve clarity.
  2. [Abstract] The acronym XQCfD is introduced without an explicit expansion on first use, which is standard for algorithmic papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the abstract and supporting claims with clearer evidence. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central attribution of SOTA gains to the stationary architecture's higher-entropy OOD predictions is load-bearing for the contribution, yet the provided description contains no reference to specific ablation studies, entropy measurements, or controlled comparisons against non-stationary baselines that would isolate this mechanism from the effects of augmented buffers and pretrained policies.

    Authors: We agree the abstract is too concise on this point. The full manuscript includes ablation studies in Section 5.2 that isolate the stationary architecture by comparing variants with and without it (while holding buffers and pretraining fixed), plus entropy measurements in Figure 6 and OOD policy improvement analysis in Section 4.3. We will revise the abstract to explicitly reference these controlled comparisons and measurements. revision: yes

  2. Referee: [Abstract] The claim that existing actor-critic designs inherently fail to retain and improve upon strong pretrained policies (Abstract) requires explicit evidence from head-to-head experiments; without reported metrics on policy retention (e.g., performance degradation curves or KL divergence to the initial policy) on the same Adroit/MimicGen tasks, it is difficult to assess whether the stationary design is necessary or merely sufficient.

    Authors: Section 4.1 already reports head-to-head results on Adroit and MimicGen showing performance degradation for non-stationary baselines (SAC, TD3) initialized from the same pretrained policies, contrasted with XQCfD's retention and improvement. However, we did not include explicit KL divergence to the initial policy or full degradation curves. We will add these metrics in the revision to directly support the necessity claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic extension with empirical validation

full rationale

The paper presents XQCfD as an extension of the prior XQC actor-critic algorithm, incorporating augmented replay buffers, pretrained policies, and a stationary network architecture. Claims of improved out-of-distribution policy improvement and SOTA performance on Adroit/Robomimic/MimicGen benchmarks are supported by experimental results rather than any closed-form derivations or predictions. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central result to its own inputs appear in the provided abstract or high-level description. The derivation chain is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or uniqueness-imported circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no new mathematical axioms or invented entities; it relies on standard RL assumptions such as the existence of useful demonstration data and the ability of replay buffers to mix distributions. No free parameters are explicitly fitted in the provided text.

pith-pipeline@v0.9.0 · 5486 in / 1216 out tokens · 62733 ms · 2026-05-12T04:52:50.016377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Maximum a posteriori policy optimisation

    Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. InInternational Conference on Learning Representations (ICLR), 2018

  2. [2]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning (ICML), 2023

  3. [3]

    Bellemare, Will Dabney, and Rémi Munos

    Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InInternational Conference on Machine Learning (ICML), 2017

  4. [4]

    CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.International Conference on Learning Representations (ICLR), 2024

    Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.International Conference on Learning Representations (ICLR), 2024

  5. [5]

    On-robot reinforcement learning with goal-contrastive rewards

    Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Ja- son Ma, Robert Platt, Jan-Willem van de Meent, and Lawson LS Wong. On-robot reinforcement learning with goal-contrastive rewards. InIEEE International Conference on Robotics and Automation (ICRA), 2025

  6. [6]

    Randomized Ensembled Dou- ble Q-Learning: Learning fast without a model

    Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized Ensembled Dou- ble Q-Learning: Learning fast without a model. InInternational Conference on Learning Representations (ICLR), 2021

  7. [7]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research (IJRR), 2025

  8. [8]

    Asymptotic evaluation of certain Markov process expectations for large time.Communications on Pure and Applied Mathematics, 1983

    Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time.Communications on Pure and Applied Mathematics, 1983

  9. [9]

    An investigation into neural net opti- mization via Hessian eigenvalue density

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net opti- mization via Hessian eigenvalue density. InInternational Conference on Machine Learning (ICML), 2019

  10. [10]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Conference on Machine Learning (ICML), 2018. 10

  11. [11]

    TD-MPC2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024

  12. [12]

    Imitation bootstrapped reinforcement learning

    Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning. InRobotics: Science and Systems (RSS), 2024

  13. [13]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning (ICML), 2015

  14. [14]

    Policy search for motor primitives in robotics.Advances in Neural Information Processing Systems (NeurIPS), 2008

    Jens Kober and Jan Peters. Policy search for motor primitives in robotics.Advances in Neural Information Processing Systems (NeurIPS), 2008

  15. [15]

    Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble

    Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. InConference on Robot Learning (CoRL), 2022

  16. [16]

    End-to-end training of deep visuomotor policies.Journal of Machine Learning Research (JMLR), 2016

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research (JMLR), 2016

  17. [17]

    Normalization and effective learning rates in reinforcement learning

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  18. [18]

    What matters in learning from offline human demonstrations for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

  19. [19]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

  20. [20]

    Periodic activation functions induce stationarity

    Lassi Meronen, Martin Trapp, and Arno Solin. Periodic activation functions induce stationarity. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  21. [21]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. AW AC: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

  22. [22]

    Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems (NeurIPS), 2023

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems (NeurIPS), 2023

  23. [23]

    What matters for adversarial imitation learning? InAdvances in Neural Information Processing Systems (NeurIPS), 2021

    Manu Orsini, Anton Raichuk, Léonard Hussenot, Damien Vincent, Robert Dadashi, Sertan Girgin, Matthieu Geist, Olivier Bachem, Olivier Pietquin, and Marcin Andrychowicz. What matters for adversarial imitation learning? InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  24. [24]

    Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Daniel Palenicek, Florian V ogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

  25. [25]

    XQC: Well- conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

    Daniel Palenicek, Florian V ogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well- conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

  26. [26]

    Policy gradient methods for robotics

    Jan Peters and Stefan Schaal. Policy gradient methods for robotics. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006

  27. [27]

    Information theoretic methods in statistics and computer science: Lecture 1 — f-divergences, 2020

    Yury Polyanskiy. Information theoretic methods in statistics and computer science: Lecture 1 — f-divergences, 2020

  28. [28]

    D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 1991. 11

  29. [29]

    Random features for large-scale kernel machines.Advances in Neural Information Processing Systems (NeurIPS), 2007

    Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines.Advances in Neural Information Processing Systems (NeurIPS), 2007

  30. [30]

    Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.Robotics: Science and Systems (RSS), 2018

    Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.Robotics: Science and Systems (RSS), 2018

  31. [31]

    Rasmussen and Chris Williams.Gaussian Processes for Machine Learning

    Carl E. Rasmussen and Chris Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

  32. [32]

    On stochastic optimal control and reinforcement learning by approximate inference

    Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. InRobotics: Science and Systems (RSS), 2013

  33. [33]

    On pathologies in KL-regularized reinforcement learning from expert demonstrations.Advances in Neural Information Processing Systems (NeurIPS), 2021

    Tim GJ Rudner, Cong Lu, Michael A Osborne, Yarin Gal, and Yee Teh. On pathologies in KL-regularized reinforcement learning from expert demonstrations.Advances in Neural Information Processing Systems (NeurIPS), 2021

  34. [34]

    How does batch nor- malization help optimization?Advances in Neural Information Processing Systems (NeurIPS), 2018

    Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch nor- malization help optimization?Advances in Neural Information Processing Systems (NeurIPS), 2018

  35. [35]

    Temporal difference learning of position evaluation in the game of Go.Advances in Neural Information Processing Systems (NeurIPS), 1993

    Nicol Schraudolph, Peter Dayan, and Terrence J Sejnowski. Temporal difference learning of position evaluation in the game of Go.Advances in Neural Information Processing Systems (NeurIPS), 1993

  36. [36]

    Keep doing what worked: Behavior modelling priors for offline reinforcement learning

    Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2020

  37. [37]

    Mastering the game of Go with deep neural networks and tree search.Nature, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search.Nature, 2016

  38. [38]

    Sanjana, and David A

    Andrew Stirn, Hans-Hermann Wessels, Megan Schertzer, Laura Pereira, Neville E. Sanjana, and David A. Knowles. Faithful heteroscedastic regression with neural networks. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

  39. [39]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998

  40. [40]

    L2 regularization versus batch and weight normalization

    Twan Van Laarhoven. L2 regularization versus batch and weight normalization.arXiv preprint arXiv:1706.05350, 2017

  41. [41]

    Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

    Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817, 2017

  42. [42]

    Neural linear models with functional Gaussian process priors

    Joe Watson, Jihao Andreas Lin, Pascal Klink, and Jan Peters. Neural linear models with functional Gaussian process priors. InThird Symposium on Advances in Approximate Bayesian Inference

  43. [43]

    Huang, and Nicolas Heess

    Joe Watson, Sandy H. Huang, and Nicolas Heess. Coherent soft imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  44. [44]

    NX i=1 mi(s,a,s ′) logp θ,i(s,a) # ,(6) where the target projection coefficients are computed as mi(s,a,s ′) = NX j=1

    Yi Zhao, Rinu Boney, Alexander Ilin, Juho Kannala, and Joni Pajarinen. Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning.Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2021. 12 A Ablation: Sensitivity to the Number of Expert Demonstrations Figures 8 to 10 show that XQCfD remains...

  45. [45]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...