Behavior-Consistent Deep Reinforcement Learning

Benjamin Eysenbach; Claas Voelcker; Eric Eaton; Liv G. d'Aliberti; Marcel Hussing

arxiv: 2605.21214 · v2 · pith:N4OHBJPZnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Behavior-Consistent Deep Reinforcement Learning

Marcel Hussing , Liv G. d'Aliberti , Claas Voelcker , Benjamin Eysenbach , Eric Eaton This is my paper

Pith reviewed 2026-05-22 10:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningbehavior consistencypolicy divergencemaximum entropy RLQ-value disagreementvariance reductioncontinuous controlKL divergence

0 comments

The pith

Selecting temperature proportional to Q-function disagreement bounds pairwise KL divergence between Boltzmann policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that reinforcement learning can produce policies that are both high-performing and behaviorally consistent across independent training runs. It proves that for Boltzmann policies in maximum-entropy RL, setting the temperature in proportion to Q-function disagreement directly limits the KL divergence between policies from different runs. A sympathetic reader would care because high cross-run variance makes RL results unreliable and hard to deploy. The work introduces Q-value Expectile Disagreement as a practical state-dependent schedule that uses double-critic disagreement inside one run as a proxy for true cross-run disagreement, showing large reductions in divergence on continuous-control tasks.

Core claim

For Boltzmann policies, choosing the temperature proportional to Q-function disagreement bounds the pairwise KL divergence between the induced policies. Q-value Expectile Disagreement (QED) is a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement, yielding policies that are high-performing and distributionally similar across training runs.

What carries the argument

Q-value Expectile Disagreement (QED), a state-dependent temperature schedule in maximum-entropy RL that anchors runs to a common prior by modulating entropy according to double-critic disagreement.

If this is right

Across-run policy divergence drops by two orders of magnitude on 18 continuous-control tasks.
Return variance falls substantially while performance is preserved.
Naive entropy increases that impair optimization are avoided through the disagreement-based schedule.
The KL bound holds specifically for Boltzmann policies when temperature scales with Q-disagreement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Consistent policies could reduce the need for extensive seed averaging in practical RL deployments.
The disagreement proxy might extend to controlling other sources of training stochasticity beyond entropy.
Reproducibility metrics in RL benchmarks could incorporate distributional similarity as a standard requirement.
The approach might be tested in settings with discrete actions or non-Boltzmann policy classes to check generality.

Load-bearing premise

Double-critic disagreement measured inside a single training run accurately reflects the Q-function disagreement that would arise between independent runs started from different random seeds.

What would settle it

Run multiple independent agents on the same task with distinct seeds, compute the actual cross-run Q-function disagreement between them, and check whether this value matches the double-critic disagreement observed within any one of those runs.

Figures

Figures reproduced from arXiv: 2605.21214 by Benjamin Eysenbach, Claas Voelcker, Eric Eaton, Liv G. d'Aliberti, Marcel Hussing.

**Figure 1.** Figure 1: QED makes independently trained policies visibly behavior-consistent. Visualization of policies from three different training runs on the cheetah_run task, comparing (left) traditional entropy autotuning (Haarnoja et al., 2018b) against (right) our approach (QED). Color shade denotes the mean pairwise L2 distance between state vectors at each timestep: blue is low, red is high. design (Booth et al., 2023),… view at source ↗

**Figure 2.** Figure 2: High entropy can amplify off-policy extrapolation error. We repeat the toy MDP, but prefill the replay buffer with actions from only part of the action space, leaving one reward mode outside the data support. The learned Q-functions exhibit extrapolation error, and the policy accentuates this problem as it predicts value outside the support, particularly at larger α values. Toy example setup: To study how … view at source ↗

**Figure 3.** Figure 3: QED reduces inter-run policy divergence while preserving returns. (a) Final normalized return vs pairwise symmetric KL across independent training runs on the 18-task dm_control suite. Lower KL indicates more behaviorally consistent policies. Applying QED to both SAC-LN and MAD-SAC decreases pairwise KL by about two orders of magnitude, while retaining comparable normalized return. (b) Width of the 95% boo… view at source ↗

**Figure 4.** Figure 4: QED produces more consistent rollout-level behavior across independently trained policies. Measuring pairwise L2 action distances across policies at each step, we find that QED reduces cumulative action distance. Action distance: We take the trained MAD-SAC policies and roll them out for 20 evaluation trajectories of 100 steps. In each task, we compute the pairwise (L2) action distance between all polic… view at source ↗

**Figure 5.** Figure 5: QED reduces trainingtime variance and policy divergence on a high-variance control task. Return over training steps. QED improves performance and reduces dispersion across seeds. Finally, to demonstrate the power of our approach, we highlight a challenging task in the dm_control suite: the hopper_hop task. Across various state-of-the-art algorithms, the return variance on this task is very high and repo… view at source ↗

**Figure 6.** Figure 6: Early double-critic disagreement predicts cross-seed [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Per-task learning curves for SAC-LN and SAC-QED on the 18-task dm_control suite. Episode reward over environment steps for the SAC-LN baseline and QED variants with k ∈ {0.1, 0.2, 0.4}, matching the SAC-LN conditions used in the aggregate results in Figure 3a. QED generally preserves strong performance on easier tasks while introducing a performance-consistency trade-off on harder locomotion tasks, especia… view at source ↗

**Figure 8.** Figure 8: Per-task learning curves for MAD-SAC and MAD-SAC-QED on the 18-task dm_control suite. Episode reward over environment steps for the MAD-SAC baseline and QED variants with k ∈ {0.2, 0.3, 0.4}, matching the MAD-SAC conditions used in the aggregate results in Figure 3a. Compared with SAC-LN, MAD-SAC is more robust to the additional entropy induced by QED, and QED often preserves or improves learning while red… view at source ↗

**Figure 9.** Figure 9: Per-task inter-run policy divergence for SAC-LN and SAC-QED on the 18-task dm_control suite. Pairwise symmetric pre-tanh policy KL over environment steps for the SAC-LN baseline and QED variants with k ∈ {0.1, 0.2, 0.4}, matching the SAC-LN conditions used in the aggregate results in Figure 3a. Across most tasks, QED substantially lowers inter-run policy divergence relative to standard target-entropy tunin… view at source ↗

**Figure 10.** Figure 10: Per-task inter-run policy divergence for MAD-SAC and MAD-SAC-QED on the 18-task dm_control suite. Pairwise symmetric pre-tanh policy KL over environment steps for the MAD-SAC baseline and QED variants with k ∈ {0.2, 0.3, 0.4}, matching the MAD-SAC conditions used in the aggregate results in Figure 3a. QED consistently suppresses the growth of cross-seed policy divergence, showing that the behavioral-consi… view at source ↗

read the original abstract

Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that na\"ively increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QED ties temperature to critic disagreement for more consistent Boltzmann policies across seeds, with a clean KL bound and big empirical gains, but the single-run proxy is the shaky part.

read the letter

The paper gives a concrete way to reduce policy divergence across random seeds in continuous-control RL by tying the entropy temperature to Q-function disagreement. They start from max-ent RL and prove that for Boltzmann policies, setting the temperature proportional to the disagreement between Q-values bounds the KL divergence between policies from different runs. Then they introduce QED, which uses the disagreement between a pair of critics inside a single training run as a proxy for that cross-run disagreement. On 18 tasks this cuts the divergence by two orders of magnitude while keeping returns competitive. What stands out is the clean theoretical step linking temperature choice directly to a KL bound; that is not a routine extension of prior max-ent work. The empirical scale of the consistency gain is also notable. The soft spot is the proxy itself. The stress-test note is right to flag that two critics trained on the same replay buffer and seed will likely disagree less than two fully independent runs would. The paper treats the intra-run disagreement as a stand-in, but without direct measurement of how well it tracks the true inter-run Q-spread, the schedule could be under-calibrated. If the experiments include an ablation that compares the proxy to actual multi-seed variance, that would help; otherwise the justification rests more on the outcome than on the mechanism being faithful. This is aimed at researchers who care about repeatable behavior in RL rather than just average performance. The combination of a provable bound and strong empirical consistency makes it worth sending out for review, even if the proxy needs tighter validation in revision.

Referee Report

2 major / 2 minor

Summary. The paper formalizes behavior-consistent RL to reduce cross-run policy divergence. It proves that for Boltzmann policies, setting the temperature τ(s) proportional to Q-function disagreement bounds the pairwise KL divergence between induced policies. It proposes QED, a state-dependent temperature schedule using double-critic disagreement as a single-run proxy for cross-run Q-disagreement. On 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, at modest sample-efficiency cost.

Significance. If the proof is tight and the intra-run proxy faithfully approximates cross-run Q-variance, the result would be significant for improving reproducibility and deployment reliability in RL. The formal link between temperature and KL control, combined with the large empirical reduction in return variance, addresses a practical pain point. The work also highlights trade-offs with entropy regularization and off-policy error.

major comments (2)

[Proof of temperature-KL relationship (§3)] The central proof (abstract and §3) shows that τ(s) ∝ disagreement bounds KL(π_i || π_j) only when the disagreement term equals the actual Q-variance across independent runs. QED instead uses double-critic disagreement within a single run (same seed, replay buffer, and optimization trajectory). This shared trajectory likely produces systematically smaller disagreement than true cross-run variance, so the resulting τ(s) may be too small to enforce the claimed bound. Please add a derivation or empirical test (e.g., comparing intra-run vs. multi-seed disagreement) showing the proxy remains sufficient.
[Experimental results (§5)] Table 1 and the QED ablation (likely §5) report large divergence reductions, but the manuscript does not detail the exact policy-divergence metric, whether statistical tests were applied across the 18 tasks, or an ablation isolating the expectile choice. Without these, it is difficult to confirm that the two-order-of-magnitude claim is robust rather than an artifact of the proxy or task selection.

minor comments (2)

[QED definition (§4)] The notation for the state-dependent temperature schedule and the expectile parameter could be clarified with an explicit equation in §4.
[Figures] Figure 2 (or equivalent) showing KL curves would benefit from error bars across seeds to visualize the claimed variance reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help improve the clarity and rigor of our work on behavior-consistent RL. We address each major comment below, proposing specific revisions to the manuscript.

read point-by-point responses

Referee: [Proof of temperature-KL relationship (§3)] The central proof (abstract and §3) shows that τ(s) ∝ disagreement bounds KL(π_i || π_j) only when the disagreement term equals the actual Q-variance across independent runs. QED instead uses double-critic disagreement within a single run (same seed, replay buffer, and optimization trajectory). This shared trajectory likely produces systematically smaller disagreement than true cross-run variance, so the resulting τ(s) may be too small to enforce the claimed bound. Please add a derivation or empirical test (e.g., comparing intra-run vs. multi-seed disagreement) showing the proxy remains sufficient.

Authors: We agree that the theoretical bound in Section 3 applies when the disagreement exactly matches the cross-run Q-variance. QED uses double-critic disagreement as a proxy, which, as the referee notes, may be smaller due to shared optimization trajectories. To strengthen the connection, we will add an empirical comparison in the appendix showing intra-run vs. cross-run disagreement levels across several tasks. This analysis will illustrate that the proxy, while conservative, still leads to effective KL bounding in practice as evidenced by the empirical results. We will also clarify in the text that the bound is for the true disagreement and QED is a practical surrogate. revision: yes
Referee: [Experimental results (§5)] Table 1 and the QED ablation (likely §5) report large divergence reductions, but the manuscript does not detail the exact policy-divergence metric, whether statistical tests were applied across the 18 tasks, or an ablation isolating the expectile choice. Without these, it is difficult to confirm that the two-order-of-magnitude claim is robust rather than an artifact of the proxy or task selection.

Authors: We will revise the experimental section to explicitly define the policy-divergence metric as the mean pairwise KL divergence between policies trained with different seeds, evaluated on a common set of states. We will also report statistical tests (such as Wilcoxon signed-rank tests) to confirm the significance of the divergence reductions across the 18 tasks. For the expectile choice, we will expand the ablation studies to include a dedicated analysis isolating the impact of the expectile parameter by comparing QED with variants using different expectile values and fixed-temperature baselines. These changes will provide stronger evidence for the robustness of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper first states a mathematical proof that setting Boltzmann temperature proportional to Q-function disagreement bounds pairwise KL divergence between induced policies. This is presented as a first-principles derivation rather than a definitional equivalence or fitted input. The QED method then adopts double-critic disagreement within a single run as a practical proxy for cross-run disagreement, which is an engineering approximation justified by the subsequent empirical results rather than by construction. No load-bearing self-citations, ansatz smuggling, or renaming of known results are indicated in the provided text. The central empirical claim of two-order-of-magnitude reduction in across-run divergence on 18 tasks rests on observed outcomes against external benchmarks and is not forced by the inputs or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard maximum-entropy RL assumptions plus the new modeling choice that single-run critic disagreement approximates cross-run disagreement.

axioms (2)

domain assumption Policies are Boltzmann distributions over actions given the current Q-function.
Invoked for the KL-divergence bound stated in the abstract.
ad hoc to paper Double-critic disagreement is a sufficient statistic for cross-run Q-disagreement.
Central modeling step that enables the single-run QED schedule.

pith-pipeline@v0.9.0 · 5744 in / 1436 out tokens · 29893 ms · 2026-05-22T10:02:08.054512+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 4 internal anchors

[1]

Issues in Using Function Approximation for Reinforcement Learning

Thrun, Sebastian and Schwartz, Anton. Issues in Using Function Approximation for Reinforcement Learning. Proceedings of the 1993 Connectionist Models Summer School. 1993

work page 1993
[2]

and Barto, Andrew G

Sutton, Richard S. and Barto, Andrew G. , edition =. Reinforcement Learning: An Introduction , year =

work page
[3]

, title =

Puterman, Martin L. , title =. 1994 , isbn =

work page 1994
[4]

Pendrith, Mark and Ryan, Malcolm , year =

work page
[5]

and Dasgupta, Sanjoy , title =

Precup, Doina and Sutton, Richard S. and Dasgupta, Sanjoy , title =. Proceedings of the Eighteenth International Conference on Machine Learning , pages =. 2001 , isbn =

work page 2001
[6]

, title =

Mannor, Shie and Simester, Duncan and Sun, Peng and Tsitsiklis, John N. , title =. Manage. Sci. , month =. 2007 , issue_date =

work page 2007
[7]

Bias-corrected Q-learning to control max-operator bias in Q-learning

Donghun Lee and Boris Defourny and Powell, Warren Buckler. Bias-corrected Q-learning to control max-operator bias in Q-learning. Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2013 - 2013 IEEE Symposium Series on Computational Intelligence, SSCI 2013. 2013. doi:10.1109/ADPRL.2013.6614994

work page doi:10.1109/adprl.2013.6614994 2013
[8]

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , pages =

Hasselt, Hado van and Guez, Arthur and Silver, David , title =. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , pages =. 2016 , publisher =

work page 2016
[9]

Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence , pages =

Fox, Roy and Pakman, Ari and Tishby, Naftali , title =. Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence , pages =. 2016 , isbn =

work page 2016
[10]

Averaged-

Oron Anschel and Nir Baram and Nahum Shimkin , booktitle =. Averaged-

work page
[11]

Kochenderfer , title =

Zongzhang Zhang and Zhiyuan Pan and Mykel J. Kochenderfer , title =. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,. 2017 , doi =

work page 2017
[12]

International Conference on Learning Representations , year=

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , author=. International Conference on Learning Representations , year=

work page
[13]

Proceedings of the 38th International Conference on Machine Learning , pages =

Ensemble Bootstrapping for Q-Learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[14]

International Conference on Learning Representations , year=

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model , author=. International Conference on Learning Representations , year=

work page
[15]

2023 , eprint=

Loss of Plasticity in Continual Deep Reinforcement Learning , author=. 2023 , eprint=

work page 2023
[16]

International Conference on Learning Representations , year=

Transient Non-stationarity and Generalisation in Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=

work page
[17]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Sokar, Ghada and Agarwal, Rishabh and Castro, Pablo Samuel and Evci, Utku , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[18]

and Hunt, Jonathan J

Lillicrap, Timothy P. and Hunt, Jonathan J. and Pritzel, Alexander and Heess, Nicolas and Erez, Tom and Tassa, Yuval and Silver, David and Wierstra, Daan , booktitle =

work page
[19]

Proceedings of the 35th International Conference on Machine Learning , pages =

Addressing Function Approximation Error in Actor-Critic Methods , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[20]

Proceedings of the Conference on Robot Learning , year =

Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning , author =. Proceedings of the Conference on Robot Learning , year =

work page
[21]

Better Exploration with Optimistic Actor Critic , volume =

Ciosek, Kamil and Vuong, Quan and Loftin, Robert and Hofmann, Katja , booktitle =. Better Exploration with Optimistic Actor Critic , volume =

work page
[22]

Advances in Neural Information Processing Systems , year =

Michael Janner and Justin Fu and Marvin Zhang and Sergey Levine , title =. Advances in Neural Information Processing Systems , year =

work page
[23]

, booktitle =

Pomerleau, Dean A. , booktitle =. ALVINN: An Autonomous Land Vehicle in a Neural Network , volume =

work page
[24]

and Schaal, Stefan , title =

Atkeson, Christopher G. and Schaal, Stefan , title =. Proceedings of the Fourteenth International Conference on Machine Learning , pages =. 1997 , isbn =

work page 1997
[25]

International Conference on Machine Learning , pages=

Off-Policy Deep Reinforcement Learning without Exploration , author=. International Conference on Machine Learning , pages=

work page
[26]

Advances in Neural Information Processing Systems , editor=

A Minimalist Approach to Offline Reinforcement Learning , author=. Advances in Neural Information Processing Systems , editor=

work page
[27]

Nicklas Hansen and Hao Su and Xiaolong Wang , booktitle=

work page
[28]

Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations , pages =. 1986 , isbn =

work page 1986
[29]

Q-Learning with Hidden-Unit Restarting , volume =

Anderson, Charles , booktitle =. Q-Learning with Hidden-Unit Restarting , volume =

work page
[30]

Nair, Vinod and Hinton, Geoffrey E , booktitle =

work page
[31]

Proceedings of the 30th International Conference on Machine Learning , pages =

On the importance of initialization and momentum in deep learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

work page 2013
[32]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =

Deep Sparse Rectifier Neural Networks , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =. 2011 , editor =

work page 2011
[33]

, title =

Bishop, Christopher M. , title =. 2006 , isbn =

work page 2006
[34]

A Simple Weight Decay Can Improve Generalization , volume =

Krogh, Anders and Hertz, John , booktitle =. A Simple Weight Decay Can Improve Generalization , volume =

work page
[35]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

work page 2016
[36]

Understanding and Improving Layer Normalization , volume =

Xu, Jingjing and Sun, Xu and Zhang, Zhiyuan and Zhao, Guangxiang and Lin, Junyang , booktitle =. Understanding and Improving Layer Normalization , volume =

work page
[37]

Adam: A Method for Stochastic Optimization , year =

Kingma, Diederik and Ba, Jimmy , booktitle =. Adam: A Method for Stochastic Optimization , year =

work page
[38]

Journal of Machine Learning Research , year =

Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =

work page
[39]

Polyak, B. T. and Juditsky, A. B. , title =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =

work page 1992
[40]

The Annals of Mathematical Statistics , number =

Herbert Robbins and Sutton Monro , title =. The Annals of Mathematical Statistics , number =. 1951 , doi =

work page 1951
[41]

2020 , issn =

dm\_control: Software and tasks for continuous control , journal =. 2020 , issn =

work page 2020
[42]

Resetting the Optimizer in Deep

Kavosh Asadi and Rasool Fakoor and Shoham Sabach , booktitle=. Resetting the Optimizer in Deep

work page
[43]

The Phenomenon of Policy Churn , year =

Schaul, Tom and Barreto, Andre and Quan, John and Ostrovski, Georg , booktitle =. The Phenomenon of Policy Churn , year =

work page
[44]

International Conference on Machine Learning , year=

Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning , author=. International Conference on Machine Learning , year=

work page
[45]

International Conference on Learning Representations , year=

Understanding and Preventing Capacity Loss in Reinforcement Learning , author=. International Conference on Learning Representations , year=

work page
[46]

Advances in Neural Information Processing Systems , volume=

Deep reinforcement learning with plasticity injection , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

Bridging

Cassidy Laidlaw and Stuart Russell and Anca Dragan , booktitle=. Bridging

work page
[48]

Clevert, Djork-Arné and Unterthiner, Thomas and Hochreiter, Sepp , booktitle =

work page
[49]

Adaptive step-sizes for reinforcement learning , author=

work page
[50]

International Conference on Machine Learning , year=

PID accelerated value iteration algorithm , author=. International Conference on Machine Learning , year=

work page
[51]

International Conference on Artificial Intelligence and Statistics , year =

Momentum in Reinforcement Learning , author =. International Conference on Artificial Intelligence and Statistics , year =

work page
[52]

Advances in Neural Information Processing Systems , year=

Tactical optimism and pessimism for deep reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

work page
[53]

Notes on RMax exploration , author =

work page
[54]

On the Sample Complexity of Reinforcement Learning , author =

work page
[55]

Advances in Neural Information Processing Systems , publisher =

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , author =. Advances in Neural Information Processing Systems , publisher =

work page
[56]

Advances in Neural Information Processing Systems , publisher =

Near-optimal Regret Bounds for Reinforcement Learning , author =. Advances in Neural Information Processing Systems , publisher =

work page
[57]

Journal of Machine Learning Research , volume = 11, number = 51, pages =

Near-optimal Regret Bounds for Reinforcement Learning , author =. Journal of Machine Learning Research , volume = 11, number = 51, pages =

work page
[58]

Proceedings of the 34th International Conference on Machine Learning , publisher =

Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , publisher =

work page
[59]

2019 , eprint=

Deep Exploration via Randomized Value Functions , author=. 2019 , eprint=

work page 2019
[60]

On Lower Bounds for Regret in Reinforcement Learning

On Lower Bounds for Regret in Reinforcement Learning , author =. 1608.02732 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , location =

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , author =. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , location =

work page
[62]

Proceedings of the 37th International Conference on Machine Learning , publisher =

Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , publisher =

work page
[63]

Proceedings of the 36th International Conference on Machine Learning , publisher =

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , publisher =

work page
[64]

Advances in Neural Information Processing Systems , publisher =

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , author =. Advances in Neural Information Processing Systems , publisher =

work page
[65]

Action-Gap Phenomenon in Reinforcement Learning , author =

work page
[66]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Deep Reinforcement Learning That Matters , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page
[67]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , author =

work page
[68]

2nd Reproducibility in Machine Learning Workshop at ICML 2018 , address =

Deterministic Implementations for Reproducibility in Deep Reinforcement Learning , author =. 2nd Reproducibility in Machine Learning Workshop at ICML 2018 , address =

work page 2018
[69]

Proceedings of the 37th International Conference on Machine Learning , publisher =

Evaluating the Performance of Reinforcement Learning Algorithms , author =. Proceedings of the 37th International Conference on Machine Learning , publisher =

work page
[70]

D3rlpy: An Offline Deep Reinforcement Learning Library , author =

work page
[71]

Proceedings of the NeurIPS 2020 Competition and Demonstration Track , publisher =

Towards robust and domain agnostic reinforcement learning competitions: MineRL 2020 , author =. Proceedings of the NeurIPS 2020 Competition and Demonstration Track , publisher =

work page 2020
[72]

Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =

Adaptive Reward-Free Exploration , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =

work page 2021
[73]

Proceedings of the Twenty-First International Conference on Machine Learning , publisher =

Bias and Variance in Value Function Estimation , author =. Proceedings of the Twenty-First International Conference on Machine Learning , publisher =

work page
[74]

IEEE Transactions on Automatic Control , volume = 61, number = 9, pages =

Distributionally Robust Counterpart in Markov Decision Processes , author =. IEEE Transactions on Automatic Control , volume = 61, number = 9, pages =

work page
[75]

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , publisher =

Sample Complexity of Robust Reinforcement Learning with a Generative Model , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , publisher =

work page
[76]

Robust and optimal control , author =

work page
[77]

Mathematics of Operations Research , publisher =

Robust MDPs with k-Rectangular Uncertainty , author =. Mathematics of Operations Research , publisher =

work page
[78]

Operations Research , publisher =

Markov Decision Processes with Imprecise Transition Probabilities , author =. Operations Research , publisher =

work page
[79]

Advances in Neural Information Processing Systems , publisher =

Distributionally Robust Markov Decision Processes , author =. Advances in Neural Information Processing Systems , publisher =

work page
[80]

Advances in Neural Information Processing Systems , publisher =

Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , author =. Advances in Neural Information Processing Systems , publisher =

work page

Showing first 80 references.

[1] [1]

Issues in Using Function Approximation for Reinforcement Learning

Thrun, Sebastian and Schwartz, Anton. Issues in Using Function Approximation for Reinforcement Learning. Proceedings of the 1993 Connectionist Models Summer School. 1993

work page 1993

[2] [2]

and Barto, Andrew G

Sutton, Richard S. and Barto, Andrew G. , edition =. Reinforcement Learning: An Introduction , year =

work page

[3] [3]

, title =

Puterman, Martin L. , title =. 1994 , isbn =

work page 1994

[4] [4]

Pendrith, Mark and Ryan, Malcolm , year =

work page

[5] [5]

and Dasgupta, Sanjoy , title =

Precup, Doina and Sutton, Richard S. and Dasgupta, Sanjoy , title =. Proceedings of the Eighteenth International Conference on Machine Learning , pages =. 2001 , isbn =

work page 2001

[6] [6]

, title =

Mannor, Shie and Simester, Duncan and Sun, Peng and Tsitsiklis, John N. , title =. Manage. Sci. , month =. 2007 , issue_date =

work page 2007

[7] [7]

Bias-corrected Q-learning to control max-operator bias in Q-learning

Donghun Lee and Boris Defourny and Powell, Warren Buckler. Bias-corrected Q-learning to control max-operator bias in Q-learning. Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2013 - 2013 IEEE Symposium Series on Computational Intelligence, SSCI 2013. 2013. doi:10.1109/ADPRL.2013.6614994

work page doi:10.1109/adprl.2013.6614994 2013

[8] [8]

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , pages =

Hasselt, Hado van and Guez, Arthur and Silver, David , title =. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , pages =. 2016 , publisher =

work page 2016

[9] [9]

Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence , pages =

Fox, Roy and Pakman, Ari and Tishby, Naftali , title =. Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence , pages =. 2016 , isbn =

work page 2016

[10] [10]

Averaged-

Oron Anschel and Nir Baram and Nahum Shimkin , booktitle =. Averaged-

work page

[11] [11]

Kochenderfer , title =

Zongzhang Zhang and Zhiyuan Pan and Mykel J. Kochenderfer , title =. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,. 2017 , doi =

work page 2017

[12] [12]

International Conference on Learning Representations , year=

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , author=. International Conference on Learning Representations , year=

work page

[13] [13]

Proceedings of the 38th International Conference on Machine Learning , pages =

Ensemble Bootstrapping for Q-Learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021

[14] [14]

International Conference on Learning Representations , year=

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model , author=. International Conference on Learning Representations , year=

work page

[15] [15]

2023 , eprint=

Loss of Plasticity in Continual Deep Reinforcement Learning , author=. 2023 , eprint=

work page 2023

[16] [16]

International Conference on Learning Representations , year=

Transient Non-stationarity and Generalisation in Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=

work page

[17] [17]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Sokar, Ghada and Agarwal, Rishabh and Castro, Pablo Samuel and Evci, Utku , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[18] [18]

and Hunt, Jonathan J

Lillicrap, Timothy P. and Hunt, Jonathan J. and Pritzel, Alexander and Heess, Nicolas and Erez, Tom and Tassa, Yuval and Silver, David and Wierstra, Daan , booktitle =

work page

[19] [19]

Proceedings of the 35th International Conference on Machine Learning , pages =

Addressing Function Approximation Error in Actor-Critic Methods , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018

[20] [20]

Proceedings of the Conference on Robot Learning , year =

Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning , author =. Proceedings of the Conference on Robot Learning , year =

work page

[21] [21]

Better Exploration with Optimistic Actor Critic , volume =

Ciosek, Kamil and Vuong, Quan and Loftin, Robert and Hofmann, Katja , booktitle =. Better Exploration with Optimistic Actor Critic , volume =

work page

[22] [22]

Advances in Neural Information Processing Systems , year =

Michael Janner and Justin Fu and Marvin Zhang and Sergey Levine , title =. Advances in Neural Information Processing Systems , year =

work page

[23] [23]

, booktitle =

Pomerleau, Dean A. , booktitle =. ALVINN: An Autonomous Land Vehicle in a Neural Network , volume =

work page

[24] [24]

and Schaal, Stefan , title =

Atkeson, Christopher G. and Schaal, Stefan , title =. Proceedings of the Fourteenth International Conference on Machine Learning , pages =. 1997 , isbn =

work page 1997

[25] [25]

International Conference on Machine Learning , pages=

Off-Policy Deep Reinforcement Learning without Exploration , author=. International Conference on Machine Learning , pages=

work page

[26] [26]

Advances in Neural Information Processing Systems , editor=

A Minimalist Approach to Offline Reinforcement Learning , author=. Advances in Neural Information Processing Systems , editor=

work page

[27] [27]

Nicklas Hansen and Hao Su and Xiaolong Wang , booktitle=

work page

[28] [28]

Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations , pages =. 1986 , isbn =

work page 1986

[29] [29]

Q-Learning with Hidden-Unit Restarting , volume =

Anderson, Charles , booktitle =. Q-Learning with Hidden-Unit Restarting , volume =

work page

[30] [30]

Nair, Vinod and Hinton, Geoffrey E , booktitle =

work page

[31] [31]

Proceedings of the 30th International Conference on Machine Learning , pages =

On the importance of initialization and momentum in deep learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

work page 2013

[32] [32]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =

Deep Sparse Rectifier Neural Networks , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =. 2011 , editor =

work page 2011

[33] [33]

, title =

Bishop, Christopher M. , title =. 2006 , isbn =

work page 2006

[34] [34]

A Simple Weight Decay Can Improve Generalization , volume =

Krogh, Anders and Hertz, John , booktitle =. A Simple Weight Decay Can Improve Generalization , volume =

work page

[35] [35]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

work page 2016

[36] [36]

Understanding and Improving Layer Normalization , volume =

Xu, Jingjing and Sun, Xu and Zhang, Zhiyuan and Zhao, Guangxiang and Lin, Junyang , booktitle =. Understanding and Improving Layer Normalization , volume =

work page

[37] [37]

Adam: A Method for Stochastic Optimization , year =

Kingma, Diederik and Ba, Jimmy , booktitle =. Adam: A Method for Stochastic Optimization , year =

work page

[38] [38]

Journal of Machine Learning Research , year =

Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =

work page

[39] [39]

Polyak, B. T. and Juditsky, A. B. , title =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =

work page 1992

[40] [40]

The Annals of Mathematical Statistics , number =

Herbert Robbins and Sutton Monro , title =. The Annals of Mathematical Statistics , number =. 1951 , doi =

work page 1951

[41] [41]

2020 , issn =

dm\_control: Software and tasks for continuous control , journal =. 2020 , issn =

work page 2020

[42] [42]

Resetting the Optimizer in Deep

Kavosh Asadi and Rasool Fakoor and Shoham Sabach , booktitle=. Resetting the Optimizer in Deep

work page

[43] [43]

The Phenomenon of Policy Churn , year =

Schaul, Tom and Barreto, Andre and Quan, John and Ostrovski, Georg , booktitle =. The Phenomenon of Policy Churn , year =

work page

[44] [44]

International Conference on Machine Learning , year=

Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning , author=. International Conference on Machine Learning , year=

work page

[45] [45]

International Conference on Learning Representations , year=

Understanding and Preventing Capacity Loss in Reinforcement Learning , author=. International Conference on Learning Representations , year=

work page

[46] [46]

Advances in Neural Information Processing Systems , volume=

Deep reinforcement learning with plasticity injection , author=. Advances in Neural Information Processing Systems , volume=

work page

[47] [47]

Bridging

Cassidy Laidlaw and Stuart Russell and Anca Dragan , booktitle=. Bridging

work page

[48] [48]

Clevert, Djork-Arné and Unterthiner, Thomas and Hochreiter, Sepp , booktitle =

work page

[49] [49]

Adaptive step-sizes for reinforcement learning , author=

work page

[50] [50]

International Conference on Machine Learning , year=

PID accelerated value iteration algorithm , author=. International Conference on Machine Learning , year=

work page

[51] [51]

International Conference on Artificial Intelligence and Statistics , year =

Momentum in Reinforcement Learning , author =. International Conference on Artificial Intelligence and Statistics , year =

work page

[52] [52]

Advances in Neural Information Processing Systems , year=

Tactical optimism and pessimism for deep reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

work page

[53] [53]

Notes on RMax exploration , author =

work page

[54] [54]

On the Sample Complexity of Reinforcement Learning , author =

work page

[55] [55]

Advances in Neural Information Processing Systems , publisher =

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , author =. Advances in Neural Information Processing Systems , publisher =

work page

[56] [56]

Advances in Neural Information Processing Systems , publisher =

Near-optimal Regret Bounds for Reinforcement Learning , author =. Advances in Neural Information Processing Systems , publisher =

work page

[57] [57]

Journal of Machine Learning Research , volume = 11, number = 51, pages =

Near-optimal Regret Bounds for Reinforcement Learning , author =. Journal of Machine Learning Research , volume = 11, number = 51, pages =

work page

[58] [58]

Proceedings of the 34th International Conference on Machine Learning , publisher =

Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , publisher =

work page

[59] [59]

2019 , eprint=

Deep Exploration via Randomized Value Functions , author=. 2019 , eprint=

work page 2019

[60] [60]

On Lower Bounds for Regret in Reinforcement Learning

On Lower Bounds for Regret in Reinforcement Learning , author =. 1608.02732 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , location =

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , author =. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , location =

work page

[62] [62]

Proceedings of the 37th International Conference on Machine Learning , publisher =

Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , publisher =

work page

[63] [63]

Proceedings of the 36th International Conference on Machine Learning , publisher =

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , publisher =

work page

[64] [64]

Advances in Neural Information Processing Systems , publisher =

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , author =. Advances in Neural Information Processing Systems , publisher =

work page

[65] [65]

Action-Gap Phenomenon in Reinforcement Learning , author =

work page

[66] [66]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Deep Reinforcement Learning That Matters , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page

[67] [67]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , author =

work page

[68] [68]

2nd Reproducibility in Machine Learning Workshop at ICML 2018 , address =

Deterministic Implementations for Reproducibility in Deep Reinforcement Learning , author =. 2nd Reproducibility in Machine Learning Workshop at ICML 2018 , address =

work page 2018

[69] [69]

Proceedings of the 37th International Conference on Machine Learning , publisher =

Evaluating the Performance of Reinforcement Learning Algorithms , author =. Proceedings of the 37th International Conference on Machine Learning , publisher =

work page

[70] [70]

D3rlpy: An Offline Deep Reinforcement Learning Library , author =

work page

[71] [71]

Proceedings of the NeurIPS 2020 Competition and Demonstration Track , publisher =

Towards robust and domain agnostic reinforcement learning competitions: MineRL 2020 , author =. Proceedings of the NeurIPS 2020 Competition and Demonstration Track , publisher =

work page 2020

[72] [72]

Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =

Adaptive Reward-Free Exploration , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =

work page 2021

[73] [73]

Proceedings of the Twenty-First International Conference on Machine Learning , publisher =

Bias and Variance in Value Function Estimation , author =. Proceedings of the Twenty-First International Conference on Machine Learning , publisher =

work page

[74] [74]

IEEE Transactions on Automatic Control , volume = 61, number = 9, pages =

Distributionally Robust Counterpart in Markov Decision Processes , author =. IEEE Transactions on Automatic Control , volume = 61, number = 9, pages =

work page

[75] [75]

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , publisher =

Sample Complexity of Robust Reinforcement Learning with a Generative Model , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , publisher =

work page

[76] [76]

Robust and optimal control , author =

work page

[77] [77]

Mathematics of Operations Research , publisher =

Robust MDPs with k-Rectangular Uncertainty , author =. Mathematics of Operations Research , publisher =

work page

[78] [78]

Operations Research , publisher =

Markov Decision Processes with Imprecise Transition Probabilities , author =. Operations Research , publisher =

work page

[79] [79]

Advances in Neural Information Processing Systems , publisher =

Distributionally Robust Markov Decision Processes , author =. Advances in Neural Information Processing Systems , publisher =

work page

[80] [80]

Advances in Neural Information Processing Systems , publisher =

Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , author =. Advances in Neural Information Processing Systems , publisher =

work page