Behavior-Consistent Deep Reinforcement Learning
Pith reviewed 2026-05-22 10:02 UTC · model grok-4.3
The pith
Selecting temperature proportional to Q-function disagreement bounds pairwise KL divergence between Boltzmann policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For Boltzmann policies, choosing the temperature proportional to Q-function disagreement bounds the pairwise KL divergence between the induced policies. Q-value Expectile Disagreement (QED) is a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement, yielding policies that are high-performing and distributionally similar across training runs.
What carries the argument
Q-value Expectile Disagreement (QED), a state-dependent temperature schedule in maximum-entropy RL that anchors runs to a common prior by modulating entropy according to double-critic disagreement.
If this is right
- Across-run policy divergence drops by two orders of magnitude on 18 continuous-control tasks.
- Return variance falls substantially while performance is preserved.
- Naive entropy increases that impair optimization are avoided through the disagreement-based schedule.
- The KL bound holds specifically for Boltzmann policies when temperature scales with Q-disagreement.
Where Pith is reading between the lines
- Consistent policies could reduce the need for extensive seed averaging in practical RL deployments.
- The disagreement proxy might extend to controlling other sources of training stochasticity beyond entropy.
- Reproducibility metrics in RL benchmarks could incorporate distributional similarity as a standard requirement.
- The approach might be tested in settings with discrete actions or non-Boltzmann policy classes to check generality.
Load-bearing premise
Double-critic disagreement measured inside a single training run accurately reflects the Q-function disagreement that would arise between independent runs started from different random seeds.
What would settle it
Run multiple independent agents on the same task with distinct seeds, compute the actual cross-run Q-function disagreement between them, and check whether this value matches the double-critic disagreement observed within any one of those runs.
Figures
read the original abstract
Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that na\"ively increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes behavior-consistent RL to reduce cross-run policy divergence. It proves that for Boltzmann policies, setting the temperature τ(s) proportional to Q-function disagreement bounds the pairwise KL divergence between induced policies. It proposes QED, a state-dependent temperature schedule using double-critic disagreement as a single-run proxy for cross-run Q-disagreement. On 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, at modest sample-efficiency cost.
Significance. If the proof is tight and the intra-run proxy faithfully approximates cross-run Q-variance, the result would be significant for improving reproducibility and deployment reliability in RL. The formal link between temperature and KL control, combined with the large empirical reduction in return variance, addresses a practical pain point. The work also highlights trade-offs with entropy regularization and off-policy error.
major comments (2)
- [Proof of temperature-KL relationship (§3)] The central proof (abstract and §3) shows that τ(s) ∝ disagreement bounds KL(π_i || π_j) only when the disagreement term equals the actual Q-variance across independent runs. QED instead uses double-critic disagreement within a single run (same seed, replay buffer, and optimization trajectory). This shared trajectory likely produces systematically smaller disagreement than true cross-run variance, so the resulting τ(s) may be too small to enforce the claimed bound. Please add a derivation or empirical test (e.g., comparing intra-run vs. multi-seed disagreement) showing the proxy remains sufficient.
- [Experimental results (§5)] Table 1 and the QED ablation (likely §5) report large divergence reductions, but the manuscript does not detail the exact policy-divergence metric, whether statistical tests were applied across the 18 tasks, or an ablation isolating the expectile choice. Without these, it is difficult to confirm that the two-order-of-magnitude claim is robust rather than an artifact of the proxy or task selection.
minor comments (2)
- [QED definition (§4)] The notation for the state-dependent temperature schedule and the expectile parameter could be clarified with an explicit equation in §4.
- [Figures] Figure 2 (or equivalent) showing KL curves would benefit from error bars across seeds to visualize the claimed variance reduction.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which help improve the clarity and rigor of our work on behavior-consistent RL. We address each major comment below, proposing specific revisions to the manuscript.
read point-by-point responses
-
Referee: [Proof of temperature-KL relationship (§3)] The central proof (abstract and §3) shows that τ(s) ∝ disagreement bounds KL(π_i || π_j) only when the disagreement term equals the actual Q-variance across independent runs. QED instead uses double-critic disagreement within a single run (same seed, replay buffer, and optimization trajectory). This shared trajectory likely produces systematically smaller disagreement than true cross-run variance, so the resulting τ(s) may be too small to enforce the claimed bound. Please add a derivation or empirical test (e.g., comparing intra-run vs. multi-seed disagreement) showing the proxy remains sufficient.
Authors: We agree that the theoretical bound in Section 3 applies when the disagreement exactly matches the cross-run Q-variance. QED uses double-critic disagreement as a proxy, which, as the referee notes, may be smaller due to shared optimization trajectories. To strengthen the connection, we will add an empirical comparison in the appendix showing intra-run vs. cross-run disagreement levels across several tasks. This analysis will illustrate that the proxy, while conservative, still leads to effective KL bounding in practice as evidenced by the empirical results. We will also clarify in the text that the bound is for the true disagreement and QED is a practical surrogate. revision: yes
-
Referee: [Experimental results (§5)] Table 1 and the QED ablation (likely §5) report large divergence reductions, but the manuscript does not detail the exact policy-divergence metric, whether statistical tests were applied across the 18 tasks, or an ablation isolating the expectile choice. Without these, it is difficult to confirm that the two-order-of-magnitude claim is robust rather than an artifact of the proxy or task selection.
Authors: We will revise the experimental section to explicitly define the policy-divergence metric as the mean pairwise KL divergence between policies trained with different seeds, evaluated on a common set of states. We will also report statistical tests (such as Wilcoxon signed-rank tests) to confirm the significance of the divergence reductions across the 18 tasks. For the expectile choice, we will expand the ablation studies to include a dedicated analysis isolating the impact of the expectile parameter by comparing QED with variants using different expectile values and fixed-temperature baselines. These changes will provide stronger evidence for the robustness of the reported improvements. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper first states a mathematical proof that setting Boltzmann temperature proportional to Q-function disagreement bounds pairwise KL divergence between induced policies. This is presented as a first-principles derivation rather than a definitional equivalence or fitted input. The QED method then adopts double-critic disagreement within a single run as a practical proxy for cross-run disagreement, which is an engineering approximation justified by the subsequent empirical results rather than by construction. No load-bearing self-citations, ansatz smuggling, or renaming of known results are indicated in the provided text. The central empirical claim of two-order-of-magnitude reduction in across-run divergence on 18 tasks rests on observed outcomes against external benchmarks and is not forced by the inputs or definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Policies are Boltzmann distributions over actions given the current Q-function.
- ad hoc to paper Double-critic disagreement is a sufficient statistic for cross-run Q-disagreement.
Reference graph
Works this paper leans on
-
[1]
Issues in Using Function Approximation for Reinforcement Learning
Thrun, Sebastian and Schwartz, Anton. Issues in Using Function Approximation for Reinforcement Learning. Proceedings of the 1993 Connectionist Models Summer School. 1993
work page 1993
-
[2]
Sutton, Richard S. and Barto, Andrew G. , edition =. Reinforcement Learning: An Introduction , year =
- [3]
-
[4]
Pendrith, Mark and Ryan, Malcolm , year =
-
[5]
and Dasgupta, Sanjoy , title =
Precup, Doina and Sutton, Richard S. and Dasgupta, Sanjoy , title =. Proceedings of the Eighteenth International Conference on Machine Learning , pages =. 2001 , isbn =
work page 2001
- [6]
-
[7]
Bias-corrected Q-learning to control max-operator bias in Q-learning
Donghun Lee and Boris Defourny and Powell, Warren Buckler. Bias-corrected Q-learning to control max-operator bias in Q-learning. Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2013 - 2013 IEEE Symposium Series on Computational Intelligence, SSCI 2013. 2013. doi:10.1109/ADPRL.2013.6614994
-
[8]
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , pages =
Hasselt, Hado van and Guez, Arthur and Silver, David , title =. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence , pages =. 2016 , publisher =
work page 2016
-
[9]
Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence , pages =
Fox, Roy and Pakman, Ari and Tishby, Naftali , title =. Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence , pages =. 2016 , isbn =
work page 2016
- [10]
-
[11]
Zongzhang Zhang and Zhiyuan Pan and Mykel J. Kochenderfer , title =. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,. 2017 , doi =
work page 2017
-
[12]
International Conference on Learning Representations , year=
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , author=. International Conference on Learning Representations , year=
-
[13]
Proceedings of the 38th International Conference on Machine Learning , pages =
Ensemble Bootstrapping for Q-Learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
work page 2021
-
[14]
International Conference on Learning Representations , year=
Randomized Ensembled Double Q-Learning: Learning Fast Without a Model , author=. International Conference on Learning Representations , year=
-
[15]
Loss of Plasticity in Continual Deep Reinforcement Learning , author=. 2023 , eprint=
work page 2023
-
[16]
International Conference on Learning Representations , year=
Transient Non-stationarity and Generalisation in Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=
-
[17]
Proceedings of the 40th International Conference on Machine Learning , articleno =
Sokar, Ghada and Agarwal, Rishabh and Castro, Pablo Samuel and Evci, Utku , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[18]
Lillicrap, Timothy P. and Hunt, Jonathan J. and Pritzel, Alexander and Heess, Nicolas and Erez, Tom and Tassa, Yuval and Silver, David and Wierstra, Daan , booktitle =
-
[19]
Proceedings of the 35th International Conference on Machine Learning , pages =
Addressing Function Approximation Error in Actor-Critic Methods , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =
work page 2018
-
[20]
Proceedings of the Conference on Robot Learning , year =
Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning , author =. Proceedings of the Conference on Robot Learning , year =
-
[21]
Better Exploration with Optimistic Actor Critic , volume =
Ciosek, Kamil and Vuong, Quan and Loftin, Robert and Hofmann, Katja , booktitle =. Better Exploration with Optimistic Actor Critic , volume =
-
[22]
Advances in Neural Information Processing Systems , year =
Michael Janner and Justin Fu and Marvin Zhang and Sergey Levine , title =. Advances in Neural Information Processing Systems , year =
-
[23]
Pomerleau, Dean A. , booktitle =. ALVINN: An Autonomous Land Vehicle in a Neural Network , volume =
-
[24]
Atkeson, Christopher G. and Schaal, Stefan , title =. Proceedings of the Fourteenth International Conference on Machine Learning , pages =. 1997 , isbn =
work page 1997
-
[25]
International Conference on Machine Learning , pages=
Off-Policy Deep Reinforcement Learning without Exploration , author=. International Conference on Machine Learning , pages=
-
[26]
Advances in Neural Information Processing Systems , editor=
A Minimalist Approach to Offline Reinforcement Learning , author=. Advances in Neural Information Processing Systems , editor=
-
[27]
Nicklas Hansen and Hao Su and Xiaolong Wang , booktitle=
-
[28]
Rumelhart, D. E. and Hinton, G. E. and Williams, R. J. , title =. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations , pages =. 1986 , isbn =
work page 1986
-
[29]
Q-Learning with Hidden-Unit Restarting , volume =
Anderson, Charles , booktitle =. Q-Learning with Hidden-Unit Restarting , volume =
-
[30]
Nair, Vinod and Hinton, Geoffrey E , booktitle =
-
[31]
Proceedings of the 30th International Conference on Machine Learning , pages =
On the importance of initialization and momentum in deep learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =
work page 2013
-
[32]
Deep Sparse Rectifier Neural Networks , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages =. 2011 , editor =
work page 2011
- [33]
-
[34]
A Simple Weight Decay Can Improve Generalization , volume =
Krogh, Anders and Hertz, John , booktitle =. A Simple Weight Decay Can Improve Generalization , volume =
- [35]
-
[36]
Understanding and Improving Layer Normalization , volume =
Xu, Jingjing and Sun, Xu and Zhang, Zhiyuan and Zhao, Guangxiang and Lin, Junyang , booktitle =. Understanding and Improving Layer Normalization , volume =
-
[37]
Adam: A Method for Stochastic Optimization , year =
Kingma, Diederik and Ba, Jimmy , booktitle =. Adam: A Method for Stochastic Optimization , year =
-
[38]
Journal of Machine Learning Research , year =
Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , title =. Journal of Machine Learning Research , year =
-
[39]
Polyak, B. T. and Juditsky, A. B. , title =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =
work page 1992
-
[40]
The Annals of Mathematical Statistics , number =
Herbert Robbins and Sutton Monro , title =. The Annals of Mathematical Statistics , number =. 1951 , doi =
work page 1951
-
[41]
dm\_control: Software and tasks for continuous control , journal =. 2020 , issn =
work page 2020
-
[42]
Resetting the Optimizer in Deep
Kavosh Asadi and Rasool Fakoor and Shoham Sabach , booktitle=. Resetting the Optimizer in Deep
-
[43]
The Phenomenon of Policy Churn , year =
Schaul, Tom and Barreto, Andre and Quan, John and Ostrovski, Georg , booktitle =. The Phenomenon of Policy Churn , year =
-
[44]
International Conference on Machine Learning , year=
Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning , author=. International Conference on Machine Learning , year=
-
[45]
International Conference on Learning Representations , year=
Understanding and Preventing Capacity Loss in Reinforcement Learning , author=. International Conference on Learning Representations , year=
-
[46]
Advances in Neural Information Processing Systems , volume=
Deep reinforcement learning with plasticity injection , author=. Advances in Neural Information Processing Systems , volume=
- [47]
-
[48]
Clevert, Djork-Arné and Unterthiner, Thomas and Hochreiter, Sepp , booktitle =
-
[49]
Adaptive step-sizes for reinforcement learning , author=
-
[50]
International Conference on Machine Learning , year=
PID accelerated value iteration algorithm , author=. International Conference on Machine Learning , year=
-
[51]
International Conference on Artificial Intelligence and Statistics , year =
Momentum in Reinforcement Learning , author =. International Conference on Artificial Intelligence and Statistics , year =
-
[52]
Advances in Neural Information Processing Systems , year=
Tactical optimism and pessimism for deep reinforcement learning , author=. Advances in Neural Information Processing Systems , year=
-
[53]
Notes on RMax exploration , author =
-
[54]
On the Sample Complexity of Reinforcement Learning , author =
-
[55]
Advances in Neural Information Processing Systems , publisher =
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , author =. Advances in Neural Information Processing Systems , publisher =
-
[56]
Advances in Neural Information Processing Systems , publisher =
Near-optimal Regret Bounds for Reinforcement Learning , author =. Advances in Neural Information Processing Systems , publisher =
-
[57]
Journal of Machine Learning Research , volume = 11, number = 51, pages =
Near-optimal Regret Bounds for Reinforcement Learning , author =. Journal of Machine Learning Research , volume = 11, number = 51, pages =
-
[58]
Proceedings of the 34th International Conference on Machine Learning , publisher =
Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , publisher =
-
[59]
Deep Exploration via Randomized Value Functions , author=. 2019 , eprint=
work page 2019
-
[60]
On Lower Bounds for Regret in Reinforcement Learning
On Lower Bounds for Regret in Reinforcement Learning , author =. 1608.02732 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , author =. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , location =
-
[62]
Proceedings of the 37th International Conference on Machine Learning , publisher =
Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , publisher =
-
[63]
Proceedings of the 36th International Conference on Machine Learning , publisher =
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , publisher =
-
[64]
Advances in Neural Information Processing Systems , publisher =
Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , author =. Advances in Neural Information Processing Systems , publisher =
-
[65]
Action-Gap Phenomenon in Reinforcement Learning , author =
-
[66]
Proceedings of the AAAI Conference on Artificial Intelligence , year =
Deep Reinforcement Learning That Matters , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
-
[67]
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , author =
-
[68]
2nd Reproducibility in Machine Learning Workshop at ICML 2018 , address =
Deterministic Implementations for Reproducibility in Deep Reinforcement Learning , author =. 2nd Reproducibility in Machine Learning Workshop at ICML 2018 , address =
work page 2018
-
[69]
Proceedings of the 37th International Conference on Machine Learning , publisher =
Evaluating the Performance of Reinforcement Learning Algorithms , author =. Proceedings of the 37th International Conference on Machine Learning , publisher =
-
[70]
D3rlpy: An Offline Deep Reinforcement Learning Library , author =
-
[71]
Proceedings of the NeurIPS 2020 Competition and Demonstration Track , publisher =
Towards robust and domain agnostic reinforcement learning competitions: MineRL 2020 , author =. Proceedings of the NeurIPS 2020 Competition and Demonstration Track , publisher =
work page 2020
-
[72]
Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =
Adaptive Reward-Free Exploration , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , editor =
work page 2021
-
[73]
Proceedings of the Twenty-First International Conference on Machine Learning , publisher =
Bias and Variance in Value Function Estimation , author =. Proceedings of the Twenty-First International Conference on Machine Learning , publisher =
-
[74]
IEEE Transactions on Automatic Control , volume = 61, number = 9, pages =
Distributionally Robust Counterpart in Markov Decision Processes , author =. IEEE Transactions on Automatic Control , volume = 61, number = 9, pages =
-
[75]
Sample Complexity of Robust Reinforcement Learning with a Generative Model , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , publisher =
-
[76]
Robust and optimal control , author =
-
[77]
Mathematics of Operations Research , publisher =
Robust MDPs with k-Rectangular Uncertainty , author =. Mathematics of Operations Research , publisher =
-
[78]
Operations Research , publisher =
Markov Decision Processes with Imprecise Transition Probabilities , author =. Operations Research , publisher =
-
[79]
Advances in Neural Information Processing Systems , publisher =
Distributionally Robust Markov Decision Processes , author =. Advances in Neural Information Processing Systems , publisher =
-
[80]
Advances in Neural Information Processing Systems , publisher =
Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , author =. Advances in Neural Information Processing Systems , publisher =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.