Adaptive Ensemble Aggregation for Actor-Critics

Bahareh Tasdighi; Manuel Haussmann; Melih Kandemir; Nicklas Werge; Yi-Shan Wu

arxiv: 2507.23501 · v2 · submitted 2025-07-31 · 💻 cs.LG · stat.ML

Adaptive Ensemble Aggregation for Actor-Critics

Nicklas Werge , Yi-Shan Wu , Manuel Haussmann , Bahareh Tasdighi , Melih Kandemir This is my paper

Pith reviewed 2026-05-19 02:11 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords adaptive ensemble aggregationactor-criticoff-policy reinforcement learningvalue estimationvariance reductionmonotonic policy improvementcontinuous control

0 comments

The pith

Adaptive Ensemble Aggregation converges to equilibrium minimizing value estimation error and vanishing bias with larger ensembles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Ensemble Aggregation (AEA) to dynamically build targets for critic and actor updates in off-policy actor-critic learning from the training process itself. Current static or hyperparameter-heavy aggregation methods are replaced by an approach that adapts the combination rule on the fly. The central proofs show convergence to a unique equilibrium point inside a stability region where the aggregation parameter reduces value estimation error. A shrinkage effect is established so that bias drops to zero as the full ensemble size increases, delivering variance reduction that improves with every added model and a guarantee of monotonic policy improvement. Experiments on continuous control tasks show outperformance over baselines on most problems.

Core claim

AEA dynamically constructs ensemble-based targets for both critic and actor updates directly from training dynamics. We prove that AEA converges to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region. Theoretically, we establish that AEA achieves a shrinkage property where the estimation bias vanishes as the total ensemble size grows. Unlike subset-based methods, AEA exploits the full ensemble to achieve optimal variance reduction scaling inversely with the total number of models and maximal Fisher information, with a formal guarantee for monotonic policy improvement.

What carries the argument

Adaptive Ensemble Aggregation (AEA), the mechanism that dynamically constructs ensemble targets from training dynamics to adapt the aggregation parameter without fixed rules.

If this is right

AEA converges to a unique equilibrium that minimizes value estimation error inside a defined stability region.
Estimation bias vanishes as total ensemble size grows, removing the fixed variance floor of subset methods.
Variance reduction scales inversely with the total number of models while using the full ensemble.
Maximal Fisher information is extracted from the complete ensemble.
Monotonic policy improvement holds under the adaptive aggregation regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-calibrating property could reduce the amount of task-specific tuning required when deploying ensemble actor-critics in new environments.
If the shrinkage property holds beyond the tested tasks, larger ensembles would become strictly more useful rather than redundant.
The same dynamic-target construction might transfer to other off-policy methods that currently rely on static ensemble rules.

Load-bearing premise

Training dynamics supply enough information for the aggregation parameter to reach its unique equilibrium inside the stability region without introducing instabilities.

What would settle it

Measure value estimation bias while steadily increasing ensemble size; the central claim fails if bias stops shrinking and plateaus beyond some finite size.

Figures

Figures reproduced from arXiv: 2507.23501 by Bahareh Tasdighi, Manuel Haussmann, Melih Kandemir, Nicklas Werge, Yi-Shan Wu.

**Figure 2.** Figure 2: Learning curves for all MuJoCo tasks under the interactive learning regime (Table 2). [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Learning curves for all MuJoCo tasks under the sample-efficient learning regime (Table 2). [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of initializing the critic-side directional aggregation parameter [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of initializing the critic-side directional aggregation parameter [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Ensembles are ubiquitous in off-policy actor-critic learning, yet their efficacy depends critically on how they are aggregated. Current methods typically rely on static rules or task-specific hyperparameters to balance overestimation bias and variance, leaving the challenge of a truly adaptive approach open. We introduce Adaptive Ensemble Aggregation (AEA), an algorithm that dynamically constructs ensemble-based targets for both critic and actor updates directly from training dynamics. We prove that AEA converges to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region. Theoretically, we establish that AEA achieves a shrinkage property where the estimation bias vanishes as the total ensemble size grows. Unlike subset-based methods like REDQ, which hit an information bottleneck determined by a fixed variance floor regardless of the ensemble size, AEA exploits the full ensemble to achieve optimal variance reduction-scaling inversely with the total number of models-and maximal Fisher information. Furthermore, we provide a formal guarantee for monotonic policy improvement under this adaptive regime. Extensive evaluations on various continuous control tasks demonstrate that AEA outperforms, on the majority of tasks, state-of-the-art baselines, providing a robust and self-calibrating framework for ensemble-based reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AEA claims a dynamic aggregation rule for RL ensembles that converges and scales variance reduction with full ensemble size, but the proofs and update details need close checking.

read the letter

The main point is that this paper introduces Adaptive Ensemble Aggregation for off-policy actor-critics. It builds targets directly from training dynamics instead of static rules or extra hyperparameters, and it claims convergence to a unique equilibrium that minimizes value estimation error along with a shrinkage property where bias drops as the ensemble grows larger. It also promises monotonic policy improvement and better variance scaling than subset methods like REDQ by using the full ensemble for maximal Fisher information. If the derivations hold, this could cut down on tuning in ensemble-based RL. The experiments reportedly show wins over baselines on most continuous control tasks, which gives it some practical grounding. What stands out as new is the explicit link between the aggregation parameter and training dynamics to achieve inverse variance scaling without an information bottleneck. The paper does a reasonable job framing the problem against existing static or fixed-variance approaches. On the softer side, the abstract leans heavily on the convergence and stability region claims without the full equations or fixed-point analysis visible here, so it is hard to judge whether the aggregation update stays independent or risks fitting the same data it adapts on. The weakest assumption seems to be that training dynamics always supply clean enough signals to avoid instabilities or task-specific tweaks. This work is aimed at RL researchers handling ensemble critics and variance-bias tradeoffs. A reader focused on adaptive methods or theoretical guarantees in actor-critics would get value from the ideas and results. It has enough structure, novelty in the adaptive mechanism, and empirical claims to deserve a serious referee rather than a desk reject, though any review would likely press for the detailed derivations and clearer experiment protocols. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces Adaptive Ensemble Aggregation (AEA) for off-policy actor-critic methods. AEA dynamically constructs ensemble-based targets for critic and actor updates directly from training dynamics. The authors claim to prove convergence to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region, a shrinkage property in which estimation bias vanishes as ensemble size grows, optimal variance reduction that scales inversely with the total number of models (contrasting with REDQ's information bottleneck), maximal Fisher information, and a formal guarantee of monotonic policy improvement. Experiments on continuous control tasks show AEA outperforming state-of-the-art baselines on the majority of tasks.

Significance. If the theoretical claims hold, the work offers a meaningful advance for ensemble-based RL by providing a self-calibrating aggregation mechanism that avoids task-specific hyperparameters and fully exploits ensemble size for variance reduction. The formal guarantees of convergence, shrinkage, and monotonic improvement, together with the contrast to subset-based methods, represent substantive theoretical contributions that could improve stability in actor-critic learning.

major comments (2)

[Abstract / theoretical analysis] Abstract and theoretical analysis section: the claim that the aggregation parameter converges to a unique equilibrium minimizing value estimation error from training dynamics is load-bearing for the convergence and stability results, yet it is unclear whether this minimization is independent of the value estimates or reduces to a quantity fitted on the same data, which risks circularity in the fixed-point analysis.
[Theoretical analysis] Theoretical analysis section, shrinkage and variance claims: the statement that AEA achieves bias vanishing with growing ensemble size and variance reduction scaling inversely with total models (unlike REDQ's fixed floor) is central to the optimality argument, but the derivation showing how the full ensemble is exploited without introducing instabilities or task-specific choices is not sufficiently detailed to verify independence from the adaptation rule.

minor comments (1)

[Abstract] The abstract refers to 'various continuous control tasks' and 'state-of-the-art baselines' without naming the specific environments or listing the exact baselines and metrics in the summary paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below with clarifications on the theoretical claims and have revised the manuscript to improve the presentation of the fixed-point analysis and derivations.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract and theoretical analysis section: the claim that the aggregation parameter converges to a unique equilibrium minimizing value estimation error from training dynamics is load-bearing for the convergence and stability results, yet it is unclear whether this minimization is independent of the value estimates or reduces to a quantity fitted on the same data, which risks circularity in the fixed-point analysis.

Authors: We appreciate the referee highlighting this potential circularity concern. The aggregation parameter is adapted by minimizing an objective constructed from the observed temporal-difference residuals and ensemble variance under the current policy's stationary distribution. This objective targets population-level error quantities rather than directly regressing on the instantaneous value estimates, and the fixed-point analysis relies on the contraction property of the aggregated Bellman operator within the defined stability region. The equilibrium condition is therefore independent of any sample-specific fitting loop. We have revised the theoretical analysis section to include an explicit remark and a short proof sketch separating the adaptation dynamics from the value-function fixed point. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section, shrinkage and variance claims: the statement that AEA achieves bias vanishing with growing ensemble size and variance reduction scaling inversely with total models (unlike REDQ's fixed floor) is central to the optimality argument, but the derivation showing how the full ensemble is exploited without introducing instabilities or task-specific choices is not sufficiently detailed to verify independence from the adaptation rule.

Authors: We thank the referee for noting the need for greater detail in these derivations. The bias-shrinkage result follows from showing that the aggregated target converges in probability to the true value function as ensemble size N grows, by applying a law-of-large-numbers argument to the uncorrelated critic errors. The variance scales as O(1/N) because the adaptive weights are derived to maximize Fisher information over the entire ensemble without imposing a fixed subset size. Independence from the precise adaptation rule is guaranteed by the stability-region constraint that keeps the overall operator contractive. We have expanded the main-text derivation with additional intermediate steps and moved the full lemmas to the appendix to make the argument verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents AEA as dynamically constructing targets from training dynamics and proves convergence to an equilibrium minimizing value estimation error, along with shrinkage, variance scaling, and monotonic improvement guarantees. No equations, update rules, or derivation steps are provided in the given text that would allow quoting a specific reduction (such as a fitted parameter being renamed as a prediction or an equilibrium defined circularly by construction). The central claims are framed as independent theoretical results derived from the adaptive mechanism, and the derivation remains self-contained against external benchmarks without load-bearing self-citations or definitional loops exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claims rest on the existence of a stability region and the ability of training dynamics to supply an independent signal for adaptation; no free parameters or invented entities are explicitly named.

axioms (2)

domain assumption Existence of a defined stability region in which the aggregation parameter converges to the error-minimizing equilibrium
Invoked in the stated convergence proof.
domain assumption Training dynamics supply sufficient information to construct targets dynamically without external hyperparameters
Basis for the adaptive mechanism described.

pith-pipeline@v0.9.0 · 5748 in / 1510 out tokens · 43235 ms · 2026-05-19T02:11:05.301883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

[1]

Abbas, R

Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado. Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents (CoLLAs), 2023

work page 2023
[2]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[3]

G. An, S. Moon, J.-H. Kim, and H. O. Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[4]

Anschel, N

O. Anschel, N. Baram, and N. Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017
[5]

M. G. Bellemare, W. Dabney, and M. Rowland. Distributional reinforcement learning. MIT Press, 2023

work page 2023
[6]

Bertsekas and J

D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996

work page 1996
[7]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Cetin and O

E. Cetin and O. Celiktutan. Learning pessimism for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023
[9]

R. Y . Chen, S. Sidor, P. Abbeel, and J. Schulman. Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

X. Chen, C. Wang, Z. Zhou, and K. W. Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[11]

Ciosek, Q

K. Ciosek, Q. Vuong, R. Loftin, and K. Hofmann. Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[12]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (ICML), 2018

work page 2018
[13]

Haarnoja, H

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017
[14]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning (ICML), 2018

work page 2018
[15]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 10

work page 2015
[17]

Kuznetsov, P

A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In Proceedings of the International Conference on Machine Learning (ICML), 2020

work page 2020
[18]

Q. Lan, Y . Pan, A. Fyshe, and M. White. Maxmin q-learning: Controlling the estimation bias of q-learning. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[19]

H. Lee, D. Hwang, D. Kim, H. Kim, J. J. Tai, K. Subramanian, P. R. Wurman, J. Choo, P. Stone, and T. Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[20]

K. Lee, M. Laskin, A. Srinivas, and P. Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021
[21]

Liang, Y

L. Liang, Y . Xu, S. McAleer, D. Hu, A. Ihler, P. Abbeel, and R. Fox. Reducing variance in temporal- difference value estimation via ensemble of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2022

work page 2022
[22]

Lillicrap, J

T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016

work page 2016
[23]

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015

work page 2015
[25]

Moskovitz, J

T. Moskovitz, J. Parker-Holder, A. Pacchiano, M. Arbel, and M. Jordan. Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[26]

Nauman and M

M. Nauman and M. Cygan. Decoupled policy actor-critic: Bridging pessimism and risk awareness in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[27]

Nauman, M

M. Nauman, M. Bortkiewicz, P. Miło´s, T. Trzcinski, M. Ostaszewski, and M. Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[28]

Nauman, M

M. Nauman, M. Ostaszewski, K. Jankowski, P. Miło´s, and M. Cygan. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[29]

Nikishin, M

E. Nikishin, M. Schwarzer, P. D’Oro, P.-L. Bacon, and A. Courville. The primacy bias in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2022

work page 2022
[30]

Osband, C

I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems (NeurIPS), 2016

work page 2016
[31]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing System...

work page 2019
[32]

M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[33]

Shang, K

W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. InProceedings of the International Conference on Machine Learning (ICML), 2016

work page 2016
[34]

Sheikh, M

H. Sheikh, M. Phielipp, and L. Boloni. Maximizing ensemble diversity in deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[35]

Silver, G

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (ICML), 2014

work page 2014
[36]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018. 11

work page 2018
[37]

Thrun and A

S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993

work page 1993
[38]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

work page 2012
[39]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulao, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Van Hasselt

H. Van Hasselt. Double q-learning. Advances in Neural Information Processing Systems (NeurIPS), 2010

work page 2010
[41]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2016

work page 2016
[42]

C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 1992

work page 1992
[43]

Y . Wu, X. Chen, C. Wang, Y . Zhang, and K. W. Ross. Aggressive q-learning with ensembles: Achieving both high sample efficiency and high asymptotic performance. Advances in Neural Information Processing Systems (NeurIPS), 2022. Deep Reinforcement Learning Workshop

work page 2022
[44]

B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy . Carnegie Mellon University, 2010. A Extended discussion of related work Reducing overestimation bias with ensembles Overestimation bias in Q-value estimates can destabilize training and degrade policy performance. To address this issue, various methods have...

work page 2010

[1] [1]

Abbas, R

Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado. Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents (CoLLAs), 2023

work page 2023

[2] [2]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[3] [3]

G. An, S. Moon, J.-H. Kim, and H. O. Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[4] [4]

Anschel, N

O. Anschel, N. Baram, and N. Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017

[5] [5]

M. G. Bellemare, W. Dabney, and M. Rowland. Distributional reinforcement learning. MIT Press, 2023

work page 2023

[6] [6]

Bertsekas and J

D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996

work page 1996

[7] [7]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Cetin and O

E. Cetin and O. Celiktutan. Learning pessimism for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

work page 2023

[9] [9]

R. Y . Chen, S. Sidor, P. Abbeel, and J. Schulman. Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

X. Chen, C. Wang, Z. Zhou, and K. W. Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations (ICLR), 2021

work page 2021

[11] [11]

Ciosek, Q

K. Ciosek, Q. Vuong, R. Loftin, and K. Hofmann. Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[12] [12]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (ICML), 2018

work page 2018

[13] [13]

Haarnoja, H

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning (ICML), 2017

work page 2017

[14] [14]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning (ICML), 2018

work page 2018

[15] [15]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 10

work page 2015

[17] [17]

Kuznetsov, P

A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In Proceedings of the International Conference on Machine Learning (ICML), 2020

work page 2020

[18] [18]

Q. Lan, Y . Pan, A. Fyshe, and M. White. Maxmin q-learning: Controlling the estimation bias of q-learning. In International Conference on Learning Representations (ICLR), 2020

work page 2020

[19] [19]

H. Lee, D. Hwang, D. Kim, H. Kim, J. J. Tai, K. Subramanian, P. R. Wurman, J. Choo, P. Stone, and T. Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2025

work page 2025

[20] [20]

K. Lee, M. Laskin, A. Srinivas, and P. Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021

[21] [21]

Liang, Y

L. Liang, Y . Xu, S. McAleer, D. Hu, A. Ihler, P. Abbeel, and R. Fox. Reducing variance in temporal- difference value estimation via ensemble of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2022

work page 2022

[22] [22]

Lillicrap, J

T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016

work page 2016

[23] [23]

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[24] [24]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015

work page 2015

[25] [25]

Moskovitz, J

T. Moskovitz, J. Parker-Holder, A. Pacchiano, M. Arbel, and M. Jordan. Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[26] [26]

Nauman and M

M. Nauman and M. Cygan. Decoupled policy actor-critic: Bridging pessimism and risk awareness in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025

[27] [27]

Nauman, M

M. Nauman, M. Bortkiewicz, P. Miło´s, T. Trzcinski, M. Ostaszewski, and M. Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024

[28] [28]

Nauman, M

M. Nauman, M. Ostaszewski, K. Jankowski, P. Miło´s, and M. Cygan. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[29] [29]

Nikishin, M

E. Nikishin, M. Schwarzer, P. D’Oro, P.-L. Bacon, and A. Courville. The primacy bias in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2022

work page 2022

[30] [30]

Osband, C

I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems (NeurIPS), 2016

work page 2016

[31] [31]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing System...

work page 2019

[32] [32]

M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[33] [33]

Shang, K

W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. InProceedings of the International Conference on Machine Learning (ICML), 2016

work page 2016

[34] [34]

Sheikh, M

H. Sheikh, M. Phielipp, and L. Boloni. Maximizing ensemble diversity in deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2022

work page 2022

[35] [35]

Silver, G

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (ICML), 2014

work page 2014

[36] [36]

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018. 11

work page 2018

[37] [37]

Thrun and A

S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993

work page 1993

[38] [38]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

work page 2012

[39] [39]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulao, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Van Hasselt

H. Van Hasselt. Double q-learning. Advances in Neural Information Processing Systems (NeurIPS), 2010

work page 2010

[41] [41]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2016

work page 2016

[42] [42]

C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 1992

work page 1992

[43] [43]

Y . Wu, X. Chen, C. Wang, Y . Zhang, and K. W. Ross. Aggressive q-learning with ensembles: Achieving both high sample efficiency and high asymptotic performance. Advances in Neural Information Processing Systems (NeurIPS), 2022. Deep Reinforcement Learning Workshop

work page 2022

[44] [44]

B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy . Carnegie Mellon University, 2010. A Extended discussion of related work Reducing overestimation bias with ensembles Overestimation bias in Q-value estimates can destabilize training and degrade policy performance. To address this issue, various methods have...

work page 2010