pith. sign in

arxiv: 2507.23501 · v2 · submitted 2025-07-31 · 💻 cs.LG · stat.ML

Adaptive Ensemble Aggregation for Actor-Critics

Pith reviewed 2026-05-19 02:11 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords adaptive ensemble aggregationactor-criticoff-policy reinforcement learningvalue estimationvariance reductionmonotonic policy improvementcontinuous control
0
0 comments X

The pith

Adaptive Ensemble Aggregation converges to equilibrium minimizing value estimation error and vanishing bias with larger ensembles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Ensemble Aggregation (AEA) to dynamically build targets for critic and actor updates in off-policy actor-critic learning from the training process itself. Current static or hyperparameter-heavy aggregation methods are replaced by an approach that adapts the combination rule on the fly. The central proofs show convergence to a unique equilibrium point inside a stability region where the aggregation parameter reduces value estimation error. A shrinkage effect is established so that bias drops to zero as the full ensemble size increases, delivering variance reduction that improves with every added model and a guarantee of monotonic policy improvement. Experiments on continuous control tasks show outperformance over baselines on most problems.

Core claim

AEA dynamically constructs ensemble-based targets for both critic and actor updates directly from training dynamics. We prove that AEA converges to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region. Theoretically, we establish that AEA achieves a shrinkage property where the estimation bias vanishes as the total ensemble size grows. Unlike subset-based methods, AEA exploits the full ensemble to achieve optimal variance reduction scaling inversely with the total number of models and maximal Fisher information, with a formal guarantee for monotonic policy improvement.

What carries the argument

Adaptive Ensemble Aggregation (AEA), the mechanism that dynamically constructs ensemble targets from training dynamics to adapt the aggregation parameter without fixed rules.

If this is right

  • AEA converges to a unique equilibrium that minimizes value estimation error inside a defined stability region.
  • Estimation bias vanishes as total ensemble size grows, removing the fixed variance floor of subset methods.
  • Variance reduction scales inversely with the total number of models while using the full ensemble.
  • Maximal Fisher information is extracted from the complete ensemble.
  • Monotonic policy improvement holds under the adaptive aggregation regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-calibrating property could reduce the amount of task-specific tuning required when deploying ensemble actor-critics in new environments.
  • If the shrinkage property holds beyond the tested tasks, larger ensembles would become strictly more useful rather than redundant.
  • The same dynamic-target construction might transfer to other off-policy methods that currently rely on static ensemble rules.

Load-bearing premise

Training dynamics supply enough information for the aggregation parameter to reach its unique equilibrium inside the stability region without introducing instabilities.

What would settle it

Measure value estimation bias while steadily increasing ensemble size; the central claim fails if bias stops shrinking and plateaus beyond some finite size.

Figures

Figures reproduced from arXiv: 2507.23501 by Bahareh Tasdighi, Manuel Haussmann, Melih Kandemir, Nicklas Werge, Yi-Shan Wu.

Figure 1
Figure 1. Figure 1: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning curves for all MuJoCo tasks under the interactive learning regime (Table 2). [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves for all MuJoCo tasks under the sample-efficient learning regime (Table 2). [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Trajectories of the directional aggregation parameters [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of initializing the critic-side directional aggregation parameter [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of initializing the critic-side directional aggregation parameter [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Ensembles are ubiquitous in off-policy actor-critic learning, yet their efficacy depends critically on how they are aggregated. Current methods typically rely on static rules or task-specific hyperparameters to balance overestimation bias and variance, leaving the challenge of a truly adaptive approach open. We introduce Adaptive Ensemble Aggregation (AEA), an algorithm that dynamically constructs ensemble-based targets for both critic and actor updates directly from training dynamics. We prove that AEA converges to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region. Theoretically, we establish that AEA achieves a shrinkage property where the estimation bias vanishes as the total ensemble size grows. Unlike subset-based methods like REDQ, which hit an information bottleneck determined by a fixed variance floor regardless of the ensemble size, AEA exploits the full ensemble to achieve optimal variance reduction-scaling inversely with the total number of models-and maximal Fisher information. Furthermore, we provide a formal guarantee for monotonic policy improvement under this adaptive regime. Extensive evaluations on various continuous control tasks demonstrate that AEA outperforms, on the majority of tasks, state-of-the-art baselines, providing a robust and self-calibrating framework for ensemble-based reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Adaptive Ensemble Aggregation (AEA) for off-policy actor-critic methods. AEA dynamically constructs ensemble-based targets for critic and actor updates directly from training dynamics. The authors claim to prove convergence to a unique equilibrium where the aggregation parameter minimizes value estimation error within a defined stability region, a shrinkage property in which estimation bias vanishes as ensemble size grows, optimal variance reduction that scales inversely with the total number of models (contrasting with REDQ's information bottleneck), maximal Fisher information, and a formal guarantee of monotonic policy improvement. Experiments on continuous control tasks show AEA outperforming state-of-the-art baselines on the majority of tasks.

Significance. If the theoretical claims hold, the work offers a meaningful advance for ensemble-based RL by providing a self-calibrating aggregation mechanism that avoids task-specific hyperparameters and fully exploits ensemble size for variance reduction. The formal guarantees of convergence, shrinkage, and monotonic improvement, together with the contrast to subset-based methods, represent substantive theoretical contributions that could improve stability in actor-critic learning.

major comments (2)
  1. [Abstract / theoretical analysis] Abstract and theoretical analysis section: the claim that the aggregation parameter converges to a unique equilibrium minimizing value estimation error from training dynamics is load-bearing for the convergence and stability results, yet it is unclear whether this minimization is independent of the value estimates or reduces to a quantity fitted on the same data, which risks circularity in the fixed-point analysis.
  2. [Theoretical analysis] Theoretical analysis section, shrinkage and variance claims: the statement that AEA achieves bias vanishing with growing ensemble size and variance reduction scaling inversely with total models (unlike REDQ's fixed floor) is central to the optimality argument, but the derivation showing how the full ensemble is exploited without introducing instabilities or task-specific choices is not sufficiently detailed to verify independence from the adaptation rule.
minor comments (1)
  1. [Abstract] The abstract refers to 'various continuous control tasks' and 'state-of-the-art baselines' without naming the specific environments or listing the exact baselines and metrics in the summary paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below with clarifications on the theoretical claims and have revised the manuscript to improve the presentation of the fixed-point analysis and derivations.

read point-by-point responses
  1. Referee: [Abstract / theoretical analysis] Abstract and theoretical analysis section: the claim that the aggregation parameter converges to a unique equilibrium minimizing value estimation error from training dynamics is load-bearing for the convergence and stability results, yet it is unclear whether this minimization is independent of the value estimates or reduces to a quantity fitted on the same data, which risks circularity in the fixed-point analysis.

    Authors: We appreciate the referee highlighting this potential circularity concern. The aggregation parameter is adapted by minimizing an objective constructed from the observed temporal-difference residuals and ensemble variance under the current policy's stationary distribution. This objective targets population-level error quantities rather than directly regressing on the instantaneous value estimates, and the fixed-point analysis relies on the contraction property of the aggregated Bellman operator within the defined stability region. The equilibrium condition is therefore independent of any sample-specific fitting loop. We have revised the theoretical analysis section to include an explicit remark and a short proof sketch separating the adaptation dynamics from the value-function fixed point. revision: yes

  2. Referee: [Theoretical analysis] Theoretical analysis section, shrinkage and variance claims: the statement that AEA achieves bias vanishing with growing ensemble size and variance reduction scaling inversely with total models (unlike REDQ's fixed floor) is central to the optimality argument, but the derivation showing how the full ensemble is exploited without introducing instabilities or task-specific choices is not sufficiently detailed to verify independence from the adaptation rule.

    Authors: We thank the referee for noting the need for greater detail in these derivations. The bias-shrinkage result follows from showing that the aggregated target converges in probability to the true value function as ensemble size N grows, by applying a law-of-large-numbers argument to the uncorrelated critic errors. The variance scales as O(1/N) because the adaptive weights are derived to maximize Fisher information over the entire ensemble without imposing a fixed subset size. Independence from the precise adaptation rule is guaranteed by the stability-region constraint that keeps the overall operator contractive. We have expanded the main-text derivation with additional intermediate steps and moved the full lemmas to the appendix to make the argument verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents AEA as dynamically constructing targets from training dynamics and proves convergence to an equilibrium minimizing value estimation error, along with shrinkage, variance scaling, and monotonic improvement guarantees. No equations, update rules, or derivation steps are provided in the given text that would allow quoting a specific reduction (such as a fitted parameter being renamed as a prediction or an equilibrium defined circularly by construction). The central claims are framed as independent theoretical results derived from the adaptive mechanism, and the derivation remains self-contained against external benchmarks without load-bearing self-citations or definitional loops exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claims rest on the existence of a stability region and the ability of training dynamics to supply an independent signal for adaptation; no free parameters or invented entities are explicitly named.

axioms (2)
  • domain assumption Existence of a defined stability region in which the aggregation parameter converges to the error-minimizing equilibrium
    Invoked in the stated convergence proof.
  • domain assumption Training dynamics supply sufficient information to construct targets dynamically without external hyperparameters
    Basis for the adaptive mechanism described.

pith-pipeline@v0.9.0 · 5748 in / 1510 out tokens · 43235 ms · 2026-05-19T02:11:05.301883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 5 internal anchors

  1. [1]

    Abbas, R

    Z. Abbas, R. Zhao, J. Modayil, A. White, and M. C. Machado. Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents (CoLLAs), 2023

  2. [2]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems (NeurIPS), 2021

  3. [3]

    G. An, S. Moon, J.-H. Kim, and H. O. Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems (NeurIPS), 2021

  4. [4]

    Anschel, N

    O. Anschel, N. Baram, and N. Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2017

  5. [5]

    M. G. Bellemare, W. Dabney, and M. Rowland. Distributional reinforcement learning. MIT Press, 2023

  6. [6]

    Bertsekas and J

    D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996

  7. [7]

    OpenAI Gym

    G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

  8. [8]

    Cetin and O

    E. Cetin and O. Celiktutan. Learning pessimism for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

  9. [9]

    R. Y . Chen, S. Sidor, P. Abbeel, and J. Schulman. Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017

  10. [10]

    X. Chen, C. Wang, Z. Zhou, and K. W. Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations (ICLR), 2021

  11. [11]

    Ciosek, Q

    K. Ciosek, Q. Vuong, R. Loftin, and K. Hofmann. Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems (NeurIPS), 2019

  12. [12]

    Fujimoto, H

    S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (ICML), 2018

  13. [13]

    Haarnoja, H

    T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning (ICML), 2017

  14. [14]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning (ICML), 2018

  15. [15]

    Soft Actor-Critic Algorithms and Applications

    T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018

  16. [16]

    Kingma and J

    D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 10

  17. [17]

    Kuznetsov, P

    A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In Proceedings of the International Conference on Machine Learning (ICML), 2020

  18. [18]

    Q. Lan, Y . Pan, A. Fyshe, and M. White. Maxmin q-learning: Controlling the estimation bias of q-learning. In International Conference on Learning Representations (ICLR), 2020

  19. [19]

    H. Lee, D. Hwang, D. Kim, H. Kim, J. J. Tai, K. Subramanian, P. R. Wurman, J. Choo, P. Stone, and T. Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2025

  20. [20]

    K. Lee, M. Laskin, A. Srinivas, and P. Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2021

  21. [21]

    Liang, Y

    L. Liang, Y . Xu, S. McAleer, D. Hu, A. Ihler, P. Abbeel, and R. Fox. Reducing variance in temporal- difference value estimation via ensemble of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2022

  22. [22]

    Lillicrap, J

    T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016

  23. [23]

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  24. [24]

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015

  25. [25]

    Moskovitz, J

    T. Moskovitz, J. Parker-Holder, A. Pacchiano, M. Arbel, and M. Jordan. Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2021

  26. [26]

    Nauman and M

    M. Nauman and M. Cygan. Decoupled policy actor-critic: Bridging pessimism and risk awareness in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

  27. [27]

    Nauman, M

    M. Nauman, M. Bortkiewicz, P. Miło´s, T. Trzcinski, M. Ostaszewski, and M. Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2024

  28. [28]

    Nauman, M

    M. Nauman, M. Ostaszewski, K. Jankowski, P. Miło´s, and M. Cygan. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control. Advances in Neural Information Processing Systems (NeurIPS), 2024

  29. [29]

    Nikishin, M

    E. Nikishin, M. Schwarzer, P. D’Oro, P.-L. Bacon, and A. Courville. The primacy bias in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 2022

  30. [30]

    Osband, C

    I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems (NeurIPS), 2016

  31. [31]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing System...

  32. [32]

    M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  33. [33]

    Shang, K

    W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. InProceedings of the International Conference on Machine Learning (ICML), 2016

  34. [34]

    Sheikh, M

    H. Sheikh, M. Phielipp, and L. Boloni. Maximizing ensemble diversity in deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2022

  35. [35]

    Silver, G

    D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (ICML), 2014

  36. [36]

    R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018. 11

  37. [37]

    Thrun and A

    S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993

  38. [38]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

  39. [39]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulao, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032, 2024

  40. [40]

    Van Hasselt

    H. Van Hasselt. Double q-learning. Advances in Neural Information Processing Systems (NeurIPS), 2010

  41. [41]

    Van Hasselt, A

    H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2016

  42. [42]

    C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 1992

  43. [43]

    Y . Wu, X. Chen, C. Wang, Y . Zhang, and K. W. Ross. Aggressive q-learning with ensembles: Achieving both high sample efficiency and high asymptotic performance. Advances in Neural Information Processing Systems (NeurIPS), 2022. Deep Reinforcement Learning Workshop

  44. [44]

    B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy . Carnegie Mellon University, 2010. A Extended discussion of related work Reducing overestimation bias with ensembles Overestimation bias in Q-value estimates can destabilize training and degrade policy performance. To address this issue, various methods have...