pith. sign in

arxiv: 2606.10129 · v1 · pith:T2ROBKT2new · submitted 2026-06-08 · 💻 cs.LG · cs.NE

Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

Pith reviewed 2026-06-27 17:28 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords reinforcement learningevolutionary algorithmsparameter controlOneMaxinterpretabilitygenetic algorithmsdeep Q-networkspolicy distillation
0
0 comments X

The pith

Deep RL with action decomposition and reward adjustments produces a distilled symbolic policy for multi-parameter control in the (1+(λ,λ))-GA that outperforms baselines on OneMax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that deep reinforcement learning can be made to work for learning multi-parameter control policies in evolutionary algorithms, where theoretical analysis has been limited to single-parameter cases. Standard RL approaches fail to converge reliably in this setting, but three algorithm-agnostic fixes—action-space decomposition, reward shifting, and long-horizon discounting—allow Double DQN to learn stable trajectories. These trajectories are then distilled into a transparent symbolic policy that retains strong performance across problem sizes while enabling future formal study.

Core claim

After stabilizing training via action-space decomposition, reward shifting, and long-horizon discounting, Double DQN learns trajectories that can be distilled into an interpretable symbolic control policy for the (1+(λ,λ))-genetic algorithm on OneMax; this policy consistently outperforms existing baselines across a wide range of problem sizes.

What carries the argument

Distillation of the neural-network policy into a transparent symbolic control rule that preserves performance while exposing the decision logic for theoretical inspection.

If this is right

  • Multi-parameter control becomes amenable to the same style of rigorous analysis previously applied only to single-parameter settings.
  • The same enhancement pipeline can be tested on other evolutionary algorithms and fitness landscapes.
  • Symbolic policies extracted this way can serve as candidates for manual simplification or proof of optimality.
  • Interpretability removes the black-box barrier that has prevented formal study of joint parameter dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reveal simple decision rules that generalize beyond OneMax and could be verified by direct mathematical analysis.
  • Similar distillation could be applied to other RL-controlled optimizers to produce human-readable rules that bridge empirical performance and theory.
  • If the symbolic policy is compact, it could be used as a starting point for designing new theoretical bounds on multi-parameter speedups.

Load-bearing premise

The three training enhancements enable stable convergence to a high-performing policy whose performance is largely retained after distillation into a symbolic form.

What would settle it

Run the distilled symbolic policy on OneMax instances of increasing size and compare its success probability or runtime against the best known static and dynamic baselines; failure to outperform on multiple sizes would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.10129 by Carola Doerr, Nguyen Dang, Phong Le, Tai Nguyen.

Figure 1
Figure 1. Figure 1: The proposed two-stage distillation framework for discovering symbolic multi-parameter control policies. The DAC setting of (1+(λ,λ))- GA solving the ONEMAX problem is represented by the loop (bottom). To bridge the gap between empirical deep-RL performance and theoretical interpretability, our methodology operates in two stages. Stage I: A deep￾RL oracle generates optimal parameter trajectories, which are… view at source ↗
Figure 2
Figure 2. Figure 2: Deep neural network architectures for (a) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves for PPO under single-parameter control ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized ERT (↓) comparison of deep-RL policies against the theory-derived baseline πTHEORY for problem sizes n ∈ {100, 200, 500}. We evaluate single-parameter PPO [35], our multi-parameter PPO variants, and the top DDQN policies from Table I. factored action space representation consistently demonstrates learning stability across both problem sizes. We conclude that, although the factored representation… view at source ↗
Figure 5
Figure 5. Figure 5: Transition from controlling only one parameter of (1+( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DDQN-based policies and the theory-derived policy across six problem [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Normalized ERT (and its standard deviation) of our two newly derived policies ( [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, interpretable multi-parameter policies amenable to formal study. We demonstrate how deep-RL can be leveraged to overcome this barrier, using the (1+($\lambda$,$\lambda$))-genetic algorithm optimizing OneMax, one of the few problems where a super-constant speedup of dynamic control has been formally proven, as a representative case study. We first show that standard approaches struggle to converge in this multi-parameter setting, and introduce algorithm-agnostic enhancements targeting action-space decomposition, reward shifting, and long-horizon discounting. With these in place, we compare common deep-RL methods and find that Double Deep Q-Networks uniquely avoid the policy collapse observed in Proximal Policy Optimization, yielding trajectories suitable for downstream analysis. Crucially, we move beyond the ``black-box'' nature of neural networks by distilling the learned behaviors into a transparent, symbolic control policy. This resulting policy does not only offer interpretability for future theoretical analysis but also yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper applies deep RL (focusing on DDQN after algorithm-agnostic enhancements to action-space decomposition, reward shifting, and long-horizon discounting) to learn multi-parameter control policies for the (1+(λ,λ))-GA on OneMax. It then distills the resulting neural policy into a transparent symbolic form, claiming that this interpretable policy offers both theoretical utility and exceptional performance that consistently outperforms existing baselines across problem sizes.

Significance. If the quantitative claims hold with proper controls, the work would supply one of the first concrete, interpretable multi-parameter policies amenable to formal analysis in a setting where super-constant speedups have already been proven for single-parameter control. The explicit distillation step and the identification of DDQN as the only method avoiding policy collapse are potentially reusable contributions.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'this resulting policy ... yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes' is unsupported by any quantitative results, baseline definitions, statistical tests, or experimental protocol in the supplied text. Without these, the outperformance assertion cannot be evaluated.
  2. [Abstract] Abstract (and § on distillation): no fidelity metric, performance table, or ablation is referenced that directly compares the distilled symbolic policy against the DDQN policy from which it was derived. If distillation introduces approximation error, the reported gains could be artifacts of the neural controller only; this must be shown explicitly for the headline claim to stand.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on the abstract claims. We address each point below and will revise the manuscript to strengthen the presentation of results and comparisons.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'this resulting policy ... yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes' is unsupported by any quantitative results, baseline definitions, statistical tests, or experimental protocol in the supplied text. Without these, the outperformance assertion cannot be evaluated.

    Authors: The full manuscript contains the requested details in the experimental evaluation (Section 4), including performance tables for problem sizes n=100 to n=10000, explicit baseline definitions (constant-λ, theoretical dynamic-λ, and prior RL controllers), and statistical significance via paired t-tests over 30 independent runs. The abstract, as a high-level summary, does not repeat these numbers. We will revise the abstract to add a concise clause referencing these results (e.g., “empirical evaluation across problem sizes demonstrates consistent outperformance”) while preserving length constraints. revision: yes

  2. Referee: [Abstract] Abstract (and § on distillation): no fidelity metric, performance table, or ablation is referenced that directly compares the distilled symbolic policy against the DDQN policy from which it was derived. If distillation introduces approximation error, the reported gains could be artifacts of the neural controller only; this must be shown explicitly for the headline claim to stand.

    Authors: We agree that an explicit side-by-side comparison is necessary to substantiate that the headline performance gains are retained after distillation. The current manuscript reports the symbolic policy’s standalone performance but does not include a dedicated fidelity table (e.g., action-agreement rate or cumulative-reward correlation) or ablation against the source DDQN policy. We will add this comparison, including the requested metrics, to the distillation subsection and reference it from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL discovery with no tautological reductions

full rationale

The paper describes an empirical workflow: standard deep RL methods are modified with algorithm-agnostic enhancements (action-space decomposition, reward shifting, long-horizon discounting), DDQN is trained to produce trajectories, and behaviors are distilled into a symbolic policy whose performance is then measured experimentally against baselines. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to fitted quantities or self-citations by construction. The central claims rest on observed experimental outcomes rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing step. This is the expected non-finding for an applied RL paper whose value is in the empirical results and interpretability of the distilled policy.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5755 in / 1005 out tokens · 23590 ms · 2026-06-27T17:28:07.717058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 11 canonical work pages · 9 internal anchors

  1. [1]

    Pa- rameter control in evolutionary algorithms,

    A. E. Eiben, Z. Michalewicz, M. Schoenauer, and J. E. Smith, “Pa- rameter control in evolutionary algorithms,” inParameter setting in evolutionary algorithms. Springer, 2007, pp. 19–46

  2. [2]

    A systematic literature review of adaptive pa- rameter control methods for evolutionary algorithms,

    A. Aleti and I. Moser, “A systematic literature review of adaptive pa- rameter control methods for evolutionary algorithms,”ACM Computing Surveys (CSUR), 2016

  3. [3]

    A generic approach to parameter control,

    G. Karafotias, S. K. Smit, and A. E. Eiben, “A generic approach to parameter control,” inProc. of EvoApplications, 2012

  4. [4]

    Parameter control in evolutionary algorithms,

    A. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter control in evolutionary algorithms,”TEVC, 1999

  5. [5]

    A systematic literature review of adaptive pa- rameter control methods for evolutionary algorithms,

    A. Aleti and I. Moser, “A systematic literature review of adaptive pa- rameter control methods for evolutionary algorithms,”ACM Computing Surveys, 2016

  6. [6]

    Theory of parameter control for discrete black- box optimization: Provable performance gains through dynamic parame- ter choices,

    B. Doerr and C. Doerr, “Theory of parameter control for discrete black- box optimization: Provable performance gains through dynamic parame- ter choices,”Theory of Evolutionary Computation: Recent Developments in Discrete Optimization, 2020

  7. [7]

    Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation,

    N. Hansen and A. Ostermeier, “Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation,” inProc. of IEEE ICEC, 1996

  8. [8]

    A restart CMA evolution strategy with increasing population size,

    A. Auger and N. Hansen, “A restart CMA evolution strategy with increasing population size,” inIEEE CEC, 2005

  9. [9]

    Parameter control in evolutionary algorithms: Trends and challenges,

    G. Karafotias, M. Hoogendoorn, and ´A. E. Eiben, “Parameter control in evolutionary algorithms: Trends and challenges,”IEEE TEVC, 2014

  10. [10]

    Classification-based self-adaptive differential evo- lution with fast and reliable convergence performance,

    X.-J. Bi and J. Xiao, “Classification-based self-adaptive differential evo- lution with fast and reliable convergence performance,”Soft Computing, 2011

  11. [11]

    Self-adaptive differential evolution algorithm for numerical optimization,

    A. K. Qin and P. N. Suganthan, “Self-adaptive differential evolution algorithm for numerical optimization,” in2005 IEEE CEC, 2005

  12. [12]

    Empirical study on the effect of population size on differential evolution algorithm,

    R. Mallipeddi and P. N. Suganthan, “Empirical study on the effect of population size on differential evolution algorithm,” inIEEE CEC, 2008

  13. [13]

    Adaptive operator selection with dynamic multi-armed bandits,

    L. DaCosta, A. Fialho, M. Schoenauer, and M. Sebag, “Adaptive operator selection with dynamic multi-armed bandits,” inGECCO, 2008

  14. [14]

    Analyzing bandit-based adaptive operator selection mechanisms,

    ´A. Fialho, L. Da Costa, M. Schoenauer, and M. Sebag, “Analyzing bandit-based adaptive operator selection mechanisms,”Annals of Math- ematics and Artificial Intelligence, 2010

  15. [15]

    k-bit mutation with self-adjusting k outperforms standard bit mutation,

    B. Doerr, C. Doerr, and J. Yang, “k-bit mutation with self-adjusting k outperforms standard bit mutation,” inProc. of PPSN, 2016

  16. [16]

    SMAC3: A versatile Bayesian optimization package for hyperparameter optimization,

    M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, D. Deng, C. Benjamins, T. Ruhkopf, R. Sass, and F. Hutter, “SMAC3: A versatile Bayesian optimization package for hyperparameter optimization,”JMLR, 2022

  17. [17]

    Deep reinforcement learning based parameter control in differential evolution,

    M. Sharma, A. Komninos, M. L ´opez-Ib´a˜nez, and D. Kazakov, “Deep reinforcement learning based parameter control in differential evolution,” inProc. of GECCO, 2019

  18. [18]

    Learning step-size adaptation in CMA-ES,

    G. Shala, A. Biedenkapp, N. Awad, S. Adriaensen, M. Lindauer, and F. Hutter, “Learning step-size adaptation in CMA-ES,” inPPSN, 2020

  19. [19]

    Learning adaptive differential evolution algorithm from optimization experiences by policy gradient,

    J. Sun, X. Liu, T. B ¨ack, and Z. Xu, “Learning adaptive differential evolution algorithm from optimization experiences by policy gradient,” IEEE TEVC, 2021

  20. [20]

    Auto-configuring exploration-exploitation tradeoff in evolutionary computation via deep reinforcement learning,

    Z. Ma, J. Chen, H. Guo, Y . Ma, and Y .-J. Gong, “Auto-configuring exploration-exploitation tradeoff in evolutionary computation via deep reinforcement learning,” inProc. of GECCO, 2024

  21. [21]

    Multi-parameter control for the(1 + (λ, λ))-GA on OneMax via deep reinforcement learning,

    T. Nguyen, P. Le, C. Doerr, and N. Dang, “Multi-parameter control for the(1 + (λ, λ))-GA on OneMax via deep reinforcement learning,” in Proc. of FOGA, 2025

  22. [22]

    On the importance of reward design in reinforcement learning-based dynamic algorithm configuration: A case study on OneMax with(1 + (λ, λ))- GA,

    T. Nguyen, P. Le, A. Biedenkapp, C. Doerr, and N. Dang, “On the importance of reward design in reinforcement learning-based dynamic algorithm configuration: A case study on OneMax with(1 + (λ, λ))- GA,” inProc. of GECCO, 2025

  23. [23]

    Nguyen, P

    T. Nguyen, P. Le, C. Doerr, and N. Dang, https://github.com/taindp98/ OneMax-MPDAC/tree/dev/extension, 2025

  24. [24]

    Parameter control in evolutionary algorithms,

    ´A. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter control in evolutionary algorithms,”IEEE TEVC, 1999

  25. [25]

    Birattari and J

    M. Birattari and J. Kacprzyk,Tuning metaheuristics: a machine learning perspective. Springer, 2009, vol. 197. 15

  26. [26]

    Dynamic algorithm configuration: Foundation of a new meta- algorithmic framework,

    A. Biedenkapp, H. F. Bozkurt, T. Eimer, F. Hutter, and M. Lin- dauer, “Dynamic algorithm configuration: Foundation of a new meta- algorithmic framework,” inECAI. IOS Press, 2020, pp. 427–434

  27. [27]

    ParamILS: an automatic algorithm configuration framework,

    F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. St ¨utzle, “ParamILS: an automatic algorithm configuration framework,”JAIR, 2009

  28. [28]

    Controlling genetic algorithms with reinforcement learning,

    J. E. Pettinger and R. M. Everson, “Controlling genetic algorithms with reinforcement learning,” inProc. of GECCO, 2002

  29. [29]

    Algorithm selection using reinforcement learning

    M. G. Lagoudakis, M. L. Littmanet al., “Algorithm selection using reinforcement learning.” inICML, 2000

  30. [30]

    Hyper-heuristics: A survey of the state of the art,

    E. K. Burke, M. Gendreau, M. Hyde, G. Kendall, G. Ochoa, E. ¨Ozcan, and R. Qu, “Hyper-heuristics: A survey of the state of the art,”Journal of the Operational Research Society, 2013

  31. [31]

    The general combinatorial optimiza- tion problem: Towards automated algorithm design,

    R. Qu, G. Kendall, and N. Pillay, “The general combinatorial optimiza- tion problem: Towards automated algorithm design,”IEEE Computa- tional Intelligence Magazine, 2020

  32. [32]

    Automated dynamic algorithm configuration,

    S. Adriaensen, A. Biedenkapp, G. Shala, N. Awad, T. Eimer, M. Lin- dauer, and F. Hutter, “Automated dynamic algorithm configuration,” JAIR, 2022

  33. [33]

    Reinforcement learning based adaptive meta- heuristics,

    M. Tessari and G. Iacca, “Reinforcement learning based adaptive meta- heuristics,” inProc. of GECCO Companion, 2022

  34. [34]

    Learning heuristic selection with dynamic algorithm configuration,

    D. Speck, A. Biedenkapp, F. Hutter, R. Mattm ¨uller, and M. Lindauer, “Learning heuristic selection with dynamic algorithm configuration,” in Proc. of ICAPS, 2021

  35. [35]

    Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($\lambda$,$\lambda$))-GA

    T. Nguyen, P. Le, A. Biedenkapp, C. Doerr, and N. Dang, “Deep reinforcement learning for dynamic algorithm configuration: A case study on optimizing OneMax with the(1 + (λ, λ))-GA,”arXiv preprint arXiv:2512.03805, 2025

  36. [36]

    Accelerate evolution strategy by proximal policy optimization,

    T. Xu, H. C. Chen, and J. He, “Accelerate evolution strategy by proximal policy optimization,” inProc. of GECCO, 2024

  37. [37]

    Re- inforcement learning-based self-adaptive differential evolution through automated landscape feature learning,

    H. Guo, S. Ma, Z. Huang, Y . Hu, Z. Ma, X. Zhang, and Y .-J. Gong, “Re- inforcement learning-based self-adaptive differential evolution through automated landscape feature learning,” inProc. of GECCO, 2025

  38. [38]

    Deep reinforcement learning with double Q-learning,

    H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” inProc. of AAAI, 2016

  39. [39]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, 2015

  40. [40]

    Q-learning,

    C. J. Watkins and P. Dayan, “Q-learning,”Machine learning, 1992

  41. [41]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, 1992

  42. [42]

    Reinforcement learning: An introduction,

    R. Sutton and A. Barto, “Reinforcement learning: An introduction,” IEEE Transactions on Neural Networks, 1998

  43. [43]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  44. [44]

    Simple hyper-heuristics control the neighbourhood size of randomised local search optimally for leadingones,

    A. Lissovoi, P. S. Oliveto, and J. A. Warwicker, “Simple hyper-heuristics control the neighbourhood size of randomised local search optimally for leadingones,”Evolutionary Computation, 2020

  45. [45]

    Theory-inspired parameter control benchmarks for dynamic algorithm configuration,

    A. Biedenkapp, N. Dang, M. S. Krejca, F. Hutter, and C. Doerr, “Theory-inspired parameter control benchmarks for dynamic algorithm configuration,” inProc. of GECCO, 2022

  46. [46]

    From black-box complexity to designing new genetic algorithms,

    B. Doerr, C. Doerr, and F. Ebel, “From black-box complexity to designing new genetic algorithms,”Theoretical Computer Science, 2015

  47. [47]

    Optimal static and self-adjusting parameter choices for the (1+(λ,λ)) genetic algorithm,

    B. Doerr and C. Doerr, “Optimal static and self-adjusting parameter choices for the (1+(λ,λ)) genetic algorithm,”Algorithmica, 2018

  48. [48]

    Fast mutation in crossover- based algorithms,

    D. Antipov, M. Buzdalov, and B. Doerr, “Fast mutation in crossover- based algorithms,”Algorithmica, 2022

  49. [49]

    Playing Mastermind with constant-size memory,

    B. Doerr and C. Winzen, “Playing Mastermind with constant-size memory,”Theory of Computing Systems, 2014

  50. [50]

    Adaptive step size random search,

    M. A. Schumer and K. Steiglitz, “Adaptive step size random search,” IEEE Transactions on Automatic Control, 1968

  51. [51]

    Rechenberg,Evolutionsstrategie

    I. Rechenberg,Evolutionsstrategie. Stuttgart: Friedrich Fromman Verlag (G¨unther Holzboog KG), 1973

  52. [52]

    Devroye,The compound random search

    L. Devroye,The compound random search. Ph.D. dissertation, Purdue Univ., West Lafayette, IN, 1972

  53. [53]

    Learning probability distributions in continuous evo- lutionary algorithms–a comparative review,

    S. Kern, S. D. M ¨uller, N. Hansen, D. B ¨uche, J. Ocenasek, and P. Koumoutsakos, “Learning probability distributions in continuous evo- lutionary algorithms–a comparative review,”Natural Computing, 2004

  54. [54]

    Lazy parameter tuning and control: Choosing all parameters randomly from a power-law distribu- tion,

    D. Antipov, M. Buzdalov, and B. Doerr, “Lazy parameter tuning and control: Choosing all parameters randomly from a power-law distribu- tion,”Algorithmica, 2024

  55. [55]

    The “one-fifth rule

    A. O. Bassin, M. V . Buzdalov, and A. A. Shalyto, “The “one-fifth rule” with rollbacks for self-adjustment of the population size in the(1 + (λ, λ))genetic algorithm,”Autom. Control. Comput. Sci., 2021

  56. [56]

    Black-box search by unbiased variation,

    P. K. Lehre and C. Witt, “Black-box search by unbiased variation,” Algorithmica, 2012

  57. [57]

    Using automated algorithm configuration for parameter control,

    D. Chen, M. Buzdalov, C. Doerr, and N. Dang, “Using automated algorithm configuration for parameter control,” inProc. of FOGA, 2023

  58. [58]

    The irace package: Iterated racing for automatic algorithm configuration,

    M. L ´opez-Ib´a˜nez, J. Dubois-Lacoste, L. P. C ´aceres, M. Birattari, and T. St ¨utzle, “The irace package: Iterated racing for automatic algorithm configuration,”Operations Research Perspectives, 2016

  59. [59]

    Hyper-parameter tuning for the(1 + (λ, λ)) GA,

    N. Dang and C. Doerr, “Hyper-parameter tuning for the(1 + (λ, λ)) GA,” inProc. of GECCO, 2019

  60. [60]

    On learning intrinsic rewards for policy gradient methods,

    Z. Zheng, J. Oh, and S. Singh, “On learning intrinsic rewards for policy gradient methods,”NeurIPS, 2018

  61. [61]

    Combining automated optimisation of hyperparameters and reward shape,

    J. Dierkes, E. Cramer, S. Trimpe, and H. Hoos, “Combining automated optimisation of hyperparameters and reward shape,” inSeventeenth European Workshop on Reinforcement Learning, 2024

  62. [62]

    Challenges of Real-World Reinforcement Learning

    G. Dulac-Arnold, D. Mankowitz, and T. Hester, “Challenges of real- world reinforcement learning,”arXiv preprint arXiv:1904.12901, 2019

  63. [63]

    Deep neural networks for YouTube recommendations,

    P. Covington, J. Adams, and E. Sargin, “Deep neural networks for YouTube recommendations,” inProc. of the 10th ACM conference on recommender systems, 2016

  64. [64]

    Deep Reinforcement Learning in Large Discrete Action Spaces

    G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep rein- forcement learning in large discrete action spaces. arXiv 2015,”arXiv preprint arXiv:1512.07679

  65. [65]

    Continuous control with deep reinforcement learning,

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” inICLR, 2016

  66. [66]

    Deep reinforcement learning with a combinatorial action space for predicting popular Reddit threads,

    J. He, M. Ostendorf, X. He, J. Chen, J. Gao, L. Li, and L. Deng, “Deep reinforcement learning with a combinatorial action space for predicting popular Reddit threads,” inProc. of EMNLP, Nov. 2016

  67. [67]

    Deep reinforcement learning for traffic signal control: A review,

    F. Rasheed, K.-L. A. Yau, R. M. Noor, C. Wu, and Y .-C. Low, “Deep reinforcement learning for traffic signal control: A review,”IEEE Access, 2020

  68. [68]

    Learn what not to learn: Action elimination with deep reinforcement learning,

    T. Zahavy, M. Haroush, N. Merlis, D. J. Mankowitz, and S. Mannor, “Learn what not to learn: Action elimination with deep reinforcement learning,”NeurIPS, 2018

  69. [69]

    Action branching architectures for deep reinforcement learning,

    A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures for deep reinforcement learning,” inProc. of AAAI, 2018

  70. [70]

    Stable baselines,

    A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y . Wu, “Stable baselines,” https://github.com/ hill-a/stable-baselines, 2018

  71. [71]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inICLR, 2015

  72. [72]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015

  73. [73]

    Deep reinforcement learning that matters,

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” inProc. of AAAI, 2018

  74. [74]

    Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

    R. Islam, P. Henderson, M. Gomrokchi, and D. Precup, “Reproducibil- ity of benchmarked deep reinforcement learning tasks for continuous control,”arXiv preprint arXiv:1708.04133, 2017

  75. [75]

    Benchmarking the (1+1) evolution strategy with one-fifth success rule on the BBOB-2009 function testbed,

    A. Auger, “Benchmarking the (1+1) evolution strategy with one-fifth success rule on the BBOB-2009 function testbed,” inProc. of GECCO: Late Breaking Papers, 2009

  76. [76]

    Reinforcement learning with deep energy-based policies,

    T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” inICML, 2017

  77. [77]

    Understand- ing the impact of entropy on policy optimization,

    Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understand- ing the impact of entropy on policy optimization,” inICML, 2019

  78. [78]

    Implementation matters in deep RL: A case study on PPO and TRPO,

    L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry, “Implementation matters in deep RL: A case study on PPO and TRPO,” inICLR, 2019

  79. [79]

    Challenges to solving combinatorially hard long-horizon deep RL tasks,

    A. C. Li, P. Vaezipoor, R. T. Icarte, and S. A. McIlraith, “Challenges to solving combinatorially hard long-horizon deep RL tasks,”arXiv preprint arXiv:2206.01812, 2022

  80. [80]

    Sequential model- based optimization for general algorithm configuration,

    F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model- based optimization for general algorithm configuration,” inInternational conference on learning and intelligent optimization, 2011

Showing first 80 references.