Replacing Tunable Parameters in Weather and Climate Models with State-Dependent Functions using Reinforcement Learning

arxiv: 2601.04268 · v2 · submitted 2026-01-07 · 💻 cs.LG · physics.ao-ph

Replacing Tunable Parameters in Weather and Climate Models with State-Dependent Functions using Reinforcement Learning

Pritthijit Nath , Sebastian Schemm , Henry Moss , Peter Haynes , Emily Shuckburgh , Mark J. Webb This is my paper

Pith reviewed 2026-05-16 16:30 UTC · model grok-4.3

classification 💻 cs.LG physics.ao-ph

keywords reinforcement learningparametrizationclimate modelsweather modelsstate-dependent functionsonline learningidealized testbedsenergy balance model

0 comments p. Extension

The pith

Reinforcement learning learns state-dependent corrections to replace fixed tunable parameters in idealized weather and climate models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a reinforcement learning framework that updates parametrization components online as functions of the evolving model state rather than relying on fixed coefficients tuned offline. It tests this approach across three idealized setups: a simple climate bias correction, a radiative-convective equilibrium model, and a zonal-mean energy balance model, comparing single-agent and federated multi-agent configurations. Several RL algorithms, particularly TQC, DDPG, and TD3, produce stable convergence and lower area-weighted RMSE than static baselines, with the resulting adjustments reducing meridional temperature biases and vertical profile errors in physically interpretable ways. If the gains hold beyond these testbeds, the method supplies a scalable route to online adaptation inside numerical models that could shrink persistent structural biases.

Core claim

Reinforcement learning agents can learn skilful state-dependent parametrization components in idealized settings, with TQC, DDPG and TD3 showing highest skill and stable convergence; single-agent control outperforms static tuning in the EBM especially in tropical and mid-latitude bands, while federated multi-agent DDPG with frequent aggregation yields the lowest area-weighted RMSE, and the learned corrections prove physically meaningful by modulating radiative parameters, lapse rates and heating increments to limit drift.

What carries the argument

Reinforcement learning agents that map current model state to updates of parametrization parameters, trained to minimize diagnostics such as area-weighted RMSE in single-agent and federated multi-agent settings.

If this is right

State-dependent corrections reduce meridional temperature biases more effectively than fixed parameters in the energy-balance model.
Federated multi-agent RL enables specialized regional control and faster convergence than a single global agent.
The learned adjustments adjust lapse rates and heating increments in ways that match observed vertical and meridional error patterns.
Online learning replaces offline tuning of fixed coefficients, potentially allowing models to adapt during integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach scales, models could continuously adjust sub-grid schemes in response to changing large-scale conditions such as altered circulation regimes.
Multi-agent federated training may offer a practical route for regional specialization without separate offline tuning campaigns for each domain.
The same RL machinery could be tested on additional parametrization targets such as cloud microphysics or boundary-layer mixing once stability issues are addressed.

Load-bearing premise

Performance improvements seen in the three simple idealized testbeds will carry over to full-complexity weather and climate models without introducing numerical instabilities or unphysical drift.

What would settle it

Apply the trained RL policies inside a full global climate model and check whether area-weighted RMSE decreases while long-term temperature and energy drifts remain within acceptable bounds and no new instabilities appear.

Figures

Figures reproduced from arXiv: 2601.04268 by Emily Shuckburgh, Henry Moss, Mark J. Webb, Peter Haynes, Pritthijit Nath, Sebastian Schemm.

**Figure 1.** Figure 1: ebm-v0 (left) and ebm-v1 (right) single-agent setup. In ebm-v1, the global agent observes the full zonalmean temperature profile and outputs latitude-dependent OLR parameters {Aϕ, Bϕ} as well as independent ones {α0, α2, D}. Loss from observations are computed over all 96 latitudes [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 2.** Figure 2: ebm-v2 multi-agent ensemble with FedRL agents operate on assigned regions with local rewards while receiving the global profile as input. Periodic aggregation every K episodes synchronises policy weights across n agents. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Schematics for GCM-like ebm-v3. FedRL agents observe local temperature slices only (unlike ebm-v2) and optimise local rewards. Local rewards are computed over regions for each agent. Policy weights are aggregated every K episodes via Flower for FedRL and also contribute towards a global non-local policy. {α0, α2, D} are averaged over all n sub-processes. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Developmental progression of RL algorithms, evaluated in this [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Schematic of the experimental workflow for the single agent climateRL experiments (e.g., [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Pipeline for the ebm-v1/2/3 experiments. The process begins with configuring the Budyko–Sellers EBM in either single-agent (ebm-v1) or spatially decomposed multi-agent forms (ebm-v2, ebm-v3) using two (a2) or six (a6) regions. Agents are trained with one of three RL algorithms (DDPG, TD3, TQC) under FedRL coordination schemes fed05, fed10, or nofed. In multi-agent settings, policies are periodically aggreg… view at source ↗

**Figure 7.** Figure 7: Training curves across 10 seeds for SCBC environments ( [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Heating increment dynamics for scbc-v0-optim-L-60k. Top panels: temperature evolution compared to the observed temperature (321.75 K). Bottom panels: normalised heating increments applied by the RL agent. Shaded regions denote ±1.96 standard deviation (95%). Black dashed line indicates the SCBC model w/ bias-correction as reference. The reference SCBC model stabilised around 358 K driven by the cancellatio… view at source ↗

**Figure 9.** Figure 9: Training curves for RCE environments (rce-v0, rce17-v0, rce17-v1) under the optim-L-10k configuration. Episodic returns (each episode = 500 steps) are plotted on a log scale. Shaded regions denote ±1.96 standard deviation (95% confidence intervals). Threshold values are mentioned in [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Mean absolute temperature error (°C) at 100 hPa, 200 hPa, and 1000 hPa for the optim-L-10k configuration. White horizontal bars with a cross indicate the best-performing seed for each algorithm. Error bars represent 95% confidence intervals over 10 seeds. The zero error for the vanilla climlab model at 1000 hPa in rce-v0 is with a constant lapse rate set at 6.5 (unlike MALR in rce17-v0 and MALR with water… view at source ↗

**Figure 11.** Figure 11: Top: Temperature trajectories at 100 hPa, 200 hPa, and 1000 hPa for TQC in [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Training curves for single-agent EBM environments ( [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: areaWRMSE of zonal mean temperatures across six latitude bands for three experiments under [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Area-weighted zonal mean temperature bias averaged with 95% spreads over 10 seeds across six latitude [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Temperature trajectories (top) and normalised actions (bottom) for parameters [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Episodic return curves (log-scaled) with 95% CI spreads over 10 seeds for three RL algorithms: TQC, [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison of zonal skill achieved by DDPG under FedRL coordination in [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison of global policy zonal skill achieved by DDPG under FedRL coordination in [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

read the original abstract

Weather and climate models rely on parametrisations to represent unresolved sub-grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates RL-driven parameter updates across idealised testbeds spanning a simple climate bias correction (SCBC), a radiative-convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with single-agent and federated multi-agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and stable convergence, with performance assessed against a static baseline using area-weighted RMSE, temperature and pressure-level diagnostics. For the EBM, single-agent RL outperformed static parameter tuning with the strongest gains in tropical and mid-latitude bands, while federated RL on multi-agent setups enabled specialised control and faster convergence, with a six-agent DDPG configuration using frequent aggregation yielding the lowest area-weighted RMSE across the tropics and mid-latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised heating increments to limit drift. Overall, results show that RL can learn skilful state-dependent parametrisation components in idealised settings, offering a scalable pathway for online learning within numerical models and a starting point for evaluation in weather and climate models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL can learn state-dependent parametrizations that improve on static tuning in simple climate models, but the work stops at idealized cases.

read the letter

RL can learn state-dependent parametrizations that improve on static tuning in simple climate models, but the work stops at idealized cases. The new element is putting RL agents inside the model loop so they output adjustments to parametrization coefficients based on the current state, rather than using fixed values tuned offline. They run this on three idealized setups: a simple climate bias correction scheme, radiative-convective equilibrium, and a zonal energy balance model. Single-agent and federated multi-agent versions are tested, with TQC, DDPG, and TD3 beating a static baseline on area-weighted RMSE plus some temperature and pressure diagnostics. The adjustments also look physically sensible, such as reducing meridional biases or stabilizing heating rates. The paper does a reasonable job showing that the approach is feasible and that the learned changes are not just random. The multi-agent federated setup for quicker convergence is a useful detail. The main limitation is that everything stays inside these simplified testbeds. No results appear from a full GCM or even a more complex model, so it is still an open question whether the method will scale without causing numerical problems or unphysical drift. The abstract also gives few numbers on the size of the gains or training details, which makes it harder to judge robustness. This paper is for climate modelers interested in online adaptation of sub-grid schemes and for people working on RL for scientific control problems. A reader looking for proof-of-concept ideas on state-dependent parametrizations would pick up some practical pointers, though they would want the full methods section for any follow-up work. It deserves peer review because the core framework is new enough and the idealized results are coherent enough to benefit from expert feedback on scaling and stability.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes replacing fixed tunable parameters in weather and climate model parametrizations with state-dependent functions learned online via reinforcement learning. It evaluates nine RL algorithms across three idealized testbeds (SCBC, RCE, EBM) in single-agent and federated multi-agent configurations, reporting that TQC, DDPG, and TD3 outperform a static baseline on area-weighted RMSE and selected physical diagnostics, with the learned adjustments described as physically interpretable.

Significance. If the results hold under more complete reporting, the work provides a proof-of-concept that RL can produce skilful state-dependent parametrization components in controlled idealized settings. This offers a potential pathway for online adaptive tuning that could reduce persistent model biases more dynamically than offline calibration. Credit is due for the multi-algorithm comparison, the inclusion of both single- and multi-agent setups, and the emphasis on physical interpretability of the corrections.

major comments (2)

[Abstract] Abstract: The central claim that TQC, DDPG, and TD3 achieve the highest skill rests on qualitative statements of outperformance; no numerical RMSE values, effect sizes, error bars, or statistical significance tests versus the static baseline are supplied, leaving the magnitude and reliability of the reported gains difficult to assess.
[Methods/Results] Methods/Results (training details): Full training specifications (episode counts, learning rates, network architectures, discount factors, and convergence criteria) are not provided for the nine algorithms or the federated aggregation protocol; without these, reproducibility and the extent to which performance depends on standard RL hyperparameter tuning cannot be evaluated.

minor comments (1)

[EBM experiments] The description of the EBM multi-agent setup would benefit from an explicit diagram or pseudocode showing how local agents interact with the global state and how aggregation occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential of our RL framework for learning state-dependent parametrizations in idealized climate models. We address each major comment below and have revised the manuscript to strengthen the presentation of results and ensure reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that TQC, DDPG, and TD3 achieve the highest skill rests on qualitative statements of outperformance; no numerical RMSE values, effect sizes, error bars, or statistical significance tests versus the static baseline are supplied, leaving the magnitude and reliability of the reported gains difficult to assess.

Authors: We agree that quantitative metrics are necessary to substantiate the claims of outperformance. In the revised manuscript we have updated the abstract to report specific area-weighted RMSE values for TQC, DDPG and TD3 versus the static baseline on each testbed, together with relative improvements (expressed as percentages), standard deviations across multiple random seeds, and p-values from paired statistical tests confirming significance of the gains. Corresponding numerical tables and error-bar plots have also been added to the Results section. revision: yes
Referee: [Methods/Results] Methods/Results (training details): Full training specifications (episode counts, learning rates, network architectures, discount factors, and convergence criteria) are not provided for the nine algorithms or the federated aggregation protocol; without these, reproducibility and the extent to which performance depends on standard RL hyperparameter tuning cannot be evaluated.

Authors: We acknowledge that complete hyperparameter specifications are essential for reproducibility. The revised manuscript now contains an expanded Methods section and a supplementary table that lists, for all nine algorithms: episode counts, learning rates, network architectures (layer sizes and activations), discount factors, and convergence criteria. The federated aggregation protocol is described in detail, including aggregation frequency, number of agents, and model-combination rules. We have also added a brief sensitivity analysis showing that the ranking of the top algorithms remains stable across modest hyperparameter variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reports empirical results from training RL agents (TQC, DDPG, TD3) on trajectories from three idealized testbeds (SCBC, RCE, EBM) and compares them to a static baseline via area-weighted RMSE and physical diagnostics. No algebraic derivation, self-definitional equations, or fitted-input-renamed-as-prediction steps are present. Performance claims rest on observed training outcomes rather than reduction to pre-specified constants or self-citation chains. Standard RL hyperparameter tuning exists but is external to the central claim and does not force the reported skill gains by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract supplies limited technical detail; main unstated premises concern the adequacy of the three idealized testbeds and the transferability of learned policies.

free parameters (1)

RL hyperparameters (learning rate, discount factor, network sizes)
Standard RL tuning knobs that affect convergence and final policy quality but are not enumerated in the abstract.

axioms (1)

domain assumption Idealized testbeds (SCBC, RCE, EBM) capture the essential dynamics needed to evaluate state-dependent parametrizations for real models.
Invoked when claiming the approach offers a scalable pathway.

pith-pipeline@v0.9.0 · 5617 in / 1273 out tokens · 34781 ms · 2026-05-16T16:30:54.647969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages

[1]

N., 1969: The predictability of a flow which possesses many scales of motion

doi: 10.1111/j.2153-3490.1969.tb00466.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2153-3490.1969.tb00466.x (accessed 6 April 2025). Sellers WD(1969) A Global Climatic Model Based on the Energy Balance of the Earth-Atmosphere System. en, 6 April8(3), Section: Journal of Applied Meteorology and Climatology, 392–400.issn: 1520-0450. https://journals...

work page doi:10.1111/j.2153-3490.1969.tb00466.x 1969
[2]

Available at http://arxiv.org/abs/2403.05175 (accessed 14 December 2025)

doi: 10.1016/B978-0-443-15754-7.00073-0. Available at http://arxiv.org/abs/2403.05175 (accessed 14 December 2025). Appendix A. Additional Methods A.1. RL Algorithm Summaries 43 Algorithm Properties Truncated Quantile Critics (TQC) (Kuznetsov et al. 2020)1. Off-policy actor-critic algorithm that builds on SAC with a distributional critic using quantile regression

work page doi:10.1016/b978-0-443-15754-7.00073-0 2025
[3]

Models the full distribution of returnsZ(s, a) as quantiles{τi}, capturing uncertainty and reducing bias

work page
[4]

Discards the topk quantiles before computing target values to avoid overestimation from outlier returns

work page
[5]

Action Value Gradient (AVG) (Vasan et al

Provides a smooth and robust training signal for distributional value estimation by minimising the quantile Huber loss: Lτ = 1 N NX i=1 ρκ(τi −y) where τi is the predictedi-th quantile,y is the target return,ρκ denotes the Huber loss with thresholdκ, andNis the number of quantile estimates used in training. Action Value Gradient (AVG) (Vasan et al. 2024) ...

work page 2024
[6]

The actor minimises an entropy-augmented objective: Lactor =αlogπ(a|s)−Q(s, a)wherea∼tanh(N(µ θ(s), σθ(s))) Thispromotesbothhigh-returnandhigh-entropypoliciesbypenalisingconfidentlow-reward actions

work page
[7]

The critic is updated using TD learning with scaled errors: δ= r+γV(s ′)−Q(s, a) ˆσδ whereˆσ δ is a running estimate of std(δ) Normalising TD errors ensures stable critic gradients even in the presence of large reward magnitudes or early-stage noise

work page
[8]

Table A.1: Four point summaries (contd.) 44 Algorithm Properties REINFORCE (Williams 1992) 1

Supports robust learning in continuous control by incorporatingtanh-based action squashing, TD error scaling, and optional clipping of actions to respect environment bounds. Table A.1: Four point summaries (contd.) 44 Algorithm Properties REINFORCE (Williams 1992) 1. Off-policy Monte Carlo algorithm that performs updates only at the end of full trajectori...

work page 1992
[9]

3.Optimisestheexpectedreturnviathescorefunction(log-derivative)estimator: ∇θJ(θ) = E[∇θ logπ θ(τ)R(τ)]

Uses a stochastic policyπθ(a|s) to generate actions and samples episodes from the environment. 3.Optimisestheexpectedreturnviathescorefunction(log-derivative)estimator: ∇θJ(θ) = E[∇θ logπ θ(τ)R(τ)]

work page
[10]

Deterministic Policy Gradient (DPG) (Silver, Lever et al

Suffers from high variance and slow convergence, often mitigated by introducing baselines (advantage estimates) or reward normalisation. Deterministic Policy Gradient (DPG) (Silver, Lever et al. 2014) 1. On-policy actor-critic algorithm where the actor uses a deterministic policyµθ(s), suited to continuous action spaces

work page 2014
[11]

Critic is trained using TD learning to estimate the Q-function, while actor update uses gradient chain rule over Q

work page
[12]

Gradient of the objective is∇θJ(θ) =E s∼ρµ[∇θµθ(s)∇aQµ(s, a)|a=µθ(s)]

work page
[13]

Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al

Lacks target networks, making the learning process less stable and more sensitive to hyperparameters. Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. 2019)1. Off-policy actor-critic algorithm that extends DPG using deep function approximators and additional stabilisation mechanisms

work page 2019
[14]

Introduces experience replay and target networks with soft target updates to decorrelate samples and improve training stability

work page
[15]

Actor and critic networks are both updated similar to DPG

work page
[16]

Overestimation bias and sensitivity to exploration noise often limit performance unless mitigated by design changes (e.g. TD3). Twin Delayed DDPG (TD3) (Fujimoto et al. 2018) 1. Off-policy actor-critic method designed to reduce the overestimation bias observed in DDPG

work page 2018
[17]

Uses a double critic architecture where the minimum of two Q-value estimates is used for critic updates

work page
[18]

Actor is updated less frequently than the critics, and target networks are softly updated to reduce update variance

work page
[19]

Table A.1: Four point summaries of different RL algorithms used 45 Algorithm Properties Trust Region Policy Optimisation (TRPO) (Schulman, Levine et al

Injects temporally correlated Gaussian noise into the target actions to promote exploration in continuous action spaces. Table A.1: Four point summaries of different RL algorithms used 45 Algorithm Properties Trust Region Policy Optimisation (TRPO) (Schulman, Levine et al. 2015)1. On-policy stochastic actor-critic algorithm with a theoretical guarantee of...

work page 2015
[20]

Formulates a constrained optimisation problem using a KL-divergence bound to control step sizes and prevent policy collapse

work page
[21]

Uses advantage-weighted surrogate objectives with a linear approximation and solves the constraint via conjugate gradient

work page
[22]

Proximal Policy Optimisation (PPO) (Schulman, Wolski et al

Highly stable but computationally expensive due to second-order updates and line search in large parameter spaces. Proximal Policy Optimisation (PPO) (Schulman, Wolski et al. 2017) 1. On-policy stochastic actor-critic algorithm designed as a first-order alternative to TRPO with comparable stability

work page 2017
[23]

Uses a clipped surrogate objective to prevent excessively large policy updates: LCLIP(θ) =E h min rt(θ) ˆAt,clip(r t(θ),1−ϵ,1 +ϵ) ˆAt i wherer t(θ) = πθ(at|st) πθold (at|st) and ˆAt is an estimator of the advantage function

work page
[24]

Leverages Generalised Advantage Estimation (GAE) to reduce variance of the policy gradient with tunable bias–variance trade-off

work page
[25]

Soft Actor-Critic (SAC) (Haarnoja et al

Efficient, robust and widely used in practice due to its balance of ease of implementation and empirical performance. Soft Actor-Critic (SAC) (Haarnoja et al. 2018) 1. Off-policy actor-critic method for continuous control that augments the reward with an entropy term: J(π) = X t E(st,at)∼ρπ [r(st, at) +αH(π(·|s t))] whereHis the entropy andαis a temperatu...

work page 2018
[26]

Encourages exploration by optimising for both reward and entropy, using a stochastic policyπ(a|s)

work page
[27]

4.Featuresautomaticentropytuning,highsampleefficiency,andstrongempiricalperformance on continuous control benchmarks

Maintains two Q-networks and uses the minimum value to reduce overestimation bias, with delayed soft target updates. 4.Featuresautomaticentropytuning,highsampleefficiency,andstrongempiricalperformance on continuous control benchmarks. Table A.1: Four point summaries (contd.) 46 A.2. RL Algorithm Pseudocodes A.2.1. REINFORCE Algorithm 1REINFORCE 1:Input:Gy...

work page
[28]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 101 102 (Episodic Return) scbc-v1-optim-L

work page
[29]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 (Episodic Return) scbc-v2-optim-L

work page
[30]

DDPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 10 1 100 (Episodic Return) scbc-v0-optim-L-60k

work page
[31]

TQC THRESH 0 10K 20K 30K 40K 50K 60K # Steps 100 101 102 (Episodic Return) scbc-v1-optim-L-60k

work page
[32]

DPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 1.85 × 102 1.9 × 102 (Episodic Return) scbc-v2-optim-L-60k

work page
[33]

DPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 10 2 10 1 100 (Episodic Return) scbc-v0-homo-64L

work page
[34]

DPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 101 102 (Episodic Return) scbc-v1-homo-64L

work page
[35]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 (Episodic Return) scbc-v2-homo-64L

work page
[36]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 10 1 100 (Episodic Return) scbc-v0-homo-64L-60k

work page
[37]

TQC THRESH 0 10K 20K 30K 40K 50K 60K # Steps 101 102 (Episodic Return) scbc-v1-homo-64L-60k

work page
[38]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 (Episodic Return) scbc-v2-homo-64L-60k

work page
[39]

Episodic returns are plotted on a log scale

SAC THRESH FigureB.1:Trainingcurvesacross10seedsforSCBCenvironments(v0,v1,v2)acrossalltwelvetuningconfigurations (scbc-v0/1/2-optim-L-60kreproduced for easy reference). Episodic returns are plotted on a log scale. Shaded regions denote±1.96 standard deviation (95% confidence intervals). Threshold values are mentioned in Table 3. 58 B.1.2. RCE Environment ...

work page
[40]

PPO THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v0-optim-L

work page
[41]

DDPG THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-optim-L

work page
[42]

SAC THRESH 2K 4K 6K 8K 10K # Steps 4 × 104 5 × 104 6 × 104 7 × 104 (Episodic Return) rce-v0-optim-L-10k

work page
[43]

TQC THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v0-optim-L-10k

work page
[44]

TQC THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-optim-L-10k

work page
[45]

TD3 THRESH 2K 4K 6K 8K 10K # Steps 105 4 × 104 6 × 104 (Episodic Return) rce-v0-homo-64L

work page
[46]

DPG THRESH 2K 4K 6K 8K 10K # Steps 5 × 104 6 × 104 7 × 104 (Episodic Return) rce17-v0-homo-64L

work page
[47]

DPG THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-homo-64L

work page
[48]

DDPG THRESH 2K 4K 6K 8K 10K # Steps 4 × 104 5 × 104 6 × 104 (Episodic Return) rce-v0-homo-64L-10k

work page
[49]

DPG THRESH 2K 4K 6K 8K 10K # Steps 5 × 104 6 × 104 7 × 104 (Episodic Return) rce17-v0-homo-64L-10k

work page
[50]

DPG THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-homo-64L-10k

work page
[51]

Episodic returns are plotted on a log scale

TD3 THRESH Figure B.2: Training curves for RCE environments (rce-v0, rce17-v0, rce17-v1) across all twelve tuning configurations (rce-v0/rce17-v0/1-optim-L-10k reproduced for easy reference). Episodic returns are plotted on a log scale. Shaded regions denote±1.96 standard deviation (95% confidence intervals). Threshold values are mentioned in Table 3. 59 ...

work page
[52]

SAC THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 (Episodic Return) ebm-v1-optim-L

work page
[53]

DDPG THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 104 105 (Episodic Return) ebm-v0-optim-L-20k

work page
[54]

TRPO THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 2 × 104 3 × 104 4 × 104 6 × 104 (Episodic Return) ebm-v1-optim-L-20k

work page
[55]

TD3 THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 103 104 105 (Episodic Return) ebm-v0-homo-64L

work page
[56]

TD3 THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 2 × 104 3 × 104 4 × 104 6 × 104 (Episodic Return) ebm-v1-homo-64L

work page
[57]

TD3 THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 104 105 (Episodic Return) ebm-v0-homo-64L-20k

work page
[58]

SAC THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 2 × 104 3 × 104 4 × 104 6 × 104 (Episodic Return) ebm-v1-homo-64L-20k

work page
[59]

Threshold values are mentioned in Table 3

TD3 THRESH Figure B.3: Episodic return curves (log-scaled) with 95% spreads over 10 seeds for the top-3 RL algorithms across eight single-agent EBM configurations (ebm-v0/1-optim-L-20kreproduced for easy reference). Threshold values are mentioned in Table 3. 60 B.2. Multi-agent RL B.2.1. FedRL Skill Metrics (a) DDPG climlab fed05 fed10 nofed ebm-v1Mean±St...

work page

[1] [1]

N., 1969: The predictability of a flow which possesses many scales of motion

doi: 10.1111/j.2153-3490.1969.tb00466.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2153-3490.1969.tb00466.x (accessed 6 April 2025). Sellers WD(1969) A Global Climatic Model Based on the Energy Balance of the Earth-Atmosphere System. en, 6 April8(3), Section: Journal of Applied Meteorology and Climatology, 392–400.issn: 1520-0450. https://journals...

work page doi:10.1111/j.2153-3490.1969.tb00466.x 1969

[2] [2]

Available at http://arxiv.org/abs/2403.05175 (accessed 14 December 2025)

doi: 10.1016/B978-0-443-15754-7.00073-0. Available at http://arxiv.org/abs/2403.05175 (accessed 14 December 2025). Appendix A. Additional Methods A.1. RL Algorithm Summaries 43 Algorithm Properties Truncated Quantile Critics (TQC) (Kuznetsov et al. 2020)1. Off-policy actor-critic algorithm that builds on SAC with a distributional critic using quantile regression

work page doi:10.1016/b978-0-443-15754-7.00073-0 2025

[3] [3]

Models the full distribution of returnsZ(s, a) as quantiles{τi}, capturing uncertainty and reducing bias

work page

[4] [4]

Discards the topk quantiles before computing target values to avoid overestimation from outlier returns

work page

[5] [5]

Action Value Gradient (AVG) (Vasan et al

Provides a smooth and robust training signal for distributional value estimation by minimising the quantile Huber loss: Lτ = 1 N NX i=1 ρκ(τi −y) where τi is the predictedi-th quantile,y is the target return,ρκ denotes the Huber loss with thresholdκ, andNis the number of quantile estimates used in training. Action Value Gradient (AVG) (Vasan et al. 2024) ...

work page 2024

[6] [6]

The actor minimises an entropy-augmented objective: Lactor =αlogπ(a|s)−Q(s, a)wherea∼tanh(N(µ θ(s), σθ(s))) Thispromotesbothhigh-returnandhigh-entropypoliciesbypenalisingconfidentlow-reward actions

work page

[7] [7]

The critic is updated using TD learning with scaled errors: δ= r+γV(s ′)−Q(s, a) ˆσδ whereˆσ δ is a running estimate of std(δ) Normalising TD errors ensures stable critic gradients even in the presence of large reward magnitudes or early-stage noise

work page

[8] [8]

Table A.1: Four point summaries (contd.) 44 Algorithm Properties REINFORCE (Williams 1992) 1

Supports robust learning in continuous control by incorporatingtanh-based action squashing, TD error scaling, and optional clipping of actions to respect environment bounds. Table A.1: Four point summaries (contd.) 44 Algorithm Properties REINFORCE (Williams 1992) 1. Off-policy Monte Carlo algorithm that performs updates only at the end of full trajectori...

work page 1992

[9] [9]

3.Optimisestheexpectedreturnviathescorefunction(log-derivative)estimator: ∇θJ(θ) = E[∇θ logπ θ(τ)R(τ)]

Uses a stochastic policyπθ(a|s) to generate actions and samples episodes from the environment. 3.Optimisestheexpectedreturnviathescorefunction(log-derivative)estimator: ∇θJ(θ) = E[∇θ logπ θ(τ)R(τ)]

work page

[10] [10]

Deterministic Policy Gradient (DPG) (Silver, Lever et al

Suffers from high variance and slow convergence, often mitigated by introducing baselines (advantage estimates) or reward normalisation. Deterministic Policy Gradient (DPG) (Silver, Lever et al. 2014) 1. On-policy actor-critic algorithm where the actor uses a deterministic policyµθ(s), suited to continuous action spaces

work page 2014

[11] [11]

Critic is trained using TD learning to estimate the Q-function, while actor update uses gradient chain rule over Q

work page

[12] [12]

Gradient of the objective is∇θJ(θ) =E s∼ρµ[∇θµθ(s)∇aQµ(s, a)|a=µθ(s)]

work page

[13] [13]

Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al

Lacks target networks, making the learning process less stable and more sensitive to hyperparameters. Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al. 2019)1. Off-policy actor-critic algorithm that extends DPG using deep function approximators and additional stabilisation mechanisms

work page 2019

[14] [14]

Introduces experience replay and target networks with soft target updates to decorrelate samples and improve training stability

work page

[15] [15]

Actor and critic networks are both updated similar to DPG

work page

[16] [16]

Overestimation bias and sensitivity to exploration noise often limit performance unless mitigated by design changes (e.g. TD3). Twin Delayed DDPG (TD3) (Fujimoto et al. 2018) 1. Off-policy actor-critic method designed to reduce the overestimation bias observed in DDPG

work page 2018

[17] [17]

Uses a double critic architecture where the minimum of two Q-value estimates is used for critic updates

work page

[18] [18]

Actor is updated less frequently than the critics, and target networks are softly updated to reduce update variance

work page

[19] [19]

Table A.1: Four point summaries of different RL algorithms used 45 Algorithm Properties Trust Region Policy Optimisation (TRPO) (Schulman, Levine et al

Injects temporally correlated Gaussian noise into the target actions to promote exploration in continuous action spaces. Table A.1: Four point summaries of different RL algorithms used 45 Algorithm Properties Trust Region Policy Optimisation (TRPO) (Schulman, Levine et al. 2015)1. On-policy stochastic actor-critic algorithm with a theoretical guarantee of...

work page 2015

[20] [20]

Formulates a constrained optimisation problem using a KL-divergence bound to control step sizes and prevent policy collapse

work page

[21] [21]

Uses advantage-weighted surrogate objectives with a linear approximation and solves the constraint via conjugate gradient

work page

[22] [22]

Proximal Policy Optimisation (PPO) (Schulman, Wolski et al

Highly stable but computationally expensive due to second-order updates and line search in large parameter spaces. Proximal Policy Optimisation (PPO) (Schulman, Wolski et al. 2017) 1. On-policy stochastic actor-critic algorithm designed as a first-order alternative to TRPO with comparable stability

work page 2017

[23] [23]

Uses a clipped surrogate objective to prevent excessively large policy updates: LCLIP(θ) =E h min rt(θ) ˆAt,clip(r t(θ),1−ϵ,1 +ϵ) ˆAt i wherer t(θ) = πθ(at|st) πθold (at|st) and ˆAt is an estimator of the advantage function

work page

[24] [24]

Leverages Generalised Advantage Estimation (GAE) to reduce variance of the policy gradient with tunable bias–variance trade-off

work page

[25] [25]

Soft Actor-Critic (SAC) (Haarnoja et al

Efficient, robust and widely used in practice due to its balance of ease of implementation and empirical performance. Soft Actor-Critic (SAC) (Haarnoja et al. 2018) 1. Off-policy actor-critic method for continuous control that augments the reward with an entropy term: J(π) = X t E(st,at)∼ρπ [r(st, at) +αH(π(·|s t))] whereHis the entropy andαis a temperatu...

work page 2018

[26] [26]

Encourages exploration by optimising for both reward and entropy, using a stochastic policyπ(a|s)

work page

[27] [27]

4.Featuresautomaticentropytuning,highsampleefficiency,andstrongempiricalperformance on continuous control benchmarks

Maintains two Q-networks and uses the minimum value to reduce overestimation bias, with delayed soft target updates. 4.Featuresautomaticentropytuning,highsampleefficiency,andstrongempiricalperformance on continuous control benchmarks. Table A.1: Four point summaries (contd.) 46 A.2. RL Algorithm Pseudocodes A.2.1. REINFORCE Algorithm 1REINFORCE 1:Input:Gy...

work page

[28] [28]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 101 102 (Episodic Return) scbc-v1-optim-L

work page

[29] [29]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 (Episodic Return) scbc-v2-optim-L

work page

[30] [30]

DDPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 10 1 100 (Episodic Return) scbc-v0-optim-L-60k

work page

[31] [31]

TQC THRESH 0 10K 20K 30K 40K 50K 60K # Steps 100 101 102 (Episodic Return) scbc-v1-optim-L-60k

work page

[32] [32]

DPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 1.85 × 102 1.9 × 102 (Episodic Return) scbc-v2-optim-L-60k

work page

[33] [33]

DPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 10 2 10 1 100 (Episodic Return) scbc-v0-homo-64L

work page

[34] [34]

DPG THRESH 0 10K 20K 30K 40K 50K 60K # Steps 101 102 (Episodic Return) scbc-v1-homo-64L

work page

[35] [35]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 (Episodic Return) scbc-v2-homo-64L

work page

[36] [36]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 10 1 100 (Episodic Return) scbc-v0-homo-64L-60k

work page

[37] [37]

TQC THRESH 0 10K 20K 30K 40K 50K 60K # Steps 101 102 (Episodic Return) scbc-v1-homo-64L-60k

work page

[38] [38]

TD3 THRESH 0 10K 20K 30K 40K 50K 60K # Steps 1.6 × 102 1.65 × 102 1.7 × 102 1.75 × 102 1.8 × 102 (Episodic Return) scbc-v2-homo-64L-60k

work page

[39] [39]

Episodic returns are plotted on a log scale

SAC THRESH FigureB.1:Trainingcurvesacross10seedsforSCBCenvironments(v0,v1,v2)acrossalltwelvetuningconfigurations (scbc-v0/1/2-optim-L-60kreproduced for easy reference). Episodic returns are plotted on a log scale. Shaded regions denote±1.96 standard deviation (95% confidence intervals). Threshold values are mentioned in Table 3. 58 B.1.2. RCE Environment ...

work page

[40] [40]

PPO THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v0-optim-L

work page

[41] [41]

DDPG THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-optim-L

work page

[42] [42]

SAC THRESH 2K 4K 6K 8K 10K # Steps 4 × 104 5 × 104 6 × 104 7 × 104 (Episodic Return) rce-v0-optim-L-10k

work page

[43] [43]

TQC THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v0-optim-L-10k

work page

[44] [44]

TQC THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-optim-L-10k

work page

[45] [45]

TD3 THRESH 2K 4K 6K 8K 10K # Steps 105 4 × 104 6 × 104 (Episodic Return) rce-v0-homo-64L

work page

[46] [46]

DPG THRESH 2K 4K 6K 8K 10K # Steps 5 × 104 6 × 104 7 × 104 (Episodic Return) rce17-v0-homo-64L

work page

[47] [47]

DPG THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-homo-64L

work page

[48] [48]

DDPG THRESH 2K 4K 6K 8K 10K # Steps 4 × 104 5 × 104 6 × 104 (Episodic Return) rce-v0-homo-64L-10k

work page

[49] [49]

DPG THRESH 2K 4K 6K 8K 10K # Steps 5 × 104 6 × 104 7 × 104 (Episodic Return) rce17-v0-homo-64L-10k

work page

[50] [50]

DPG THRESH 2K 4K 6K 8K 10K # Steps 4.4 × 104 4.6 × 104 4.8 × 104 5 × 104 5.2 × 104 5.4 × 104 5.6 × 104 (Episodic Return) rce17-v1-homo-64L-10k

work page

[51] [51]

Episodic returns are plotted on a log scale

TD3 THRESH Figure B.2: Training curves for RCE environments (rce-v0, rce17-v0, rce17-v1) across all twelve tuning configurations (rce-v0/rce17-v0/1-optim-L-10k reproduced for easy reference). Episodic returns are plotted on a log scale. Shaded regions denote±1.96 standard deviation (95% confidence intervals). Threshold values are mentioned in Table 3. 59 ...

work page

[52] [52]

SAC THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 (Episodic Return) ebm-v1-optim-L

work page

[53] [53]

DDPG THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 104 105 (Episodic Return) ebm-v0-optim-L-20k

work page

[54] [54]

TRPO THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 2 × 104 3 × 104 4 × 104 6 × 104 (Episodic Return) ebm-v1-optim-L-20k

work page

[55] [55]

TD3 THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 103 104 105 (Episodic Return) ebm-v0-homo-64L

work page

[56] [56]

TD3 THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 2 × 104 3 × 104 4 × 104 6 × 104 (Episodic Return) ebm-v1-homo-64L

work page

[57] [57]

TD3 THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 104 105 (Episodic Return) ebm-v0-homo-64L-20k

work page

[58] [58]

SAC THRESH 0 2.5K 5K 7.5K 10K 12.5K 15K 17.5K 20K # Steps 105 2 × 104 3 × 104 4 × 104 6 × 104 (Episodic Return) ebm-v1-homo-64L-20k

work page

[59] [59]

Threshold values are mentioned in Table 3

TD3 THRESH Figure B.3: Episodic return curves (log-scaled) with 95% spreads over 10 seeds for the top-3 RL algorithms across eight single-agent EBM configurations (ebm-v0/1-optim-L-20kreproduced for easy reference). Threshold values are mentioned in Table 3. 60 B.2. Multi-agent RL B.2.1. FedRL Skill Metrics (a) DDPG climlab fed05 fed10 nofed ebm-v1Mean±St...

work page