arxiv: 2604.22795 · v2 · submitted 2026-04-13 · 📡 eess.SY · cs.LG· cs.SY

Recognition: unknown

Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning

Teodor {\AA}strand , Marcus Binder Nilsen , Iasonas Tsaklis , Tuhfe G\"o\c{c}men , Pierre-Elouan R\'ethor\'e , Nikolay Dimitrov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY

keywords wind farm flow controlmulti-agent reinforcement learningwake steeringdamage equivalent loadsload constrained controlsoft actor-criticwind energy optimization

0 comments

The pith

Multi-agent reinforcement learning lets wind farm turbines steer wakes for higher total power while keeping load increases below set thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a multi-agent reinforcement learning method to manage wake steering in wind farms so that overall electricity production rises without pushing structural fatigue on downstream machines beyond chosen limits. It pairs independent soft actor-critic agents with a fast surrogate that turns local wind measurements into real-time estimates of damage equivalent loads and folds those estimates into each agent's reward signal. Training occurs inside a dynamic wake simulation where agents must respect load-increase caps of 10, 20 or 30 percent above a baseline controller. If the approach holds, wind farms could run higher-yield wake strategies without incurring the extra maintenance costs that usually accompany aggressive steering. The work shows that the learned policies do increase power while deliberately avoiding the most load-intensive actions.

Core claim

Turbine-specific agents trained with an independent soft actor-critic architecture and a shaped reward that includes both power output and real-time damage equivalent load estimates from a local inflow sector-averaged surrogate learn collaborative policies that raise total farm power while staying inside prescribed load-increase bounds under non-stationary wake conditions.

What carries the argument

Independent Soft Actor-Critic multi-agent setup whose reward function adds a data-driven sector-averaged surrogate model that supplies real-time damage equivalent load estimates so agents can trade power gains against load limits.

If this is right

Agents learn to retreat from high-load wake-steering actions while still capturing net power gains.
The same framework can be retrained for different allowed load-increase levels of 10, 20 or 30 percent.
Real-time load estimates are generated locally from inflow sector data without requiring full-farm sensors.
The method runs inside a dynamic wake meandering simulation that captures changing wake positions over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-shaping idea could be tested on farms with different turbine spacings or in regions with strong wind shear to see how quickly policies adapt.
Combining the learned agents with occasional direct load measurements might reduce reliance on the surrogate and improve safety margins.
Scaling the number of agents to very large farms would show whether local communication among nearby turbines is needed for stable coordination.

Load-bearing premise

The surrogate model must deliver damage equivalent load estimates accurate enough to be used directly in the reward without causing the agents to learn policies that violate real load limits.

What would settle it

Deploy the trained policies on an operating wind farm and record measured turbine loads and power output to verify that load increases remain below the target thresholds while total power rises.

Figures

Figures reproduced from arXiv: 2604.22795 by Iasonas Tsaklis, Marcus Binder Nilsen, Nikolay Dimitrov, Pierre-Elouan R\'ethor\'e, Teodor {\AA}strand, Tuhfe G\"o\c{c}men.

**Figure 2.** Figure 2: Agent training progress for six random seeds, each running in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Time series of farm-level power production for the MARL-controlled case compared [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: a) Time series of turbine-specific DEL for the blade root flap-wise moment, computed [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Yaw angles commanded by learned agent policies during evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

This study presents a multi-agent reinforcement learning (MARL) framework for load-constrained wind farm flow control (WFFC). While wake steering can enhance total wind farm power, it often introduces increased structural loads on downstream turbines. To address this, we integrate an Independent Soft Actor-Critic (I-SAC) architecture with a data-driven, local inflow sector-averaged surrogate model to provide real-time estimates of Damage Equivalent Loads (DELs). By incorporating these estimates into a shaped reward function, turbine-specific agents are trained to maximize power production while adhering to specific load-increase thresholds ($\Delta_{max}$) of 10%, 20%, and 30% relative to a baseline controller. The framework is implemented within the WindGym environment using the DYNAMIKS flow solver with Dynamic Wake Meandering (DWM) model to capture non-stationary wake physics. Results indicate that the MARL agents successfully learn collaborative policies that prioritise power gain while actively retreating from high-DEL control strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies I-SAC agents to load-constrained wake steering via a sector-averaged DEL surrogate in WindGym, but the abstract supplies no power gains, load statistics, or surrogate validation.

read the letter

The paper's core move is to embed a data-driven, local-inflow surrogate for damage equivalent loads directly into the per-agent reward of an independent soft actor-critic controller. Agents are trained inside the WindGym environment with the DWM solver to respect explicit load-increase caps of 10, 20, or 30 percent while still chasing extra power. That combination of algorithm, surrogate, and shaped reward is not in the cited prior work, so the application itself is new enough to note.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a multi-agent reinforcement learning (MARL) framework based on Independent Soft Actor-Critic (I-SAC) for load-constrained wind farm flow control. A data-driven surrogate model provides real-time estimates of Damage Equivalent Loads (DELs) from local inflow sector averages; these estimates are inserted into a shaped reward function so that agents maximize total power while respecting author-specified load-increase thresholds Δ_max of 10 %, 20 %, and 30 % relative to a baseline controller. The framework is implemented in the WindGym environment using the DYNAMIKS solver with the Dynamic Wake Meandering (DWM) model. The central claim is that the trained agents learn collaborative yaw-steering policies that increase power production while actively retreating from high-DEL actions.

Significance. If the surrogate proves sufficiently accurate and the reported policies generalize, the work would offer a practical route to deploy wake-steering strategies without compromising turbine lifetime. The combination of MARL coordination with an online load surrogate inside the reward loop is a technically relevant contribution to wind-farm control. The use of the DWM model for non-stationary wake physics is also a positive modeling choice.

major comments (3)

[Abstract and Results] Abstract and Results section: the claim that agents 'successfully learn collaborative policies that prioritise power gain while actively retreating from high-DEL control strategies' is unsupported by any quantitative metrics (power gains, DEL statistics, or baseline comparisons) in the abstract and is not accompanied by tables or figures showing these values for each Δ_max. Without such data the central empirical claim cannot be evaluated.
[Surrogate Model] Surrogate-model section (methodology): the data-driven, sector-averaged local-inflow surrogate is inserted directly into the reward loop, yet no closed-loop error statistics (bias, variance, or worst-case under-prediction) are reported between surrogate DELs and the full DWM solver outputs evaluated at the yaw angles chosen by the learned policies. Because the skeptic concern is that the surrogate may systematically understate loads for non-stationary wake-steering actions, this validation is load-bearing for the claim that true DELs respect the stated Δ_max thresholds.
[Results] Results section: no ablation is presented in which the surrogate is replaced by direct DEL computation from DYNAMIKS inside the reward during training or evaluation. Such an ablation would directly test whether the learned policies remain feasible when the reward uses the true physics rather than the approximation.

minor comments (2)

[Methodology] The precise definition of the baseline controller against which Δ_max is measured should be stated explicitly (e.g., greedy individual control or a fixed yaw schedule).
[Implementation] Training hyperparameters for the I-SAC agents (learning rates, entropy coefficient schedule, replay buffer size) are not listed; these details are needed for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important aspects for strengthening the presentation and validation of our work. We address each major comment below and indicate the revisions we will undertake.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the claim that agents 'successfully learn collaborative policies that prioritise power gain while actively retreating from high-DEL control strategies' is unsupported by any quantitative metrics (power gains, DEL statistics, or baseline comparisons) in the abstract and is not accompanied by tables or figures showing these values for each Δ_max. Without such data the central empirical claim cannot be evaluated.

Authors: We agree that the abstract would be strengthened by explicit quantitative support and that a summary table would improve clarity. The results section contains figures that illustrate power production and load behavior for the three Δ_max thresholds, but we acknowledge the absence of a consolidated table with numerical metrics and baseline comparisons. In the revised manuscript we will update the abstract to report key quantitative outcomes (power gains and DEL adherence for each Δ_max) and add a table in the results section that tabulates average power increase, DEL statistics, and comparisons to the baseline controller. revision: yes
Referee: [Surrogate Model] Surrogate-model section (methodology): the data-driven, sector-averaged local-inflow surrogate is inserted directly into the reward loop, yet no closed-loop error statistics (bias, variance, or worst-case under-prediction) are reported between surrogate DELs and the full DWM solver outputs evaluated at the yaw angles chosen by the learned policies. Because the skeptic concern is that the surrogate may systematically understate loads for non-stationary wake-steering actions, this validation is load-bearing for the claim that true DELs respect the stated Δ_max thresholds.

Authors: We recognize the importance of closed-loop validation at the operating points selected by the learned policies. The surrogate was trained and tested on DWM-generated data, but we did not report error statistics specifically for the yaw angles arising from the trained agents. In the revised manuscript we will add an analysis (in the results or an appendix) that evaluates the surrogate against full DWM DEL computations for the final policies, reporting bias, variance, and worst-case under-prediction to confirm that the true loads remain within the stated Δ_max limits. revision: yes
Referee: [Results] Results section: no ablation is presented in which the surrogate is replaced by direct DEL computation from DYNAMIKS inside the reward during training or evaluation. Such an ablation would directly test whether the learned policies remain feasible when the reward uses the true physics rather than the approximation.

Authors: We agree that such an ablation would provide valuable insight into the effect of the surrogate approximation. However, embedding direct DEL evaluation from the full DWM solver inside the reward loop during training incurs prohibitive computational cost, as each step would require a complete non-stationary simulation rather than an instantaneous surrogate query. We will add a discussion of this computational trade-off in the revised paper and, where resources permit, perform a limited ablation during policy evaluation (rather than full retraining) to assess feasibility with true DELs. revision: partial

Circularity Check

0 steps flagged

No load-bearing circularity; surrogate and thresholds are independent of final policy outcomes

full rationale

The paper's core result is an empirical outcome of running I-SAC training inside the WindGym/DYNAMIKS simulator. The DEL surrogate is trained on separate data and inserted into a shaped reward whose Δ_max thresholds are chosen by the authors rather than fitted to the learned policy. No derivation step reduces a claimed prediction to a quantity defined by the same fit, and no self-citation chain is invoked to justify uniqueness or force the result. The training loop is self-contained but not tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on the accuracy of a data-driven surrogate that maps sector-averaged inflow to DEL and on the fidelity of the DWM wake model inside the simulator; no new physical entities are postulated.

free parameters (1)

Delta_max load-increase thresholds
Explicit ceilings (10 %, 20 %, 30 %) chosen by the authors to define acceptable policy behavior.

axioms (2)

domain assumption The sector-averaged local inflow surrogate produces DEL estimates accurate enough for real-time reward shaping
Invoked when the surrogate output is inserted into the shaped reward function.
domain assumption The DWM model inside WindGym captures the non-stationary wake physics relevant to load and power trade-offs
Used to generate the training environment for all agents.

pith-pipeline@v0.9.0 · 5509 in / 1534 out tokens · 30453 ms · 2026-05-10T15:46:23.314141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Thomsen K and Sørensen P 1999 Fatigue loads for wind turbines operating in wakesJournal of Wind Engineering and Industrial Aerodynamics80121–136 ISSN 0167-6105

1999
[2]

Debusscher C M J, G¨ o¸ cmen T and Andersen S J 2022 Probabilistic surrogates for flow control using combined control strategiesJournal of Physics: Conference Series2265032110

2022
[3]

Padullaparthi V R, Nagarathinam S, Vasan A, Menon V and Sudarsanam D 2022 Falcon- farm level control for wind turbines using multi-agent deep reinforcement learningRenewable Energy181445–456 ISSN 0960- 1481

2022
[4]

Damiani R, Dana S, Annoni J, Fleming P, Roadman J, van Dam J and Dykes K 2018 Assessment of wind turbine component loads under yaw-offset conditionsWind Energy Science3173–189 ISSN 2366-7443 publisher: Copernicus GmbH

2018
[5]

G¨ o¸ cmen T, Liew J, Kadoche E, Dimitrov N, Riva R, Andersen S J, Lio A W, Quick J, R´ ethor´ e P E and Dykes K 2025 Data-driven wind farm flow control and challenges towards field implementation: A review Renewable and Sustainable Energy Reviews216115605 ISSN 1364-0321

2025
[6]

Monroc C B, Buˇ si´ c A, Dubuc D and Zhu J 2025 Wfcrl: A multi-agent reinforcement learning benchmark for wind farm control (Preprint2501.13592)

work page arXiv 2025
[7]

Kadoche E, Gourv´ enec S, Pallud M and Levent T 2023 Marlyc: Multi-agent reinforcement learning yaw controlRenewable Energy217119129 ISSN 0960-1481

2023
[8]

Sutton R and Barto A 1998Reinforcement Learning: An IntroductionA Bradford book (MIT Press) ISBN 9780262193986
[9]

2025 WindGym: Reinforcement Learning Environment for Wind Farm Control https://github.com/DTUWindEnergy/windgymversion accessed on 7 May 2026

2025
[10]

DTU Wind 2023 DYNAMIKShttps://gitlab.windenergy.dtu.dk/DYNAMIKS/dynamiks

2023
[11]

Huang S, Dossa R F J, Ye C, Braga J, Chakraborty D, Mehta K and Ara´ ujo J G 2022 Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithmsJournal of Machine Learning Research 231–18

2022
[12]

Haarnoja T, Zhou A, Abbeel P and Levine S 2018 Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor (Preprint1801.01290)

work page internal anchor Pith review arXiv 2018
[13]

Larsen T J and Hansen A M 2019How 2 HA WC2, the User’s ManualDTU Wind Energy Roskilde, Denmark version 12.7 URLhttps://www.hawc2.dk
[14]

Vad A, Guillor´ e A, Anand A, Pettas V, Shah A H, Lizarraga-Saenz I, Aparicio-Sanchez M, Eguinoa I, Conti Gost N, Tsaklis I, Fr` ere A, Hermans K W, Kamau J K, Dimitrov N, G¨ o¸ cmen T and Bottasso C L 2026 Modeling wind farm response: a modular, integrated, and multi-stakeholder approachWind Energy Science Discussions20261–49

2026
[15]

Meyers J, Bottasso C, Dykes K, Fleming P, Gebraad P, Giebel G, G¨ o¸ cmen T and van Wingerden J W 2022 Wind farm flow control: prospects and challengesWind Energy Science72271–2306

2022