arxiv: 2604.24549 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI

Recognition: unknown

GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility

Yihong Zhou , Hongtai Zeng , Thomas Morstyn

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-agent proximal learningdecentralized policythree-phase power flowimplicit differentiationgrid-edge controlneural network policiesIEEE 123-bus test feeder

0 comments

The pith

GradMAP learns fully decentralized neural policies for 1,000 grid-edge agents by back-propagating exact three-phase AC power-flow violations through implicit differentiation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GradMAP to coordinate large populations of grid-edge devices such as batteries, heat pumps, and generators while keeping all decisions local and communication-free at deployment time. Each agent runs its own neural network that receives only its local observation and outputs an action; the networks are trained independently with no parameter sharing. During offline training a differentiable three-phase AC power-flow model is embedded inside a primal-dual loop so that network constraint violations can be differentiated exactly and sent back to update every policy. A proximal surrogate operating directly in action space re-uses the expensive environment gradients inside a trust region, producing a reported 3-5x training speed-up over prior gradient-based and multi-agent reinforcement-learning methods on the IEEE 123-bus feeder.

Core claim

GradMAP embeds a differentiable three-phase AC power-flow model inside a primal-dual learning loop and uses implicit differentiation to propagate exact load-flow constraint violations to the parameters of independent neural policies; a proximal surrogate defined in action space then accelerates training inside a trust region, enabling 1,000 agents to reach low-violation decentralized policies in 15 minutes on a single GPU.

What carries the argument

The primal-dual loop with implicit differentiation of the three-phase AC power-flow model, paired with a proximal surrogate operating in action space rather than probability space.

If this is right

Policies require no inter-agent communication or central coordinator once deployed.
Training finishes in 15 minutes on commodity hardware and scales to at least 1,000 heterogeneous devices.
Out-of-sample tests show among the lowest operating costs and constraint violations compared with self-supervised and multi-agent RL baselines.
The same differentiable power-flow embedding can be reused across different network topologies without retraining the differentiation pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on real-time hardware-in-the-loop simulators to check whether the 15-minute training time holds under measurement noise and model mismatch.
Because each policy is completely local, the same framework might extend to other networked control problems whose physics admit differentiable simulators, such as traffic or water networks.
Removing the need for communication infrastructure could lower the capital cost of large-scale demand-response programs, an implication left implicit in the case studies.

Load-bearing premise

That an accurate differentiable three-phase AC power-flow model can be embedded in the training loop so implicit differentiation transmits precise constraint violations to the policy updates.

What would settle it

Deploy the learned policies on the same IEEE 123-bus feeder with realistic load and renewable profiles never seen in training and measure whether three-phase voltage and line-flow violations remain below the levels reported in the paper.

Figures

Figures reproduced from arXiv: 2604.24549 by Hongtai Zeng, Thomas Morstyn, Yihong Zhou.

**Figure 1.** Figure 1: Architecture diagram of the proposed GradMA and GradMAP view at source ↗

**Figure 3.** Figure 3: Five-day visualisation of the datasets used in the IEEE123 view at source ↗

**Figure 2.** Figure 2: IEEE123 feeder topology used in the main case study together view at source ↗

**Figure 4.** Figure 4: All 1,000 agents’ normalised power on the held-out test day 18. Positive values represent demand while negative values represent generation. Grey view at source ↗

**Figure 5.** Figure 5: Zoomed-in illustration for Fig. 4, where we randomly selected 2 agents view at source ↗

**Figure 7.** Figure 7: Out-of-sample benchmark comparison for the 1,000-agent case over view at source ↗

**Figure 8.** Figure 8: Scaling comparison across three different number of agents on view at source ↗

read the original abstract

Coordinating large populations of grid-edge devices requires learning methods that remain fully decentralised in deployment while still respecting three-phase AC distribution-network physics. This paper proposes gradient-based multi-agent proximal learning (GradMAP) to address this challenge. GradMAP trains independent neural-network policies for each agent without any parameter sharing, and each agent uses only its own local observation for online decision-making without communication. During offline training, GradMAP embeds a differentiable three-phase AC power-flow model in a primal-dual learning loop and uses implicit differentiation to propagate exact network-constraint violations to update the policy parameters. To speed up training, GradMAP reuses expensive environment gradients through a proximal surrogate within a trust region defined in the more direct policy-output (action) space, instead of the probability distribution space used in other works, such as PPO. In case studies with 1,000 agents managing batteries, heat pumps, and controllable generators on the IEEE 123-bus feeder, GradMAP learns decentralised policies that minimise three-phase AC load-flow constraint violations within 15 minutes of training on a single workstation-class NVIDIA RTX PRO 5000 Blackwell 48GB GPU. This is a 3--5x training speed-up over gradient-based self-supervised learning benchmarks and substantially better training efficiency than multi-agent reinforcement-learning benchmarks. In out-of-sample tests, GradMAP also delivers among the lowest operating cost and constraint violations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GradMAP gets 1000 independent agents on the IEEE 123-bus feeder to train in 15 minutes with reported 3-5x speed-up by embedding implicit differentiation of a three-phase power flow into a primal-dual loop and using proximal surrogates directly in action space.

read the letter

The main result is that GradMAP trains fully decentralized neural policies for 1000 agents controlling batteries, heat pumps, and generators on the IEEE 123-bus system. Each policy uses only local observations, no parameter sharing, and no communication at runtime, yet still reduces three-phase constraint violations and operating costs in out-of-sample tests. Training finishes in 15 minutes on one high-end GPU, beating the self-supervised and multi-agent RL baselines they compare against by a factor of 3-5x in wall-clock time. That combination of scale, decentralization, and reported speed is what stands out from the abstract and the case study numbers they give.

Referee Report

2 major / 2 minor

Summary. The paper proposes GradMAP, a gradient-based multi-agent proximal learning method for decentralized control of grid-edge devices. Independent neural-network policies are trained for each of up to 1000 agents using only local observations and no communication. A differentiable three-phase unbalanced AC power-flow model is embedded in a primal-dual loop; implicit differentiation propagates network constraint violations to policy updates. Training is accelerated by a proximal surrogate operating in action space within a trust region. On the IEEE 123-bus feeder, the method reports 15-minute training on a single GPU (3-5x faster than gradient-based self-supervised baselines and better than MARL), with competitive out-of-sample operating cost and constraint-violation performance.

Significance. If the implicit gradients remain exact, GradMAP would provide a scalable, physics-informed route to fully decentralized DER coordination that respects three-phase AC physics without inter-agent communication. The reported training times on a 1000-agent instance and the action-space proximal surrogate constitute concrete efficiency gains over existing self-supervised and RL approaches, with potential practical value for distribution-system operators managing high DER penetration.

major comments (2)

[primal-dual learning loop and implicit differentiation] Implicit differentiation through the three-phase AC power-flow solver (description of the primal-dual loop and differentiable load-flow embedding): the central claim that exact constraint violations are propagated to independent policy parameters assumes the Newton-Raphson Jacobian remains invertible and the solution unique throughout training. Under high DER penetration on the IEEE 123-bus feeder, singularities or multiple solutions are known to occur; the manuscript provides no conditioning analysis, uniqueness verification, or fallback mechanism. If this assumption fails, the reported 15-minute convergence and out-of-sample constraint satisfaction no longer follow from the stated mechanism.
[case studies and results] Experimental results (case-study section and performance tables): the abstract and results report concrete speed-up factors, training times, and out-of-sample metrics, yet supply no statistical details (standard deviations over seeds), exact baseline hyper-parameters, or implementation verification that the implicit-differentiation step matches the claimed gradients. This weakens the 3-5x speedup and “among the lowest” performance claims.

minor comments (2)

[method] Notation for the proximal trust-region radius and action-space surrogate should be introduced in a dedicated table or appendix to avoid ambiguity when comparing to PPO-style methods.
[figures] Figure captions for the IEEE 123-bus results should explicitly state the number of random seeds and whether error bars represent standard deviation or inter-quartile range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: [primal-dual learning loop and implicit differentiation] Implicit differentiation through the three-phase AC power-flow solver (description of the primal-dual loop and differentiable load-flow embedding): the central claim that exact constraint violations are propagated to independent policy parameters assumes the Newton-Raphson Jacobian remains invertible and the solution unique throughout training. Under high DER penetration on the IEEE 123-bus feeder, singularities or multiple solutions are known to occur; the manuscript provides no conditioning analysis, uniqueness verification, or fallback mechanism. If this assumption fails, the reported 15-minute convergence and out-of-sample constraint satisfaction no longer follow from the stated mechanism.

Authors: We thank the referee for highlighting this important assumption. The method relies on the standard invertibility of the Newton-Raphson Jacobian at converged power-flow solutions, which holds in the operating regimes of our experiments. We agree that explicit verification is absent. In the revised manuscript we will add a subsection (and associated appendix table) reporting the observed condition numbers of the Jacobian throughout training on the IEEE 123-bus feeder, confirming that no singularities were encountered and that the primal-dual loop includes an automatic step-size reduction on solver non-convergence. This will directly support the validity of the implicit-gradient propagation and the reported training times. revision: yes
Referee: [case studies and results] Experimental results (case-study section and performance tables): the abstract and results report concrete speed-up factors, training times, and out-of-sample metrics, yet supply no statistical details (standard deviations over seeds), exact baseline hyper-parameters, or implementation verification that the implicit-differentiation step matches the claimed gradients. This weakens the 3-5x speedup and “among the lowest” performance claims.

Authors: We agree that additional statistical detail and verification would strengthen the claims. In the revision we will (i) report means and standard deviations over five random seeds for all training-time and out-of-sample metrics, (ii) provide a supplementary table with the exact hyper-parameters used for every baseline, and (iii) add a small-scale verification experiment demonstrating that the implicit-differentiation gradients match finite-difference approximations to within numerical tolerance. These additions will substantiate the reported speed-ups and performance comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: GradMAP applies standard implicit differentiation and proximal optimization to decentralised policy learning without reducing claims to self-defined quantities.

full rationale

The paper's core derivation embeds a differentiable three-phase AC power-flow model inside a primal-dual loop, applies implicit differentiation to obtain constraint gradients for independent neural policies, and accelerates training via a proximal surrogate in action space rather than policy-distribution space. These steps invoke established numerical techniques (implicit differentiation through Newton solves, trust-region proximal updates) whose validity is treated as external rather than redefined inside the paper. Empirical results on the IEEE 123-bus feeder with 1,000 agents are presented as out-of-sample validation against separate gradient-based self-supervised and multi-agent RL benchmarks; no performance metric is shown to be a direct algebraic rearrangement or statistical fit of quantities defined only by the method's own fitted parameters or prior self-citations. The derivation chain therefore remains non-circular and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the differentiability of the embedded AC power-flow model and standard assumptions of policy-gradient methods; no new physical entities are postulated.

free parameters (1)

proximal trust-region radius
Hyperparameter defining the size of the action-space trust region used for the surrogate; its value is not reported in the abstract but is required for the claimed speed-up.

axioms (2)

domain assumption The three-phase AC power-flow equations are differentiable with respect to device set-points
Invoked to justify implicit differentiation through the network model in the primal-dual loop.
domain assumption Independent local policies can achieve network-wide constraint satisfaction without inter-agent communication at deployment time
Core premise enabling the fully decentralized claim.

pith-pipeline@v0.9.0 · 5554 in / 1688 out tokens · 68220 ms · 2026-05-08T04:14:44.206081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Digitalisation and Energy,

The International Energy Agency (IEA), “Digitalisation and Energy,” www.iea.org/reports/digitalisation-and-energy, The International Energy Agency (IEA), Technical Report, 2017, Accessed: 2025-05-22

2017
[2]

Distributed robust dynamic economic dispatch of integrated transmission and distribution systems,

Z. Chen, C. Guo, S. Dong, Y . Ding, and H. Mao, “Distributed robust dynamic economic dispatch of integrated transmission and distribution systems,”IEEE Transactions on Industry Applications, 2021

2021
[3]

Distributionally robust joint chance-constrained dispatch for integrated transmission-distribution systems via distributed optimiza- tion,

J. Zhaiet al., “Distributionally robust joint chance-constrained dispatch for integrated transmission-distribution systems via distributed optimiza- tion,”IEEE Transactions on Smart Grid, 2022

2022
[4]

Aggregated feasible active power region for distributed energy resources with a distributionally robust joint probabilistic guarantee,

Y . Zhou, C. Essayeh, and T. Morstyn, “Aggregated feasible active power region for distributed energy resources with a distributionally robust joint probabilistic guarantee,”IEEE Transactions on Power Systems, 2025

2025
[5]

Coordinated vertical provision of flexibility from distribution systems,

H. Fr ¨uhet al., “Coordinated vertical provision of flexibility from distribution systems,”IEEE Transactions on Power Systems, 2023

2023
[6]

Non-iterative solution for coordinated optimal dispatch via equivalent projection—part i: Theory,

Z. Tan, Z. Yan, H. Zhong, and Q. Xia, “Non-iterative solution for coordinated optimal dispatch via equivalent projection—part i: Theory,” IEEE Transactions on Power Systems, vol. 39, no. 1, pp. 890–898, 2024

2024
[7]

LSTM-based energy management for electric vehicle charging in commercial-building prosumers,

H. Zhouet al., “LSTM-based energy management for electric vehicle charging in commercial-building prosumers,”Journal of Modern Power Systems and Clean Energy, vol. 9, no. 5, pp. 1205–1216, 2021

2021
[8]

Online optimal power scheduling of a microgrid via imitation learning,

S. Gao, C. Xiang, M. Yu, K. T. Tan, and T. H. Lee, “Online optimal power scheduling of a microgrid via imitation learning,”IEEE Transac- tions on Smart Grid, vol. 13, no. 2, pp. 861–876, 2021

2021
[9]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2015

2015
[10]

Deep reinforcement learning-based energy storage ar- bitrage with accurate lithium-ion battery degradation model,

J. Caoet al., “Deep reinforcement learning-based energy storage ar- bitrage with accurate lithium-ion battery degradation model,”IEEE Transactions on Smart Grid, 2020

2020
[11]

Reinforcement learning for electric vehicle applications in power systems: A critical review,

D. Qiu, Y . Wang, W. Hua, and G. Strbac, “Reinforcement learning for electric vehicle applications in power systems: A critical review,” Renewable and Sustainable Energy Reviews, vol. 173, p. 113052, 2023

2023
[12]

Scalable multi- agent reinforcement learning for distributed control of residential energy flexibility,

F. Charbonnier, T. Morstyn, and M. D. McCulloch, “Scalable multi- agent reinforcement learning for distributed control of residential energy flexibility,”Applied energy, vol. 314, p. 118825, 2022

2022
[13]

Centralised rehearsal of decentralised coopera- tion: Multi-agent reinforcement learning for the scalable coordination of residential energy flexibility,

F. Charbonnieret al., “Centralised rehearsal of decentralised coopera- tion: Multi-agent reinforcement learning for the scalable coordination of residential energy flexibility,”Applied Energy, 2025

2025
[14]

Network topology optimisation (nia2 neso087),

“Network topology optimisation (nia2 neso087),” National Energy System Operator, Tech. Rep., Dec. 2025. [Online]. Available: https://www.neso.energy/document/373681/download

2025
[15]

Fifty years of power systems optimization,

A. Kaya, A. J. Conejo, and S. Rebennack, “Fifty years of power systems optimization,”European Journal of Operational Research, 2025

2025
[16]

Dynamic relationship between kkt saddle solutions and optimal solutions in ac opf problems,

H.-D. Chiang, Z.-Y . Wang, and L. Zeng, “Dynamic relationship between kkt saddle solutions and optimal solutions in ac opf problems,”IEEE Transactions on Power Systems, vol. 39, no. 1, pp. 1637–1646, 2023

2023
[17]

Model-augmented actor-critic: Back- propagating through paths,

I. Clavera, V . Fu, and P. Abbeel, “Model-augmented actor-critic: Back- propagating through paths,”arXiv preprint arXiv:2005.08068, 2020

work page arXiv 2005
[18]

Deep differentiable reinforcement learning for generic energy storage day-ahead bidding with transformer neural networks,

J. Liu, H. Guo, and Q. Chen, “Deep differentiable reinforcement learning for generic energy storage day-ahead bidding with transformer neural networks,” in2024 PESGM. IEEE, 2024, pp. 1–5

2024
[19]

Physics-guided safe policy learning with enhanced per- ception for real-time dynamic security constrained optimal power flow,

Y . Yeet al., “Physics-guided safe policy learning with enhanced per- ception for real-time dynamic security constrained optimal power flow,” Journal of Modern Power Systems and Clean Energy, 2025

2025
[20]

Pods: Policy optimization via differentiable simulation,

M. A. Z. Mora, M. Peychev, S. Ha, M. Vechev, and S. Coros, “Pods: Policy optimization via differentiable simulation,” inInternational Con- ference on Machine Learning. PMLR, 2021, pp. 7805–7817

2021
[21]

Gradient informed proximal policy optimization,

S. Son, L. Zheng, R. Sullivan, Y .-L. Qiao, and M. Lin, “Gradient informed proximal policy optimization,”Advances in Neural Information Processing Systems, vol. 36, pp. 8788–8814, 2023

2023
[22]

Reparameterization proximal policy optimization,

H. Zhong, X. Wang, Z. Li, and L. Huang, “Reparameterization proximal policy optimization,”arXiv preprint arXiv:2508.06214, 2025

work page arXiv 2025
[23]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[24]

Self-supervised learning for large- scale preventive security constrained dc optimal power flow,

S. Park and P. Van Hentenryck, “Self-supervised learning for large- scale preventive security constrained dc optimal power flow,”IEEE Transactions on Power Systems, vol. 40, no. 3, pp. 2205–2216, 2024

2024
[25]

Neural risk limiting dispatch in power net- works: Formulation and generalization guarantees,

G. Chen and J. Qin, “Neural risk limiting dispatch in power net- works: Formulation and generalization guarantees,”IEEE Transactions on Power Systems, 2025

2025
[26]

A hard-constrained nn learning framework for rapidly restoring ac-opf from dc-opf,

K. Chen, B. Knueven, and W. Jones, “A hard-constrained nn learning framework for rapidly restoring ac-opf from dc-opf,”arXiv preprint arXiv:2602.06255, 2026

work page arXiv 2026
[27]

Self- supervised learning of parametric approximation for security-constrained dc-opf,

A. Anrrango, A. Quisaguano, G. E. Constante-Flores, and C. Li, “Self- supervised learning of parametric approximation for security-constrained dc-opf,”arXiv preprint arXiv:2601.13486, 2026

work page arXiv 2026
[28]

Model-free self-supervised learning for dispatching distributed energy resources,

G. Chen, J. Qin, and H. Zhang, “Model-free self-supervised learning for dispatching distributed energy resources,”IEEE Transactions on Smart Grid, vol. 16, no. 2, pp. 1287–1300, 2024

2024
[29]

The surprising effectiveness of ppo in cooperative multi- agent games,

C. Yuet al., “The surprising effectiveness of ppo in cooperative multi- agent games,”Advances in neural information processing systems, vol. 35, pp. 24 611–24 624, 2022

2022
[30]

Assessing the energy content of system frequency and electric vehicle charging efficiency for ancillary service provision,

A. Thingvad, C. Ziras, J. Hu, and M. Marinelli, “Assessing the energy content of system frequency and electric vehicle charging efficiency for ancillary service provision,” inUPEC. IEEE, 2017, pp. 1–6

2017
[31]

Analysis and modeling of cycle aging of a commercial lifepo4/graphite cell,

M. Naumann, F. B. Spingler, and A. Jossen, “Analysis and modeling of cycle aging of a commercial lifepo4/graphite cell,”Journal of Power Sources, vol. 451, p. 227666, 2020

2020
[32]

Aggregate flexibility of thermostatically controlled loads,

H. Hao, B. M. Sanandaji, K. Poolla, and T. L. Vincent, “Aggregate flexibility of thermostatically controlled loads,”IEEE Transactions on Power Systems, vol. 30, no. 1, pp. 189–198, 2014

2014
[33]

Linear power-flow models in multi- phase distribution networks,

A. Bernstein and E. Dall’Anese, “Linear power-flow models in multi- phase distribution networks,” inISGT-Europe. IEEE, 2017

2017
[34]

Pv- integration strategies for low voltage networks,

C. Heinrich, P. Fortenbacher, A. Fuchs, and G. Andersson, “Pv- integration strategies for low voltage networks,” inENERGYCON, 2016

2016
[35]

Comprehensive modeling of three-phase distribution systems via the bus admittance matrix,

M. Bazrafshan and N. Gatsis, “Comprehensive modeling of three-phase distribution systems via the bus admittance matrix,”IEEE Transactions on Power Systems, vol. 33, no. 2, pp. 2015–2029, 2018

2015
[36]

Analytic considerations and design basis for the ieee distribution test feeders,

K. P. Schneideret al., “Analytic considerations and design basis for the ieee distribution test feeders,”IEEE Trans on power systems, 2017

2017
[37]

Dataset (TC1a): basic profiling of domestic smart meter customers,

R. Wardleet al., “Dataset (TC1a): basic profiling of domestic smart meter customers,”Accessed: Sep, vol. 27, 2020

2020
[38]

Renewables.ninja

S. Pfenninger and I. Staffell, “Renewables.ninja.” [Online]. Available: https://www.renewables.ninja/
[39]

Agile historical data portal,

Octopus Energy, “Agile historical data portal,” https://agile. octopushome.net/historical-data, 2026, accessed: 2026-03-26

2026
[40]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. L ´eonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,”arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review arXiv 2013