arxiv: 2604.20586 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.SY· eess.SY

Recognition: unknown

A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs

Patrick Wilk , Ethan Cantor , Yikui Liu , Jie Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:30 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords distributed energy resourcespeer-to-peer tradingmulti-agent reinforcement learningwholesale marketsStackelberg gamedemand-side participation

0 comments

The pith

A hierarchical MARL approach lets individual prosumers trade energy in P2P retail auctions and aggregates them for wholesale market participation, coordinated by a Stackelberg game.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a market framework for distributed energy resources that uses multi-agent reinforcement learning at two levels. At the lower level, prosumers learn policies to participate directly in peer-to-peer retail auctions. At the higher level, these prosumers are aggregated so their combined actions can engage wholesale markets. A Stackelberg game then coordinates the two layers to improve overall market outcomes and grid flexibility as electrification and DER adoption increase.

Core claim

The central claim is that a hierarchical multi-agent deep reinforcement learning structure enables prosumers to handle retail P2P trading and wholesale participation, with the layers coordinated through a Stackelberg game to deliver enhanced market performance compared with uncoordinated approaches.

What carries the argument

The hierarchical MARL structure in which lower-level agents learn P2P retail policies and are aggregated for wholesale engagement, with a Stackelberg game serving as the coordination mechanism between the levels.

If this is right

Prosumers develop autonomous policies for P2P retail auctions without central control.
Aggregated prosumers can participate more effectively in wholesale markets than isolated ones.
The Stackelberg layer improves the combined retail-wholesale performance of the DER framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the learned policies transfer across different market rules, the same hierarchy could support retail markets in multiple regions.
Extending the framework to include battery storage dynamics or network constraints would test whether the coordination layer remains effective at scale.

Load-bearing premise

The agents will converge on stable, effective trading policies under realistic market conditions and the Stackelberg coordinator will produce measurable gains, even though reward functions, state spaces, and convergence properties are not specified.

What would settle it

Simulation runs in which the full hierarchical MARL-plus-Stackelberg system produces no improvement, or produces worse results, in market efficiency or DER participation metrics than a flat MARL setup or a non-learning benchmark would falsify the claimed coordination benefit.

Figures

Figures reproduced from arXiv: 2604.20586 by Ethan Cantor, Jie Li, Patrick Wilk, Yikui Liu.

**Figure 1.** Figure 1: Proposed Wholesale and P2P Market Participation Framework of DERs. 𝑂 ∗ 𝑘 = { 𝑜 ∗ 1,𝑘, 𝑜∗ 2,𝑘, …, 𝑜∗ 𝑛𝑜 ,𝑘} with 𝑝𝑜 ∗ 1,𝑘 ≤ 𝑝𝑜 ∗ 2,𝑘 ≤⋯≤𝑝𝑜 ∗ 𝑛𝑏 ,𝑘 (6) Here, 𝑛𝑏 = |𝑁𝐵| and 𝑛𝑜 = |𝑁𝑂| represent the number of bids and offers in current time interval. The superscript (*) denotes the sorted lists, and 𝑘 indexes the round of the matching process. 𝑏 ∗ 𝑖,𝑘 is the 𝑖-th highest-priced bid and 𝑜 ∗ 𝑗,𝑘 is the 𝑗-th lowes… view at source ↗

**Figure 2.** Figure 2: SEAM-LESS Stackelberg Game Framework. The proposed SEAM-LESS framework employs PPO for the aggregator and LSD-MADDPG for prosumers, leveraging their suitability for a non-iterative, hierarchical DRL approach in a Stackelberg game structure. In practical wholesale markets, bids are submitted once per period (e.g., hourly) and remain fixed, aligning with RL’s single-action policies that ensure timely, com… view at source ↗

**Figure 3.** Figure 3: Time-series of P2P participants energy overlaid with hourly wholesale price forecasts in Case I [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: SEAM-LESS settling price evolution of Case I [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: SEAM-LESS settling price evolution of Case II. Figures. 4 (for Case I) and 5 (for Case II) illustrate the settling price evolution of the aggregator and P2P market participants, collectively demonstrating how market power emerges when one side’s quantity is limited yet bounded by the aggregator’s fallback. In Case I, two buyers’ total demand exceeds two sellers’ total supply, forcing buyers to compete for … view at source ↗

read the original abstract

The ongoing shift towards decentralization of the electric energy sector, driven by the growing electrification across end-use sectors, and widespread adoption of distributed energy resources (DERs), necessitates their active participation in the electricity markets to support grid operations. Furthermore, with bi-directional energy and communication flows becoming standard, intelligent, easy-to-deploy, resource-conservative demand-side participation is expected to play a critical role in securing power grid operational flexibility and market efficiency. This work proposes a market engagement framework that leverages a hierarchical multi-agent deep reinforcement learning (MARL) approach to enable individual prosumers to participate in peer-to-peer retail auctions and further aggregate these intelligent prosumers to facilitate effective DER participation in wholesale markets. Ultimately, a Stackelberg game is proposed to coordinate this hierarchical MARL-based DER market participation framework toward enhanced market performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a high-level proposal for hierarchical MARL plus Stackelberg coordination of DERs in P2P retail and wholesale markets, but it contains no reward functions, algorithms, simulations, or results.

read the letter

The paper outlines a framework where individual prosumers use multi-agent deep RL to trade in peer-to-peer retail auctions, then get aggregated for wholesale market bids, with a Stackelberg layer on top to align the levels. The motivation around rising DERs and the need for flexible market participation is reasonable and timely for the energy systems community. The hierarchical structure they sketch could be a sensible way to handle scale differences between retail and wholesale, and the choice to combine MARL with a game-theoretic coordinator is a natural extension of existing work in the area. That framing is the main thing the paper contributes at this stage. The problem is that none of the actual technical content is present. There are no definitions of the state or action spaces, no reward functions that would drive the learning, no network architectures, no convergence arguments, and no simulation setup or baseline comparisons. Without those pieces the claim that the approach will produce stable policies and measurable market gains stays untested. The abstract and the rest of the manuscript read as a design sketch rather than a completed piece of research. Readers already working on MARL for energy markets might find the high-level architecture useful as a prompt for their own experiments, but there is nothing here that can be directly used, reproduced, or cited. For a serious referee process this would need the missing methods and results sections filled in first. I would not bring it to a reading group in its current form and would not cite it.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a hierarchical multi-agent deep reinforcement learning (MARL) framework to enable individual prosumers to participate in peer-to-peer (P2P) retail energy auctions, with aggregation of these prosumers to facilitate DER participation in wholesale markets, coordinated via a Stackelberg game to achieve enhanced market performance.

Significance. If fully implemented and validated with concrete algorithms and empirical results, the proposed framework could contribute to improved coordination of DERs across retail and wholesale markets by combining decentralized MARL-based trading with hierarchical game-theoretic coordination. This has potential relevance for grid flexibility and market efficiency under high DER penetration. However, the manuscript contains no technical details or evidence, so its significance cannot be assessed.

major comments (1)

Abstract: The central claim that the hierarchical MARL approach plus Stackelberg coordination will enable stable prosumer P2P trading and yield measurable wholesale-market performance gains is unsupported, as the manuscript supplies no reward functions, state or action space definitions, network architectures, convergence arguments, simulation setups, or any quantitative results or baseline comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the need for concrete technical and empirical support. We agree that the current manuscript is primarily a conceptual framework proposal and lacks the implementation specifics and results necessary to substantiate the performance claims. Below we respond point-by-point to the major comment and outline the revisions we will make.

read point-by-point responses

Referee: Abstract: The central claim that the hierarchical MARL approach plus Stackelberg coordination will enable stable prosumer P2P trading and yield measurable wholesale-market performance gains is unsupported, as the manuscript supplies no reward functions, state or action space definitions, network architectures, convergence arguments, simulation setups, or any quantitative results or baseline comparisons.

Authors: We fully agree that the abstract's claims require supporting technical details and evidence, which are absent from the current version. The manuscript presents a high-level market engagement framework rather than a fully implemented and validated algorithm. In the revised manuscript we will add: (i) explicit definitions of the state and action spaces for the prosumers and aggregator agents, (ii) the reward functions used in the hierarchical MARL setup, (iii) the neural network architectures and training procedures, (iv) a formal description of the Stackelberg coordination mechanism including leader-follower equilibrium conditions, and (v) a dedicated simulation section with quantitative results, convergence analysis, and comparisons against relevant baselines (e.g., non-coordinated MARL and centralized optimization). These additions will directly address the lack of evidence for stable P2P trading and wholesale-market gains. revision: yes

Circularity Check

0 steps flagged

No derivation chain or load-bearing reductions present

full rationale

The paper proposes a hierarchical MARL framework plus Stackelberg coordination for DER market participation but supplies no equations, reward functions, state/action spaces, convergence arguments, or fitted parameters. The abstract and text contain only high-level design statements with no self-definitional loops, fitted inputs renamed as predictions, or self-citation chains that reduce the central claim to its own inputs. All performance assertions remain unverified proposals rather than derived results, so no circularity exists.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5453 in / 1182 out tokens · 53399 ms · 2026-05-10T00:30:39.320822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 31 canonical work pages

[1]

J.Blazquez,R.Fuentes-Bracamontes,C.A.Bollino,N.Nezamuddin, The renewable energy policy paradox, Renewable and Sustainable Energy Reviews 82 (2018) 1–5.doi : https : / / doi.org / 10.1016 / j.rser.2017.09.002

2018
[2]

(2019-62)

M.Greenstone,I.Nath,Dorenewableportfoliostandardsdelivercost- effectivecarbonabatement?,UniversityofChicago,BeckerFriedman Institute for Economics Working Paper. (2019-62). URLhttps:// ssrn.com/abstract=3374942

2019
[3]

Wang, Do mandatory u.s

H. Wang, Do mandatory u.s. state renewable portfolio standards increase electricity prices?, Growth and Change 47 (2) (2016) 157– 174.doi:https://doi.org/10.1111/grow.12118

work page doi:10.1111/grow.12118 2016
[4]

2025 (2023)

Statista, Historical electricity prices in the united states from 1990 to 2023,https://www.statista.com, accessed: Jan. 2025 (2023)

1990
[5]

J. D. Rhodes, The old, dirty, creaky u.s. electric grid would cost $5 trillion to replace. where should infrastructure spending go?, accessed: Jan. 2025 (Dec. 2018). URLhttps://energy.utexas.edu/ news/old- dirty- creaky- us- electric- grid- would- cost- 5- trillion- replace-where-should-infrastructure

2025
[6]

Strielkowski, L

W. Strielkowski, L. Civín, E. Tarkhanova, M. Tvaronavičien˙e, Y. Pe- trenko,Renewableenergyinthesustainabledevelopmentofelectrical power sector: A review, Energies 14 (24).doi:10.3390/en14248240

work page doi:10.3390/en14248240
[7]

U.J.Hahnel,M.Herberz,A.Pena-Bello,D.Parra,T.Brosch,Becom- ing prosumer: Revealing trading preferences and decision-making strategies in peer-to-peer energy communities, Energy Policy 137 (2020) 111098.doi:https://doi.org/10.1016/j.enpol.2019.111098

work page doi:10.1016/j.enpol.2019.111098 2020
[8]

Pinto, Z

T. Pinto, Z. Vale, S. Widergren, Local Electricity Markets, Elsevier, Amsterdam, The Netherlands, 2021. URLhttps : / / www.sciencedirect.com / book / edited - volume / 9780128200742 / local - electricity-markets

2021
[9]

Y. Ye, D. Papadaskalopoulos, Q. Yuan, Y. Tang, G. Strbac, Multi- agent deep reinforcement learning for coordinated energy trading and flexibility services provision in local electricity markets, IEEE Transactions on Smart Grid 14 (2) (2023) 1541–1554.doi:10.1109/ TSG.2022.3149266

work page arXiv 2023
[10]

T. Chen, W. Su, Indirect customer-to-customer energy trading with reinforcement learning, IEEE Transactions on Smart Grid 10 (4) (2019) 4338–4348.doi:10.1109/TSG.2018.2857449

work page doi:10.1109/tsg.2018.2857449 2019
[11]

Strbac, D

G. Strbac, D. Papadaskalopoulos, N. Chrysanthopoulos, A. Es- tanqueiro, H. Algarvio, F. Lopes, L. de Vries, G. Morales-Espana, J. Sijm, R. Hernandez-Serna, J. Kiviluoma, N. Helisto, Decarboniza- tionofelectricitysystemsineurope:Marketdesignchallenges,IEEE power & energy magazine 19 (1).doi:10.1109/MPE.2020.3033397

work page doi:10.1109/mpe.2020.3033397 2020
[12]

Ghasemi, A

A. Ghasemi, A. Shojaeighadikolaei, K. Jones, M. Hashemi, A. G. Bardas, R. Ahmadi, A multi-agent deep reinforcement learning ap- proach for a distributed energy marketplace in smart grids, in: 2020 IEEE International Conference on Communications, Control, and P. Wilk et al.:Preprint submitted to ElsevierPage 10 of 11 MARL-Based Coordinated P2P Electricity ...

work page doi:10.1109/smartgridcomm47815.2020.9302981 2020
[13]

doi:https://doi.org/10.1016/j.apenergy.2021.116940

D.Qiu,Y.Ye,D.Papadaskalopoulos,G.Strbac,Scalablecoordinated management of peer-to-peer energy trading: A multi-cluster deep reinforcementlearningapproach,AppliedEnergy292(2021)116940. doi:https://doi.org/10.1016/j.apenergy.2021.116940

work page doi:10.1016/j.apenergy.2021.116940 2021
[14]

Papadaskalopoulos, G

D. Papadaskalopoulos, G. Strbac, Nonlinear and randomized pricing for distributed management of flexible loads, IEEE Transactions on Smart Grid 7 (2) (2016) 1137–1146.doi:10.1109/TSG.2015.2437795

work page doi:10.1109/tsg.2015.2437795 2016
[15]

G. Yang, S. Du, Q. Duan, J. Su, Deep reinforcement learning- based trading strategy for load aggregators on price-responsive de- mand,ComputationalIntelligence andNeuroscience2022(1) (2022) 6884956.doi:https://doi.org/10.1155/2022/6884956

work page doi:10.1155/2022/6884956 2022
[16]

Le Ray, E

G. Le Ray, E. M. Larsen, P. Pinson, Evaluating price-based demand response in practice—with application to the ecogrid eu experiment, IEEE Transactions on Smart Grid 9 (3) (2018) 2304–2313.doi: 10.1109/TSG.2016.2610518

work page doi:10.1109/tsg.2016.2610518 2018
[17]

Khojasteh, P

M. Khojasteh, P. Faria, F. Lezama, Z. Vale, A novel adaptive robust model for scheduling distributed energy resources in local electricity and flexibility markets, Applied Energy 342 (2023) 121144.doi: https://doi.org/10.1016/j.apenergy.2023.121144

work page doi:10.1016/j.apenergy.2023.121144 2023
[18]

Agwan, L

U. Agwan, L. Spangher, W. Arnold, T. Srivastava, K. Poolla, C. J. Spanos, Pricing in prosumer aggregations using reinforcement learn- ing,e-Energy’21,AssociationforComputingMachinery,NewYork, NY, USA, 2021, p. 220–224.doi:10.1145/3447555.3464853

work page doi:10.1145/3447555.3464853 2021
[19]

H. Wang, J. Huang, Incentivizing energy trading for interconnected microgrids, IEEE Transactions on Smart Grid 9 (4) (2018) 2647– 2657.doi:10.1109/TSG.2016.2614988

work page doi:10.1109/tsg.2016.2614988 2018
[20]

L. Jia, Q. Zhao, L. Tong, Retail pricing for stochastic demand with unknown parameters: An online machine learning approach, in: 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2013, pp. 1353–1358.doi : 10.1109 / Allerton.2013.6736684

work page arXiv 2013
[21]

S.-J. Kim, G. B. Giannakis, An online convex optimization approach to real-time energy pricing for demand response, IEEE Transactions onSmartGrid8(6)(2017)2784–2793.doi:10.1109/TSG.2016.2539948

work page doi:10.1109/tsg.2016.2539948 2017
[22]

N. Liu, X. Yu, C. Wang, C. Li, L. Ma, J. Lei, Energy-sharing model with price-based demand response for microgrids of peer-to-peer prosumers,IEEETransactionsonPowerSystems32(5)(2017)3569– 3583.doi:10.1109/TPWRS.2017.2649558

work page doi:10.1109/tpwrs.2017.2649558 2017
[23]

Horrillo-Quintero, P

P. Horrillo-Quintero, P. García-Triviño, D. Carrasco-González, C. A. García-Vázquez, L. M. Fernández-Ramírez, Smart energy coordi- nation in microgrid clusters using hybrid model predictive con- trol and differential evolution optimization, Energy Conversion and Management 351 (2026) 121039.doi : https : / / doi.org / 10.1016 / j.enconman.2026.121039

work page arXiv 2026
[24]

H.Khajeh,H.Firoozi,M.R.Hesamzadeh,H.Laaksonen,M.Shafie- Khah, A local capacity market providing local and system-wide flexibilityservices,IEEEAccess9(2021)52336–52351.doi:10.1109/ ACCESS.2021.3069949

work page arXiv 2021
[25]

Vicente-Pastor, J

A. Vicente-Pastor, J. Nieto-Martin, D. W. Bunn, A. Laur, Evaluation of flexibility markets for retailer–dso–tso coordination, IEEE Trans- actions on Power Systems 34 (3) (2019) 2003–2012.doi:10.1109/ TPWRS.2018.2880123

work page arXiv 2019
[26]

Y. Zhou, J. Wu, G. Song, C. Long, Framework design and optimal bidding strategy for ancillary service provision from a peer-to-peer energytradingcommunity,AppliedEnergy278(2020)115671.doi: https://doi.org/10.1016/j.apenergy.2020.115671

work page doi:10.1016/j.apenergy.2020.115671 2020
[27]

Z. Guo, P. Pinson, S. Chen, Q. Yang, Z. Yang, Chance-constrained peer-to-peer joint energy and reserve market considering renewable generation uncertainty, IEEE Transactions on Smart Grid 12 (1) (2021) 798–809.doi:10.1109/TSG.2020.3019603

work page doi:10.1109/tsg.2020.3019603 2021
[28]

D. Cao, W. Hu, J. Zhao, G. Zhang, B. Zhang, Z. Liu, Z. Chen, F. Blaabjerg, Reinforcement learning and its applications in modern power and energy systems: A review, Journal of Modern Power Systems and Clean Energy 8 (6) (2020) 1029–1042.doi:10.35833/ MPCE.2020.000552

work page arXiv 2020
[29]

R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, Cambridge, MA, USA, 2018. URLhttp: //incompleteideas.net/book/the-book-2nd.html

2018
[30]

T. Chen, W. Su, Local energy trading behavior modeling with deep reinforcement learning, IEEE Access 6 (2018) 62806–62814.doi: 10.1109/ACCESS.2018.2876652

work page doi:10.1109/access.2018.2876652 2018
[31]

H. Hua, Y. Qin, C. Hao, J. Cao, Optimal energy management strate- gies for energy internet via deep reinforcement learning approach, Applied Energy 239 (2019) 598–609.doi:https://doi.org/10.1016/ j.apenergy.2019.01.145

2019
[32]

Z. Wan, H. Li, H. He, D. Prokhorov, Model-free real-time ev charg- ing scheduling based on deep reinforcement learning, IEEE Trans- actions on Smart Grid 10 (5) (2019) 5246–5257.doi : 10.1109 / TSG.2018.2879572

work page arXiv 2019
[33]

Anvari-Moghaddam, A

A. Anvari-Moghaddam, A. Rahimi-Kian, M. S. Mirian, J. M. Guer- rero,Amulti-agentbasedenergymanagementsolutionforintegrated buildings and microgrid system, Applied Energy 203 (2017) 41–56. doi:https://doi.org/10.1016/j.apenergy.2017.06.007

work page doi:10.1016/j.apenergy.2017.06.007 2017
[34]

Brandi, M

S. Brandi, M. S. Piscitelli, M. Martellacci, A. Capozzoli, Deep rein- forcementlearningtooptimiseindoortemperaturecontrolandheating energy consumption in buildings, Energy and Buildings 224 (2020) 110225.doi:https://doi.org/10.1016/j.enbuild.2020.110225

work page doi:10.1016/j.enbuild.2020.110225 2020
[35]

J.-G. Kim, B. Lee, Automatic p2p energy trading model based on re- inforcement learning using long short-term delayed reward, Energies 13 (20).doi:10.3390/en13205359

work page doi:10.3390/en13205359
[36]

2019.A Survey of Learning in Multiagent Environments: Dealing with Non- Stationarity

P. Hernandez-Leal, M. Kaisers, T. Baarslag, E. M. de Cote, A survey oflearninginmultiagentenvironments:Dealingwithnon-stationarity (2019). URLhttps://arxiv.org/abs/1707.09183

work page arXiv 2019
[37]

Vazquez-Canteli, T

J. Vazquez-Canteli, T. Detjeen, G. Henze, J. Kämpf, Z. Nagy, Multi- agent reinforcement learning for adaptive demand response in smart cities,JournalofPhysics:ConferenceSeries1343(1)(2019)012058. doi:10.1088/1742-6596/1343/1/012058

work page doi:10.1088/1742-6596/1343/1/012058 2019
[38]

Lu, Y.-C

R. Lu, Y.-C. Li, Y. Li, J. Jiang, Y. Ding, Multi-agent deep rein- forcementlearningbaseddemandresponsefordiscretemanufacturing systems energy management, Applied Energy 276 (2020) 115473. doi:https://doi.org/10.1016/j.apenergy.2020.115473

work page doi:10.1016/j.apenergy.2020.115473 2020
[39]

P.Wilk,N.Wang,J.Li,Multi-agentreinforcementlearningforsmart community energy management, Energies 17 (20).doi : 10.3390 / en17205211
[40]

19, 2025 (2025)

PJM Interconnection, PJM Data Miner - Settlements Verified Hourly LMPs,https : / / dataminer2.pjm.com / feed / rt _ da _ monthly _ lmps, ac- cessed: Jan. 19, 2025 (2025). P. Wilk et al.:Preprint submitted to ElsevierPage 11 of 11

2025