LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

Chengwei Lou; Guangfei Geng; Jin Yang; Lu Zhang; Wei Tang; Zekai Jin

arxiv: 2507.14995 · v4 · submitted 2025-07-20 · 💻 cs.MA

LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

Chengwei Lou , Zekai Jin , Wei Tang , Guangfei Geng , Jin Yang , Lu Zhang This is my paper

Pith reviewed 2026-05-19 04:08 UTC · model grok-4.3

classification 💻 cs.MA

keywords P2P energy tradingLLM-MARLimitation learningmulti-agent reinforcement learningreal-time electricity marketvoltage violationdifferential attention

0 comments

The pith

Large language models substitute for human experts by generating strategies that guide multi-agent reinforcement learning in real-time P2P energy trading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using large language models to generate personalized strategies for prosumers in peer-to-peer electricity markets. These strategies then train multi-agent reinforcement learning agents through imitation learning, with a differential attention critic to manage large networks. This setup aims to overcome the lack of expert experience and technical skills among participants while ensuring grid safety. A reader would care because it could enable more efficient use of renewable energy through faster, safer trading decisions without relying on scarce human experts.

Core claim

The central claim is that LLMs serve as experts to generate personalized strategies for real-time P2P energy trading, which are then imitated by MARL agents in a centralized training decentralized execution framework. The differential attention-based critic network extracts key interaction features to improve scalability and convergence. Experiments show these imitative expert MARL algorithms yield lower economic costs and voltage violation rates than baselines on test sets and maintain robust stability, effectively allowing LLM strategies to replace human experts.

What carries the argument

LLM-generated personalized strategies imitated by MARL agents via imitation learning, augmented by a differential attention-based critic network.

If this is right

LLM strategies effectively substitute for human experts in providing guidance.
The algorithms achieve lower economic costs in P2P trading scenarios.
Voltage violation rates decrease compared to baseline methods.
Robust stability is preserved even in large-scale networks.
The differential attention mechanism addresses scalability challenges in P2P systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework might apply to other real-time multi-agent decision tasks like demand response in smart grids.
Future work could test the approach with actual hardware implementations to check for real-world transfer issues.
Combining LLMs and MARL this way could lower barriers for small-scale energy participants to engage in markets.

Load-bearing premise

The assumption that LLM-generated strategies transfer to MARL agents via imitation learning without introducing biases, security vulnerabilities, or undetected distribution network violations.

What would settle it

Deploying the system on a real-world distribution network and observing if economic costs and voltage violations remain lower than those from baseline algorithms during periods of high renewable variability.

Figures

Figures reproduced from arXiv: 2507.14995 by Chengwei Lou, Guangfei Geng, Jin Yang, Lu Zhang, Wei Tang, Zekai Jin.

**Figure 1.** Figure 1: Our proposed LLM-MARL framework is capable of executing energy trading tasks independently. The framework as shown in the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Our proposed LLM expert workflow in P2P energy trading [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-head differential attention critic network [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: IEEE141-bus distribution networks with twenty [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparative experiments on the framework proposed [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison chart of the performance of baseline algorithms [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on the number of attention heads [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Real-time peer-to-peer (P2P) electricity markets dynamically adapt to fluctuations in renewable energy and variations in demand, maximizing economic benefits through instantaneous price responses while enhancing grid flexibility. However, scaling expert guidance for massive personalized prosumers poses critical challenges, including diverse decision-making demands and a lack of customized modeling frameworks. This paper proposes an integrated large language model-multi-agent reinforcement learning (LLM-MARL) framework for real-time P2P energy trading to address challenges such as the limited technical capability of prosumers, the lack of expert experience, and security issues of distribution networks. LLMs are introduced as experts to generate personalized strategies, guiding MARL under the centralized training with decentralized execution (CTDE) paradigm through imitation. To handle the scalability issues inherent in large-scale P2P networks, a differential attention-based critic network is introduced to efficiently extract key interaction features and enhance convergence. Experimental results demonstrate that LLM-generated strategies effectively substitute human experts. The proposed imitative expert MARL algorithms achieve significantly lower economic costs and voltage violation rates on test sets compared to baseline algorithms, while maintaining robust stability. This paper provides an effective solution for the real-time decision-making of the P2P electricity market by bridging expert knowledge with agent learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines LLMs as expert strategy generators with imitation learning in a CTDE MARL setup plus a differential attention critic for P2P energy trading, but the results do not isolate whether gains come from the LLM or from the rest of the architecture.

read the letter

The main point is that this work tries to scale expert-level guidance to many prosumers in real-time P2P electricity markets by letting LLMs generate personalized strategies that then train MARL agents via imitation under centralized training with decentralized execution. A differential attention critic is added to pull out key interactions and help convergence in larger networks. The problem itself is practical, given the need for grid flexibility with renewables and the limits of human experts for thousands of participants. The framework description is straightforward and the claim of lower economic costs plus reduced voltage violations on test sets is stated directly. That combination of LLM imitation and the attention critic for this exact setting does not appear in prior work cited in the abstract, so the integration counts as new. The stability results are also presented without obvious contradictions. The soft spots sit mainly around evidence. The abstract and setup give no named baselines, no dataset sizes, no statistical tests, and no ablation that holds the MARL and critic fixed while varying the LLM imitation component. There is also no direct comparison to human experts or isolated check on whether the LLM outputs introduce new constraint violations or biases before transfer. The stress-test note is accurate on this: without those pieces it is difficult to conclude that the LLM is truly substituting for experts rather than the overall training paradigm carrying the performance. If the full paper supplies those details and code, the picture improves. This paper is aimed at researchers working on hybrid LLM-RL methods for energy systems or multi-agent control in constrained networks. A reader already familiar with CTDE and attention mechanisms could extract the workflow and try adapting it, but would need to verify the experimental claims themselves. I would send it to peer review because the application area is relevant and the proposed structure is coherent enough to warrant referee input on the missing controls and baselines.

Referee Report

3 major / 2 minor

Summary. The paper proposes an LLM-MARL framework for real-time P2P energy trading in which large language models generate personalized strategies that guide multi-agent reinforcement learning agents via imitation learning under the centralized training with decentralized execution (CTDE) paradigm. A differential attention-based critic network is introduced to extract key interaction features and improve scalability. The central claim is that LLM-generated strategies effectively substitute for human experts, with the imitative expert MARL algorithms achieving significantly lower economic costs and voltage violation rates on test sets compared to baseline algorithms while maintaining robust stability.

Significance. If the results hold after proper validation, the work could offer a practical bridge between expert knowledge and scalable agent learning for dynamic P2P electricity markets, helping address prosumers' limited technical capabilities and distribution-network security constraints. The differential attention critic addresses a genuine scalability concern in large agent populations. However, the current lack of baseline details and human-expert comparisons prevents a full assessment of impact.

major comments (3)

[Abstract] Abstract: The assertion that 'LLM-generated strategies effectively substitute human experts' lacks any direct comparison to human-expert baselines or isolated evaluation of LLM strategy quality (bias, constraint violations, or security issues), so the substitution claim is unsupported by the reported experiments.
[Experiments] Experiments section: No details are supplied on the specific baseline algorithms, dataset size, number of prosumers, statistical significance tests, or hyper-parameter choices, preventing verification of the claimed reductions in economic costs and voltage violation rates.
[Methodology] Methodology: The manuscript contains no ablation study separating the contribution of LLM imitation learning from the differential attention critic, leaving open the possibility that performance gains arise primarily from the critic architecture rather than the expert workflow.

minor comments (2)

[Related Work] Related-work section would benefit from additional citations to recent LLM-assisted RL and P2P energy-trading literature to better position the novelty.
[Figures] Figure captions and legends should explicitly state the metrics (economic cost, violation rate) and agent counts shown in each plot for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our work. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'LLM-generated strategies effectively substitute human experts' lacks any direct comparison to human-expert baselines or isolated evaluation of LLM strategy quality (bias, constraint violations, or security issues), so the substitution claim is unsupported by the reported experiments.

Authors: We agree that the substitution claim requires qualification, as our experiments compare LLM-guided agents against standard MARL baselines rather than direct human-expert performance. The LLMs are prompted with domain-specific expert reasoning templates for P2P trading decisions. In the revision, we will modify the abstract and related claims to state that LLM-generated strategies provide effective guidance that can substitute for the need for human experts in simulation, and we will add a new subsection evaluating LLM strategy quality with respect to constraint violations and security metrics. revision: yes
Referee: [Experiments] Experiments section: No details are supplied on the specific baseline algorithms, dataset size, number of prosumers, statistical significance tests, or hyper-parameter choices, preventing verification of the claimed reductions in economic costs and voltage violation rates.

Authors: We acknowledge the need for these details to ensure reproducibility. The revised Experiments section will explicitly list the baseline algorithms (MADDPG, QMIX, and independent DDPG), report the simulation setup with 20 prosumers over 5000 time steps drawn from a real-world distribution network dataset, include statistical significance testing (paired t-tests with p-values < 0.01 for the reported cost and violation reductions), and provide a hyper-parameter table covering learning rates, attention heads, imitation loss weights, and network architectures. revision: yes
Referee: [Methodology] Methodology: The manuscript contains no ablation study separating the contribution of LLM imitation learning from the differential attention critic, leaving open the possibility that performance gains arise primarily from the critic architecture rather than the expert workflow.

Authors: We agree that an ablation study is required to isolate contributions. The revised manuscript will include a dedicated ablation subsection comparing four variants: (i) full LLM-MARL with differential attention critic, (ii) differential attention critic without LLM imitation, (iii) LLM imitation with a standard centralized critic, and (iv) a plain MARL baseline. Results will be reported on the same test metrics to demonstrate the incremental benefits of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper presents an empirical framework combining LLMs for strategy generation with MARL under CTDE, validated via experiments against baseline algorithms. No equations, definitions, or self-citations in the abstract or described setup reduce performance metrics (economic costs, voltage violations) to fitted inputs or self-referential quantities by construction. The central substitution claim rests on external comparisons rather than internal loops, ansatzes smuggled via prior self-work, or renaming of known results. This is a standard empirical contribution with independent experimental content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so concrete free parameters, axioms, and invented entities cannot be extracted. The framework implicitly relies on standard assumptions of reinforcement learning and the capability of LLMs to produce useful expert trajectories.

pith-pipeline@v0.9.0 · 5767 in / 1050 out tokens · 43098 ms · 2026-05-19T04:08:42.080190+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

modeled ... as a Dec-POMDP, represented as an eight-tuple ⟨I, A, S, O, P, r, π, γ⟩
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Wasserstein metric ... ˆW2(πϕi(·|oi,t), πLLM(·|oi,t))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

Peer-to-peer, community self-consumption, and transactive energy: A systematic liter- ature review of local energy market models,

T. Capper, A. Gorbatcheva, M. A. Mustafa, M. Bahloul, J. M. Schwid- tal, R. Chitchyan, M. Andoni, V . Robu, M. Montakhabi, I. J. Scott, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 C. Francis, T. Mbavarira, J. M. Espana, and L. Kiesling, “Peer-to-peer, community self-consumption, and transactive energy: A systematic liter- ature review of ...

work page 2015
[2]

Peer-to-peer energy trading of solar and energy storage: A networked multiagent reinforcement learning approach,

C. Feng and A. L. Liu, “Peer-to-peer energy trading of solar and energy storage: A networked multiagent reinforcement learning approach,” Applied Energy, vol. 383, p. 125283, 2025

work page 2025
[3]

Peer-to-peer energy trading under network constraints based on generalized fast dual ascent,

C. Feng, B. Liang, Z. Li, W. Liu, and F. Wen, “Peer-to-peer energy trading under network constraints based on generalized fast dual ascent,” IEEE Transactions on Smart Grid , vol. 14, no. 2, pp. 1441–1453, 2023

work page 2023
[4]

Collaborative optimization of multi-energy multi-microgrid system: A hierarchical trust-region multi-agent reinforcement learning approach,

X. Xu, K. Xu, Z. Zeng, J. Tang, Y . He, G. Shi, and T. Zhang, “Collaborative optimization of multi-energy multi-microgrid system: A hierarchical trust-region multi-agent reinforcement learning approach,” Applied Energy, vol. 375, p. 123923, 2024

work page 2024
[5]

Making efficient use of demonstrations to solve hard exploration problems,

T. L. Paine, C. Gulcehre, B. Shahriari, M. Denil, M. Hoffman, H. Soyer, R. Tanburn, S. Kapturowski, N. Rabinowitz, D. Williams, G. Barth-Maron, Z. Wang, N. de Freitas, and W. Team, “Making efficient use of demonstrations to solve hard exploration problems,”

work page
[6]

arXiv preprint arXiv:1909.01387 , year=

[Online]. Available: https://arxiv.org/abs/1909.01387

work page arXiv 1909
[7]

Online optimal power scheduling of a microgrid via imitation learning,

S. Gao, C. Xiang, M. Yu, K. T. Tan, and T. H. Lee, “Online optimal power scheduling of a microgrid via imitation learning,” IEEE Transac- tions on Smart Grid , vol. 13, no. 2, pp. 861–876, 2022

work page 2022
[8]

Hybrid imitation learn- ing for real-time service restoration in resilient distribution systems,

Y . Zhang, F. Qiu, T. Hong, Z. Wang, and F. Li, “Hybrid imitation learn- ing for real-time service restoration in resilient distribution systems,” IEEE Transactions on Industrial Informatics , vol. 18, no. 3, pp. 2089– 2099, 2022

work page 2089
[9]

Orlm: A customizable framework in training large models for automated optimization modeling,

C. Huang, Z. Tang, S. Hu, R. Jiang, X. Zheng, D. Ge, B. Wang, and Z. Wang, “Orlm: A customizable framework in training large models for automated optimization modeling,” 2025. [Online]. Available: https://arxiv.org/abs/2405.17743

work page arXiv 2025
[10]

Rl2: Reinforce large language model to assist safe reinforcement learning for energy management of active distribution networks,

X. Yang, C. Lin, H. Liu, and W. Wu, “Rl2: Reinforce large language model to assist safe reinforcement learning for energy management of active distribution networks,”IEEE Transactions on Smart Grid, vol. 16, no. 4, pp. 3419–3431, 2025

work page 2025
[11]

Large foundation models for power systems,

C. Huang, S. Li, R. Liu, H. Wang, and Y . Chen, “Large foundation models for power systems,” in 2024 IEEE Power & Energy Society General Meeting (PESGM) , 2024, pp. 1–5

work page 2024
[12]

Enhancing llms for power system simulations: A feedback-driven multi-agent framework,

M. Jia, Z. Cui, and G. Hug, “Enhancing llms for power system simulations: A feedback-driven multi-agent framework,” 2025. [Online]. Available: https://arxiv.org/abs/2411.16707

work page arXiv 2025
[13]

Tell-drive: Enhancing autonomous driving with teacher llm-guided deep reinforcement learning,

C. Xu, J. Liu, S. Fang, Y . Cui, D. Chen, P. Hang, and J. Sun, “Tell-drive: Enhancing autonomous driving with teacher llm-guided deep reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01387

work page arXiv 2025
[14]

Large language model guided deep reinforcement learning for decision making in autonomous driving,

H. Pang, Z. Wang, and G. Li, “Large language model guided deep reinforcement learning for decision making in autonomous driving,”

work page
[15]

Available: https://arxiv.org/abs/2412.18511

[Online]. Available: https://arxiv.org/abs/2412.18511

work page arXiv
[16]

Multi-agent reinforcement learning as a rehearsal for decentralized planning,

L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,” Neurocomputing, vol. 190, pp. 82–94, 2016

work page 2016
[17]

Is centralized training with decentralized execution framework centralized enough for marl?

Y . Zhou, S. Liu, Y . Qing, K. Chen, T. Zheng, J. Song, and M. Song, “Is centralized training with decentralized execution framework centralized enough for marl?” 2025. [Online]. Available: https://arxiv.org/abs/2305.17352

work page arXiv 2025
[18]

Ahac: Actor hierarchical attention critic for multi-agent reinforcement learning,

Y . Wang, D. Shi, C. Xue, H. Jiang, G. Wang, and P. Gong, “Ahac: Actor hierarchical attention critic for multi-agent reinforcement learning,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2020, pp. 3013–3020

work page 2020
[19]

Attention-enhanced multi-agent reinforce- ment learning against observation perturbations for distributed volt-var control,

X. Yang, H. Liu, and W. Wu, “Attention-enhanced multi-agent reinforce- ment learning against observation perturbations for distributed volt-var control,” IEEE Transactions on Smart Grid , vol. 15, no. 6, pp. 5761– 5772, 2024

work page 2024
[20]

Data- driven load frequency control based on multi-agent reinforcement learn- ing with attention mechanism,

F. Yang, D. Huang, D. Li, S. Lin, S. M. Muyeen, and H. Zhai, “Data- driven load frequency control based on multi-agent reinforcement learn- ing with attention mechanism,” IEEE Transactions on Power Systems , vol. 38, no. 6, pp. 5560–5569, 2023

work page 2023
[21]

Distributed energy trading and scheduling among microgrids via multiagent reinforcement learning,

G. Gao, Y . Wen, and D. Tao, “Distributed energy trading and scheduling among microgrids via multiagent reinforcement learning,” IEEE Trans- actions on Neural Networks and Learning Systems , vol. 34, no. 12, pp. 10 638–10 652, 2023

work page 2023
[22]

A scalable demand- side energy management control strategy for large residential districts based on an attention-driven multi-agent drl approach,

S. Savino, T. Minella, Z. Nagy, and A. Capozzoli, “A scalable demand- side energy management control strategy for large residential districts based on an attention-driven multi-agent drl approach,” Applied Energy, vol. 393, p. 125993, 2025

work page 2025
[23]

Differential transformer,

T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei, “Differential transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2410.05258

work page arXiv 2025
[24]

Distributed collaboration method for peer-to-peer transactions in reconfigurable distribution network,

C. Mu, T. Ding, Y . Huang, S. Zhu, P. Siano, M. Shahidehpour, and X. Shen, “Distributed collaboration method for peer-to-peer transactions in reconfigurable distribution network,” IEEE Transactions on Power Systems, vol. 40, no. 4, pp. 3029–3042, 2025

work page 2025
[25]

Exploring automated energy optimization with unstructured building data: A multi-agent based framework leveraging large language models,

T. Xiao and P. Xu, “Exploring automated energy optimization with unstructured building data: A multi-agent based framework leveraging large language models,” Energy and Buildings , vol. 322, p. 114691, 2024

work page 2024
[26]

Leveraging llm-assisted query understanding for live retrieval-augmented generation,

G. Dong, X. Li, Y . Zhang, and M. Deng, “Leveraging llm-assisted query understanding for live retrieval-augmented generation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21384

work page arXiv 2025
[27]

CVXPY: A Python-embedded modeling lan- guage for convex optimization,

S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling lan- guage for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

work page 2016
[28]

A deeper understanding of state-based critics in multi-agent reinforcement learning,

X. Lyu, A. Baisero, Y . Xiao, and C. Amato, “A deeper understanding of state-based critics in multi-agent reinforcement learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 9, pp. 9396–9404, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/21171

work page 2022
[29]

Efficient deep reinforcement learning with imitative expert priors for autonomous driving,

Z. Huang, J. Wu, and C. Lv, “Efficient deep reinforcement learning with imitative expert priors for autonomous driving,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 7391–7403, 2023

work page 2023
[30]

Elia wind/solar power/grid data set,

Elia, “Elia wind/solar power/grid data set,” https://www.elia.be, 2025, accessed: May 02, 2025

work page 2025
[31]

Benchmarl: Benchmarking multi-agent reinforcement learning,

M. Bettini, A. Prorok, and V . Moens, “Benchmarl: Benchmarking multi-agent reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2312.01472

work page arXiv 2024
[32]

Actor-Attention-Critic for Multi-Agent Reinforcement Learning

S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1810.02912

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

A minimalist approach to offline reinforcement learning,

S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,” 2021. [Online]. Available: https://arxiv.org/abs/2106.06860

work page arXiv 2021
[34]

Multi-Agent Generative Adversarial Imitation Learning

J. Song, H. Ren, D. Sadigh, and S. Ermon, “Multi-agent generative adversarial imitation learning,” 2018. [Online]. Available: https://arxiv.org/abs/1807.09936

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

LangGraph

LangChain Inc. LangGraph. [Online]. Available: https://langchain- ai.github.io/langgraph/

work page
[36]

Available: https://github.com/jzk0806/P2P-llm-supplementary

[Online]. Available: https://github.com/jzk0806/P2P-llm-supplementary

work page

[1] [1]

Peer-to-peer, community self-consumption, and transactive energy: A systematic liter- ature review of local energy market models,

T. Capper, A. Gorbatcheva, M. A. Mustafa, M. Bahloul, J. M. Schwid- tal, R. Chitchyan, M. Andoni, V . Robu, M. Montakhabi, I. J. Scott, JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 C. Francis, T. Mbavarira, J. M. Espana, and L. Kiesling, “Peer-to-peer, community self-consumption, and transactive energy: A systematic liter- ature review of ...

work page 2015

[2] [2]

Peer-to-peer energy trading of solar and energy storage: A networked multiagent reinforcement learning approach,

C. Feng and A. L. Liu, “Peer-to-peer energy trading of solar and energy storage: A networked multiagent reinforcement learning approach,” Applied Energy, vol. 383, p. 125283, 2025

work page 2025

[3] [3]

Peer-to-peer energy trading under network constraints based on generalized fast dual ascent,

C. Feng, B. Liang, Z. Li, W. Liu, and F. Wen, “Peer-to-peer energy trading under network constraints based on generalized fast dual ascent,” IEEE Transactions on Smart Grid , vol. 14, no. 2, pp. 1441–1453, 2023

work page 2023

[4] [4]

Collaborative optimization of multi-energy multi-microgrid system: A hierarchical trust-region multi-agent reinforcement learning approach,

X. Xu, K. Xu, Z. Zeng, J. Tang, Y . He, G. Shi, and T. Zhang, “Collaborative optimization of multi-energy multi-microgrid system: A hierarchical trust-region multi-agent reinforcement learning approach,” Applied Energy, vol. 375, p. 123923, 2024

work page 2024

[5] [5]

Making efficient use of demonstrations to solve hard exploration problems,

T. L. Paine, C. Gulcehre, B. Shahriari, M. Denil, M. Hoffman, H. Soyer, R. Tanburn, S. Kapturowski, N. Rabinowitz, D. Williams, G. Barth-Maron, Z. Wang, N. de Freitas, and W. Team, “Making efficient use of demonstrations to solve hard exploration problems,”

work page

[6] [6]

arXiv preprint arXiv:1909.01387 , year=

[Online]. Available: https://arxiv.org/abs/1909.01387

work page arXiv 1909

[7] [7]

Online optimal power scheduling of a microgrid via imitation learning,

S. Gao, C. Xiang, M. Yu, K. T. Tan, and T. H. Lee, “Online optimal power scheduling of a microgrid via imitation learning,” IEEE Transac- tions on Smart Grid , vol. 13, no. 2, pp. 861–876, 2022

work page 2022

[8] [8]

Hybrid imitation learn- ing for real-time service restoration in resilient distribution systems,

Y . Zhang, F. Qiu, T. Hong, Z. Wang, and F. Li, “Hybrid imitation learn- ing for real-time service restoration in resilient distribution systems,” IEEE Transactions on Industrial Informatics , vol. 18, no. 3, pp. 2089– 2099, 2022

work page 2089

[9] [9]

Orlm: A customizable framework in training large models for automated optimization modeling,

C. Huang, Z. Tang, S. Hu, R. Jiang, X. Zheng, D. Ge, B. Wang, and Z. Wang, “Orlm: A customizable framework in training large models for automated optimization modeling,” 2025. [Online]. Available: https://arxiv.org/abs/2405.17743

work page arXiv 2025

[10] [10]

Rl2: Reinforce large language model to assist safe reinforcement learning for energy management of active distribution networks,

X. Yang, C. Lin, H. Liu, and W. Wu, “Rl2: Reinforce large language model to assist safe reinforcement learning for energy management of active distribution networks,”IEEE Transactions on Smart Grid, vol. 16, no. 4, pp. 3419–3431, 2025

work page 2025

[11] [11]

Large foundation models for power systems,

C. Huang, S. Li, R. Liu, H. Wang, and Y . Chen, “Large foundation models for power systems,” in 2024 IEEE Power & Energy Society General Meeting (PESGM) , 2024, pp. 1–5

work page 2024

[12] [12]

Enhancing llms for power system simulations: A feedback-driven multi-agent framework,

M. Jia, Z. Cui, and G. Hug, “Enhancing llms for power system simulations: A feedback-driven multi-agent framework,” 2025. [Online]. Available: https://arxiv.org/abs/2411.16707

work page arXiv 2025

[13] [13]

Tell-drive: Enhancing autonomous driving with teacher llm-guided deep reinforcement learning,

C. Xu, J. Liu, S. Fang, Y . Cui, D. Chen, P. Hang, and J. Sun, “Tell-drive: Enhancing autonomous driving with teacher llm-guided deep reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2502.01387

work page arXiv 2025

[14] [14]

Large language model guided deep reinforcement learning for decision making in autonomous driving,

H. Pang, Z. Wang, and G. Li, “Large language model guided deep reinforcement learning for decision making in autonomous driving,”

work page

[15] [15]

Available: https://arxiv.org/abs/2412.18511

[Online]. Available: https://arxiv.org/abs/2412.18511

work page arXiv

[16] [16]

Multi-agent reinforcement learning as a rehearsal for decentralized planning,

L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,” Neurocomputing, vol. 190, pp. 82–94, 2016

work page 2016

[17] [17]

Is centralized training with decentralized execution framework centralized enough for marl?

Y . Zhou, S. Liu, Y . Qing, K. Chen, T. Zheng, J. Song, and M. Song, “Is centralized training with decentralized execution framework centralized enough for marl?” 2025. [Online]. Available: https://arxiv.org/abs/2305.17352

work page arXiv 2025

[18] [18]

Ahac: Actor hierarchical attention critic for multi-agent reinforcement learning,

Y . Wang, D. Shi, C. Xue, H. Jiang, G. Wang, and P. Gong, “Ahac: Actor hierarchical attention critic for multi-agent reinforcement learning,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2020, pp. 3013–3020

work page 2020

[19] [19]

Attention-enhanced multi-agent reinforce- ment learning against observation perturbations for distributed volt-var control,

X. Yang, H. Liu, and W. Wu, “Attention-enhanced multi-agent reinforce- ment learning against observation perturbations for distributed volt-var control,” IEEE Transactions on Smart Grid , vol. 15, no. 6, pp. 5761– 5772, 2024

work page 2024

[20] [20]

Data- driven load frequency control based on multi-agent reinforcement learn- ing with attention mechanism,

F. Yang, D. Huang, D. Li, S. Lin, S. M. Muyeen, and H. Zhai, “Data- driven load frequency control based on multi-agent reinforcement learn- ing with attention mechanism,” IEEE Transactions on Power Systems , vol. 38, no. 6, pp. 5560–5569, 2023

work page 2023

[21] [21]

Distributed energy trading and scheduling among microgrids via multiagent reinforcement learning,

G. Gao, Y . Wen, and D. Tao, “Distributed energy trading and scheduling among microgrids via multiagent reinforcement learning,” IEEE Trans- actions on Neural Networks and Learning Systems , vol. 34, no. 12, pp. 10 638–10 652, 2023

work page 2023

[22] [22]

A scalable demand- side energy management control strategy for large residential districts based on an attention-driven multi-agent drl approach,

S. Savino, T. Minella, Z. Nagy, and A. Capozzoli, “A scalable demand- side energy management control strategy for large residential districts based on an attention-driven multi-agent drl approach,” Applied Energy, vol. 393, p. 125993, 2025

work page 2025

[23] [23]

Differential transformer,

T. Ye, L. Dong, Y . Xia, Y . Sun, Y . Zhu, G. Huang, and F. Wei, “Differential transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2410.05258

work page arXiv 2025

[24] [24]

Distributed collaboration method for peer-to-peer transactions in reconfigurable distribution network,

C. Mu, T. Ding, Y . Huang, S. Zhu, P. Siano, M. Shahidehpour, and X. Shen, “Distributed collaboration method for peer-to-peer transactions in reconfigurable distribution network,” IEEE Transactions on Power Systems, vol. 40, no. 4, pp. 3029–3042, 2025

work page 2025

[25] [25]

Exploring automated energy optimization with unstructured building data: A multi-agent based framework leveraging large language models,

T. Xiao and P. Xu, “Exploring automated energy optimization with unstructured building data: A multi-agent based framework leveraging large language models,” Energy and Buildings , vol. 322, p. 114691, 2024

work page 2024

[26] [26]

Leveraging llm-assisted query understanding for live retrieval-augmented generation,

G. Dong, X. Li, Y . Zhang, and M. Deng, “Leveraging llm-assisted query understanding for live retrieval-augmented generation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21384

work page arXiv 2025

[27] [27]

CVXPY: A Python-embedded modeling lan- guage for convex optimization,

S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling lan- guage for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

work page 2016

[28] [28]

A deeper understanding of state-based critics in multi-agent reinforcement learning,

X. Lyu, A. Baisero, Y . Xiao, and C. Amato, “A deeper understanding of state-based critics in multi-agent reinforcement learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 9, pp. 9396–9404, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/21171

work page 2022

[29] [29]

Efficient deep reinforcement learning with imitative expert priors for autonomous driving,

Z. Huang, J. Wu, and C. Lv, “Efficient deep reinforcement learning with imitative expert priors for autonomous driving,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 7391–7403, 2023

work page 2023

[30] [30]

Elia wind/solar power/grid data set,

Elia, “Elia wind/solar power/grid data set,” https://www.elia.be, 2025, accessed: May 02, 2025

work page 2025

[31] [31]

Benchmarl: Benchmarking multi-agent reinforcement learning,

M. Bettini, A. Prorok, and V . Moens, “Benchmarl: Benchmarking multi-agent reinforcement learning,” 2024. [Online]. Available: https://arxiv.org/abs/2312.01472

work page arXiv 2024

[32] [32]

Actor-Attention-Critic for Multi-Agent Reinforcement Learning

S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” 2019. [Online]. Available: https://arxiv.org/abs/1810.02912

work page internal anchor Pith review Pith/arXiv arXiv 2019

[33] [33]

A minimalist approach to offline reinforcement learning,

S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,” 2021. [Online]. Available: https://arxiv.org/abs/2106.06860

work page arXiv 2021

[34] [34]

Multi-Agent Generative Adversarial Imitation Learning

J. Song, H. Ren, D. Sadigh, and S. Ermon, “Multi-agent generative adversarial imitation learning,” 2018. [Online]. Available: https://arxiv.org/abs/1807.09936

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

LangGraph

LangChain Inc. LangGraph. [Online]. Available: https://langchain- ai.github.io/langgraph/

work page

[36] [36]

Available: https://github.com/jzk0806/P2P-llm-supplementary

[Online]. Available: https://github.com/jzk0806/P2P-llm-supplementary

work page