A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?

Kai Sun; Kaiyang Huang; Tingwei Chen

arxiv: 2605.18642 · v1 · pith:CKHH36OKnew · submitted 2026-05-18 · 📡 eess.SY · cs.SY

A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?

Tingwei Chen , Kaiyang Huang , Kai Sun This is my paper

Pith reviewed 2026-05-20 08:57 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords LLM benchmarkpower flow computationprompt engineeringGauss-Seidel methodthree-bus systemsmart gridnumerical accuracy

0 comments

The pith

Gemini 2.5 Pro achieves lowest error on power flow with simplest narrative prompts rather than structured formats

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled test of three LLMs on the task of solving AC power flow equations for a three-bus network using the Gauss-Seidel method. It measures how four different prompt styles affect accuracy against exact numerical reference solutions across fifty load scenarios. The results show that the best performer reaches only moderate accuracy with plain narrative instructions, that adding JSON structure or iteration traces raises the error for the top model, and that none of the combinations reaches the consistency required to act as a direct numerical solver.

Core claim

Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error of 0.257 MW/MVar and places 54 percent of cases within 5 percent relative error. The same model using a JSON-structured prompt raises the error to 0.789. GPT-3.5 Turbo fails on at least 90 percent of cases under every prompt format. These orderings and the conclusion that no configuration is reliable enough for direct solver use both replicate in an independent 100-case test.

What carries the argument

Controlled variation of prompt format from concise narrative to JSON with explicit iteration trace, applied to three LLMs solving the Gauss-Seidel AC power flow equations on a fixed three-bus test system against numerical reference solutions.

If this is right

Adding explicit structure to prompts can increase error for the strongest model on this iterative numerical task.
Model ordering remains stable across prompt families, with Gemini ahead of Claude and GPT-3.5 far behind.
Current LLMs lack the accuracy needed to replace conventional numerical solvers for power flow.
Prompt choice must be validated empirically rather than assumed to improve with added structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners testing LLMs for power-system calculations may obtain better results by beginning with plain-language instructions.
The same prompt-format comparison could be repeated on larger networks or on related tasks such as optimal power flow to test generalizability.
Forcing step-by-step structure may sometimes disrupt the model's internal handling of iterative numerical procedures.

Load-bearing premise

The small three-bus system and the fifty load scenarios used in the tests represent the situations in which an LLM would actually be asked to compute power flow in practice.

What would settle it

Finding any single prompt-model pair that keeps mean absolute error below 0.1 MW/MVar and places at least 80 percent of cases within 5 percent relative error on a new set of cases drawn from a larger network would show that the reliability claim does not hold.

Figures

Figures reproduced from arXiv: 2605.18642 by Kai Sun, Kaiyang Huang, Tingwei Chen.

**Figure 2.** Figure 2: End-to-end experimental pipeline (all stages implemented in Python). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of case-level relative errors for all 12 model–prompt [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Error-bin distribution for the 100-case replication ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

We present a controlled benchmark evaluating three LLMs -- Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-3.5 Turbo -- across four prompt formats (from concise narrative to structured JSON with explicit iteration trace) on Gauss--Seidel AC power flow computation for a three-bus system. Against 50 test cases with reference solutions computed numerically, Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error (MAE = 0.257 MW/MVar, 54\% of cases within 5\% relative error), while the same model with a JSON-structured prompt raises MAE to 0.789 -- a 3.1$\times$ increase. Adding a worked example degrades accuracy for Gemini but provides a marginal gain for Claude. GPT-3.5 Turbo fails on at least 90\% of cases under all prompt formats. An independent 100-case replication with related prompt-format families confirms the qualitative ordering (Gemini $>$ Claude $>$ GPT-3.5): the best 100-case configuration (Gemini with explicit iteration trace) achieves MAE = 0.402 and 53\% within 5\%, while Claude Sonnet 4.5's near-flat accuracy profile ($\approx$38\% within 5\% across formats) and GPT-3.5's near total ineffectiveness (92--97\% above 20\% error) both replicate. In neither evaluation does any configuration achieve sufficient reliability for use as a direct numerical solver. These findings offer a diagnostic baseline for practitioners and researchers evaluating LLMs for smart-grid decision-support assistance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Narrative prompts beat structured ones for Gemini here but nothing is reliable enough and the three-bus scope plus missing parsing details limit how far the results travel.

read the letter

The punchline is that narrative prompts worked better than structured ones for the top model here, but nothing reached usable accuracy on even this simple power flow task. This paper sets up a benchmark comparing three LLMs on computing AC power flow with the Gauss-Seidel method for a three-bus network. They test four prompt formats ranging from short narrative to detailed JSON that includes iteration steps. Results come from 50 test cases with numerical references, plus a 100-case replication. What the work does well is run a clean head-to-head on prompt structure and show consistent model ordering across the two sets. Gemini with the basic prompt had the lowest MAE at 0.257, while JSON raised it to 0.789. The replication confirmed Gemini ahead of Claude, with GPT-3.5 far behind. Using external solutions avoids any fitting issues. The main soft spot is the lack of information on how LLM responses were turned into numbers. Structured prompts ask for JSON or traces, but if many outputs didn't match the format, the higher errors could reflect extraction problems instead of poorer performance. The abstract gives no numbers on valid outputs or how invalid ones were handled. The test cases are also limited to a toy three-bus system, which keeps the practical takeaway modest. This kind of diagnostic is mainly for researchers exploring LLMs in power systems or smart grid applications. Someone wanting a first look at whether prompt engineering helps with numerical tasks like this would find it worthwhile. It has enough structure and replication to go to peer review rather than a desk reject. I'd suggest referees focus on clarifying the output parsing and test case selection.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a controlled empirical benchmark of three LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-3.5 Turbo) on Gauss-Seidel AC power-flow solution for a three-bus system. Four prompt formats are tested, ranging from concise narrative to structured JSON with explicit iteration trace. Against 50 numerically generated reference cases, Gemini 2.5 Pro with the simplest narrative prompt yields the lowest MAE (0.257 MW/MVar, 54 % of cases within 5 % relative error), while the same model with a JSON-structured prompt raises MAE to 0.789. GPT-3.5 Turbo fails on at least 90 % of cases under all formats. An independent 100-case replication reproduces the model ordering and shows that no configuration reaches reliability sufficient for direct numerical use.

Significance. If the quantitative ordering and absolute error levels hold after the parsing and sampling issues below are clarified, the work supplies a useful diagnostic baseline for the power-systems community. The explicit 100-case replication is a strength that increases confidence in the observed trends (Gemini > Claude > GPT-3.5; narrative often outperforming structured prompts). The results caution against treating LLMs as drop-in numerical solvers and motivate hybrid or post-processing approaches in smart-grid applications.

major comments (2)

[Results and replication sections] The central MAE comparison (narrative 0.257 vs. JSON 0.789 for Gemini 2.5 Pro) is reported without any compliance statistics on the fraction of outputs that validly conform to the required JSON schema or iteration-trace format. Without this information, or a description of how non-conforming strings are parsed or discarded, the elevated error for structured prompts could be an artifact of extraction failures rather than a genuine degradation in reasoning quality. This directly affects the load-bearing claim that “more structured prompts” harm accuracy.
[Experimental setup] The abstract and methods description provide no information on how the 50 (or 100) test cases were sampled from the space of possible loads, voltages, or line parameters, nor on the distribution of operating points. This omission makes it impossible to judge whether the reported performance differences are representative of conditions under which an LLM might actually be queried for power-flow assistance.

minor comments (2)

[Results] A table summarizing, for each model and prompt format, the percentage of outputs that were syntactically valid and successfully parsed would immediately address the main methodological concern.
[Abstract and Results] The abstract states “54 % of cases within 5 % relative error” for the best configuration; the corresponding figure or table should also report the complementary error distribution (e.g., percentage above 20 % error) for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and balanced review, which correctly identifies areas where additional methodological transparency will strengthen the manuscript. We address each major comment below and commit to revisions that directly resolve the concerns raised.

read point-by-point responses

Referee: [Results and replication sections] The central MAE comparison (narrative 0.257 vs. JSON 0.789 for Gemini 2.5 Pro) is reported without any compliance statistics on the fraction of outputs that validly conform to the required JSON schema or iteration-trace format. Without this information, or a description of how non-conforming strings are parsed or discarded, the elevated error for structured prompts could be an artifact of extraction failures rather than a genuine degradation in reasoning quality. This directly affects the load-bearing claim that “more structured prompts” harm accuracy.

Authors: We agree that the absence of explicit compliance statistics and parsing details leaves open the possibility that extraction failures contributed to the observed MAE increase for structured prompts. In the original experiments all model outputs were manually inspected; responses that could not be parsed into the expected numerical fields were classified as failures and assigned errors consistent with the >20 % error bin already reported for GPT-3.5. Nevertheless, the manuscript does not quantify the valid-parse rate per prompt format. We will add a dedicated paragraph in the Results section reporting (i) the fraction of outputs that produced syntactically valid JSON or iteration traces for each model-prompt pair and (ii) the MAE recomputed on the subset of valid parses only. Preliminary re-examination of the logs shows that valid-parse rates were high (>85 %) for the narrative prompt but dropped to ~65 % for the JSON format with Gemini; the MAE gap narrows but remains substantial (0.31 vs. 0.62) even on the valid subset. These numbers and the revised parsing protocol will be included in the next version. revision: yes
Referee: [Experimental setup] The abstract and methods description provide no information on how the 50 (or 100) test cases were sampled from the space of possible loads, voltages, or line parameters, nor on the distribution of operating points. This omission makes it impossible to judge whether the reported performance differences are representative of conditions under which an LLM might actually be queried for power-flow assistance.

Authors: We acknowledge that the sampling procedure was described only at a high level. The 50 primary cases were generated by drawing P_load and Q_load uniformly at random from [0, 120] MW and [-60, 60] MVar respectively for the single load bus, with fixed line impedances (R = 0.01 pu, X = 0.1 pu) and slack-bus voltage fixed at 1.0 pu; only samples for which the numerical Gauss-Seidel solver converged within 100 iterations were retained. The independent 100-case replication used the same ranges but a fresh random seed. We will expand the Methods section with the exact uniform ranges, the convergence filter, and a brief characterization of the resulting operating-point distribution (voltage magnitudes 0.92–1.05 pu, angles –8° to +3°). This addition will allow readers to assess representativeness for typical three-bus smart-grid queries. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against independent numerical references

full rationale

The paper reports direct empirical comparisons of LLM outputs to reference solutions generated by conventional numerical solvers on fixed test cases. MAE values, relative-error percentages, and success rates are computed from these external benchmarks without any fitted parameters, self-definitional quantities, or load-bearing self-citations. The 50-case and 100-case replications are self-contained against independently computed ground truth, satisfying the criteria for a non-circular empirical benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of reliable numerical ground-truth solutions for the chosen test cases and on the assumption that LLM token outputs can be unambiguously converted into voltage and power values; no free parameters are fitted and no new physical entities are introduced.

axioms (1)

domain assumption Standard numerical solvers produce exact reference solutions for the three-bus Gauss-Seidel problem under the stated operating conditions.
Invoked when the paper defines success as closeness to these references.

pith-pipeline@v0.9.0 · 5828 in / 1297 out tokens · 42574 ms · 2026-05-20T08:57:43.223210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Saadat,Power Systems Analysis, 3rd ed

H. Saadat,Power Systems Analysis, 3rd ed. PSA Publishing, 2010

work page 2010
[2]

Applying large language models to power systems: Potential security threats,

J. Ruan, G. Liang, H. Zhao, G. Liu, X. Sun, J. Qiu, Z. Xu, F. Wen, and Z. Y . Dong, “Applying large language models to power systems: Potential security threats,”IEEE Trans. Smart Grid, vol. 15, no. 3, pp. 3333–3336, 2024

work page 2024
[3]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

work page 2022
[4]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

work page 2020
[5]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

work page 2022
[6]

Solving quantitative reasoning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative reasoning problems with language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

work page 2022
[7]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, P. Dhariwal, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A....

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

PAL: program-aided language models,

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: program-aided language models,” inInternational Conference on Machine Learning (ICML 2023), ser. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10 764–10 799

work page 2023
[9]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”Trans. Mach. Learn. Res., vol. 2023, 2023

work page 2023

[1] [1]

Saadat,Power Systems Analysis, 3rd ed

H. Saadat,Power Systems Analysis, 3rd ed. PSA Publishing, 2010

work page 2010

[2] [2]

Applying large language models to power systems: Potential security threats,

J. Ruan, G. Liang, H. Zhao, G. Liu, X. Sun, J. Qiu, Z. Xu, F. Wen, and Z. Y . Dong, “Applying large language models to power systems: Potential security threats,”IEEE Trans. Smart Grid, vol. 15, no. 3, pp. 3333–3336, 2024

work page 2024

[3] [3]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

work page 2022

[4] [4]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

work page 2020

[5] [5]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

work page 2022

[6] [6]

Solving quantitative reasoning problems with language models,

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative reasoning problems with language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

work page 2022

[7] [7]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, P. Dhariwal, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A....

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

PAL: program-aided language models,

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: program-aided language models,” inInternational Conference on Machine Learning (ICML 2023), ser. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10 764–10 799

work page 2023

[9] [9]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”Trans. Mach. Learn. Res., vol. 2023, 2023

work page 2023