A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?
Pith reviewed 2026-05-20 08:57 UTC · model grok-4.3
The pith
Gemini 2.5 Pro achieves lowest error on power flow with simplest narrative prompts rather than structured formats
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error of 0.257 MW/MVar and places 54 percent of cases within 5 percent relative error. The same model using a JSON-structured prompt raises the error to 0.789. GPT-3.5 Turbo fails on at least 90 percent of cases under every prompt format. These orderings and the conclusion that no configuration is reliable enough for direct solver use both replicate in an independent 100-case test.
What carries the argument
Controlled variation of prompt format from concise narrative to JSON with explicit iteration trace, applied to three LLMs solving the Gauss-Seidel AC power flow equations on a fixed three-bus test system against numerical reference solutions.
If this is right
- Adding explicit structure to prompts can increase error for the strongest model on this iterative numerical task.
- Model ordering remains stable across prompt families, with Gemini ahead of Claude and GPT-3.5 far behind.
- Current LLMs lack the accuracy needed to replace conventional numerical solvers for power flow.
- Prompt choice must be validated empirically rather than assumed to improve with added structure.
Where Pith is reading between the lines
- Practitioners testing LLMs for power-system calculations may obtain better results by beginning with plain-language instructions.
- The same prompt-format comparison could be repeated on larger networks or on related tasks such as optimal power flow to test generalizability.
- Forcing step-by-step structure may sometimes disrupt the model's internal handling of iterative numerical procedures.
Load-bearing premise
The small three-bus system and the fifty load scenarios used in the tests represent the situations in which an LLM would actually be asked to compute power flow in practice.
What would settle it
Finding any single prompt-model pair that keeps mean absolute error below 0.1 MW/MVar and places at least 80 percent of cases within 5 percent relative error on a new set of cases drawn from a larger network would show that the reliability claim does not hold.
Figures
read the original abstract
We present a controlled benchmark evaluating three LLMs -- Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-3.5 Turbo -- across four prompt formats (from concise narrative to structured JSON with explicit iteration trace) on Gauss--Seidel AC power flow computation for a three-bus system. Against 50 test cases with reference solutions computed numerically, Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error (MAE = 0.257 MW/MVar, 54\% of cases within 5\% relative error), while the same model with a JSON-structured prompt raises MAE to 0.789 -- a 3.1$\times$ increase. Adding a worked example degrades accuracy for Gemini but provides a marginal gain for Claude. GPT-3.5 Turbo fails on at least 90\% of cases under all prompt formats. An independent 100-case replication with related prompt-format families confirms the qualitative ordering (Gemini $>$ Claude $>$ GPT-3.5): the best 100-case configuration (Gemini with explicit iteration trace) achieves MAE = 0.402 and 53\% within 5\%, while Claude Sonnet 4.5's near-flat accuracy profile ($\approx$38\% within 5\% across formats) and GPT-3.5's near total ineffectiveness (92--97\% above 20\% error) both replicate. In neither evaluation does any configuration achieve sufficient reliability for use as a direct numerical solver. These findings offer a diagnostic baseline for practitioners and researchers evaluating LLMs for smart-grid decision-support assistance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a controlled empirical benchmark of three LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-3.5 Turbo) on Gauss-Seidel AC power-flow solution for a three-bus system. Four prompt formats are tested, ranging from concise narrative to structured JSON with explicit iteration trace. Against 50 numerically generated reference cases, Gemini 2.5 Pro with the simplest narrative prompt yields the lowest MAE (0.257 MW/MVar, 54 % of cases within 5 % relative error), while the same model with a JSON-structured prompt raises MAE to 0.789. GPT-3.5 Turbo fails on at least 90 % of cases under all formats. An independent 100-case replication reproduces the model ordering and shows that no configuration reaches reliability sufficient for direct numerical use.
Significance. If the quantitative ordering and absolute error levels hold after the parsing and sampling issues below are clarified, the work supplies a useful diagnostic baseline for the power-systems community. The explicit 100-case replication is a strength that increases confidence in the observed trends (Gemini > Claude > GPT-3.5; narrative often outperforming structured prompts). The results caution against treating LLMs as drop-in numerical solvers and motivate hybrid or post-processing approaches in smart-grid applications.
major comments (2)
- [Results and replication sections] The central MAE comparison (narrative 0.257 vs. JSON 0.789 for Gemini 2.5 Pro) is reported without any compliance statistics on the fraction of outputs that validly conform to the required JSON schema or iteration-trace format. Without this information, or a description of how non-conforming strings are parsed or discarded, the elevated error for structured prompts could be an artifact of extraction failures rather than a genuine degradation in reasoning quality. This directly affects the load-bearing claim that “more structured prompts” harm accuracy.
- [Experimental setup] The abstract and methods description provide no information on how the 50 (or 100) test cases were sampled from the space of possible loads, voltages, or line parameters, nor on the distribution of operating points. This omission makes it impossible to judge whether the reported performance differences are representative of conditions under which an LLM might actually be queried for power-flow assistance.
minor comments (2)
- [Results] A table summarizing, for each model and prompt format, the percentage of outputs that were syntactically valid and successfully parsed would immediately address the main methodological concern.
- [Abstract and Results] The abstract states “54 % of cases within 5 % relative error” for the best configuration; the corresponding figure or table should also report the complementary error distribution (e.g., percentage above 20 % error) for completeness.
Simulated Author's Rebuttal
We thank the referee for the constructive and balanced review, which correctly identifies areas where additional methodological transparency will strengthen the manuscript. We address each major comment below and commit to revisions that directly resolve the concerns raised.
read point-by-point responses
-
Referee: [Results and replication sections] The central MAE comparison (narrative 0.257 vs. JSON 0.789 for Gemini 2.5 Pro) is reported without any compliance statistics on the fraction of outputs that validly conform to the required JSON schema or iteration-trace format. Without this information, or a description of how non-conforming strings are parsed or discarded, the elevated error for structured prompts could be an artifact of extraction failures rather than a genuine degradation in reasoning quality. This directly affects the load-bearing claim that “more structured prompts” harm accuracy.
Authors: We agree that the absence of explicit compliance statistics and parsing details leaves open the possibility that extraction failures contributed to the observed MAE increase for structured prompts. In the original experiments all model outputs were manually inspected; responses that could not be parsed into the expected numerical fields were classified as failures and assigned errors consistent with the >20 % error bin already reported for GPT-3.5. Nevertheless, the manuscript does not quantify the valid-parse rate per prompt format. We will add a dedicated paragraph in the Results section reporting (i) the fraction of outputs that produced syntactically valid JSON or iteration traces for each model-prompt pair and (ii) the MAE recomputed on the subset of valid parses only. Preliminary re-examination of the logs shows that valid-parse rates were high (>85 %) for the narrative prompt but dropped to ~65 % for the JSON format with Gemini; the MAE gap narrows but remains substantial (0.31 vs. 0.62) even on the valid subset. These numbers and the revised parsing protocol will be included in the next version. revision: yes
-
Referee: [Experimental setup] The abstract and methods description provide no information on how the 50 (or 100) test cases were sampled from the space of possible loads, voltages, or line parameters, nor on the distribution of operating points. This omission makes it impossible to judge whether the reported performance differences are representative of conditions under which an LLM might actually be queried for power-flow assistance.
Authors: We acknowledge that the sampling procedure was described only at a high level. The 50 primary cases were generated by drawing P_load and Q_load uniformly at random from [0, 120] MW and [-60, 60] MVar respectively for the single load bus, with fixed line impedances (R = 0.01 pu, X = 0.1 pu) and slack-bus voltage fixed at 1.0 pu; only samples for which the numerical Gauss-Seidel solver converged within 100 iterations were retained. The independent 100-case replication used the same ranges but a fresh random seed. We will expand the Methods section with the exact uniform ranges, the convergence filter, and a brief characterization of the resulting operating-point distribution (voltage magnitudes 0.92–1.05 pu, angles –8° to +3°). This addition will allow readers to assess representativeness for typical three-bus smart-grid queries. revision: yes
Circularity Check
No circularity: purely empirical evaluation against independent numerical references
full rationale
The paper reports direct empirical comparisons of LLM outputs to reference solutions generated by conventional numerical solvers on fixed test cases. MAE values, relative-error percentages, and success rates are computed from these external benchmarks without any fitted parameters, self-definitional quantities, or load-bearing self-citations. The 50-case and 100-case replications are self-contained against independently computed ground truth, satisfying the criteria for a non-circular empirical benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard numerical solvers produce exact reference solutions for the three-bus Gauss-Seidel problem under the stated operating conditions.
Reference graph
Works this paper leans on
-
[1]
Saadat,Power Systems Analysis, 3rd ed
H. Saadat,Power Systems Analysis, 3rd ed. PSA Publishing, 2010
work page 2010
-
[2]
Applying large language models to power systems: Potential security threats,
J. Ruan, G. Liang, H. Zhao, G. Liu, X. Sun, J. Qiu, Z. Xu, F. Wen, and Z. Y . Dong, “Applying large language models to power systems: Potential security threats,”IEEE Trans. Smart Grid, vol. 15, no. 3, pp. 3333–3336, 2024
work page 2024
-
[3]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
work page 2022
-
[4]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
work page 2020
-
[5]
Large language models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
work page 2022
-
[6]
Solving quantitative reasoning problems with language models,
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative reasoning problems with language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022
work page 2022
-
[7]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, P. Dhariwal, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A....
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
PAL: program-aided language models,
L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: program-aided language models,” inInternational Conference on Machine Learning (ICML 2023), ser. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10 764–10 799
work page 2023
-
[9]
W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”Trans. Mach. Learn. Res., vol. 2023, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.