pith. sign in

arxiv: 2605.18642 · v1 · pith:CKHH36OKnew · submitted 2026-05-18 · 📡 eess.SY · cs.SY

A Benchmark on LLM-Based Power Flow Computation: Do More Structured Prompts Help?

Pith reviewed 2026-05-20 08:57 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords LLM benchmarkpower flow computationprompt engineeringGauss-Seidel methodthree-bus systemsmart gridnumerical accuracy
0
0 comments X

The pith

Gemini 2.5 Pro achieves lowest error on power flow with simplest narrative prompts rather than structured formats

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled test of three LLMs on the task of solving AC power flow equations for a three-bus network using the Gauss-Seidel method. It measures how four different prompt styles affect accuracy against exact numerical reference solutions across fifty load scenarios. The results show that the best performer reaches only moderate accuracy with plain narrative instructions, that adding JSON structure or iteration traces raises the error for the top model, and that none of the combinations reaches the consistency required to act as a direct numerical solver.

Core claim

Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error of 0.257 MW/MVar and places 54 percent of cases within 5 percent relative error. The same model using a JSON-structured prompt raises the error to 0.789. GPT-3.5 Turbo fails on at least 90 percent of cases under every prompt format. These orderings and the conclusion that no configuration is reliable enough for direct solver use both replicate in an independent 100-case test.

What carries the argument

Controlled variation of prompt format from concise narrative to JSON with explicit iteration trace, applied to three LLMs solving the Gauss-Seidel AC power flow equations on a fixed three-bus test system against numerical reference solutions.

If this is right

  • Adding explicit structure to prompts can increase error for the strongest model on this iterative numerical task.
  • Model ordering remains stable across prompt families, with Gemini ahead of Claude and GPT-3.5 far behind.
  • Current LLMs lack the accuracy needed to replace conventional numerical solvers for power flow.
  • Prompt choice must be validated empirically rather than assumed to improve with added structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners testing LLMs for power-system calculations may obtain better results by beginning with plain-language instructions.
  • The same prompt-format comparison could be repeated on larger networks or on related tasks such as optimal power flow to test generalizability.
  • Forcing step-by-step structure may sometimes disrupt the model's internal handling of iterative numerical procedures.

Load-bearing premise

The small three-bus system and the fifty load scenarios used in the tests represent the situations in which an LLM would actually be asked to compute power flow in practice.

What would settle it

Finding any single prompt-model pair that keeps mean absolute error below 0.1 MW/MVar and places at least 80 percent of cases within 5 percent relative error on a new set of cases drawn from a larger network would show that the reliability claim does not hold.

Figures

Figures reproduced from arXiv: 2605.18642 by Kai Sun, Kaiyang Huang, Tingwei Chen.

Figure 1
Figure 1. Figure 1: MAE heatmap for 12 LLM–prompt configurations on 50 three-bus [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end experimental pipeline (all stages implemented in Python). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of case-level relative errors for all 12 model–prompt [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Error-bin distribution for the 100-case replication ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

We present a controlled benchmark evaluating three LLMs -- Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-3.5 Turbo -- across four prompt formats (from concise narrative to structured JSON with explicit iteration trace) on Gauss--Seidel AC power flow computation for a three-bus system. Against 50 test cases with reference solutions computed numerically, Gemini 2.5 Pro with the simplest narrative prompt achieves the lowest mean absolute error (MAE = 0.257 MW/MVar, 54\% of cases within 5\% relative error), while the same model with a JSON-structured prompt raises MAE to 0.789 -- a 3.1$\times$ increase. Adding a worked example degrades accuracy for Gemini but provides a marginal gain for Claude. GPT-3.5 Turbo fails on at least 90\% of cases under all prompt formats. An independent 100-case replication with related prompt-format families confirms the qualitative ordering (Gemini $>$ Claude $>$ GPT-3.5): the best 100-case configuration (Gemini with explicit iteration trace) achieves MAE = 0.402 and 53\% within 5\%, while Claude Sonnet 4.5's near-flat accuracy profile ($\approx$38\% within 5\% across formats) and GPT-3.5's near total ineffectiveness (92--97\% above 20\% error) both replicate. In neither evaluation does any configuration achieve sufficient reliability for use as a direct numerical solver. These findings offer a diagnostic baseline for practitioners and researchers evaluating LLMs for smart-grid decision-support assistance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a controlled empirical benchmark of three LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-3.5 Turbo) on Gauss-Seidel AC power-flow solution for a three-bus system. Four prompt formats are tested, ranging from concise narrative to structured JSON with explicit iteration trace. Against 50 numerically generated reference cases, Gemini 2.5 Pro with the simplest narrative prompt yields the lowest MAE (0.257 MW/MVar, 54 % of cases within 5 % relative error), while the same model with a JSON-structured prompt raises MAE to 0.789. GPT-3.5 Turbo fails on at least 90 % of cases under all formats. An independent 100-case replication reproduces the model ordering and shows that no configuration reaches reliability sufficient for direct numerical use.

Significance. If the quantitative ordering and absolute error levels hold after the parsing and sampling issues below are clarified, the work supplies a useful diagnostic baseline for the power-systems community. The explicit 100-case replication is a strength that increases confidence in the observed trends (Gemini > Claude > GPT-3.5; narrative often outperforming structured prompts). The results caution against treating LLMs as drop-in numerical solvers and motivate hybrid or post-processing approaches in smart-grid applications.

major comments (2)
  1. [Results and replication sections] The central MAE comparison (narrative 0.257 vs. JSON 0.789 for Gemini 2.5 Pro) is reported without any compliance statistics on the fraction of outputs that validly conform to the required JSON schema or iteration-trace format. Without this information, or a description of how non-conforming strings are parsed or discarded, the elevated error for structured prompts could be an artifact of extraction failures rather than a genuine degradation in reasoning quality. This directly affects the load-bearing claim that “more structured prompts” harm accuracy.
  2. [Experimental setup] The abstract and methods description provide no information on how the 50 (or 100) test cases were sampled from the space of possible loads, voltages, or line parameters, nor on the distribution of operating points. This omission makes it impossible to judge whether the reported performance differences are representative of conditions under which an LLM might actually be queried for power-flow assistance.
minor comments (2)
  1. [Results] A table summarizing, for each model and prompt format, the percentage of outputs that were syntactically valid and successfully parsed would immediately address the main methodological concern.
  2. [Abstract and Results] The abstract states “54 % of cases within 5 % relative error” for the best configuration; the corresponding figure or table should also report the complementary error distribution (e.g., percentage above 20 % error) for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and balanced review, which correctly identifies areas where additional methodological transparency will strengthen the manuscript. We address each major comment below and commit to revisions that directly resolve the concerns raised.

read point-by-point responses
  1. Referee: [Results and replication sections] The central MAE comparison (narrative 0.257 vs. JSON 0.789 for Gemini 2.5 Pro) is reported without any compliance statistics on the fraction of outputs that validly conform to the required JSON schema or iteration-trace format. Without this information, or a description of how non-conforming strings are parsed or discarded, the elevated error for structured prompts could be an artifact of extraction failures rather than a genuine degradation in reasoning quality. This directly affects the load-bearing claim that “more structured prompts” harm accuracy.

    Authors: We agree that the absence of explicit compliance statistics and parsing details leaves open the possibility that extraction failures contributed to the observed MAE increase for structured prompts. In the original experiments all model outputs were manually inspected; responses that could not be parsed into the expected numerical fields were classified as failures and assigned errors consistent with the >20 % error bin already reported for GPT-3.5. Nevertheless, the manuscript does not quantify the valid-parse rate per prompt format. We will add a dedicated paragraph in the Results section reporting (i) the fraction of outputs that produced syntactically valid JSON or iteration traces for each model-prompt pair and (ii) the MAE recomputed on the subset of valid parses only. Preliminary re-examination of the logs shows that valid-parse rates were high (>85 %) for the narrative prompt but dropped to ~65 % for the JSON format with Gemini; the MAE gap narrows but remains substantial (0.31 vs. 0.62) even on the valid subset. These numbers and the revised parsing protocol will be included in the next version. revision: yes

  2. Referee: [Experimental setup] The abstract and methods description provide no information on how the 50 (or 100) test cases were sampled from the space of possible loads, voltages, or line parameters, nor on the distribution of operating points. This omission makes it impossible to judge whether the reported performance differences are representative of conditions under which an LLM might actually be queried for power-flow assistance.

    Authors: We acknowledge that the sampling procedure was described only at a high level. The 50 primary cases were generated by drawing P_load and Q_load uniformly at random from [0, 120] MW and [-60, 60] MVar respectively for the single load bus, with fixed line impedances (R = 0.01 pu, X = 0.1 pu) and slack-bus voltage fixed at 1.0 pu; only samples for which the numerical Gauss-Seidel solver converged within 100 iterations were retained. The independent 100-case replication used the same ranges but a fresh random seed. We will expand the Methods section with the exact uniform ranges, the convergence filter, and a brief characterization of the resulting operating-point distribution (voltage magnitudes 0.92–1.05 pu, angles –8° to +3°). This addition will allow readers to assess representativeness for typical three-bus smart-grid queries. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against independent numerical references

full rationale

The paper reports direct empirical comparisons of LLM outputs to reference solutions generated by conventional numerical solvers on fixed test cases. MAE values, relative-error percentages, and success rates are computed from these external benchmarks without any fitted parameters, self-definitional quantities, or load-bearing self-citations. The 50-case and 100-case replications are self-contained against independently computed ground truth, satisfying the criteria for a non-circular empirical benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of reliable numerical ground-truth solutions for the chosen test cases and on the assumption that LLM token outputs can be unambiguously converted into voltage and power values; no free parameters are fitted and no new physical entities are introduced.

axioms (1)
  • domain assumption Standard numerical solvers produce exact reference solutions for the three-bus Gauss-Seidel problem under the stated operating conditions.
    Invoked when the paper defines success as closeness to these references.

pith-pipeline@v0.9.0 · 5828 in / 1297 out tokens · 42574 ms · 2026-05-20T08:57:43.223210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Saadat,Power Systems Analysis, 3rd ed

    H. Saadat,Power Systems Analysis, 3rd ed. PSA Publishing, 2010

  2. [2]

    Applying large language models to power systems: Potential security threats,

    J. Ruan, G. Liang, H. Zhao, G. Liu, X. Sun, J. Qiu, Z. Xu, F. Wen, and Z. Y . Dong, “Applying large language models to power systems: Potential security threats,”IEEE Trans. Smart Grid, vol. 15, no. 3, pp. 3333–3336, 2024

  3. [3]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

  4. [4]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  5. [5]

    Large language models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

  6. [6]

    Solving quantitative reasoning problems with language models,

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V . V . Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y . Wu, B. Neyshabur, G. Gur-Ari, and V . Misra, “Solving quantitative reasoning problems with language models,” inAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022

  7. [7]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, P. Dhariwal, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A....

  8. [8]

    PAL: program-aided language models,

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: program-aided language models,” inInternational Conference on Machine Learning (ICML 2023), ser. Proceedings of Machine Learning Research. PMLR, 2023, pp. 10 764–10 799

  9. [9]

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,

    W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,”Trans. Mach. Learn. Res., vol. 2023, 2023