arxiv: 2604.09995 · v1 · submitted 2026-04-11 · 📡 eess.SY · cs.AI· cs.SY

Recognition: unknown

Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction

Qinjuan Wang , Shan Yang , Yongli Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 📡 eess.SY cs.AIcs.SY

keywords LLM agentMATPOWERpower grid static analysisautomatic code generationerror correctionnatural language to codevector database

0 comments

The pith

An LLM agent converts natural language descriptions into reliable MATPOWER code for power grid static analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system that takes ordinary language queries about power grid problems and produces executable scripts for the MATPOWER analysis tool. It constructs a searchable database from the software's manuals and applies three successive layers of verification to catch and repair generation errors before the code runs. The goal is to let engineers and analysts obtain correct results for static studies without writing or debugging code by hand. If the approach holds, routine grid calculations become accessible through simple text prompts while maintaining consistency across different query styles.

Core claim

The framework uses an LLM agent to translate natural language into MATPOWER scripts, supported by a vector database built from the tool's documentation via DeepSeek-OCR. Reliability comes from a three-tier error-correction process: a static pre-check for syntax and structure, a dynamic feedback loop that runs the code and incorporates runtime errors, and a semantic validator that confirms the output matches the intended analysis. Execution occurs asynchronously through the Model Context Protocol in MATLAB, allowing automatic debugging. Tests show the system reaches 82.38 percent code fidelity and removes hallucinations even on complex tasks.

What carries the argument

The three-tier error-correction system of static pre-check, dynamic feedback loop, and semantic validator, backed by a vector database of MATPOWER manuals.

If this is right

Users can request grid static studies in plain English and receive working code without manual programming.
The error-correction layers allow the system to handle complex tasks while keeping code output accurate.
Automatic debugging runs asynchronously in MATLAB, reducing the need for human intervention after code generation.
Hallucinations in script creation are suppressed across a range of input phrasings and problem types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure of manual-derived database plus staged verification could apply to other power system simulators that have detailed documentation.
Connecting the agent to live sensor feeds might permit it to suggest analysis scripts that adapt to current grid conditions.
Non-specialists in utilities could perform detailed static studies more quickly if the interface stays limited to text prompts.

Load-bearing premise

The three layers of checks together with the vector database will reliably fix code errors for any natural language input without creating fresh mistakes or missing real-world grid edge cases.

What would settle it

A collection of natural language queries describing unusual grid configurations or rare analysis requests where the output script either fails to execute in MATLAB or yields analysis results that contradict known reference solutions.

Figures

Figures reproduced from arXiv: 2604.09995 by Qinjuan Wang, Shan Yang, Yongli Zhu.

**Figure 3.** Figure 3: Example of DeepSeek-OCR processing The module serves as the core controller of the system, responsible for coordinating the LLM’s cognition with interactions in the underlying executor environment. 1) MATPOWER Agent Workflow: After receiving the retrieval results (as mentioned in Section III-B2), the agent proceeds to the prompt construction. Using the LangChain message management mechanism, it generates… view at source ↗

**Figure 4.** Figure 4: MCP architecture 2) MCP Architecture and Inter-Process Communication: To overcome the closed nature of conventional scripting tools, the system implements an MCP server that enables it to be invoked as a tool by the external ecosystem ( [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Static Pre-check [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Output visualization the MATLAB executor. An independent LLM-based semantic validator compares the user request against the final code. If logical inconsistencies are detected, they are considered semantic errors. Considering potential contradictions in the user request, the assessment results are quantified as “Critical” or “Minor” levels, • Critical-level semantic deviations are forcibly deemed failures,… view at source ↗

**Figure 7.** Figure 7: Overall system accuracy (GCA) across different con [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: CSGF performance grouped by task complexity. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

read the original abstract

This paper introduces an LLM agent that automates power grid static analysis by converting natural language into MATPOWER scripts. The framework utilizes DeepSeek-OCR to build an enhanced vector database from MATPOWER manuals. To ensure reliability, it devises a three-tier error-correction system: a static pre-check, a dynamic feedback loop, and a semantic validator. Operating via the Model Context Protocol, the tool enables asynchronous execution and automatically debugging in MATLAB. Experimental results demonstrate that the system achieves a 82.38% accuracy regarding the code fidelity, effectively eliminating hallucinations even in complex analysis tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a practical LLM agent with a three-tier correction pipeline to turn natural language into MATPOWER scripts, but the 82% fidelity claim lacks the test details needed to evaluate it.

read the letter

The core of this work is an LLM-based agent that takes natural language descriptions and produces MATPOWER code for power grid static analysis. It pulls from a vector database built with DeepSeek-OCR on the manuals, then runs the output through static pre-checks, a dynamic feedback loop, and a semantic validator before executing it asynchronously in MATLAB via the Model Context Protocol. The headline result is 82.38% code fidelity with the claim that hallucinations are effectively eliminated on complex tasks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an LLM-based agentic framework that translates natural language queries into MATPOWER scripts for power grid static analysis. It constructs an enhanced vector database from MATPOWER manuals using DeepSeek-OCR and incorporates a three-tier error-correction pipeline (static pre-check, dynamic feedback loop, and semantic validator) operating through the Model Context Protocol for asynchronous MATLAB execution and debugging. The central empirical claim is that the system achieves 82.38% code fidelity and effectively eliminates hallucinations even in complex analysis tasks.

Significance. If the experimental results can be substantiated, the work could meaningfully advance automation in power systems engineering by enabling reliable natural-language interfaces to domain-specific simulation tools. The emphasis on retrieval-augmented generation combined with layered error correction directly targets the hallucination problem in LLM code generation, which is a practical strength. Credit is given for the focus on asynchronous execution and automatic debugging in a real engineering environment. However, the current lack of evaluation details limits assessment of generalizability to varied grid scenarios and real-world data.

major comments (2)

[Experimental Results] Experimental Results section: The headline claim of 82.38% code fidelity is presented without any information on test-set size, input diversity or held-out status, the precise definition and scoring rule for 'fidelity' (syntax, execution success, numerical match on sample data), baseline comparisons, error distributions, or ablation results isolating the contribution of the static pre-check, dynamic loop, and semantic validator. This information is load-bearing for the assertion that hallucinations are effectively eliminated rather than merely reduced on a narrow test suite.
[Framework description] Framework description (three-tier error-correction system): The manuscript does not specify how the dynamic feedback loop bounds iteration count, prevents introduction of new errors, or guarantees termination, nor does it detail the semantic validator's criteria or integration with the vector database for edge cases in grid data. These omissions directly affect the reliability claim for complex analysis tasks.

minor comments (2)

[Abstract] Abstract: The phrasing 'effectively eliminating hallucinations' is stronger than the 82.38% figure supports; consider qualifying the language to reflect reduction rather than elimination.
[Methods] The manuscript would benefit from explicit statements of the underlying LLM model(s) used for generation (beyond DeepSeek-OCR) and any licensing or reproducibility details for the vector database construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional detail is needed to substantiate the experimental claims and ensure the framework description supports the reliability assertions. We will revise the manuscript to address both points.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: The headline claim of 82.38% code fidelity is presented without any information on test-set size, input diversity or held-out status, the precise definition and scoring rule for 'fidelity' (syntax, execution success, numerical match on sample data), baseline comparisons, error distributions, or ablation results isolating the contribution of the static pre-check, dynamic loop, and semantic validator. This information is load-bearing for the assertion that hallucinations are effectively eliminated rather than merely reduced on a narrow test suite.

Authors: We agree that the Experimental Results section requires substantial expansion to support the central claims. In the revised manuscript we will add the test-set size and confirm it consists of held-out queries, describe the diversity of inputs (covering basic to complex grid analysis tasks), provide the exact definition and scoring procedure for code fidelity, include baseline comparisons against non-agentic LLM prompting, report error distributions, and present ablation results that isolate the contribution of each tier of the error-correction pipeline. These additions will allow readers to assess whether hallucinations are effectively eliminated. revision: yes
Referee: [Framework description] Framework description (three-tier error-correction system): The manuscript does not specify how the dynamic feedback loop bounds iteration count, prevents introduction of new errors, or guarantees termination, nor does it detail the semantic validator's criteria or integration with the vector database for edge cases in grid data. These omissions directly affect the reliability claim for complex analysis tasks.

Authors: We concur that these implementation details are necessary for reproducibility and for evaluating the reliability claims. In the revised manuscript we will add a dedicated subsection that specifies the iteration bound and termination conditions for the dynamic feedback loop, the mechanism used to avoid introducing new errors, the precise criteria and similarity thresholds employed by the semantic validator, and how the validator integrates with the vector database to handle edge cases such as atypical grid configurations. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; accuracy is experimental reporting only

full rationale

The paper describes an LLM agent framework for NL-to-MATPOWER code generation using a vector DB and three-tier error correction, then reports an experimental 82.38% code fidelity result. No equations, first-principles derivations, predictions, or uniqueness theorems are claimed anywhere in the provided text. The central claim is a measured accuracy figure from (unspecified) experiments rather than any quantity derived from or fitted to its own inputs. No self-citations, ansatzes, or renamings appear that could create circularity. Absence of test-set details or ablations is a reproducibility/validity concern, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that LLMs guided by retrieval and feedback loops can produce domain-correct code for power systems; no free parameters are explicitly fitted in the abstract, and no new physical entities are introduced.

axioms (1)

domain assumption LLMs can be made reliable for code generation in specialized domains through retrieval-augmented generation and iterative correction without domain-specific fine-tuning.
Invoked implicitly in the description of the agent and error-correction system.

pith-pipeline@v0.9.0 · 5396 in / 1417 out tokens · 31637 ms · 2026-05-10T16:33:26.172394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 2 internal anchors

[1]

MAT- POWER: Steady-State Operations, Planning and Analysis Tools for Power Systems Research and Education,

R. D. Zimmerman, C. E. Murillo-Sanchez, and R. J. Thomas, “MAT- POWER: Steady-State Operations, Planning and Analysis Tools for Power Systems Research and Education,”IEEE Transactions on Power Systems, vol. 26, no. 1, pp. 12–19, Feb. 2011

2011
[2]

Enhancing Project-Specific Code Completion by Inferring Internal API Informa- tion,

L. Deng, X. Ren, C. Ni, M. Liang, D. Lo, and Z. Liu, “Enhancing Project-Specific Code Completion by Inferring Internal API Informa- tion,”IEEE Transactions on Software Engineering, vol. 51, no. 9, pp. 2566–2582, Sept. 2025, doi: 10.1109/TSE.2025.3592823

work page doi:10.1109/tse.2025.3592823 2025
[3]

Executable code actions elicit better LLM agents,

X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inProceedings of the 41st International Conference on Machine Learning (ICML’24), Vienna, Austria, 2024, Article no. 2054

2024
[4]

AnyTool: self-reflective, hierarchical agents for large-scale API calls,

Y . Du, F. Wei, and H. Zhang, “AnyTool: self-reflective, hierarchical agents for large-scale API calls,” inProceedings of the 41st International Conference on Machine Learning (ICML’24), Vienna, Austria, 2024, Article no. 470

2024
[5]

Exploring Knowledge Filtering for Retrieval-Augmented Question Answering,

M. Qiang, Z. Wang, S. Li, and G. Zhou, “Exploring Knowledge Filtering for Retrieval-Augmented Question Answering,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 1049–1060, 2026, doi: 10.1109/TASLPRO.2026.3658957

work page doi:10.1109/taslpro.2026.3658957 2026
[6]

control bars

M. Jia, Z. Cui and G. Hug, “Enabling Large Language Mod- els to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline,”CoRR, vol. abs/2406.17215, 2024, doi: 10.48550/ARXIV .2406.17215

work page internal anchor Pith review doi:10.48550/arxiv 2024
[7]

DeepSeek-OCR: Contexts Optical Com- pression,

H. Wei, Y . Sun and Y . Li, “DeepSeek-OCR: Contexts Optical Com- pression,”arXiv preprint arXiv:2501.18234, 2025. [Online]. Available: https://arxiv.org/abs/2501.18234

work page arXiv 2025
[8]

[Online]

LangChain Open-source Framework. [Online]. Available: https://github.com/langchain-ai/langchain
[9]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI, “DeepSeek LLM: Scaling Open-Source Language Models with Longtermism,”arXiv preprint arXiv:2401.02954, 2024. [Online]. Available: https://github.com/deepseek-ai/DeepSeek-LLM

work page internal anchor Pith review arXiv 2024
[10]

[Online]

Model Context Protocol. [Online]. Available: https://modelcontextprotocol.io
[11]

R. D. Zimmerman and C. E. Murillo-Sanchez,MATPOWER User’s Manual, Version 8.1, 2025. [Online]. Available: https://matpower.org/docs/MATPOWER-manual-8.1.pdf

2025
[12]

[Online]

Faiss: A library for efficient similarity search and clustering of dense vectors. [Online]. Available: https://github.com/facebookresearch/faiss
[13]

[Online]

Chainlit: Build Python LLM apps in minutes. [Online]. Available: https://github.com/Chainlit/chainlit