PF$\Delta$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations

Alvaro Carbonero; Ana K. Rivera; Anvita Bhagavathula; Priya Donti

arxiv: 2510.22048 · v4 · pith:FX5OJPAAnew · submitted 2025-10-24 · 💻 cs.LG

PFDelta: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations

Ana K. Rivera , Anvita Bhagavathula , Alvaro Carbonero , Priya Donti This is my paper

Pith reviewed 2026-05-18 04:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords power flowbenchmark datasetmachine learninggraph neural networkscontingency analysispower systemsvoltage stabilitygrid operations

0 comments

The pith

The PFΔ benchmark provides 859,800 power flow instances to test solvers and ML methods under load, generation, topology, and contingency variations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PFΔ to fill the gap in benchmarks for power flow calculations that account for real-world variability from renewables and extreme weather. It contains 859,800 solved instances across six bus system sizes, N, N-1, and N-2 contingencies, and close-to-infeasible cases near voltage stability limits. The authors evaluate traditional solvers and GNN-based methods on this dataset and point out areas of difficulty. A reader would care because this can help develop faster tools for grid operations and security analysis.

Core claim

PFΔ is a benchmark dataset for power flow that captures diverse variations in load, generation, and topology, spanning six system sizes, three contingency types, and near-infeasible points, allowing identification of limitations in current solving approaches.

What carries the argument

The PFΔ dataset itself, built by generating systematic variations in load, generation, topology, and including contingency scenarios and stability boundary cases.

If this is right

Evaluations can guide improvements in traditional power flow algorithms for challenging cases.
GNN methods can be refined to better handle topology changes and contingencies.
The dataset enables systematic assessment of ML approaches for speeding up contingency analysis.
Future work can target the open problems highlighted for more robust grid simulation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This benchmark may standardize testing for power system ML models in a way that accelerates progress in the field.
It could be extended to include more complex dynamics or uncertainty models from climate data.
Adoption might lead to hybrid methods combining solvers and learning for better real-time performance.

Load-bearing premise

The synthetic variations and chosen scenarios are representative enough of real-world power system conditions to serve as a useful benchmark.

What would settle it

If tests on actual grid operational data yield different difficulty rankings for the methods than those observed on PFΔ.

Figures

Figures reproduced from arXiv: 2510.22048 by Alvaro Carbonero, Ana K. Rivera, Anvita Bhagavathula, Priya Donti.

**Figure 2.** Figure 2: Data generation process for a single data sample within [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Experimental results for all selected tasks. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Results for Task 3.1 showcasing the Power Balance Loss (PBL) on a combined feasible [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PF$\Delta$, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PF$\Delta$ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N -1, and N -2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at https://huggingface.co/datasets/pfdelta/pfdelta/tree/main and our code with data generation scripts and model implementations is at https://github.com/MOSSLab-MIT/pfdelta.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PFΔ is a straightforward benchmark release with a large set of solved power flow cases including N-2 contingencies and near-infeasible points, plus public code that makes it easy to use.

read the letter

PFΔ stands out as a benchmark dataset with a large number of power flow solutions that include N-2 contingencies and near-infeasible cases. The paper generates nearly 860,000 solved instances across six different bus system sizes. It covers normal operation, single contingencies, and double contingencies, plus points that are close to the edge of voltage stability. They make the data available on Hugging Face and provide the code for how it was created on GitHub. This setup lets people check the results or generate more data if needed. The comparisons between classical solvers and graph neural network models point to specific weaknesses in the ML side, which is good for moving the field forward. On the downside, the variations come from synthetic changes to standard test systems. That is fine for a starting benchmark, but real power grids have more complex correlations and uncertainties from weather and renewable sources. The paper assumes these controlled perturbations are representative enough, which is reasonable but worth noting as a limitation. Details on the exact ranges for parameters and how infeasibility is detected could be expanded to make the work even stronger. Readers who are building or testing machine learning surrogates for grid operations will find this relevant. It gives them a shared set of problems to compare against. The combination of scale, contingency coverage, and open resources makes it solid enough for peer review. I would recommend sending this to peer review.

Referee Report

0 major / 3 minor

Summary. The paper introduces PFΔ, a benchmark dataset containing 859,800 solved power-flow instances across six bus-system sizes. The dataset incorporates controlled synthetic variations in load, generation, and topology, three contingency types (N, N-1, N-2), and operating points near steady-state voltage stability limits. It reports evaluations of conventional solvers and GNN-based methods on these instances and releases both the dataset and the generation scripts.

Significance. If the generation pipeline is fully reproducible, the public release of this large, documented collection of solved instances with explicit near-limit and contingency cases supplies a concrete, verifiable testbed for ML methods targeting power-flow bottlenecks in contingency analysis and topology optimization. The accompanying code and scripts constitute a clear strength that supports independent verification and extension.

minor comments (3)

[§3] §3 (Data Generation): the ranges and sampling distributions used for load and generation perturbations are not stated with sufficient numerical detail; providing the exact intervals or distributions would allow exact reproduction of the reported instance counts and near-infeasibility statistics.
[Table 1] Table 1 or equivalent summary table: the breakdown of instances by bus-system size, contingency type, and feasibility status should be presented explicitly so that readers can immediately verify the claimed totals (859,800) and the proportion of close-to-infeasible cases.
[Evaluation] Evaluation section: the precise definition of “close-to-infeasible” (e.g., voltage magnitude or loading margin thresholds) and the infeasibility detection criterion used by the underlying solver should be stated in one place to avoid ambiguity when comparing solver and GNN performance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the dataset's significance for ML methods in power systems, and recommendation for minor revision. The assessment of reproducibility and utility for contingency analysis and topology optimization aligns with our goals. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical dataset contribution is self-contained

full rationale

The paper's core contribution is the creation and public release of the PFΔ benchmark dataset consisting of 859,800 solved power-flow instances generated from standard test cases via controlled synthetic perturbations in load, generation, and topology, along with N-1/N-2 contingencies and near-limit points. No derivation chain, first-principles predictions, or fitted parameters are claimed; evaluations of solvers and GNN methods are empirical and independently verifiable. The generation pipeline relies on established power-flow solvers whose outputs can be reproduced externally. No self-citation load-bearing steps, self-definitional reductions, or ansatz smuggling are present. This is a standard honest finding for a dataset/benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard power-system modeling assumptions and the premise that the chosen synthetic variations adequately sample the space of realistic operating conditions.

axioms (1)

standard math Standard power flow equations are solved by conventional numerical methods to produce the labeled instances.
Invoked throughout the dataset construction process described in the abstract.

pith-pipeline@v0.9.0 · 5796 in / 1324 out tokens · 75763 ms · 2026-05-18T04:05:22.957619+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce PFΔ, a benchmark dataset for evaluating ML approaches to power flow across variations in load distributions, generator profiles, grid sizes, and N–1/N–2 topological perturbations.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Power flow (PF) calculations are the backbone of real-time grid operations... solving the nonlinear, implicit system of equations comprising (1)–(2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.