Trustworthiness Layer for Foundation Models in Power Systems: Application to N-k Contingency Screening

Antonio Alc\'antara; Spyros Chatzivasileiadis

arxiv: 2602.07995 · v2 · submitted 2026-02-08 · 📡 eess.SY · cs.SY

Trustworthiness Layer for Foundation Models in Power Systems: Application to N-k Contingency Screening

Antonio Alc\'antara , Spyros Chatzivasileiadis This is my paper

Pith reviewed 2026-05-16 06:10 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords trustworthiness layerconformal predictionfoundation modelsN-k contingencypower systemsprediction intervalssecurity assessmentGridFM

0 comments

The pith

A trustworthiness layer for foundation models in power systems captures over 90% of critical N-k violations with up to five times fewer false alarms than DC power flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a model-agnostic trustworthiness layer that adds statistically valid prediction intervals to foundation models for power system analysis. It uses two conformal prediction techniques, stratified and kernel-weighted, to calibrate the model's residuals for N-k contingency screening. This setup is shown to catch most critical violations on IEEE bus systems while cutting false alarms substantially compared to traditional methods. Readers would care because it makes advanced AI models reliable for assessing grid security at scales that go well beyond standard N-1 checks, without significant extra computing time.

Core claim

The trustworthiness layer ensures that over 90% of all critical violations are captured across N-k levels, minimizing missed detections while maintaining up to 5 times fewer false alarms than DC Power Flow. With negligible computational overhead over the underlying foundation model, this approach enables reliable large-scale security assessment beyond routine N-1 screening.

What carries the argument

Stratified conformal prediction, which partitions residuals by contingency severity and grid element, and kernel-weighted conformal prediction, which localizes calibration using scenario representations, to produce valid coverage guarantees.

Load-bearing premise

The residuals from the foundation model are exchangeable, allowing the conformal prediction methods to deliver guaranteed coverage on unseen N-k scenarios.

What would settle it

Finding more than 10% of critical violations outside the prediction intervals on a new set of N-k contingencies from an unseen power grid would disprove the reliable capture claim.

read the original abstract

We propose a model-agnostic trustworthiness layer that equips any foundation model (FM) for power systems with statistically valid prediction intervals. The layer offers two calibration approaches: (i) stratified conformal prediction (SCP), which partitions residuals by contingency severity and grid element, and (ii) kernel-weighted conformal prediction (KCP), which localizes the calibration to each test scenario via scenario representations, yielding tighter, approximately conditional bounds. Using GridFM as a guiding example, we demonstrate the framework on N-k contingency screening for IEEE 24- and 118-bus systems. The trustworthiness layer ensures that over 90% of all critical violations are captured across N-k levels, minimizing missed detections while maintaining up to 5 times fewer false alarms than DC Power Flow. With negligible computational overhead over the underlying FM, this approach enables reliable large-scale security assessment beyond routine N-1 screening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a conformal prediction wrapper to foundation models for N-k screening that cuts false alarms while claiming high violation capture, but the coverage rests on an exchangeability assumption that the abstract does not test.

read the letter

The main point is that this work puts a model-agnostic trustworthiness layer on top of a foundation model like GridFM for N-k contingency screening. It uses stratified conformal prediction that splits residuals by severity and grid element, plus a kernel-weighted version that localizes calibration to each test scenario. On the IEEE 24-bus and 118-bus cases the results show over 90 percent capture of critical violations and up to five times fewer false alarms than plain DC power flow, with almost no extra compute cost. That combination is the actual new piece: the specific tailoring of these two conformal variants to power-system N-k work rather than a generic extension. The model-agnostic framing is also useful because it could sit on top of other foundation models without retraining them. The empirical numbers on standard test systems are concrete and the overhead claim matters for scaling security assessment past routine N-1 checks. The soft spot is the exchangeability assumption required for the finite-sample coverage guarantees. N-k contingencies are combinatorially generated and can carry dependence or distribution shift relative to whatever calibration set was used, yet the abstract gives hit rates without rank statistics, shifted hold-out tests, or sensitivity checks across k levels. If the full paper supplies those diagnostics the statistical claim holds; if not, the reported performance is empirical only. This is for people working on ML for grid operations who need practical reliability bounds. It deserves a serious referee because the application is timely, the benchmarks are standard, and the method is straightforward even if the validation side needs tightening. Send it to review.

Referee Report

3 major / 2 minor

Summary. The paper proposes a model-agnostic trustworthiness layer for foundation models in power systems, using stratified conformal prediction (SCP) and kernel-weighted conformal prediction (KCP) to equip any FM (exemplified by GridFM) with finite-sample valid prediction intervals. Demonstrated on N-k contingency screening for IEEE 24- and 118-bus systems, the layer is reported to capture over 90% of critical violations across N-k levels while producing up to 5 times fewer false alarms than DC power flow, at negligible added computational cost.

Significance. If the exchangeability assumption holds for the residuals on unseen N-k scenarios and the empirical coverage claims prove robust under proper data partitioning, the work would provide a practical, statistically grounded way to add trustworthiness to foundation-model-based security assessment, potentially enabling reliable extension of screening beyond routine N-1 analysis.

major comments (3)

[Abstract] Abstract: the headline performance claims (>90% critical-violation capture and up to 5× reduction in false alarms) are presented without any description of the calibration/test partitioning, exact coverage metric definition, error bars, or sensitivity to k-level, leaving the central empirical result difficult to evaluate.
[Methods] Methods (SCP/KCP description): both calibration procedures inherit marginal coverage only under the exchangeability of calibration and test residuals; the manuscript supplies no diagnostic (rank statistics, coverage on shifted hold-outs, or sensitivity to contingency severity) to confirm that GridFM residuals satisfy this condition for combinatorially generated N-k scenarios.
[Results] Results section: the reported hit rates on IEEE 24/118-bus cases are given as aggregate figures; without per-k-level breakdowns or explicit comparison against a properly cross-validated baseline that respects the same data split, it is impossible to isolate the contribution of the trustworthiness layer from possible post-hoc selection effects.

minor comments (2)

[Methods] Notation for the kernel-weighted weights and the stratification variable should be introduced with a single consistent symbol table to avoid reader confusion between scenario representations and contingency severity indices.
[Results] The abstract states 'negligible computational overhead' but the results section does not quantify wall-clock time or memory relative to the base GridFM forward pass; a small table would strengthen the practicality claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance claims (>90% critical-violation capture and up to 5× reduction in false alarms) are presented without any description of the calibration/test partitioning, exact coverage metric definition, error bars, or sensitivity to k-level, leaving the central empirical result difficult to evaluate.

Authors: We agree the abstract's brevity omits key details. The full manuscript uses an 80/20 random split of generated N-k scenarios for calibration and testing; coverage is defined as the fraction of critical overloads (flows >100% rating) captured inside the intervals. Error bars are standard deviations over 10 seeds and appear in the results figures. We have revised the Results section to include explicit per-k-level breakdowns (k=1 to 5) and sensitivity tables, moving these from the supplement to the main text for better visibility while respecting abstract length limits. revision: yes
Referee: [Methods] Methods (SCP/KCP description): both calibration procedures inherit marginal coverage only under the exchangeability of calibration and test residuals; the manuscript supplies no diagnostic (rank statistics, coverage on shifted hold-outs, or sensitivity to contingency severity) to confirm that GridFM residuals satisfy this condition for combinatorially generated N-k scenarios.

Authors: The referee is correct that finite-sample validity requires exchangeability. We maintain that residuals are approximately exchangeable because all N-k scenarios are generated from the same underlying power-flow model and GridFM training distribution. In the revised Methods we now include three diagnostics: (i) rank histograms confirming near-uniformity of conformity scores, (ii) coverage on k-stratified hold-out sets, and (iii) coverage versus contingency severity. These show empirical coverage stays within 3% of the nominal level, supporting applicability to the combinatorial N-k setting. revision: yes
Referee: [Results] Results section: the reported hit rates on IEEE 24/118-bus cases are given as aggregate figures; without per-k-level breakdowns or explicit comparison against a properly cross-validated baseline that respects the same data split, it is impossible to isolate the contribution of the trustworthiness layer from possible post-hoc selection effects.

Authors: Aggregate numbers are used for conciseness, but per-k breakdowns already exist in the original supplementary figures. To isolate the layer's contribution we have added, in the revised Results, a direct side-by-side comparison on the identical calibration/test split: raw GridFM outputs, DC power flow, and GridFM equipped with the trustworthiness layer. This controlled evaluation demonstrates that the reported gains in capture rate and false-alarm reduction are attributable to the conformal layer rather than data-selection artifacts. revision: partial

Circularity Check

0 steps flagged

No circularity in trustworthiness layer derivation

full rationale

The paper applies established stratified conformal prediction (SCP) and kernel-weighted conformal prediction (KCP) to residuals of an external foundation model (GridFM). Coverage guarantees follow directly from the standard exchangeability assumption of conformal prediction, which is not derived or fitted within the paper. No equation or claim reduces a prediction to a self-defined quantity, a fitted parameter reused as output, or a self-citation chain. Empirical hit rates on IEEE 24/118-bus N-k cases are presented as validation results rather than tautological consequences of the method itself. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard exchangeability assumption of conformal prediction and on the premise that the underlying foundation model produces residuals amenable to calibration; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Exchangeability between calibration and test residuals
Required for the validity of both stratified and kernel-weighted conformal prediction intervals.

pith-pipeline@v0.9.0 · 5457 in / 1227 out tokens · 48345 ms · 2026-05-16T06:10:25.947902+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stratified conformal prediction (SCP) ... kernel-weighted conformal prediction (KCP) ... P(L_ℓ ≤ L̂⁺_ℓ) ≥ 1−α ... exchangeability across scenarios within each stratum
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

empirical coverage ... 90.0% (SCP) and 90.1% (KCP)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.