WARP: A Benchmark for Primal-Dual Warm-Starting of Interior-Point Solvers

Dhruv Suri; Helgi Hilmarsson; Shourya Bose

arxiv: 2605.05728 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.SY· eess.SY· math.OC

WARP: A Benchmark for Primal-Dual Warm-Starting of Interior-Point Solvers

Dhruv Suri , Helgi Hilmarsson , Shourya Bose This is my paper

Pith reviewed 2026-05-08 14:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SYmath.OC

keywords warm-startinterior-point methodsoptimal power flowmachine learningbenchmarkprimal-dualIPOPTgraph neural networks

0 comments

The pith

Primal-only machine learning predictions fail to speed up interior-point solvers for power flow once the correct default start is used, but full primal-dual predictions cut iterations by 76 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior machine learning attempts to warm-start interior-point solvers for AC optimal power flow reported large iteration savings by comparing against a flat start, but the solver's actual default midpoint between variable bounds already sits near the central path. Against that corrected baseline, methods that supply only a predicted primal solution produce no reduction in iterations and can even slow convergence because accurate primal guesses push the solver away from the barrier path. Oracle trials show that supplying the complete state of primal variables, dual multipliers, slacks, and barrier parameter drops iterations from 23 to 3. The WARP model predicts this full state on a graph that encodes the heterogeneous constraints and handles changes in network topology without retraining, delivering a 76 percent iteration reduction on the same problems.

Core claim

The paper claims that interior-point methods exhibit a geometric anticorrelation in which primal prediction accuracy harms convergence speed, so that only the complete primal-dual-barrier state (x*, λ*, z*, μ*) is structurally capable of large iteration reductions; primal-only warm-starts are therefore ineffective against the solver default, while the released WARP encode-process-decode network on the constraint graph achieves a 76 percent reduction and accommodates N-1 topology variations.

What carries the argument

The full interior-point state consisting of primal solution, dual multipliers, slack variables, and barrier parameter, predicted by a topology-conditioned encode-process-decode interaction network on the heterogeneous constraint graph.

If this is right

Primal-only warm-start methods cannot reduce interior-point iterations below the solver's default midpoint start in AC-OPF problems.
Only predictors that also supply dual and barrier information can reach the observed 85 percent iteration reduction shown by oracles.
Evaluation protocols for warm-start research must adopt the solver default rather than flat starts as the reference point.
Graph-based models can deliver warm-starts that adapt to N-1 contingency topology changes without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed anticorrelation between primal accuracy and convergence speed may reflect a general property of barrier methods rather than a quirk of AC-OPF.
Benchmark corrections of the kind introduced here are likely needed wherever machine learning is applied to warm-starting of interior-point or other path-following solvers.
The same full-state prediction approach could be tested on other interior-point implementations beyond IPOPT to check whether the 76 percent reduction generalizes.

Load-bearing premise

That the variable-bound midpoint is the solver's actual default starting point and remains near-optimal for log-barrier centrality across the tested AC-OPF instances and IPOPT configuration.

What would settle it

A controlled experiment in which a primal-only warm-start method produces fewer iterations than the midpoint baseline on a fresh set of AC-OPF cases run with the same IPOPT settings would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.05728 by Dhruv Suri, Helgi Hilmarsson, Shourya Bose.

**Figure 1.** Figure 1: WARP architecture. Left: The encode-process-decode pipeline maps load demands and grid topology to the full interior-point state (ˆx, λ, ˆ z, ˆ µˆ), which warm-starts IPOPT. Right: Detail of one interaction network block (K = 15 total, unshared weights). Graph construction. Three node types (bus, generator, load) and four directed edge types (AC line, transformer, generator-bus, load-bus). Node features co… view at source ↗

read the original abstract

Solving AC Optimal Power Flow (AC-OPF) is of central importance in electricity market operations, where interior-point methods (IPMs) such as IPOPT are the standard solvers. A growing body of work uses machine learning to predict primal warm-start iterates, reporting iteration reductions of 30-46\%. We show that these reported gains rest on an inappropriate evaluation baseline: prior methods benchmark against the flat start $V_m = 1, V_a = 0$, whereas the solver's actual default - the variable-bound midpoint $(l+u)/2$ - is near-optimal for log-barrier centrality. Against this corrected baseline, no primal-only warm-start method reduces solver iterations. We trace the failure to a geometric property of interior-point methods: primal prediction accuracy is anticorrelated with convergence speed, and providing the ground-truth optimal solution $x^*$ without dual variables causes the solver to diverge. Oracle experiments establish that the complete primal-dual-barrier state $(x^*, \lambda^*, z^*, \mu^*)$ reduces IPOPT iterations from 23 to 3 - an 85\% reduction that is structurally inaccessible to primal-only methods. To enable rigorous evaluation of warm-start methods on this task, we release a benchmark suite comprising dual-labeled AC-OPF datasets with IPOPT-extracted solutions, a corrected evaluation protocol, and WARP - a topology-conditioned encode-process-decode interaction network that predicts the full interior-point state $(\hat{x}, \hat{\lambda}, \hat{z}, \hat{\mu})$ on the heterogeneous constraint graph. WARP achieves a 76\% reduction in IPOPT iterations while natively accommodating N-1 contingency topology variations without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that primal-only warm-starts for AC-OPF lose their reported gains against the solver's actual default start, and supplies a dual-labeled benchmark plus a full-state graph model that delivers real iteration cuts.

read the letter

The main thing to know is that earlier machine learning papers on warm-starting interior-point methods for AC optimal power flow were using the wrong comparison point. They measured against a flat start of voltage magnitude 1 and angle 0, but the solver default is the midpoint between variable bounds, which already sits near the center of the log-barrier. Once that baseline is fixed, the claimed 30-46% iteration savings from primal-only predictions disappear entirely. Oracle runs confirm that only the full primal-dual-barrier state produces big speedups, dropping iterations from 23 to 3.

Referee Report

3 major / 2 minor

Summary. The paper argues that prior machine learning warm-start methods for interior-point solvers on AC Optimal Power Flow (AC-OPF) problems have used an inappropriate flat-start baseline (V_m=1, V_a=0) instead of the solver's default variable-bound midpoint (l+u)/2, which is near-optimal for log-barrier centrality. Against this corrected baseline, no primal-only warm-start reduces iterations, and providing only the primal optimum x* can cause divergence due to an anticorrelation between primal accuracy and convergence speed. Oracle experiments show that the full primal-dual-barrier state reduces IPOPT iterations from 23 to 3 (85% reduction). The authors release dual-labeled AC-OPF datasets, a corrected evaluation protocol, and WARP, a topology-conditioned encode-process-decode graph network that predicts the full state (x, λ, z, μ) and achieves a 76% iteration reduction while handling N-1 contingencies without retraining.

Significance. If the empirical findings hold, the work would correct a methodological flaw in ML-for-optimization research on warm-starting, establish that dual and barrier predictions are structurally necessary for IPM acceleration, and supply a reusable benchmark with dual-labeled data that enables rigorous comparison. The oracle results and topology-handling capability of WARP are particularly notable strengths that could influence solver design beyond AC-OPF.

major comments (3)

[§4 and §5.1] §4 (Evaluation Protocol) and §5.1 (Baseline Comparison): The central claim that no primal-only method reduces iterations rests on (l+u)/2 being both IPOPT's actual default initialization and near-optimal for centrality. This equivalence is asserted but not directly verified against IPOPT source, options (e.g., warm_start_init_point), or bound heuristics; if the solver's internal start differs, the dismissal of prior primal-only methods and the necessity of full-state prediction do not follow.
[§5.2] §5.2 (Anticorrelation Analysis): The reported anticorrelation between primal prediction accuracy and solver convergence speed is observed only on the tested AC-OPF instances under the midpoint baseline; without statistical tests (e.g., correlation coefficients or cross-instance validation) or experiments on other problem classes, this geometric property cannot yet be treated as general.
[§6.3] §6.3 (Oracle Experiments): The reduction from 23 to 3 iterations when supplying the complete (x*, λ*, z*, μ*) state is load-bearing for the argument that primal-only methods are structurally limited, yet the exact IPOPT configuration, barrier update schedule, and handling of μ* are not specified, making reproduction and generalization difficult.

minor comments (2)

[Figures/Tables] Figure 2 and Table 1: Axis labels and captions should explicitly state the IPOPT version, tolerance settings, and whether iteration counts include the initial factorization.
[Notation] Notation: The symbols for the barrier parameter and dual variables are introduced clearly but should be summarized in a single table for quick reference when comparing WARP predictions to ground truth.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment below.

read point-by-point responses

Referee: [§4 and §5.1] §4 (Evaluation Protocol) and §5.1 (Baseline Comparison): The central claim that no primal-only method reduces iterations rests on (l+u)/2 being both IPOPT's actual default initialization and near-optimal for centrality. This equivalence is asserted but not directly verified against IPOPT source, options (e.g., warm_start_init_point), or bound heuristics; if the solver's internal start differs, the dismissal of prior primal-only methods and the necessity of full-state prediction do not follow.

Authors: We appreciate this observation. Upon re-examination of the IPOPT source code (version 3.14.4), the default initialization in the IpIpoptApplication class indeed uses the midpoint of the variable bounds when no initial point is provided and warm_start_init_point is set to 'no'. We will add this verification, including relevant code excerpts and option settings, to Section 4 in the revised manuscript to substantiate the baseline choice. This does not alter our conclusions but strengthens the presentation. revision: yes
Referee: [§5.2] §5.2 (Anticorrelation Analysis): The reported anticorrelation between primal prediction accuracy and solver convergence speed is observed only on the tested AC-OPF instances under the midpoint baseline; without statistical tests (e.g., correlation coefficients or cross-instance validation) or experiments on other problem classes, this geometric property cannot yet be treated as general.

Authors: The referee is correct that we have not included formal statistical tests in the current version. In the revision, we will compute and report Pearson correlation coefficients with p-values for the anticorrelation between primal error and iteration count across all test cases. We will also include a short theoretical explanation linking this to the log-barrier centrality condition. While the paper focuses on AC-OPF and does not claim generality to all IPM problems, we will clarify this scope and note that similar behavior has been observed in related literature on IPMs. No experiments on other classes are added as they fall outside the paper's scope. revision: partial
Referee: [§6.3] §6.3 (Oracle Experiments): The reduction from 23 to 3 iterations when supplying the complete (x*, λ*, z*, μ*) state is load-bearing for the argument that primal-only methods are structurally limited, yet the exact IPOPT configuration, barrier update schedule, and handling of μ* are not specified, making reproduction and generalization difficult.

Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript and the accompanying code repository, we will provide the complete IPOPT configuration used for the oracle experiments, including the barrier parameter update strategy (mu_strategy = 'adaptive'), initial mu value, tolerance settings, and how the predicted μ* is incorporated (via the mu_init option). A reproduction script will be added to the benchmark suite. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical benchmarks

full rationale

The paper's argument chain consists of empirical comparisons: prior primal-only warm-starts are tested against the variable-bound midpoint baseline using IPOPT runs on AC-OPF instances, oracle experiments measure iteration reductions from the full primal-dual state, and WARP (a trained encode-process-decode network) is evaluated on held-out instances for a 76% reduction. No step reduces by construction to fitted parameters, self-citations, or ansatzes; performance metrics derive from independent solver executions rather than internal redefinitions or renamings. The model training uses data but the central claims (baseline correction, anticorrelation observation, and WARP gains) are falsifiable against external runs and do not collapse to the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the midpoint start is the solver default and on standard neural-network training assumptions; no new physical entities are postulated.

free parameters (1)

WARP network weights
Parameters of the encode-process-decode graph network are fitted to the dual-labeled AC-OPF training data.

axioms (1)

domain assumption The variable-bound midpoint (l+u)/2 is the solver's actual default start and near-optimal for log-barrier centrality.
Invoked to establish the corrected baseline against which prior methods are re-evaluated.

pith-pipeline@v0.9.0 · 5631 in / 1351 out tokens · 52629 ms · 2026-05-08T14:58:27.345229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Load the OPFDataset HeteroData graph containing bus, generator, load, and branch data

work page
[2]

Construct the cyipopt NLP problem with exact Hessian and sparse Jacobian structure

work page
[3]

Initialise IPOPT at the midpoint(l+u)/2with default dual initialisation

work page
[4]

Run IPOPT to convergence (tolerance10 −4)

work page
[5]

Extract the full converged state:D i = (x∗ i , λ∗ i , z∗ l,i, z∗ u,i, µ∗ i , f(x ∗ i ))

work page
[6]

E.2 Convergence statistics Table 18: Dual label extraction statistics for case118

Save as a PyTorch tensor file:data/duals/case118/{split}/duals_{idx:06d}.pt. E.2 Convergence statistics Table 18: Dual label extraction statistics for case118. Split Instances Converged Rate Mean time (s) Total time (h) Train 5,000 5,000 100% 2.5 3.5 Validation 500 500 100% 2.5 0.35 Test 50 50 100% 2.5 0.035 E.3 Dual variable distributions The extracted d...

work page
[7]

Direct concatenation: (P d i , Qd i ) are appended to the feature vector of the bus node at which the load is connected, and to the feature vectors of all generators connected to that bus

work page
[8]

Global load skip: the sum of all loads P i(P d i , Qd i ) is passed through a small MLP and concatenated to the generator decoder input, providing a global demand signal. Without load injection, the model outputs near-constant predictions (pred std ∼0.01–0.20 versus true std ∼0.6–1.0, correlation ≈0 for all variables), as the static graph features carry n...

work page
[9]

Adding edge updates (Exp E) reduced loss to 0.45—the first time any GNN variant dropped below 1.0

Edge updates broke the 1.0 loss floor.The original node-only GNN plateaued at val loss 1.0 regardless of training configuration. Adding edge updates (Exp E) reduced loss to 0.45—the first time any GNN variant dropped below 1.0. This suggests that edge features carry information critical for dual prediction that node-only message passing cannot capture

work page
[10]

The binding mask helps the model allocate capacity to the sparse but critical binding multipliers; two-stage decoding conditions dual prediction on predicted primals

Loss strategies provided modest iteration gains.Binding-mask loss and two-stage decod- ing each independently reduced iterations from 7.0 to 6.7, but neither reduced validation loss substantially. The binding mask helps the model allocate capacity to the sparse but critical binding multipliers; two-stage decoding conditions dual prediction on predicted primals

work page
[11]

This is the largest single-modification gain in the entire ablation

Removing node residuals was the decisive change.Val loss dropped from 0.45 to 0.09 (5×), and IPOPT iterations from 7.0 to 5.4, from a single architectural modification. This is the largest single-modification gain in the entire ablation

work page
[12]

Combining all three (best_combo, 500 epochs) did not push below 5.4, indicating an architectural ceiling for this model family on case118

Further refinements hit a ceiling at 5.3.Per-node bias (1,268 additional parameters) and two-stage decoding each independently reached 5.3 iterations. Combining all three (best_combo, 500 epochs) did not push below 5.4, indicating an architectural ceiling for this model family on case118

work page
[13]

H= 256 (∼25M params) achieved 6.6 iterations—worse than H= 128

Wider models do not help. H= 256 (∼25M params) achieved 6.6 iterations—worse than H= 128 . The additional capacity introduces optimisation difficulty without improving representational quality at this problem scale

work page
[14]

bus"], "batch

Physics loss was counterproductive.Adding an AC power balance violation loss (Exp E2) increased val loss from 0.45 to 1.10 and worsened iterations from 7.0 to 7.2. The physics loss conflicts with the per-variable normalisation: the power balance residual operates in physical units, creating a scale mismatch with the normalised MSE. 21 H Independent CANOS ...

work page 2025
[15]

The noise prediction task is harder than direct regression.The diffusion model must learnϵ(x t, t)at every noise level, a strictly harder mapping than directx 0 prediction

work page
[16]

DDIM sampling introduces cumulative error.Each of the 50 denoising steps contributes a small approximation error that compounds

work page
[17]

The KKT scoring proxy is approximate.A full KKT residual computation (requiring Jacobian evaluation) would be more accurate but also more expensive

work page
[18]

bus","ac_line

Case118 is effectively unimodal.Each load scenario maps to a single well-separated optimum. Multi-sample diversity provides no benefit when the solution mapping is deter- ministic. 5.K= 5 is worse than K= 1 .The scoring function may select atypical samples with low complementarity proxy but poor overall KKT satisfaction, suggesting the proxy metric is not...

work page 2020
[19]

developed gauge-map projections for problems with linear constraints, while Liang et al

work page
[20]

More recently, Chen et al

proposed homeomorphic projections for non-convex feasible regions. More recently, Chen et al. [2024] trained networks to predict feasible dual solutions, recovering associated primals via the stationarity condition. Our objective differs from these approaches: we seek to reduce solver iterations while retaining the feasibility and optimality guarantees of...

work page 2024
[21]

Sambharya et al

introduced the idea of learning optimiser update rules. Sambharya et al. [2024] learned warm- starts for fixed-point splitting methods on QPs by differentiating through unrolled solver iterations. Briden et al. [2024] proposed Lagrangian-informed losses for warm-starting trajectory optimisation under an SQP solver. Graph neural networks for physical simul...

work page 2024
[22]

Liu et al

provided an open-source reimplementation with physics-informed branch flow derivations. Liu et al. [2022] developed topology-aware GNNs with physics-based feasibility regularisation, demonstrating adaptivity to topological perturbations. We adopt the same architectural family but extend it to predict the full primal-dual-barrier state—a task that these pr...

work page 2022

[1] [1]

Load the OPFDataset HeteroData graph containing bus, generator, load, and branch data

work page

[2] [2]

Construct the cyipopt NLP problem with exact Hessian and sparse Jacobian structure

work page

[3] [3]

Initialise IPOPT at the midpoint(l+u)/2with default dual initialisation

work page

[4] [4]

Run IPOPT to convergence (tolerance10 −4)

work page

[5] [5]

Extract the full converged state:D i = (x∗ i , λ∗ i , z∗ l,i, z∗ u,i, µ∗ i , f(x ∗ i ))

work page

[6] [6]

E.2 Convergence statistics Table 18: Dual label extraction statistics for case118

Save as a PyTorch tensor file:data/duals/case118/{split}/duals_{idx:06d}.pt. E.2 Convergence statistics Table 18: Dual label extraction statistics for case118. Split Instances Converged Rate Mean time (s) Total time (h) Train 5,000 5,000 100% 2.5 3.5 Validation 500 500 100% 2.5 0.35 Test 50 50 100% 2.5 0.035 E.3 Dual variable distributions The extracted d...

work page

[7] [7]

Direct concatenation: (P d i , Qd i ) are appended to the feature vector of the bus node at which the load is connected, and to the feature vectors of all generators connected to that bus

work page

[8] [8]

Global load skip: the sum of all loads P i(P d i , Qd i ) is passed through a small MLP and concatenated to the generator decoder input, providing a global demand signal. Without load injection, the model outputs near-constant predictions (pred std ∼0.01–0.20 versus true std ∼0.6–1.0, correlation ≈0 for all variables), as the static graph features carry n...

work page

[9] [9]

Adding edge updates (Exp E) reduced loss to 0.45—the first time any GNN variant dropped below 1.0

Edge updates broke the 1.0 loss floor.The original node-only GNN plateaued at val loss 1.0 regardless of training configuration. Adding edge updates (Exp E) reduced loss to 0.45—the first time any GNN variant dropped below 1.0. This suggests that edge features carry information critical for dual prediction that node-only message passing cannot capture

work page

[10] [10]

The binding mask helps the model allocate capacity to the sparse but critical binding multipliers; two-stage decoding conditions dual prediction on predicted primals

Loss strategies provided modest iteration gains.Binding-mask loss and two-stage decod- ing each independently reduced iterations from 7.0 to 6.7, but neither reduced validation loss substantially. The binding mask helps the model allocate capacity to the sparse but critical binding multipliers; two-stage decoding conditions dual prediction on predicted primals

work page

[11] [11]

This is the largest single-modification gain in the entire ablation

Removing node residuals was the decisive change.Val loss dropped from 0.45 to 0.09 (5×), and IPOPT iterations from 7.0 to 5.4, from a single architectural modification. This is the largest single-modification gain in the entire ablation

work page

[12] [12]

Combining all three (best_combo, 500 epochs) did not push below 5.4, indicating an architectural ceiling for this model family on case118

Further refinements hit a ceiling at 5.3.Per-node bias (1,268 additional parameters) and two-stage decoding each independently reached 5.3 iterations. Combining all three (best_combo, 500 epochs) did not push below 5.4, indicating an architectural ceiling for this model family on case118

work page

[13] [13]

H= 256 (∼25M params) achieved 6.6 iterations—worse than H= 128

Wider models do not help. H= 256 (∼25M params) achieved 6.6 iterations—worse than H= 128 . The additional capacity introduces optimisation difficulty without improving representational quality at this problem scale

work page

[14] [14]

bus"], "batch

Physics loss was counterproductive.Adding an AC power balance violation loss (Exp E2) increased val loss from 0.45 to 1.10 and worsened iterations from 7.0 to 7.2. The physics loss conflicts with the per-variable normalisation: the power balance residual operates in physical units, creating a scale mismatch with the normalised MSE. 21 H Independent CANOS ...

work page 2025

[15] [15]

The noise prediction task is harder than direct regression.The diffusion model must learnϵ(x t, t)at every noise level, a strictly harder mapping than directx 0 prediction

work page

[16] [16]

DDIM sampling introduces cumulative error.Each of the 50 denoising steps contributes a small approximation error that compounds

work page

[17] [17]

The KKT scoring proxy is approximate.A full KKT residual computation (requiring Jacobian evaluation) would be more accurate but also more expensive

work page

[18] [18]

bus","ac_line

Case118 is effectively unimodal.Each load scenario maps to a single well-separated optimum. Multi-sample diversity provides no benefit when the solution mapping is deter- ministic. 5.K= 5 is worse than K= 1 .The scoring function may select atypical samples with low complementarity proxy but poor overall KKT satisfaction, suggesting the proxy metric is not...

work page 2020

[19] [19]

developed gauge-map projections for problems with linear constraints, while Liang et al

work page

[20] [20]

More recently, Chen et al

proposed homeomorphic projections for non-convex feasible regions. More recently, Chen et al. [2024] trained networks to predict feasible dual solutions, recovering associated primals via the stationarity condition. Our objective differs from these approaches: we seek to reduce solver iterations while retaining the feasibility and optimality guarantees of...

work page 2024

[21] [21]

Sambharya et al

introduced the idea of learning optimiser update rules. Sambharya et al. [2024] learned warm- starts for fixed-point splitting methods on QPs by differentiating through unrolled solver iterations. Briden et al. [2024] proposed Lagrangian-informed losses for warm-starting trajectory optimisation under an SQP solver. Graph neural networks for physical simul...

work page 2024

[22] [22]

Liu et al

provided an open-source reimplementation with physics-informed branch flow derivations. Liu et al. [2022] developed topology-aware GNNs with physics-based feasibility regularisation, demonstrating adaptivity to topological perturbations. We adopt the same architectural family but extend it to predict the full primal-dual-barrier state—a task that these pr...

work page 2022