pith. sign in

arxiv: 2604.23979 · v2 · submitted 2026-04-27 · 💻 cs.DC

SDSL-Solver: Scalable Distributed Sparse Linear Solvers for Large-Scale Interior Point Methods

Pith reviewed 2026-05-08 01:43 UTC · model grok-4.3

classification 💻 cs.DC
keywords sparse linear solversdistributed computinginterior point methodsKrylov subspace methodspreconditionersSchur complementscalable optimization
0
0 comments X

The pith

SDSL-Solver delivers distributed Krylov-based solvers that achieve up to 97 times speedup over single-node direct methods for large interior point problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDSL-Solver, a framework that tackles the dominant bottleneck in interior point methods by replacing direct sparse solvers with scalable distributed Krylov subspace methods equipped with custom preconditioners. It introduces two complementary parallel strategies, Block Jacobi for diagonally dominant cases and Bordered Block Diagonal for general or ill-conditioned matrices, plus a reuse strategy that spreads preconditioner costs across successive IPM steps. Tests on clusters with matrices up to five million dimensions show consistent speedups against both distributed PETSc and single-node PARDISO. If the approach generalizes, it would let practitioners solve substantially larger optimization models on ordinary multi-node hardware without prohibitive memory growth or fill-in.

Core claim

SDSL-Solver employs Krylov subspace methods combined with numerics-based sparse filtering and diagonal correction to produce high-quality preconditioners; it supplies Block Jacobi for diagonally dominant matrices and Bordered Block Diagonal (BBD) using Schur complement techniques for general or ill-conditioned matrices, together with preconditioner reuse across IPM iterations, yielding average speedups of 6.23 times and 7.77 times over PETSc on four nodes and 97.54 times and 5.85 times over single-node PARDISO.

What carries the argument

Krylov subspace methods with numerics-based sparse filtering and diagonal correction for preconditioners, realized through Block Jacobi and Bordered Block Diagonal (BBD) distributed methods plus Schur complement coupling and preconditioner reuse across iterations.

If this is right

  • Optimization problems with millions of variables become solvable on modest multi-node clusters rather than single high-memory machines.
  • The fraction of IPM runtime spent on linear solves, previously over 70 percent, drops enough to allow more iterations or tighter tolerances within the same wall-clock budget.
  • Preconditioner reuse across successive IPM steps reduces repeated construction costs, making the approach attractive for long sequences of related solves.
  • Direct solvers limited by fill-in and memory can be replaced for many sparse symmetric systems without sacrificing numerical quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filtering and correction techniques for preconditioners could transfer to other repeated sparse linear solves arising in time-stepping or nonlinear solvers.
  • Scaling the BBD Schur-complement approach to dozens of nodes may require additional communication-reduction layers not explored in the four-node experiments.
  • Integration with GPU-accelerated sparse kernels could amplify the observed speedups for the dominant matrix-vector and triangular-solve phases.

Load-bearing premise

The benchmark problems and matrix characteristics are representative of real-world large-scale IPM instances, and the reported speedups persist after all communication and preconditioner-construction overheads.

What would settle it

Execution of SDSL-Solver on a fresh collection of large-scale IPM problems drawn from a different application area that shows either loss of the claimed speedups or failure to converge within the reported iteration budgets.

Figures

Figures reproduced from arXiv: 2604.23979 by Fan Zhang, Guangming Tan, Shaofeng Yang, Xin He, Yingying Cheng, Yunting Wang.

Figure 1
Figure 1. Figure 1: Workflow of an interior point method. The sparse view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of SDSL-Solver. The framework receives the reduced linear system from the IPMs system and selects view at source ↗
Figure 3
Figure 3. Figure 3: Strong scaling of SDSL-Solver on X86 cluster. The view at source ↗
read the original abstract

The solution of sparse linear systems constitutes the dominant computational bottleneck in interior point methods (IPMs), frequently consuming over 70% of the total solution time. As optimization problems scale to millions of variables, direct solvers encounter prohibitive fill-in, excessive memory consumption, and limited parallel scalability. We present SDSL-Solver, a scalable distributed sparse linear solver framework designed for IPMs. SDSL-Solver employs Krylov subspace methods, combined with numerics-based sparse filtering and diagonal correction techniques that produce high-quality preconditioners. To accommodate diverse problem characteristics, SDSL-Solver offers two complementary distributed parallel methods: Block Jacobi for diagonally dominant matrices, and Bordered Block Diagonal (BBD) for general or ill-conditioned matrices requiring globally coupled preconditioning via Schur complement techniques. A preconditioner reuse strategy further amortizes construction costs across consecutive IPMs iterations. We evaluate SDSL-Solver on benchmark problems with matrix dimensions ranging from tens of thousands to over five million on multi-node clusters equipped with X86 processors. The experimental results show that under the Block Jacobi and BBD distributed methods, SDSL-Solver on a four-node configuration achieves average speedups of 6.23 times and 7.77 times, respectively, compared to PETSc running on the same number of nodes. Relative to the single-node PARDISO, the average speedups reach 97.54 times and 5.85 times, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SDSL-Solver, a distributed framework for the sparse linear systems that dominate interior-point methods. It combines Krylov subspace iteration with numerics-based sparse filtering and diagonal correction to form preconditioners, and supplies two complementary distributed schemes: Block Jacobi for diagonally dominant matrices and Bordered Block Diagonal (BBD) with Schur-complement coupling for general or ill-conditioned cases. A preconditioner-reuse strategy amortizes construction cost across IPM iterations. On matrices up to five million dimensions the paper reports average speedups of 6.23× and 7.77× versus PETSc on four nodes and 97.54× and 5.85× versus single-node PARDISO.

Significance. If the reported speedups are reproducible and the methods continue to perform well at larger node counts, SDSL-Solver would materially accelerate large-scale IPM solvers in optimization, machine learning, and scientific computing. The dual-method design and reuse strategy directly address the dominant linear-algebra bottleneck and the memory/fill-in limits of direct solvers. The current four-node evaluation, however, leaves the central scalability claim unverified.

major comments (2)
  1. [Experimental Evaluation] The title and abstract assert a “scalable distributed” solver, yet all quantitative results are confined to a four-node X86 cluster. No strong-scaling or weak-scaling curves, no results on 8–32 nodes, and no communication-versus-computation breakdown are supplied for either the Block Jacobi or BBD method. Because both schemes rely on inter-node communication (diagonal-block exchanges in Jacobi, Schur-complement coupling in BBD), the absence of scaling data makes it impossible to confirm that preconditioner quality and reuse continue to dominate overhead at the scales implied by the paper.
  2. [Abstract and Experimental Evaluation] The 97.54× speedup quoted for Block Jacobi is measured against single-node PARDISO. This baseline mixes a distributed multi-node solver with a single-node direct solver; a fairer comparison would require either a multi-threaded or distributed PARDISO run on the same four nodes or an explicit statement that no such parallel PARDISO configuration was feasible.
minor comments (2)
  1. [Abstract] The abstract states that matrices range “from tens of thousands to over five million” but does not indicate how many matrices fall into each size class or whether the reported averages are dominated by the largest instances.
  2. [Method Description] The description of the numerics-based filtering and diagonal-correction procedures would benefit from a short pseudocode or explicit algorithmic listing so that the preconditioner construction can be reproduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our experimental claims. We agree that the current four-node results do not fully substantiate broad scalability assertions and that baseline comparisons should be presented more carefully. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The title and abstract assert a “scalable distributed” solver, yet all quantitative results are confined to a four-node X86 cluster. No strong-scaling or weak-scaling curves, no results on 8–32 nodes, and no communication-versus-computation breakdown are supplied for either the Block Jacobi or BBD method. Because both schemes rely on inter-node communication (diagonal-block exchanges in Jacobi, Schur-complement coupling in BBD), the absence of scaling data makes it impossible to confirm that preconditioner quality and reuse continue to dominate overhead at the scales implied by the paper.

    Authors: We agree that additional scaling data would strengthen the central claim. Our evaluation was performed on the largest cluster available to us at submission time (four nodes), and the reported speedups versus PETSc on the same hardware demonstrate practical gains from the preconditioning and reuse strategy. However, we lack empirical strong- or weak-scaling curves beyond four nodes. We will therefore make a partial revision: change the title to remove the word “Scalable,” revise the abstract and introduction to describe the framework as “distributed” with “demonstrated multi-node performance,” and add a short subsection discussing the communication patterns (block exchanges and Schur coupling) together with a qualitative analysis of expected overhead growth. These changes will accurately bound the current evidence while preserving the core algorithmic contributions. revision: partial

  2. Referee: [Abstract and Experimental Evaluation] The 97.54× speedup quoted for Block Jacobi is measured against single-node PARDISO. This baseline mixes a distributed multi-node solver with a single-node direct solver; a fairer comparison would require either a multi-threaded or distributed PARDISO run on the same four nodes or an explicit statement that no such parallel PARDISO configuration was feasible.

    Authors: We accept that the single-node PARDISO baseline is not the most direct parallel comparison. The 97.54× figure was intended to highlight the advantage of the distributed iterative approach for problems whose fill-in and memory requirements exceed typical single-node direct-solver limits. To address the concern, we will revise the abstract and results section to (a) add a multi-threaded PARDISO baseline on the same four-node hardware where threading is supported, and (b) explicitly note that the original single-node figure illustrates the benefit versus a sequential direct solver rather than a fully parallel one. If threading data cannot be obtained, we will qualify the comparison in the text. This constitutes a clear revision to the presentation of results. revision: yes

standing simulated objections not resolved
  • Absence of strong/weak scaling experiments on 8–32 nodes, which we cannot supply without additional cluster access.

Circularity Check

0 steps flagged

No circularity; claims rest on direct experimental wall-clock measurements against external solvers

full rationale

The paper introduces algorithmic components (Krylov methods, numerics-based filtering, diagonal correction, Block Jacobi and BBD preconditioners, reuse strategy) and validates them solely via measured runtimes on benchmark matrices (tens of thousands to >5M dimensions) run on a 4-node X86 cluster. Speedups (6.23×/7.77× vs PETSc on same nodes; 97.54×/5.85× vs single-node PARDISO) are obtained by direct comparison, not by any derivation, fitted parameter, or equation that reduces to its own inputs. No self-definitional steps, no predictions called from fits, and no load-bearing self-citations appear in the provided text. The four-node limit and lack of strong-scaling data are questions of experimental coverage, not circularity in a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard numerical linear algebra assumptions and does not introduce new free parameters, axioms, or invented entities beyond conventional Krylov convergence theory and distributed-memory communication models.

axioms (1)
  • standard math Krylov subspace methods converge for the preconditioned systems arising in IPMs
    Invoked implicitly when claiming effective iteration counts after filtering and diagonal correction.

pith-pipeline@v0.9.0 · 5572 in / 1243 out tokens · 60106 ms · 2026-05-08T01:43:56.665424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Adams, et al

    Satish Balay, Shrirang Abhyankar, Mark F. Adams, et al. 2019.PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.12. Argonne National Laboratory

  2. [2]

    Timothy A. Davis. 2006.Direct Methods for Sparse Linear Systems. SIAM

  3. [3]

    I. S. Duff and J. Koster. 2001. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix.Siam Journal on Matrix Analysis & Applications22, 4 (2001), 973–996

  4. [4]

    Eisenstat, Howard C

    Stanley C. Eisenstat, Howard C. Elman, and Martin H. Schultz. 1983. Variational Iterative Methods for Nonsymmetric Systems of Linear Equations.SIAM J. Numer. Anal.20, 2 (1983), 345–357. doi:10.1137/0720023

  5. [5]

    Jacek Gondzio. 2012. Interior point methods 25 years later.European Journal of Operational Research218, 3 (2012), 587–601

  6. [6]

    Hestenes and Eduard Stiefel

    Magnus R. Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems.J. Res. Nat. Bur. Standards49, 6 (1952), 409–436

  7. [7]

    HSL. n.d.. HSL_MC64: Permute and scale a sparse unsymmetric or rectangular matrix to put large entries on the diagonal. Harwell Subroutine Library. https: //www.hsl.rl.ac.uk/specs/hsl_mc64.pdf swMATH ID: 13139

  8. [8]

    George Karypis, Kirk Schloegel, and Vipin Kumar. 1997. Parmetis: Parallel graph partitioning and sparse matrix ordering library. (1997)

  9. [9]

    Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. 2023. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language.arXiv preprint arXiv:2312.14832(2023)

  10. [10]

    Hans Mittelmann. 2026. LPfeas Benchmark (find PD feasible point). Online. https://plato.asu.edu/ftp/lpfeas.html

  11. [11]

    Jorge Nocedal and Stephen J. Wright. 2006.Numerical Optimization(2nd ed.). Springer

  12. [12]

    Paige and Michael A

    Christopher C. Paige and Michael A. Saunders. 1975. Solution of sparse indefinite systems of linear equations.SIAM J. Numer. Anal.12, 4 (1975), 617–629. doi:10. 1137/0712047

  13. [13]

    2003.Iterative Methods for Sparse Linear Systems(2nd ed.)

    Yousef Saad. 2003.Iterative Methods for Sparse Linear Systems(2nd ed.). SIAM

  14. [14]

    Yousef Saad and Martin H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.SIAM J. Sci. Statist. Comput. 7, 3 (1986), 856–869

  15. [15]

    Olaf Schenk and Klaus Gärtner. 2004. Solving unsymmetric sparse systems of linear equations with PARDISO.Future Generation Computer Systems20, 3 (2004), 475–487

  16. [16]

    2023.Algorithms for sparse linear systems

    Jennifer Scott and Miroslav Tůma. 2023.Algorithms for sparse linear systems. Springer

  17. [17]

    van der Vorst

    Henk A. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems.SIAM J. Sci. Statist. Comput.13, 2 (1992), 631–644

  18. [18]

    Stephen J. Wright. 1997.Primal-dual interior-point methods. SIAM. 10