SDSL-Solver: Scalable Distributed Sparse Linear Solvers for Large-Scale Interior Point Methods

Fan Zhang; Guangming Tan; Shaofeng Yang; Xin He; Yingying Cheng; Yunting Wang

arxiv: 2604.23979 · v2 · submitted 2026-04-27 · 💻 cs.DC

SDSL-Solver: Scalable Distributed Sparse Linear Solvers for Large-Scale Interior Point Methods

Shaofeng Yang , Yunting Wang , Yingying Cheng , Fan Zhang , Xin He , Guangming Tan This is my paper

Pith reviewed 2026-05-08 01:43 UTC · model grok-4.3

classification 💻 cs.DC

keywords sparse linear solversdistributed computinginterior point methodsKrylov subspace methodspreconditionersSchur complementscalable optimization

0 comments

The pith

SDSL-Solver delivers distributed Krylov-based solvers that achieve up to 97 times speedup over single-node direct methods for large interior point problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDSL-Solver, a framework that tackles the dominant bottleneck in interior point methods by replacing direct sparse solvers with scalable distributed Krylov subspace methods equipped with custom preconditioners. It introduces two complementary parallel strategies, Block Jacobi for diagonally dominant cases and Bordered Block Diagonal for general or ill-conditioned matrices, plus a reuse strategy that spreads preconditioner costs across successive IPM steps. Tests on clusters with matrices up to five million dimensions show consistent speedups against both distributed PETSc and single-node PARDISO. If the approach generalizes, it would let practitioners solve substantially larger optimization models on ordinary multi-node hardware without prohibitive memory growth or fill-in.

Core claim

SDSL-Solver employs Krylov subspace methods combined with numerics-based sparse filtering and diagonal correction to produce high-quality preconditioners; it supplies Block Jacobi for diagonally dominant matrices and Bordered Block Diagonal (BBD) using Schur complement techniques for general or ill-conditioned matrices, together with preconditioner reuse across IPM iterations, yielding average speedups of 6.23 times and 7.77 times over PETSc on four nodes and 97.54 times and 5.85 times over single-node PARDISO.

What carries the argument

Krylov subspace methods with numerics-based sparse filtering and diagonal correction for preconditioners, realized through Block Jacobi and Bordered Block Diagonal (BBD) distributed methods plus Schur complement coupling and preconditioner reuse across iterations.

If this is right

Optimization problems with millions of variables become solvable on modest multi-node clusters rather than single high-memory machines.
The fraction of IPM runtime spent on linear solves, previously over 70 percent, drops enough to allow more iterations or tighter tolerances within the same wall-clock budget.
Preconditioner reuse across successive IPM steps reduces repeated construction costs, making the approach attractive for long sequences of related solves.
Direct solvers limited by fill-in and memory can be replaced for many sparse symmetric systems without sacrificing numerical quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering and correction techniques for preconditioners could transfer to other repeated sparse linear solves arising in time-stepping or nonlinear solvers.
Scaling the BBD Schur-complement approach to dozens of nodes may require additional communication-reduction layers not explored in the four-node experiments.
Integration with GPU-accelerated sparse kernels could amplify the observed speedups for the dominant matrix-vector and triangular-solve phases.

Load-bearing premise

The benchmark problems and matrix characteristics are representative of real-world large-scale IPM instances, and the reported speedups persist after all communication and preconditioner-construction overheads.

What would settle it

Execution of SDSL-Solver on a fresh collection of large-scale IPM problems drawn from a different application area that shows either loss of the claimed speedups or failure to converge within the reported iteration budgets.

Figures

Figures reproduced from arXiv: 2604.23979 by Fan Zhang, Guangming Tan, Shaofeng Yang, Xin He, Yingying Cheng, Yunting Wang.

**Figure 1.** Figure 1: Workflow of an interior point method. The sparse view at source ↗

**Figure 2.** Figure 2: Architecture of SDSL-Solver. The framework receives the reduced linear system from the IPMs system and selects view at source ↗

**Figure 3.** Figure 3: Strong scaling of SDSL-Solver on X86 cluster. The view at source ↗

read the original abstract

The solution of sparse linear systems constitutes the dominant computational bottleneck in interior point methods (IPMs), frequently consuming over 70% of the total solution time. As optimization problems scale to millions of variables, direct solvers encounter prohibitive fill-in, excessive memory consumption, and limited parallel scalability. We present SDSL-Solver, a scalable distributed sparse linear solver framework designed for IPMs. SDSL-Solver employs Krylov subspace methods, combined with numerics-based sparse filtering and diagonal correction techniques that produce high-quality preconditioners. To accommodate diverse problem characteristics, SDSL-Solver offers two complementary distributed parallel methods: Block Jacobi for diagonally dominant matrices, and Bordered Block Diagonal (BBD) for general or ill-conditioned matrices requiring globally coupled preconditioning via Schur complement techniques. A preconditioner reuse strategy further amortizes construction costs across consecutive IPMs iterations. We evaluate SDSL-Solver on benchmark problems with matrix dimensions ranging from tens of thousands to over five million on multi-node clusters equipped with X86 processors. The experimental results show that under the Block Jacobi and BBD distributed methods, SDSL-Solver on a four-node configuration achieves average speedups of 6.23 times and 7.77 times, respectively, compared to PETSc running on the same number of nodes. Relative to the single-node PARDISO, the average speedups reach 97.54 times and 5.85 times, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDSL-Solver reports practical speedups on four nodes but supplies no scaling data to support the broader claims.

read the letter

The paper's main contribution is a packaged solver that combines Krylov methods with numerics-based filtering, diagonal correction, and preconditioner reuse, then wraps them in two distributed modes (Block Jacobi and BBD with Schur) aimed at IPM linear systems. It reports concrete wall-clock gains—roughly 6x and 8x versus PETSc on four nodes, and larger factors versus single-node PARDISO—for matrices up to five million dimensions. Those numbers come from direct runs on X86 hardware and include the reuse strategy that should cut repeated construction costs across IPM iterations. That engineering package is the part worth noticing; it gives practitioners a ready option for distributed sparse solves without inventing new theory.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SDSL-Solver, a distributed framework for the sparse linear systems that dominate interior-point methods. It combines Krylov subspace iteration with numerics-based sparse filtering and diagonal correction to form preconditioners, and supplies two complementary distributed schemes: Block Jacobi for diagonally dominant matrices and Bordered Block Diagonal (BBD) with Schur-complement coupling for general or ill-conditioned cases. A preconditioner-reuse strategy amortizes construction cost across IPM iterations. On matrices up to five million dimensions the paper reports average speedups of 6.23× and 7.77× versus PETSc on four nodes and 97.54× and 5.85× versus single-node PARDISO.

Significance. If the reported speedups are reproducible and the methods continue to perform well at larger node counts, SDSL-Solver would materially accelerate large-scale IPM solvers in optimization, machine learning, and scientific computing. The dual-method design and reuse strategy directly address the dominant linear-algebra bottleneck and the memory/fill-in limits of direct solvers. The current four-node evaluation, however, leaves the central scalability claim unverified.

major comments (2)

[Experimental Evaluation] The title and abstract assert a “scalable distributed” solver, yet all quantitative results are confined to a four-node X86 cluster. No strong-scaling or weak-scaling curves, no results on 8–32 nodes, and no communication-versus-computation breakdown are supplied for either the Block Jacobi or BBD method. Because both schemes rely on inter-node communication (diagonal-block exchanges in Jacobi, Schur-complement coupling in BBD), the absence of scaling data makes it impossible to confirm that preconditioner quality and reuse continue to dominate overhead at the scales implied by the paper.
[Abstract and Experimental Evaluation] The 97.54× speedup quoted for Block Jacobi is measured against single-node PARDISO. This baseline mixes a distributed multi-node solver with a single-node direct solver; a fairer comparison would require either a multi-threaded or distributed PARDISO run on the same four nodes or an explicit statement that no such parallel PARDISO configuration was feasible.

minor comments (2)

[Abstract] The abstract states that matrices range “from tens of thousands to over five million” but does not indicate how many matrices fall into each size class or whether the reported averages are dominated by the largest instances.
[Method Description] The description of the numerics-based filtering and diagonal-correction procedures would benefit from a short pseudocode or explicit algorithmic listing so that the preconditioner construction can be reproduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our experimental claims. We agree that the current four-node results do not fully substantiate broad scalability assertions and that baseline comparisons should be presented more carefully. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Experimental Evaluation] The title and abstract assert a “scalable distributed” solver, yet all quantitative results are confined to a four-node X86 cluster. No strong-scaling or weak-scaling curves, no results on 8–32 nodes, and no communication-versus-computation breakdown are supplied for either the Block Jacobi or BBD method. Because both schemes rely on inter-node communication (diagonal-block exchanges in Jacobi, Schur-complement coupling in BBD), the absence of scaling data makes it impossible to confirm that preconditioner quality and reuse continue to dominate overhead at the scales implied by the paper.

Authors: We agree that additional scaling data would strengthen the central claim. Our evaluation was performed on the largest cluster available to us at submission time (four nodes), and the reported speedups versus PETSc on the same hardware demonstrate practical gains from the preconditioning and reuse strategy. However, we lack empirical strong- or weak-scaling curves beyond four nodes. We will therefore make a partial revision: change the title to remove the word “Scalable,” revise the abstract and introduction to describe the framework as “distributed” with “demonstrated multi-node performance,” and add a short subsection discussing the communication patterns (block exchanges and Schur coupling) together with a qualitative analysis of expected overhead growth. These changes will accurately bound the current evidence while preserving the core algorithmic contributions. revision: partial
Referee: [Abstract and Experimental Evaluation] The 97.54× speedup quoted for Block Jacobi is measured against single-node PARDISO. This baseline mixes a distributed multi-node solver with a single-node direct solver; a fairer comparison would require either a multi-threaded or distributed PARDISO run on the same four nodes or an explicit statement that no such parallel PARDISO configuration was feasible.

Authors: We accept that the single-node PARDISO baseline is not the most direct parallel comparison. The 97.54× figure was intended to highlight the advantage of the distributed iterative approach for problems whose fill-in and memory requirements exceed typical single-node direct-solver limits. To address the concern, we will revise the abstract and results section to (a) add a multi-threaded PARDISO baseline on the same four-node hardware where threading is supported, and (b) explicitly note that the original single-node figure illustrates the benefit versus a sequential direct solver rather than a fully parallel one. If threading data cannot be obtained, we will qualify the comparison in the text. This constitutes a clear revision to the presentation of results. revision: yes

standing simulated objections not resolved

Absence of strong/weak scaling experiments on 8–32 nodes, which we cannot supply without additional cluster access.

Circularity Check

0 steps flagged

No circularity; claims rest on direct experimental wall-clock measurements against external solvers

full rationale

The paper introduces algorithmic components (Krylov methods, numerics-based filtering, diagonal correction, Block Jacobi and BBD preconditioners, reuse strategy) and validates them solely via measured runtimes on benchmark matrices (tens of thousands to >5M dimensions) run on a 4-node X86 cluster. Speedups (6.23×/7.77× vs PETSc on same nodes; 97.54×/5.85× vs single-node PARDISO) are obtained by direct comparison, not by any derivation, fitted parameter, or equation that reduces to its own inputs. No self-definitional steps, no predictions called from fits, and no load-bearing self-citations appear in the provided text. The four-node limit and lack of strong-scaling data are questions of experimental coverage, not circularity in a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard numerical linear algebra assumptions and does not introduce new free parameters, axioms, or invented entities beyond conventional Krylov convergence theory and distributed-memory communication models.

axioms (1)

standard math Krylov subspace methods converge for the preconditioned systems arising in IPMs
Invoked implicitly when claiming effective iteration counts after filtering and diagonal correction.

pith-pipeline@v0.9.0 · 5572 in / 1243 out tokens · 60106 ms · 2026-05-08T01:43:56.665424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Adams, et al

Satish Balay, Shrirang Abhyankar, Mark F. Adams, et al. 2019.PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.12. Argonne National Laboratory

work page 2019
[2]

Timothy A. Davis. 2006.Direct Methods for Sparse Linear Systems. SIAM

work page 2006
[3]

I. S. Duff and J. Koster. 2001. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix.Siam Journal on Matrix Analysis & Applications22, 4 (2001), 973–996

work page 2001
[4]

Eisenstat, Howard C

Stanley C. Eisenstat, Howard C. Elman, and Martin H. Schultz. 1983. Variational Iterative Methods for Nonsymmetric Systems of Linear Equations.SIAM J. Numer. Anal.20, 2 (1983), 345–357. doi:10.1137/0720023

work page doi:10.1137/0720023 1983
[5]

Jacek Gondzio. 2012. Interior point methods 25 years later.European Journal of Operational Research218, 3 (2012), 587–601

work page 2012
[6]

Hestenes and Eduard Stiefel

Magnus R. Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems.J. Res. Nat. Bur. Standards49, 6 (1952), 409–436

work page 1952
[7]

HSL. n.d.. HSL_MC64: Permute and scale a sparse unsymmetric or rectangular matrix to put large entries on the diagonal. Harwell Subroutine Library. https: //www.hsl.rl.ac.uk/specs/hsl_mc64.pdf swMATH ID: 13139

work page
[8]

George Karypis, Kirk Schloegel, and Vipin Kumar. 1997. Parmetis: Parallel graph partitioning and sparse matrix ordering library. (1997)

work page 1997
[9]

Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. 2023. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language.arXiv preprint arXiv:2312.14832(2023)

work page arXiv 2023
[10]

Hans Mittelmann. 2026. LPfeas Benchmark (find PD feasible point). Online. https://plato.asu.edu/ftp/lpfeas.html

work page 2026
[11]

Jorge Nocedal and Stephen J. Wright. 2006.Numerical Optimization(2nd ed.). Springer

work page 2006
[12]

Paige and Michael A

Christopher C. Paige and Michael A. Saunders. 1975. Solution of sparse indefinite systems of linear equations.SIAM J. Numer. Anal.12, 4 (1975), 617–629. doi:10. 1137/0712047

work page 1975
[13]

2003.Iterative Methods for Sparse Linear Systems(2nd ed.)

Yousef Saad. 2003.Iterative Methods for Sparse Linear Systems(2nd ed.). SIAM

work page 2003
[14]

Yousef Saad and Martin H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.SIAM J. Sci. Statist. Comput. 7, 3 (1986), 856–869

work page 1986
[15]

Olaf Schenk and Klaus Gärtner. 2004. Solving unsymmetric sparse systems of linear equations with PARDISO.Future Generation Computer Systems20, 3 (2004), 475–487

work page 2004
[16]

2023.Algorithms for sparse linear systems

Jennifer Scott and Miroslav Tůma. 2023.Algorithms for sparse linear systems. Springer

work page 2023
[17]

van der Vorst

Henk A. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems.SIAM J. Sci. Statist. Comput.13, 2 (1992), 631–644

work page 1992
[18]

Stephen J. Wright. 1997.Primal-dual interior-point methods. SIAM. 10

work page 1997

[1] [1]

Adams, et al

Satish Balay, Shrirang Abhyankar, Mark F. Adams, et al. 2019.PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.12. Argonne National Laboratory

work page 2019

[2] [2]

Timothy A. Davis. 2006.Direct Methods for Sparse Linear Systems. SIAM

work page 2006

[3] [3]

I. S. Duff and J. Koster. 2001. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix.Siam Journal on Matrix Analysis & Applications22, 4 (2001), 973–996

work page 2001

[4] [4]

Eisenstat, Howard C

Stanley C. Eisenstat, Howard C. Elman, and Martin H. Schultz. 1983. Variational Iterative Methods for Nonsymmetric Systems of Linear Equations.SIAM J. Numer. Anal.20, 2 (1983), 345–357. doi:10.1137/0720023

work page doi:10.1137/0720023 1983

[5] [5]

Jacek Gondzio. 2012. Interior point methods 25 years later.European Journal of Operational Research218, 3 (2012), 587–601

work page 2012

[6] [6]

Hestenes and Eduard Stiefel

Magnus R. Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems.J. Res. Nat. Bur. Standards49, 6 (1952), 409–436

work page 1952

[7] [7]

HSL. n.d.. HSL_MC64: Permute and scale a sparse unsymmetric or rectangular matrix to put large entries on the diagonal. Harwell Subroutine Library. https: //www.hsl.rl.ac.uk/specs/hsl_mc64.pdf swMATH ID: 13139

work page

[8] [8]

George Karypis, Kirk Schloegel, and Vipin Kumar. 1997. Parmetis: Parallel graph partitioning and sparse matrix ordering library. (1997)

work page 1997

[9] [9]

Haihao Lu, Jinwen Yang, Haodong Hu, Qi Huangfu, Jinsong Liu, Tianhao Liu, Yinyu Ye, Chuwen Zhang, and Dongdong Ge. 2023. cuPDLP-C: A strengthened implementation of cuPDLP for linear programming by C language.arXiv preprint arXiv:2312.14832(2023)

work page arXiv 2023

[10] [10]

Hans Mittelmann. 2026. LPfeas Benchmark (find PD feasible point). Online. https://plato.asu.edu/ftp/lpfeas.html

work page 2026

[11] [11]

Jorge Nocedal and Stephen J. Wright. 2006.Numerical Optimization(2nd ed.). Springer

work page 2006

[12] [12]

Paige and Michael A

Christopher C. Paige and Michael A. Saunders. 1975. Solution of sparse indefinite systems of linear equations.SIAM J. Numer. Anal.12, 4 (1975), 617–629. doi:10. 1137/0712047

work page 1975

[13] [13]

2003.Iterative Methods for Sparse Linear Systems(2nd ed.)

Yousef Saad. 2003.Iterative Methods for Sparse Linear Systems(2nd ed.). SIAM

work page 2003

[14] [14]

Yousef Saad and Martin H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.SIAM J. Sci. Statist. Comput. 7, 3 (1986), 856–869

work page 1986

[15] [15]

Olaf Schenk and Klaus Gärtner. 2004. Solving unsymmetric sparse systems of linear equations with PARDISO.Future Generation Computer Systems20, 3 (2004), 475–487

work page 2004

[16] [16]

2023.Algorithms for sparse linear systems

Jennifer Scott and Miroslav Tůma. 2023.Algorithms for sparse linear systems. Springer

work page 2023

[17] [17]

van der Vorst

Henk A. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems.SIAM J. Sci. Statist. Comput.13, 2 (1992), 631–644

work page 1992

[18] [18]

Stephen J. Wright. 1997.Primal-dual interior-point methods. SIAM. 10

work page 1997