SDSL-Solver: Scalable Distributed Sparse Linear Solvers for Large-Scale Interior Point Methods
Pith reviewed 2026-05-08 01:43 UTC · model grok-4.3
The pith
SDSL-Solver delivers distributed Krylov-based solvers that achieve up to 97 times speedup over single-node direct methods for large interior point problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SDSL-Solver employs Krylov subspace methods combined with numerics-based sparse filtering and diagonal correction to produce high-quality preconditioners; it supplies Block Jacobi for diagonally dominant matrices and Bordered Block Diagonal (BBD) using Schur complement techniques for general or ill-conditioned matrices, together with preconditioner reuse across IPM iterations, yielding average speedups of 6.23 times and 7.77 times over PETSc on four nodes and 97.54 times and 5.85 times over single-node PARDISO.
What carries the argument
Krylov subspace methods with numerics-based sparse filtering and diagonal correction for preconditioners, realized through Block Jacobi and Bordered Block Diagonal (BBD) distributed methods plus Schur complement coupling and preconditioner reuse across iterations.
If this is right
- Optimization problems with millions of variables become solvable on modest multi-node clusters rather than single high-memory machines.
- The fraction of IPM runtime spent on linear solves, previously over 70 percent, drops enough to allow more iterations or tighter tolerances within the same wall-clock budget.
- Preconditioner reuse across successive IPM steps reduces repeated construction costs, making the approach attractive for long sequences of related solves.
- Direct solvers limited by fill-in and memory can be replaced for many sparse symmetric systems without sacrificing numerical quality.
Where Pith is reading between the lines
- The same filtering and correction techniques for preconditioners could transfer to other repeated sparse linear solves arising in time-stepping or nonlinear solvers.
- Scaling the BBD Schur-complement approach to dozens of nodes may require additional communication-reduction layers not explored in the four-node experiments.
- Integration with GPU-accelerated sparse kernels could amplify the observed speedups for the dominant matrix-vector and triangular-solve phases.
Load-bearing premise
The benchmark problems and matrix characteristics are representative of real-world large-scale IPM instances, and the reported speedups persist after all communication and preconditioner-construction overheads.
What would settle it
Execution of SDSL-Solver on a fresh collection of large-scale IPM problems drawn from a different application area that shows either loss of the claimed speedups or failure to converge within the reported iteration budgets.
Figures
read the original abstract
The solution of sparse linear systems constitutes the dominant computational bottleneck in interior point methods (IPMs), frequently consuming over 70% of the total solution time. As optimization problems scale to millions of variables, direct solvers encounter prohibitive fill-in, excessive memory consumption, and limited parallel scalability. We present SDSL-Solver, a scalable distributed sparse linear solver framework designed for IPMs. SDSL-Solver employs Krylov subspace methods, combined with numerics-based sparse filtering and diagonal correction techniques that produce high-quality preconditioners. To accommodate diverse problem characteristics, SDSL-Solver offers two complementary distributed parallel methods: Block Jacobi for diagonally dominant matrices, and Bordered Block Diagonal (BBD) for general or ill-conditioned matrices requiring globally coupled preconditioning via Schur complement techniques. A preconditioner reuse strategy further amortizes construction costs across consecutive IPMs iterations. We evaluate SDSL-Solver on benchmark problems with matrix dimensions ranging from tens of thousands to over five million on multi-node clusters equipped with X86 processors. The experimental results show that under the Block Jacobi and BBD distributed methods, SDSL-Solver on a four-node configuration achieves average speedups of 6.23 times and 7.77 times, respectively, compared to PETSc running on the same number of nodes. Relative to the single-node PARDISO, the average speedups reach 97.54 times and 5.85 times, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SDSL-Solver, a distributed framework for the sparse linear systems that dominate interior-point methods. It combines Krylov subspace iteration with numerics-based sparse filtering and diagonal correction to form preconditioners, and supplies two complementary distributed schemes: Block Jacobi for diagonally dominant matrices and Bordered Block Diagonal (BBD) with Schur-complement coupling for general or ill-conditioned cases. A preconditioner-reuse strategy amortizes construction cost across IPM iterations. On matrices up to five million dimensions the paper reports average speedups of 6.23× and 7.77× versus PETSc on four nodes and 97.54× and 5.85× versus single-node PARDISO.
Significance. If the reported speedups are reproducible and the methods continue to perform well at larger node counts, SDSL-Solver would materially accelerate large-scale IPM solvers in optimization, machine learning, and scientific computing. The dual-method design and reuse strategy directly address the dominant linear-algebra bottleneck and the memory/fill-in limits of direct solvers. The current four-node evaluation, however, leaves the central scalability claim unverified.
major comments (2)
- [Experimental Evaluation] The title and abstract assert a “scalable distributed” solver, yet all quantitative results are confined to a four-node X86 cluster. No strong-scaling or weak-scaling curves, no results on 8–32 nodes, and no communication-versus-computation breakdown are supplied for either the Block Jacobi or BBD method. Because both schemes rely on inter-node communication (diagonal-block exchanges in Jacobi, Schur-complement coupling in BBD), the absence of scaling data makes it impossible to confirm that preconditioner quality and reuse continue to dominate overhead at the scales implied by the paper.
- [Abstract and Experimental Evaluation] The 97.54× speedup quoted for Block Jacobi is measured against single-node PARDISO. This baseline mixes a distributed multi-node solver with a single-node direct solver; a fairer comparison would require either a multi-threaded or distributed PARDISO run on the same four nodes or an explicit statement that no such parallel PARDISO configuration was feasible.
minor comments (2)
- [Abstract] The abstract states that matrices range “from tens of thousands to over five million” but does not indicate how many matrices fall into each size class or whether the reported averages are dominated by the largest instances.
- [Method Description] The description of the numerics-based filtering and diagonal-correction procedures would benefit from a short pseudocode or explicit algorithmic listing so that the preconditioner construction can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope of our experimental claims. We agree that the current four-node results do not fully substantiate broad scalability assertions and that baseline comparisons should be presented more carefully. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Experimental Evaluation] The title and abstract assert a “scalable distributed” solver, yet all quantitative results are confined to a four-node X86 cluster. No strong-scaling or weak-scaling curves, no results on 8–32 nodes, and no communication-versus-computation breakdown are supplied for either the Block Jacobi or BBD method. Because both schemes rely on inter-node communication (diagonal-block exchanges in Jacobi, Schur-complement coupling in BBD), the absence of scaling data makes it impossible to confirm that preconditioner quality and reuse continue to dominate overhead at the scales implied by the paper.
Authors: We agree that additional scaling data would strengthen the central claim. Our evaluation was performed on the largest cluster available to us at submission time (four nodes), and the reported speedups versus PETSc on the same hardware demonstrate practical gains from the preconditioning and reuse strategy. However, we lack empirical strong- or weak-scaling curves beyond four nodes. We will therefore make a partial revision: change the title to remove the word “Scalable,” revise the abstract and introduction to describe the framework as “distributed” with “demonstrated multi-node performance,” and add a short subsection discussing the communication patterns (block exchanges and Schur coupling) together with a qualitative analysis of expected overhead growth. These changes will accurately bound the current evidence while preserving the core algorithmic contributions. revision: partial
-
Referee: [Abstract and Experimental Evaluation] The 97.54× speedup quoted for Block Jacobi is measured against single-node PARDISO. This baseline mixes a distributed multi-node solver with a single-node direct solver; a fairer comparison would require either a multi-threaded or distributed PARDISO run on the same four nodes or an explicit statement that no such parallel PARDISO configuration was feasible.
Authors: We accept that the single-node PARDISO baseline is not the most direct parallel comparison. The 97.54× figure was intended to highlight the advantage of the distributed iterative approach for problems whose fill-in and memory requirements exceed typical single-node direct-solver limits. To address the concern, we will revise the abstract and results section to (a) add a multi-threaded PARDISO baseline on the same four-node hardware where threading is supported, and (b) explicitly note that the original single-node figure illustrates the benefit versus a sequential direct solver rather than a fully parallel one. If threading data cannot be obtained, we will qualify the comparison in the text. This constitutes a clear revision to the presentation of results. revision: yes
- Absence of strong/weak scaling experiments on 8–32 nodes, which we cannot supply without additional cluster access.
Circularity Check
No circularity; claims rest on direct experimental wall-clock measurements against external solvers
full rationale
The paper introduces algorithmic components (Krylov methods, numerics-based filtering, diagonal correction, Block Jacobi and BBD preconditioners, reuse strategy) and validates them solely via measured runtimes on benchmark matrices (tens of thousands to >5M dimensions) run on a 4-node X86 cluster. Speedups (6.23×/7.77× vs PETSc on same nodes; 97.54×/5.85× vs single-node PARDISO) are obtained by direct comparison, not by any derivation, fitted parameter, or equation that reduces to its own inputs. No self-definitional steps, no predictions called from fits, and no load-bearing self-citations appear in the provided text. The four-node limit and lack of strong-scaling data are questions of experimental coverage, not circularity in a derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Krylov subspace methods converge for the preconditioned systems arising in IPMs
Reference graph
Works this paper leans on
-
[1]
Satish Balay, Shrirang Abhyankar, Mark F. Adams, et al. 2019.PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.12. Argonne National Laboratory
work page 2019
-
[2]
Timothy A. Davis. 2006.Direct Methods for Sparse Linear Systems. SIAM
work page 2006
-
[3]
I. S. Duff and J. Koster. 2001. On Algorithms for Permuting Large Entries to the Diagonal of a Sparse Matrix.Siam Journal on Matrix Analysis & Applications22, 4 (2001), 973–996
work page 2001
-
[4]
Stanley C. Eisenstat, Howard C. Elman, and Martin H. Schultz. 1983. Variational Iterative Methods for Nonsymmetric Systems of Linear Equations.SIAM J. Numer. Anal.20, 2 (1983), 345–357. doi:10.1137/0720023
-
[5]
Jacek Gondzio. 2012. Interior point methods 25 years later.European Journal of Operational Research218, 3 (2012), 587–601
work page 2012
-
[6]
Magnus R. Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems.J. Res. Nat. Bur. Standards49, 6 (1952), 409–436
work page 1952
-
[7]
HSL. n.d.. HSL_MC64: Permute and scale a sparse unsymmetric or rectangular matrix to put large entries on the diagonal. Harwell Subroutine Library. https: //www.hsl.rl.ac.uk/specs/hsl_mc64.pdf swMATH ID: 13139
-
[8]
George Karypis, Kirk Schloegel, and Vipin Kumar. 1997. Parmetis: Parallel graph partitioning and sparse matrix ordering library. (1997)
work page 1997
- [9]
-
[10]
Hans Mittelmann. 2026. LPfeas Benchmark (find PD feasible point). Online. https://plato.asu.edu/ftp/lpfeas.html
work page 2026
-
[11]
Jorge Nocedal and Stephen J. Wright. 2006.Numerical Optimization(2nd ed.). Springer
work page 2006
-
[12]
Christopher C. Paige and Michael A. Saunders. 1975. Solution of sparse indefinite systems of linear equations.SIAM J. Numer. Anal.12, 4 (1975), 617–629. doi:10. 1137/0712047
work page 1975
-
[13]
2003.Iterative Methods for Sparse Linear Systems(2nd ed.)
Yousef Saad. 2003.Iterative Methods for Sparse Linear Systems(2nd ed.). SIAM
work page 2003
-
[14]
Yousef Saad and Martin H. Schultz. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems.SIAM J. Sci. Statist. Comput. 7, 3 (1986), 856–869
work page 1986
-
[15]
Olaf Schenk and Klaus Gärtner. 2004. Solving unsymmetric sparse systems of linear equations with PARDISO.Future Generation Computer Systems20, 3 (2004), 475–487
work page 2004
-
[16]
2023.Algorithms for sparse linear systems
Jennifer Scott and Miroslav Tůma. 2023.Algorithms for sparse linear systems. Springer
work page 2023
-
[17]
Henk A. van der Vorst. 1992. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems.SIAM J. Sci. Statist. Comput.13, 2 (1992), 631–644
work page 1992
-
[18]
Stephen J. Wright. 1997.Primal-dual interior-point methods. SIAM. 10
work page 1997
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.