torch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch
Pith reviewed 2026-05-16 12:34 UTC · model grok-4.3
The pith
torch-sla supplies one autograd-aware interface for sparse solvers that runs direct, iterative, nonlinear and eigenvalue problems across five backends with built-in batching and multi-GPU support.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
torch-sla exposes a single autograd-aware API for direct, iterative, nonlinear, and eigenvalue solvers across five interchangeable backends with automatic dispatch, batched solves, and distributed multi-GPU execution via domain decomposition with halo exchange, enabled by an O(1)-graph adjoint differentiation framework.
What carries the argument
The O(1)-graph adjoint differentiation framework together with an autograd-compatible halo-exchange layer that keeps gradient computation linear in the number of solver steps.
If this is right
- Neural networks can embed sparse direct or iterative solves as differentiable layers without custom autograd rules.
- Batched solves over shared or distinct sparsity patterns become available in one API call on both CPU and GPU.
- Domain-decomposition parallelism with halo exchange scales sparse solves to multiple GPUs while preserving differentiability.
- Automatic backend selection removes the need for separate CPU and GPU code paths in scientific ML pipelines.
- Nonlinear and eigenvalue solvers now carry gradients, extending differentiable programming to a wider range of physics and optimization problems.
Where Pith is reading between the lines
- The same adjoint technique could be ported to other frameworks to give them comparable sparse-solver support.
- Real-world timing on large-scale scientific datasets would quantify whether the constant-graph overhead remains negligible in practice.
- The library's design suggests a path toward fully automatic differentiation of domain-decomposition codes beyond linear algebra.
- Adding support for additional sparse formats or preconditioners would be a direct next step that preserves the existing API.
Load-bearing premise
The adjoint differentiation framework and halo-exchange layer produce accurate gradients for every supported solver type and backend without instability or prohibitive overhead.
What would settle it
Run a gradient check that compares library gradients against finite differences on a distributed batched nonlinear solve; mismatch beyond floating-point tolerance would show the central claim is false.
read the original abstract
Differentiable sparse linear algebra is foundational for scientific machine learning, yet PyTorch lacks a unified library for it: \texttt{torch.sparse} provides only low-level kernels and a non-differentiable, CPU-only \texttt{spsolve}, and \texttt{torch.linalg} is dense-only. We present \torchsla{}, an open-source library that fills this gap. It exposes a single autograd-aware API for direct, iterative, nonlinear, and eigenvalue solvers across five interchangeable backends -- SciPy and Eigen on CPU, cuDSS, CuPy, and a PyTorch-native iterative solver on GPU -- with automatic dispatch by device and problem size. The library further supports batched solves over shared or distinct sparsity patterns and distributed multi-GPU execution via domain decomposition with halo exchange. These capabilities are made scalable by an O(1)-graph adjoint differentiation framework and an autograd-compatible distributed halo-exchange layer. The library is available at https://www.torchsla.com/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces torch-sla, an open-source PyTorch library providing a unified autograd-aware API for sparse linear algebra. It supports direct, iterative, nonlinear, and eigenvalue solvers across five interchangeable backends (SciPy/Eigen on CPU, cuDSS/CuPy/PyTorch-native on GPU) with automatic dispatch, batched solves over shared or distinct sparsity patterns, and distributed multi-GPU execution via domain decomposition with halo exchange. Scalability is achieved through an O(1)-graph adjoint differentiation framework and an autograd-compatible halo-exchange layer.
Significance. If the adjoint framework and halo-exchange layer are correctly implemented and validated, the library would address a clear gap in PyTorch for differentiable sparse computations, enabling new applications in scientific machine learning such as physics-informed neural networks and large-scale optimization. The multi-backend design and distributed support are practical strengths; the open-source release further increases potential impact.
major comments (2)
- [Abstract] Abstract: The claim that the O(1)-graph adjoint differentiation framework correctly computes gradients for iterative, nonlinear, and eigenvalue solvers (including under distributed halo exchange) is load-bearing for the central contribution, yet no derivation, implementation sketch, finite-difference verification, or per-solver stability analysis is supplied.
- [Abstract] Abstract: No benchmarks, gradient-accuracy tests, or numerical results are presented to substantiate claims of numerical stability, O(1) memory scaling, or correct behavior when swapping backends or moving to distributed execution; this absence prevents assessment of whether the autograd-compatible layers introduce instability or excessive overhead.
minor comments (1)
- The abstract states the library URL but provides no concrete API signatures, usage examples, or installation instructions, which would help readers evaluate the claimed single-API design.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the paper to incorporate the requested supporting material.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the O(1)-graph adjoint differentiation framework correctly computes gradients for iterative, nonlinear, and eigenvalue solvers (including under distributed halo exchange) is load-bearing for the central contribution, yet no derivation, implementation sketch, finite-difference verification, or per-solver stability analysis is supplied.
Authors: We agree that the central claim requires explicit support. In the revised manuscript we will add a dedicated subsection deriving the O(1)-graph adjoint method for each solver class, including implementation sketches, finite-difference verification experiments, and a per-solver stability discussion that explicitly treats the distributed halo-exchange case. These additions will be placed in the methods and experiments sections. revision: yes
-
Referee: [Abstract] Abstract: No benchmarks, gradient-accuracy tests, or numerical results are presented to substantiate claims of numerical stability, O(1) memory scaling, or correct behavior when swapping backends or moving to distributed execution; this absence prevents assessment of whether the autograd-compatible layers introduce instability or excessive overhead.
Authors: We acknowledge the lack of empirical validation in the current version. The revised manuscript will include a new experiments section containing runtime and memory benchmarks confirming O(1) scaling, gradient-accuracy comparisons against finite differences for all solver types, numerical stability results, and cross-backend plus distributed-execution tests. These will quantify any overhead introduced by the autograd layers. revision: yes
Circularity Check
No circularity: engineering implementation of existing solvers
full rationale
This is a software library paper describing an API and implementation for differentiable sparse solvers in PyTorch. It introduces no mathematical derivations, fitted parameters, uniqueness theorems, or ansatzes. The O(1)-graph adjoint framework and halo-exchange layer are presented as engineering contributions whose correctness is asserted via implementation rather than derived from prior results within the paper. No load-bearing step reduces to a self-citation or to its own inputs by construction; the work is self-contained as code that wraps and extends standard backends.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.