pith. sign in

arxiv: 2605.19254 · v1 · pith:23BVPFADnew · submitted 2026-05-19 · 💻 cs.MS · math.AC

A C implementation of the Smith massager algorithm

Pith reviewed 2026-05-20 02:41 UTC · model grok-4.3

classification 💻 cs.MS math.AC
keywords Smith normal formSmith massagerinteger matricesBLASResidue Number SystemLas Vegas algorithmmatrix algorithmscomputational algebra
0
0 comments X

The pith

C implementation of Smith massager algorithm scales like a single BLAS matrix multiplication

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a C implementation of the Las Vegas algorithm for computing the Smith normal form of a nonsingular integer matrix. The key techniques include BLAS-accelerated modular arithmetic using the Residue Number System and an adaptive batching scheme that reduces the number of iterations from O(log n) to O(1) in practice. Experiments on matrices with dimensions up to 10007 show that the running time scales proportionally to that of a single BLAS matrix multiplication, with the same effective growth rate on a log-log plot. This demonstrates that the theoretical complexity, which is softly equivalent to the cost of multiplying two matrices, can be achieved in a practical setting.

Core claim

The implementation computes a Smith massager for the input matrix at a cost softly equivalent to matrix multiplication. Through BLAS-accelerated Residue Number System modular arithmetic and adaptive batching, the theoretical O(n^ω B(log n + log ||A||) (log n)^2) bit operations are realized in practice, with running times scaling proportionally to a single BLAS matrix multiplication for n up to 10007.

What carries the argument

The adaptive batching scheme that collapses the theoretical O(log n) iterations to O(1) in practice, together with BLAS-accelerated Residue Number System for modular arithmetic.

Load-bearing premise

The adaptive batching scheme and BLAS-accelerated Residue Number System modular arithmetic can be realized in C without introducing overhead or numerical instability that would prevent the observed scaling from holding for the tested matrix sizes.

What would settle it

Running the implementation on matrices with dimension significantly larger than 10007 and verifying whether the log-log plot of running time continues to match that of BLAS matrix multiplication without deviation due to increased overhead.

Figures

Figures reproduced from arXiv: 2605.19254 by Arne Storjohann, George Labahn, Stavros Birmpilis, Ziwen Wang.

Figure 1
Figure 1. Figure 1: Log-log plot of Smith massager time and BLAS [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

We describe a C implementation of the Las Vegas algorithm of Birmpilis, Labahn and Storjohann from 2020 for computing the Smith normal form of a nonsingular integer matrix. The algorithm computes a Smith massager for the input matrix using $O(n^{\omega}\, \B(\log n + \log \|A\|)\, (\log n)^2)$ bit operations, which is softly equivalent to the cost of multiplying two matrices of the same dimension and entry size. We describe the key implementation techniques that bridge the gap between the theoretical algorithm and practical performance, including BLAS-accelerated modular arithmetic via the Residue Number System and an adaptive batching scheme that collapses the theoretical $O(\log n)$ iterations to $O(1)$ in practice. Experiments on matrices of dimension up to $n = 10007$ show that the implementation's running time scales proportionally to that of a single BLAS matrix multiplication, with both exhibiting the same effective growth rate on a log-log plot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a C implementation of the Las Vegas Smith massager algorithm of Birmpilis, Labahn and Storjohann (2020) for computing the Smith normal form of a nonsingular integer matrix. It achieves O(n^ω B(log n + log ||A||) (log n)^2) bit complexity, softly equivalent to one matrix multiplication, via BLAS-accelerated Residue Number System modular arithmetic and an adaptive batching scheme that reduces the theoretical O(log n) outer iterations to O(1) in practice. Experiments on matrices up to dimension n=10007 demonstrate that running time scales proportionally to a single BLAS matrix multiplication on a log-log plot.

Significance. If the implementation is correct and the observed scaling generalizes, the work provides a practical, high-performance realization of a near-optimal algorithm for an important problem in computational algebra. The explicit use of standard BLAS libraries for the modular arithmetic component is a notable strength that aids reproducibility and portability across hardware.

major comments (2)
  1. [Experiments] Experimental section: the reported scaling on matrices up to n=10007 lacks error bars, a complete experimental protocol (including how matrices were generated and how output correctness was verified), and any tabulation of the actual number of adaptive batching iterations versus n or log||A||. This is load-bearing for the central claim that the implementation realizes the softly-linear cost.
  2. [Implementation Techniques] Adaptive batching description: the claim that the scheme collapses O(log n) iterations to a small constant independent of n and bit size is stated to hold 'in practice' for the tested range, but no explicit bound, plot of iteration counts, or measurement showing independence from log||A|| is provided. At n=10007 the hidden logarithmic factors remain modest (~14), so proportionality to one matrix multiplication could be an artifact of the regime rather than evidence of the asymptotic behavior.
minor comments (2)
  1. [Abstract] The abstract and experimental discussion refer to 'the same effective growth rate on a log-log plot' without reporting the fitted slopes or supplying the underlying timing data in a table or supplementary file.
  2. [Introduction] Notation for the complexity bound uses B(·) without an explicit definition or reference to the underlying bit-complexity model in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. We address each major comment below and describe the revisions we will make to strengthen the experimental section and clarify the practical behavior of the adaptive batching scheme.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: the reported scaling on matrices up to n=10007 lacks error bars, a complete experimental protocol (including how matrices were generated and how output correctness was verified), and any tabulation of the actual number of adaptive batching iterations versus n or log||A||. This is load-bearing for the central claim that the implementation realizes the softly-linear cost.

    Authors: We agree that these details are necessary to support the central claim. In the revised manuscript we will add error bars computed from at least five independent runs per data point, provide a complete experimental protocol describing the random matrix generator (uniform entries of controlled bit length with a final gcd-based nonsingularity check), and explain the verification procedure (reconstruction of the input via the massager and Smith form together with explicit checks of the divisibility conditions). We will also include a table of observed adaptive batching iteration counts versus n and log||A||. revision: yes

  2. Referee: [Implementation Techniques] Adaptive batching description: the claim that the scheme collapses O(log n) iterations to a small constant independent of n and bit size is stated to hold 'in practice' for the tested range, but no explicit bound, plot of iteration counts, or measurement showing independence from log||A|| is provided. At n=10007 the hidden logarithmic factors remain modest (~14), so proportionality to one matrix multiplication could be an artifact of the regime rather than evidence of the asymptotic behavior.

    Authors: We will add a new figure in the revised version that plots the number of adaptive batching iterations against both n and log||A|| for all reported experiments. The data show that the iteration count remains between 3 and 5 across the tested range and exhibits no visible growth with log||A||. We emphasize that the batching scheme is a practical heuristic whose behavior is documented empirically rather than by a proven worst-case bound independent of bit size; the manuscript will be updated to make this distinction explicit. While the observed scaling is consistent with the softly-linear complexity, we agree that the current n=10007 regime leaves open the possibility of hidden logarithmic effects at much larger scales. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation; complexity inherited and claims empirical

full rationale

The paper implements an existing Las Vegas algorithm from the 2020 reference by three overlapping authors and states its bit complexity directly as a property of that prior algorithm rather than re-deriving it. The adaptive batching and RNS-BLAS techniques are described as practical bridges to performance, with the scaling claim resting on timing experiments for n up to 10007 that compare against BLAS matrix multiplication on log-log plots. No equation or claim reduces a result to a fitted parameter, self-definition, or unverified self-citation chain by construction; the O(1) iteration collapse is presented as an observed practical outcome supported by measurements, not a mathematical identity that assumes its own conclusion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an implementation paper; the abstract introduces no new mathematical axioms, free parameters, or invented entities. All core mathematics is referenced to the 2020 source.

pith-pipeline@v0.9.0 · 5709 in / 1092 out tokens · 50318 ms · 2026-05-20T02:41:36.162077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Birmpilis, G

    S. Birmpilis, G. Labahn, and A. Storjohann. Deterministic reduction of integer nonsingular linear system solving to matrix multiplication. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: ISSAC’19, page 58–65, New York, NY, USA, 2019. ACM

  2. [2]

    Birmpilis, G

    S. Birmpilis, G. Labahn, and A. Storjohann. A Las Vegas algorithm for computing the Smith form of a nonsingular integer matrix. InProc. Int’l. Symp. on Symbolic and Algebraic Compu- tation: ISSAC’20, page 38–45, New York, NY, USA, 2020. ACM

  3. [3]

    Birmpilis, G

    S. Birmpilis, G. Labahn, and A. Storjohann. A fast algorithm for computing the Smith normal form with multipliers for a nonsingular integer matrix.Journal of Symbolic Computation, 116:146–182, 2023

  4. [4]

    Chen and A

    Z. Chen and A. Storjohann. IML: Integer matrix library.https://cs.uwaterloo.ca/ ~astorjoh/iml.html, 2005

  5. [5]

    Cohen.A Course in Computational Algebraic Number Theory

    H. Cohen.A Course in Computational Algebraic Number Theory. Springer-Verlag, 1996

  6. [6]

    von zur Gathen and J

    J. von zur Gathen and J. Gerhard.Modern Computer Algebra. Cambridge University Press, 3rd edition, 2013

  7. [7]

    W. Hart, F. Johansson, and S. Pancratz. FLINT: Fast library for number theory.https: //www.flintlib.org/, 2024. 7

  8. [8]

    Jäger and C

    G. Jäger and C. Wagner. Efficient parallelizations of Hermite and Smith normal form algo- rithms.Parallel Computing, 35(6):345–357, 2009

  9. [9]

    Meurer, C

    A. Meurer, C. P. Smith, M. Paprocki, et al. SymPy: Symbolic mathematics in Python.https: //www.sympy.org/, 2024

  10. [10]

    Newman.Integral Matrices

    M. Newman.Integral Matrices. Academic Press, 1972

  11. [11]

    Pauderis and A

    C. Pauderis and A. Storjohann. Deterministic unimodularity certification. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: ISSAC’12, page 281–288. ACM Press, New York, 2012

  12. [12]

    Pauderis and A

    C. Pauderis and A. Storjohann. Computing the invariant structure of integer matrices: Fast algorithms into practice. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: IS- SAC’13, pages 307–314, New York, NY, USA, 2013. ACM

  13. [13]

    Storjohann

    A. Storjohann. The shifted number system for fast linear algebra on integer matrices.Journal of Complexity, 21(4):609–650, 2005. Festschrift for the 70th Birthday of Arnold Schönhage

  14. [14]

    GMP: The GNU multiple precision arithmetic library.https: //gmplib.org/, 2024

    The GMP development team. GMP: The GNU multiple precision arithmetic library.https: //gmplib.org/, 2024

  15. [15]

    Xianyi, W

    Z. Xianyi, W. Qian, and W. Saar. OpenBLAS: An optimized BLAS library.https://www. openblas.net/, 2024. 8