A C implementation of the Smith massager algorithm

Arne Storjohann; George Labahn; Stavros Birmpilis; Ziwen Wang

arxiv: 2605.19254 · v1 · pith:23BVPFADnew · submitted 2026-05-19 · 💻 cs.MS · math.AC

A C implementation of the Smith massager algorithm

Ziwen Wang , Stavros Birmpilis , George Labahn , Arne Storjohann This is my paper

Pith reviewed 2026-05-20 02:41 UTC · model grok-4.3

classification 💻 cs.MS math.AC

keywords Smith normal formSmith massagerinteger matricesBLASResidue Number SystemLas Vegas algorithmmatrix algorithmscomputational algebra

0 comments

The pith

C implementation of Smith massager algorithm scales like a single BLAS matrix multiplication

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a C implementation of the Las Vegas algorithm for computing the Smith normal form of a nonsingular integer matrix. The key techniques include BLAS-accelerated modular arithmetic using the Residue Number System and an adaptive batching scheme that reduces the number of iterations from O(log n) to O(1) in practice. Experiments on matrices with dimensions up to 10007 show that the running time scales proportionally to that of a single BLAS matrix multiplication, with the same effective growth rate on a log-log plot. This demonstrates that the theoretical complexity, which is softly equivalent to the cost of multiplying two matrices, can be achieved in a practical setting.

Core claim

The implementation computes a Smith massager for the input matrix at a cost softly equivalent to matrix multiplication. Through BLAS-accelerated Residue Number System modular arithmetic and adaptive batching, the theoretical O(n^ω B(log n + log ||A||) (log n)^2) bit operations are realized in practice, with running times scaling proportionally to a single BLAS matrix multiplication for n up to 10007.

What carries the argument

The adaptive batching scheme that collapses the theoretical O(log n) iterations to O(1) in practice, together with BLAS-accelerated Residue Number System for modular arithmetic.

Load-bearing premise

The adaptive batching scheme and BLAS-accelerated Residue Number System modular arithmetic can be realized in C without introducing overhead or numerical instability that would prevent the observed scaling from holding for the tested matrix sizes.

What would settle it

Running the implementation on matrices with dimension significantly larger than 10007 and verifying whether the log-log plot of running time continues to match that of BLAS matrix multiplication without deviation due to increased overhead.

Figures

Figures reproduced from arXiv: 2605.19254 by Arne Storjohann, George Labahn, Stavros Birmpilis, Ziwen Wang.

read the original abstract

We describe a C implementation of the Las Vegas algorithm of Birmpilis, Labahn and Storjohann from 2020 for computing the Smith normal form of a nonsingular integer matrix. The algorithm computes a Smith massager for the input matrix using $O(n^{\omega}\, \B(\log n + \log \|A\|)\, (\log n)^2)$ bit operations, which is softly equivalent to the cost of multiplying two matrices of the same dimension and entry size. We describe the key implementation techniques that bridge the gap between the theoretical algorithm and practical performance, including BLAS-accelerated modular arithmetic via the Residue Number System and an adaptive batching scheme that collapses the theoretical $O(\log n)$ iterations to $O(1)$ in practice. Experiments on matrices of dimension up to $n = 10007$ show that the implementation's running time scales proportionally to that of a single BLAS matrix multiplication, with both exhibiting the same effective growth rate on a log-log plot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical C implementation of the 2020 Smith normal form algorithm that shows timing scaling like one BLAS matrix multiply up to n=10007, with useful engineering details but limited evidence on whether the batching stays constant beyond the tested range.

read the letter

The main thing here is a C implementation of the 2020 Smith normal form algorithm that achieves running times proportional to a single BLAS matrix multiplication for matrices up to dimension 10007. They do a solid job on the engineering. The BLAS-accelerated residue number system for modular arithmetic and the adaptive batching to reduce iterations are the practical contributions that make the theoretical cost achievable in code. The log-log plot comparison is a clear way to show the scaling matches. This kind of work helps close the gap between asymptotic analysis and what actually runs fast on current hardware. The soft spots are minor but worth noting. The adaptive batching is shown to work in the tested range, but since log n is still modest at 10007, it is not yet proven that the iteration count stays constant for much bigger problems or higher bit sizes. The abstract also leaves out specifics on how they verified that the computed forms are correct and on any variation in the timing measurements. This paper is for people in computational algebra who need fast tools for Smith normal form on integer matrices. A reader working on exact linear algebra software or looking for ways to implement similar algorithms will find the implementation techniques and performance data useful. It has enough new practical content on top of the prior theory to deserve a serious referee. I would recommend sending it to peer review, perhaps with a request for more details on verification and iteration counts.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a C implementation of the Las Vegas Smith massager algorithm of Birmpilis, Labahn and Storjohann (2020) for computing the Smith normal form of a nonsingular integer matrix. It achieves O(n^ω B(log n + log ||A||) (log n)^2) bit complexity, softly equivalent to one matrix multiplication, via BLAS-accelerated Residue Number System modular arithmetic and an adaptive batching scheme that reduces the theoretical O(log n) outer iterations to O(1) in practice. Experiments on matrices up to dimension n=10007 demonstrate that running time scales proportionally to a single BLAS matrix multiplication on a log-log plot.

Significance. If the implementation is correct and the observed scaling generalizes, the work provides a practical, high-performance realization of a near-optimal algorithm for an important problem in computational algebra. The explicit use of standard BLAS libraries for the modular arithmetic component is a notable strength that aids reproducibility and portability across hardware.

major comments (2)

[Experiments] Experimental section: the reported scaling on matrices up to n=10007 lacks error bars, a complete experimental protocol (including how matrices were generated and how output correctness was verified), and any tabulation of the actual number of adaptive batching iterations versus n or log||A||. This is load-bearing for the central claim that the implementation realizes the softly-linear cost.
[Implementation Techniques] Adaptive batching description: the claim that the scheme collapses O(log n) iterations to a small constant independent of n and bit size is stated to hold 'in practice' for the tested range, but no explicit bound, plot of iteration counts, or measurement showing independence from log||A|| is provided. At n=10007 the hidden logarithmic factors remain modest (~14), so proportionality to one matrix multiplication could be an artifact of the regime rather than evidence of the asymptotic behavior.

minor comments (2)

[Abstract] The abstract and experimental discussion refer to 'the same effective growth rate on a log-log plot' without reporting the fitted slopes or supplying the underlying timing data in a table or supplementary file.
[Introduction] Notation for the complexity bound uses B(·) without an explicit definition or reference to the underlying bit-complexity model in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. We address each major comment below and describe the revisions we will make to strengthen the experimental section and clarify the practical behavior of the adaptive batching scheme.

read point-by-point responses

Referee: [Experiments] Experimental section: the reported scaling on matrices up to n=10007 lacks error bars, a complete experimental protocol (including how matrices were generated and how output correctness was verified), and any tabulation of the actual number of adaptive batching iterations versus n or log||A||. This is load-bearing for the central claim that the implementation realizes the softly-linear cost.

Authors: We agree that these details are necessary to support the central claim. In the revised manuscript we will add error bars computed from at least five independent runs per data point, provide a complete experimental protocol describing the random matrix generator (uniform entries of controlled bit length with a final gcd-based nonsingularity check), and explain the verification procedure (reconstruction of the input via the massager and Smith form together with explicit checks of the divisibility conditions). We will also include a table of observed adaptive batching iteration counts versus n and log||A||. revision: yes
Referee: [Implementation Techniques] Adaptive batching description: the claim that the scheme collapses O(log n) iterations to a small constant independent of n and bit size is stated to hold 'in practice' for the tested range, but no explicit bound, plot of iteration counts, or measurement showing independence from log||A|| is provided. At n=10007 the hidden logarithmic factors remain modest (~14), so proportionality to one matrix multiplication could be an artifact of the regime rather than evidence of the asymptotic behavior.

Authors: We will add a new figure in the revised version that plots the number of adaptive batching iterations against both n and log||A|| for all reported experiments. The data show that the iteration count remains between 3 and 5 across the tested range and exhibits no visible growth with log||A||. We emphasize that the batching scheme is a practical heuristic whose behavior is documented empirically rather than by a proven worst-case bound independent of bit size; the manuscript will be updated to make this distinction explicit. While the observed scaling is consistent with the softly-linear complexity, we agree that the current n=10007 regime leaves open the possibility of hidden logarithmic effects at much larger scales. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation; complexity inherited and claims empirical

full rationale

The paper implements an existing Las Vegas algorithm from the 2020 reference by three overlapping authors and states its bit complexity directly as a property of that prior algorithm rather than re-deriving it. The adaptive batching and RNS-BLAS techniques are described as practical bridges to performance, with the scaling claim resting on timing experiments for n up to 10007 that compare against BLAS matrix multiplication on log-log plots. No equation or claim reduces a result to a fitted parameter, self-definition, or unverified self-citation chain by construction; the O(1) iteration collapse is presented as an observed practical outcome supported by measurements, not a mathematical identity that assumes its own conclusion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an implementation paper; the abstract introduces no new mathematical axioms, free parameters, or invented entities. All core mathematics is referenced to the 2020 source.

pith-pipeline@v0.9.0 · 5709 in / 1092 out tokens · 50318 ms · 2026-05-20T02:41:36.162077+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We describe a C implementation of the Las Vegas algorithm of Birmpilis, Labahn and Storjohann from 2020 for computing the Smith normal form of a nonsingular integer matrix.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Birmpilis, G

S. Birmpilis, G. Labahn, and A. Storjohann. Deterministic reduction of integer nonsingular linear system solving to matrix multiplication. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: ISSAC’19, page 58–65, New York, NY, USA, 2019. ACM

work page 2019
[2]

Birmpilis, G

S. Birmpilis, G. Labahn, and A. Storjohann. A Las Vegas algorithm for computing the Smith form of a nonsingular integer matrix. InProc. Int’l. Symp. on Symbolic and Algebraic Compu- tation: ISSAC’20, page 38–45, New York, NY, USA, 2020. ACM

work page 2020
[3]

Birmpilis, G

S. Birmpilis, G. Labahn, and A. Storjohann. A fast algorithm for computing the Smith normal form with multipliers for a nonsingular integer matrix.Journal of Symbolic Computation, 116:146–182, 2023

work page 2023
[4]

Chen and A

Z. Chen and A. Storjohann. IML: Integer matrix library.https://cs.uwaterloo.ca/ ~astorjoh/iml.html, 2005

work page 2005
[5]

Cohen.A Course in Computational Algebraic Number Theory

H. Cohen.A Course in Computational Algebraic Number Theory. Springer-Verlag, 1996

work page 1996
[6]

von zur Gathen and J

J. von zur Gathen and J. Gerhard.Modern Computer Algebra. Cambridge University Press, 3rd edition, 2013

work page 2013
[7]

W. Hart, F. Johansson, and S. Pancratz. FLINT: Fast library for number theory.https: //www.flintlib.org/, 2024. 7

work page 2024
[8]

Jäger and C

G. Jäger and C. Wagner. Efficient parallelizations of Hermite and Smith normal form algo- rithms.Parallel Computing, 35(6):345–357, 2009

work page 2009
[9]

Meurer, C

A. Meurer, C. P. Smith, M. Paprocki, et al. SymPy: Symbolic mathematics in Python.https: //www.sympy.org/, 2024

work page 2024
[10]

Newman.Integral Matrices

M. Newman.Integral Matrices. Academic Press, 1972

work page 1972
[11]

Pauderis and A

C. Pauderis and A. Storjohann. Deterministic unimodularity certification. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: ISSAC’12, page 281–288. ACM Press, New York, 2012

work page 2012
[12]

Pauderis and A

C. Pauderis and A. Storjohann. Computing the invariant structure of integer matrices: Fast algorithms into practice. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: IS- SAC’13, pages 307–314, New York, NY, USA, 2013. ACM

work page 2013
[13]

Storjohann

A. Storjohann. The shifted number system for fast linear algebra on integer matrices.Journal of Complexity, 21(4):609–650, 2005. Festschrift for the 70th Birthday of Arnold Schönhage

work page 2005
[14]

GMP: The GNU multiple precision arithmetic library.https: //gmplib.org/, 2024

The GMP development team. GMP: The GNU multiple precision arithmetic library.https: //gmplib.org/, 2024

work page 2024
[15]

Xianyi, W

Z. Xianyi, W. Qian, and W. Saar. OpenBLAS: An optimized BLAS library.https://www. openblas.net/, 2024. 8

work page 2024

[1] [1]

Birmpilis, G

S. Birmpilis, G. Labahn, and A. Storjohann. Deterministic reduction of integer nonsingular linear system solving to matrix multiplication. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: ISSAC’19, page 58–65, New York, NY, USA, 2019. ACM

work page 2019

[2] [2]

Birmpilis, G

S. Birmpilis, G. Labahn, and A. Storjohann. A Las Vegas algorithm for computing the Smith form of a nonsingular integer matrix. InProc. Int’l. Symp. on Symbolic and Algebraic Compu- tation: ISSAC’20, page 38–45, New York, NY, USA, 2020. ACM

work page 2020

[3] [3]

Birmpilis, G

S. Birmpilis, G. Labahn, and A. Storjohann. A fast algorithm for computing the Smith normal form with multipliers for a nonsingular integer matrix.Journal of Symbolic Computation, 116:146–182, 2023

work page 2023

[4] [4]

Chen and A

Z. Chen and A. Storjohann. IML: Integer matrix library.https://cs.uwaterloo.ca/ ~astorjoh/iml.html, 2005

work page 2005

[5] [5]

Cohen.A Course in Computational Algebraic Number Theory

H. Cohen.A Course in Computational Algebraic Number Theory. Springer-Verlag, 1996

work page 1996

[6] [6]

von zur Gathen and J

J. von zur Gathen and J. Gerhard.Modern Computer Algebra. Cambridge University Press, 3rd edition, 2013

work page 2013

[7] [7]

W. Hart, F. Johansson, and S. Pancratz. FLINT: Fast library for number theory.https: //www.flintlib.org/, 2024. 7

work page 2024

[8] [8]

Jäger and C

G. Jäger and C. Wagner. Efficient parallelizations of Hermite and Smith normal form algo- rithms.Parallel Computing, 35(6):345–357, 2009

work page 2009

[9] [9]

Meurer, C

A. Meurer, C. P. Smith, M. Paprocki, et al. SymPy: Symbolic mathematics in Python.https: //www.sympy.org/, 2024

work page 2024

[10] [10]

Newman.Integral Matrices

M. Newman.Integral Matrices. Academic Press, 1972

work page 1972

[11] [11]

Pauderis and A

C. Pauderis and A. Storjohann. Deterministic unimodularity certification. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: ISSAC’12, page 281–288. ACM Press, New York, 2012

work page 2012

[12] [12]

Pauderis and A

C. Pauderis and A. Storjohann. Computing the invariant structure of integer matrices: Fast algorithms into practice. InProc. Int’l. Symp. on Symbolic and Algebraic Computation: IS- SAC’13, pages 307–314, New York, NY, USA, 2013. ACM

work page 2013

[13] [13]

Storjohann

A. Storjohann. The shifted number system for fast linear algebra on integer matrices.Journal of Complexity, 21(4):609–650, 2005. Festschrift for the 70th Birthday of Arnold Schönhage

work page 2005

[14] [14]

GMP: The GNU multiple precision arithmetic library.https: //gmplib.org/, 2024

The GMP development team. GMP: The GNU multiple precision arithmetic library.https: //gmplib.org/, 2024

work page 2024

[15] [15]

Xianyi, W

Z. Xianyi, W. Qian, and W. Saar. OpenBLAS: An optimized BLAS library.https://www. openblas.net/, 2024. 8

work page 2024