Recursive vectorized computation of the vector $p$-norm

Vedran Novakovi\'c

arxiv: 2509.06220 · v3 · submitted 2025-09-07 · 🧮 math.NA · cs.NA

Recursive vectorized computation of the vector p-norm

Vedran Novakovi\'c This is my paper

Pith reviewed 2026-05-18 17:50 UTC · model grok-4.3

classification 🧮 math.NA cs.NA

keywords p-normFrobenius normhypot functionrecursive algorithmvectorizationnumerical accuracyBLAS DNRM2bitwise reproducibility

0 comments

The pith

Recursive hypot-based algorithms compute the vector p-norm more accurately than BLAS DNRM2 in many cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces recursive algorithms for the Frobenius norm that accumulate terms using the hypot function to reduce rounding errors in floating-point arithmetic. These scalar methods are then vectorized with Intel instructions for performance matching standard libraries and further parallelized with OpenCilk. A direct modification extends the approach to the general vector p-norm. Accuracy comparisons show that the new methods often achieve significantly tighter relative error bounds than the BLAS routine DNRM2. Certain scalar variants guarantee bitwise reproducibility independent of execution order.

Core claim

Recursive algorithms built around repeated calls to hypot produce relative accuracy bounds for the Frobenius norm that are often substantially tighter than those of DNRM2, while vectorization and OpenCilk parallelization deliver comparable runtime performance; the same framework extends to the p-norm and yields unconditional bitwise reproducibility in the scalar case.

What carries the argument

Recursive accumulation via the hypot function, which safely computes the Euclidean length of two numbers to avoid overflow and underflow during norm summation.

If this is right

Norm computations in linear algebra routines can achieve higher floating-point accuracy without extra cost.
Vectorized and parallel versions deliver performance on par with established BLAS implementations.
Scalar versions produce bitwise-identical results regardless of summation order.
The same recursive structure applies directly to any vector p-norm for p at least one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar recursive hypot patterns could reduce error accumulation in other summation-heavy kernels such as dot products or matrix norms.
Library maintainers might adopt these methods to improve default numerical quality in scientific computing codes.
Reproducibility properties could simplify debugging of parallel numerical applications when exact match across runs is required.

Load-bearing premise

The accuracy improvement relies on the specific floating-point implementation of hypot and the error model assumed for the recursive summation.

What would settle it

Direct comparison of computed norms against a high-precision reference on the same test vectors, repeated across different compilers and hardware to check whether the claimed accuracy edge holds.

Figures

Figures reproduced from arXiv: 2509.06220 by Vedran Novakovi\'c.

**Figure 1.** Figure 1: The observed relative errors (33) for L and C in both precisions. gain in performance. For extremely large n a thread-based parallelization will help, but even then, single-threaded vectorized subrecursions should run faster than (but with a similar accuracy as) sequential scalar ones, as demonstrated in the following. Listing 3 is an implementation of (5) for x = D and p = 8, similar to [8, Algorithm 2.1… view at source ↗

**Figure 2.** Figure 2: Slowdown (38) of the scalar algorithms versus [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Slowdown (38) of many recursive algorithms versus [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Speedup (38) of the recursive vectorized algorithms versus [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Speedup (38) of some recursive vectorized algorithms versus [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Recursive algorithms for computing the Frobenius norm of a real array are proposed, based on hypot, a hypotenuse function. Comparing their relative accuracy bounds with those of the BLAS routine DNRM2 it is shown that the proposed algorithms could in many cases be significantly more accurate. The scalar recursive algorithms are vectorized with the Intel's vector instructions to achieve performance comparable to DNRM2, and are further parallelized with OpenCilk. Some scalar algorithms are unconditionally bitwise reproducible, while the reproducibility of the vector ones depends on the vector width. A modification of the proposed algorithms to compute the vector $p$-norm is also presented.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes recursive algorithms for computing the Frobenius norm (vector 2-norm) of real arrays using the hypot function. It derives and compares relative accuracy bounds against the BLAS DNRM2 routine, claiming the proposed methods can be significantly more accurate in many cases. Scalar versions are vectorized via Intel instructions for performance comparable to DNRM2 and further parallelized with OpenCilk; some scalar algorithms are unconditionally bitwise reproducible while vector versions depend on width. A modification for the general vector p-norm is also presented.

Significance. If the derived relative error bounds are correctly obtained and strictly tighter than DNRM2's in many cases, the work could improve numerical stability for norm computations, a core primitive in linear algebra and scientific computing. The vectorization, parallelization, and reproducibility features add practical value for high-performance and reproducible numerical software.

major comments (2)

[Accuracy bounds derivation (likely §3–4)] The central accuracy claim rests on the comparison of relative error bounds for the recursive hypot accumulation versus DNRM2. The error analysis must explicitly derive the bound under a portable floating-point model for hypot (relative error O(ε) per call) and show how recursion yields a bound independent of or better than the typical O(nε) growth for scaled summation; without the explicit expressions or propagation analysis, the assertion that the algorithms 'could in many cases be significantly more accurate' cannot be assessed.
[Comparison section (likely §5)] The comparison to DNRM2 requires concrete cases or parameter ranges (e.g., vector length n, condition number, or overflow regimes) where the proposed bound is provably tighter; the dependence on hardware-specific hypot behavior and vectorized rounding must be addressed to establish portability of the accuracy advantage.

minor comments (2)

Provide pseudocode or explicit algorithmic listings for the scalar recursive, vectorized, and p-norm variants to improve clarity and reproducibility.
Clarify the exact definition of the vector p-norm modification and how it reduces to the Frobenius case when p=2.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback on our manuscript. We address the major comments point by point below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our accuracy analysis and comparisons.

read point-by-point responses

Referee: [Accuracy bounds derivation (likely §3–4)] The central accuracy claim rests on the comparison of relative error bounds for the recursive hypot accumulation versus DNRM2. The error analysis must explicitly derive the bound under a portable floating-point model for hypot (relative error O(ε) per call) and show how recursion yields a bound independent of or better than the typical O(nε) growth for scaled summation; without the explicit expressions or propagation analysis, the assertion that the algorithms 'could in many cases be significantly more accurate' cannot be assessed.

Authors: We appreciate this comment, as it helps us improve the clarity of our error analysis. The manuscript in Sections 3 and 4 does derive the relative error bounds assuming a standard model where each hypot invocation incurs a relative error of at most γ = O(ε), with ε the unit roundoff. The recursive structure allows the error to be bounded independently of the vector length n in the absence of overflow, which is a key advantage over the O(nε) bound typical for DNRM2's scaled summation approach. To make this more explicit, we will include a detailed step-by-step propagation of the error terms through the recursion in the revised version, including the explicit expressions for the accumulated bound. revision: yes
Referee: [Comparison section (likely §5)] The comparison to DNRM2 requires concrete cases or parameter ranges (e.g., vector length n, condition number, or overflow regimes) where the proposed bound is provably tighter; the dependence on hardware-specific hypot behavior and vectorized rounding must be addressed to establish portability of the accuracy advantage.

Authors: We agree that providing concrete cases will better illustrate the practical benefits. While Section 5 includes numerical experiments comparing accuracy, we will expand it with specific examples, such as for large n (e.g., n=10^6), varying condition numbers, and regimes prone to overflow/underflow, demonstrating where the proposed bound is strictly tighter. For portability, our analysis relies on a portable floating-point model for hypot rather than hardware specifics; however, we acknowledge that actual implementations may vary. We will add a discussion on how vectorized rounding affects reproducibility and accuracy, and include results from different hardware to support the portability of the accuracy advantage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on algorithmic construction and standard FP error analysis

full rationale

The paper proposes recursive hypot-based algorithms for vector p-norms, derives relative accuracy bounds via standard floating-point error models for hypot and accumulation, and compares them directly to DNRM2. No steps reduce by definition to fitted inputs, self-citations, or ansatzes; bounds are obtained from explicit error propagation assumptions independent of the target result. The work is self-contained against external benchmarks (BLAS) with no load-bearing self-referential elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard properties of the hypot function and floating-point arithmetic models; no free parameters, invented entities, or ad-hoc axioms are identifiable from the abstract.

axioms (1)

standard math The hypot function provides a numerically stable way to compute sqrt(x^2 + y^2) without overflow or underflow.
Invoked as the foundation for the recursive norm algorithms.

pith-pipeline@v0.9.0 · 5628 in / 1167 out tokens · 41366 ms · 2026-05-18T17:50:36.465802+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Anderson, Z

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen,LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, 3 rd edn., (1999)

work page 1999
[2]

J. L. Blue, A portable Fortran program to find the Euclidean norm of a vector,ACM Trans. Math. Software4(1) (1978) 15–23

work page 1978
[3]

Anderson, Algorithm 978: Safe scaling in the Level 1 BLAS,ACM Trans

E. Anderson, Algorithm 978: Safe scaling in the Level 1 BLAS,ACM Trans. Math. Software44(1) (2017) art. no. 12

work page 2017
[4]

Novakovi´ c, Arithmetical enhancements of the Kogbetliantz method for the SVD of order two,Numer

V. Novakovi´ c, Arithmetical enhancements of the Kogbetliantz method for the SVD of order two,Numer. Algorithms(2025) online athttps://doi.org/10.1007/ s11075-025-02035-7

work page 2025
[5]

Sibidanov, P

A. Sibidanov, P. Zimmermann and S. Glondu, The CORE-MATH project,29 th IEEE Symposium on Computer Arithmetic (ARITH)(2022) 26–34

work page 2022
[6]

C. F. Borges, Algorithm 1014: An improved algorithm for hypot(x,y),ACM Trans. Math. Softw.47(1) (2020) art. no. 9

work page 2020
[7]

Shibata and F

N. Shibata and F. Petrogalli, SLEEF: A portable vectorized library of C standard mathematical functions,IEEE Trans. Parallel Distrib. Syst.31(6) (2020) 1316–1327

work page 2020
[8]

Novakovi´ c, Vectorization of a thread-parallel Jacobi singular value decomposition method,SIAM J

V. Novakovi´ c, Vectorization of a thread-parallel Jacobi singular value decomposition method,SIAM J. Sci. Comput.45(3) (2023) C73–C100

work page 2023
[9]

T. B. Schardl and I.-T. A. Lee, OpenCilk: A modular and extensible software in- frastructure for fast task-parallel code,28 th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP)(2023) 189–203

work page 2023
[10]

Fousse, G

L. Fousse, G. Hanrot, V. Lef` evre, P. P´ elissier and P. Zimmermann, MPFR: A multiple- precision binary floating-point library with correct rounding,ACM Trans. Math. Softw.33(2) (2007) art. no. 13

work page 2007
[11]

org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf, (2024)

OpenMP ARB,OpenMP Application Programming Interface.https://www.openmp. org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf, (2024). fSee the MFBDA project’s web page athttps://web.math.pmf.unizg.hr/mfbda

work page 2024

[1] [1]

Anderson, Z

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney and D. Sorensen,LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, 3 rd edn., (1999)

work page 1999

[2] [2]

J. L. Blue, A portable Fortran program to find the Euclidean norm of a vector,ACM Trans. Math. Software4(1) (1978) 15–23

work page 1978

[3] [3]

Anderson, Algorithm 978: Safe scaling in the Level 1 BLAS,ACM Trans

E. Anderson, Algorithm 978: Safe scaling in the Level 1 BLAS,ACM Trans. Math. Software44(1) (2017) art. no. 12

work page 2017

[4] [4]

Novakovi´ c, Arithmetical enhancements of the Kogbetliantz method for the SVD of order two,Numer

V. Novakovi´ c, Arithmetical enhancements of the Kogbetliantz method for the SVD of order two,Numer. Algorithms(2025) online athttps://doi.org/10.1007/ s11075-025-02035-7

work page 2025

[5] [5]

Sibidanov, P

A. Sibidanov, P. Zimmermann and S. Glondu, The CORE-MATH project,29 th IEEE Symposium on Computer Arithmetic (ARITH)(2022) 26–34

work page 2022

[6] [6]

C. F. Borges, Algorithm 1014: An improved algorithm for hypot(x,y),ACM Trans. Math. Softw.47(1) (2020) art. no. 9

work page 2020

[7] [7]

Shibata and F

N. Shibata and F. Petrogalli, SLEEF: A portable vectorized library of C standard mathematical functions,IEEE Trans. Parallel Distrib. Syst.31(6) (2020) 1316–1327

work page 2020

[8] [8]

Novakovi´ c, Vectorization of a thread-parallel Jacobi singular value decomposition method,SIAM J

V. Novakovi´ c, Vectorization of a thread-parallel Jacobi singular value decomposition method,SIAM J. Sci. Comput.45(3) (2023) C73–C100

work page 2023

[9] [9]

T. B. Schardl and I.-T. A. Lee, OpenCilk: A modular and extensible software in- frastructure for fast task-parallel code,28 th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP)(2023) 189–203

work page 2023

[10] [10]

Fousse, G

L. Fousse, G. Hanrot, V. Lef` evre, P. P´ elissier and P. Zimmermann, MPFR: A multiple- precision binary floating-point library with correct rounding,ACM Trans. Math. Softw.33(2) (2007) art. no. 13

work page 2007

[11] [11]

org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf, (2024)

OpenMP ARB,OpenMP Application Programming Interface.https://www.openmp. org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf, (2024). fSee the MFBDA project’s web page athttps://web.math.pmf.unizg.hr/mfbda

work page 2024