pith. sign in

arxiv: 2604.26019 · v1 · submitted 2026-04-28 · 💻 cs.DS

Converting an Integer to a Decimal String in Under Two Nanoseconds

Pith reviewed 2026-05-07 14:54 UTC · model grok-4.3

classification 💻 cs.DS
keywords integer to string conversionSIMD optimizationdecimal string generationperformance optimizationmultiply-add instructionsdual-variant algorithminteger formatting
0
0 comments X

The pith

A SIMD algorithm converts integers to decimal strings without lookup tables, running 1.4-2x faster than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a SIMD-based method that uses integer multiply-add instructions on recent AMD and Intel processors to convert binary integers to variable-length decimal strings. It computes multiple quotients and remainders in parallel and removes reliance on lookup tables. A dual-variant design dynamically selects between a branch-heavy variant for uniform digit lengths and a branch-light variant for mixed inputs. If correct, this yields consistent speedups of 1.4-2 times over the nearest competitor and 2-4 times over the C++ standard library across all integer sizes.

Core claim

Their single-core algorithm eliminates lookup tables by leveraging integer multiply-add SIMD instructions to compute quotients and remainders in parallel, while a dynamic selector switches between branch-heavy and branch-light variants to match the input distribution; this produces 1.4-2x speed over the closest prior method and 2-4x over std::to_chars on tested workloads.

What carries the argument

SIMD integer multiply-add instructions that compute multiple quotients and remainders in parallel, paired with dynamic selection between branch-heavy and branch-light code paths.

If this is right

  • The method works across the full range of integer sizes without tables.
  • Performance holds for both homogeneous and heterogeneous digit-length inputs via adaptive selection.
  • Single-core throughput exceeds all tested alternatives under the stated hardware conditions.
  • No recursive division or small lookup tables are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar parallel quotient techniques could apply to other bases or to floating-point formatting.
  • Library implementers might adopt the dual-variant selector to improve performance on modern CPUs.
  • On older hardware without the required instructions, the gains would disappear and standard methods would be preferable.
  • The approach highlights how hardware multiply-add units can replace software tables for fundamental conversions.

Load-bearing premise

The speed claims assume recent AMD and Intel processors provide the needed integer multiply-add SIMD instructions and that the dynamic variant selector works well for the tested input distributions.

What would settle it

Measure conversion times on a processor lacking the specific multiply-add SIMD instructions and check whether the new method falls below the performance of std::to_chars or the closest competitor.

read the original abstract

Converting binary integers to variable-length decimal strings is a fundamental operation in computing. Conventional fast approaches rely on recursive division and small lookup tables. We propose a SIMD-based algorithm that leverages integer multiply-add instructions available on recent AMD and Intel processors. Our method eliminates lookup tables entirely and computes multiple quotients and remainders in parallel. Additionally, we introduce a dual-variant design with dynamic selection that adapts to input characteristics: a branch-heavy variant optimized for homogeneous digit-length distributions and a branch-light variant for heterogeneous datasets. Our single-core algorithm consistently outperforms all competing methods across the full range of integer sizes, running 1.4-2x faster than the closest competitor and 2-4x faster than the C++ standard library function std::to_chars across tested workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a SIMD-based algorithm for converting binary integers to variable-length decimal strings. It leverages integer multiply-add instructions on recent AMD and Intel processors to compute multiple quotients and remainders in parallel without lookup tables. The method includes a dual-variant design with dynamic selection between a branch-heavy variant (for homogeneous digit-length inputs) and a branch-light variant (for heterogeneous inputs), claiming consistent single-core speedups of 1.4-2x over the closest competitor and 2-4x over std::to_chars across tested workloads and the full range of integer sizes.

Significance. If the performance claims are substantiated with rigorous benchmarks, the work would be a meaningful practical contribution to high-performance integer formatting, relevant to serialization, logging, and database workloads. The elimination of lookup tables via SIMD parallel arithmetic and the adaptive dual-variant strategy are explicit strengths that could influence optimized library implementations on x86 hardware.

major comments (2)
  1. Abstract: the central claim of 'consistent outperformance ... across the full range of integer sizes' and specific speedup factors (1.4-2x, 2-4x) is load-bearing for the paper's contribution, yet the abstract provides no description of benchmark methodology, input distributions, number of trials, hardware details beyond 'recent AMD and Intel', error handling, or verification that speedups hold without post-hoc selection of workloads.
  2. Abstract (algorithm description): the dynamic selection between branch-heavy and branch-light variants is presented as key to the results, but no concrete selection predicate, decision latency, or validation against misprediction costs on heterogeneous inputs is given; this is the least secure link because the headline numbers require that the selector correctly and cheaply identifies the faster path for the tested distributions.
minor comments (1)
  1. Abstract: the specific SIMD multiply-add instructions (e.g., which variants of VPMADD or equivalent) are not named, which would aid reproducibility even at the high-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the contribution. We address each major comment below, providing clarifications drawn from the manuscript and indicating where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: Abstract: the central claim of 'consistent outperformance ... across the full range of integer sizes' and specific speedup factors (1.4-2x, 2-4x) is load-bearing for the paper's contribution, yet the abstract provides no description of benchmark methodology, input distributions, number of trials, hardware details beyond 'recent AMD and Intel', error handling, or verification that speedups hold without post-hoc selection of workloads.

    Authors: We agree that the abstract would be strengthened by a concise reference to the experimental context supporting the performance claims. In the revised version we will append a short clause to the abstract noting that the reported speedups were measured on recent Intel and AMD processors across homogeneous and heterogeneous digit-length distributions, averaged over large numbers of trials, and verified against reference implementations with no post-hoc workload selection. Complete benchmark methodology, hardware specifications, error handling, and statistical details remain in the Experiments section. This change directly addresses the concern while respecting abstract length limits. revision: yes

  2. Referee: Abstract (algorithm description): the dynamic selection between branch-heavy and branch-light variants is presented as key to the results, but no concrete selection predicate, decision latency, or validation against misprediction costs on heterogeneous inputs is given; this is the least secure link because the headline numbers require that the selector correctly and cheaply identifies the faster path for the tested distributions.

    Authors: The manuscript already specifies the selection predicate, its decision logic, and measured latency in the algorithm description section. We acknowledge that these particulars are absent from the abstract. We will revise the abstract to include a brief statement that the dual-variant selector uses a low-overhead heuristic based on input characteristics. In addition, we will expand the Experiments section with explicit branch-misprediction measurements and performance results on heterogeneous inputs to confirm that the selector reliably routes to the faster path. These additions will make the adaptive strategy's role in the reported speedups fully transparent. revision: partial

Circularity Check

0 steps flagged

No circularity detected; algorithmic derivation is self-contained

full rationale

The paper introduces a new SIMD algorithm leveraging specific multiply-add instructions and a dual-variant adaptive design for integer-to-decimal conversion. No equations, parameters, or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. Performance results are presented as direct empirical measurements on hardware, not as predictions derived from the method itself. The derivation chain stands independently without any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on hardware support for specific SIMD instructions and on the input distributions matching the adaptive logic; these are standard domain assumptions rather than new postulates.

axioms (1)
  • domain assumption Recent AMD and Intel processors provide integer multiply-add SIMD instructions usable for parallel division-by-10 operations.
    The algorithm is built directly on these instructions as described in the abstract.

pith-pipeline@v0.9.0 · 5424 in / 1189 out tokens · 66339 ms · 2026-05-07T14:54:13.460181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages

  1. [1]

    T ranscoding unicode characters with AVX-512 instructions

    Clausecker R, Lemire D. T ranscoding unicode characters with AVX-512 instructions. Software: Practice and Experience 2023;53(12):2430–2462

  2. [2]

    Base64 encoding and decoding at almost the speed of a memory copy

    Muła W, Lemire D. Base64 encoding and decoding at almost the speed of a memory copy. Software: Practice and Experience 2020;50(2):89–97

  3. [3]

    The C Programming Language

    Kernighan BW, Ritchie DM. The C Programming Language. 2nd ed. Englewood Cliffs, NJ: Prentice-Hall; 1988

  4. [4]

    Lemire D, Counting the digits of 64-bit integers; 2025.https://lemire.me/blog/2025/01/07/counting-the-digits- of-64-bit-integers/, accessed: 14 January 2026. 23

  5. [5]

    Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips

    Alexandrescu A, Three Optimization Tips for C++ Code; 2012. Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips. Associated blog post:https://www.facebook.com/notes/ 10158791579037200/Accessed: 14 January 2026

  6. [6]

    Fog A, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs; 2025.https://www.agner.org/optimize/instruction_tables.pdf, accessed: 14 January 2026

  7. [7]

    A combinatoric division algorithm for fixed-integer divisors

    Jacobsohn DH. A combinatoric division algorithm for fixed-integer divisors. IEEE T rans on Comput 1973;100(6):608– 610

  8. [8]

    A Fast Division T echnique for Constant Divisors

    Artzy E, Hinds JA, Saal HJ. A Fast Division T echnique for Constant Divisors. Commun ACM 1976 Feb;19(2):98–101

  9. [9]

    Fast Constant Division Routines

    Li SYR. Fast Constant Division Routines. IEEE T rans Comput 1985 Sep;34(9):866–869

  10. [10]

    Integer Multiplication and Division on the HP Precision Architecture

    Magenheimer DJ, Peters L, Pettis K, Zuras D. Integer Multiplication and Division on the HP Precision Architecture. SIGARCH Comput Archit News 1987 Oct;15(5):90–99.http://doi.acm.org/10.1145/36177.36189

  11. [11]

    N-Bit Unsigned Division via N-Bit Multiply-Add

    Robison AD. N-Bit Unsigned Division via N-Bit Multiply-Add. In: Proceedings of the 17th IEEE Symposium on Computer Arithmetic ARITH ’05, Washington, DC, USA: IEEE Computer Society; 2005. p. 131–139

  12. [12]

    Division by Invariant Integers Using Multiplication

    Granlund T, Montgomery PL. Division by Invariant Integers Using Multiplication. SIGPLAN Not 1994 Jun;29(6):61–72

  13. [13]

    Efficient Algorithms for Integer Division by Constants Using Multiplication

    Cavagnino D, Werbrouck AE. Efficient Algorithms for Integer Division by Constants Using Multiplication. Comput J 2008 Jul;51(4):470–480

  14. [14]

    Hacker’s Delight

    Warren HS Jr. Hacker’s Delight. 2nd ed. Boston: Addison-Wesley; 2013

  15. [15]

    Integer division by constants: optimal bounds

    Lemire D, Bartlett C, Kaser O. Integer division by constants: optimal bounds. Heliyon 2021;7(6)

  16. [16]

    SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026

    Chateauneu R, Good old Integer T o ASCII conversion (itoa). SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026

  17. [17]

    optimized itoa function

    Henriksen IE, Answer to “optimized itoa function”; 2013.https://stackoverflow.com/a/15461080/5183410, answer embeds code that attributes the optimization to Inge Eivind Henriksen (2013) and the original itoa to James Clark/groff. Accessed: 14 January 2026

  18. [18]

    GitHub repository

    fmtlib contributors, format-benchmark:u2985907implementation; 2020.https://github.com/fmtlib/format- benchmark/blob/master/src/u2985907.h, accessed: 14 January 2026. GitHub repository

  19. [19]

    C++ performance challenge: integer to std::string conversion

    Hopman C, Answer to “C++ performance challenge: integer to std::string conversion”; 2010.https://stackoverflow. com/a/4364057, StackOverflow answer describing the ‘hopman_fun’ (arithmetic) and ‘hopman_fast’ (104 lookup) variants

  20. [20]

    Message posted tocomp.lang.asm.x86news- group

    Mathisen T, Fast integer to ASCII conversion code; 1999.https://groups.google.com/g/comp.lang.asm.x86/c/ rBzgsQdkq8c/m/JCv8EkPgCS4J, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86news- group

  21. [21]

    Message posted tocomp.lang.asm.x86newsgroup

    Mathisen T, Converting numbers to text; 2015.https://comp.lang.asm.x86.narkive.com/hIuZMYCO/converting- numbers-to-text#post2, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86newsgroup

  22. [22]

    Khuong P , How to print integers really fast (with Open Source AppNexus code!); 2017.https://pvk.ca/Blog/2017/12/ 22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/, accessed: 14 January 2026

  23. [23]

    html, accessed: 14 January 2026

    Muła W, SSE: conversion integers to decimal representation; 2011.http://0x80.pl/notesen/2011-10-21-sse-itoa. html, accessed: 14 January 2026

  24. [24]

    optimized itoa function

    Usericecreamsword, Answer to “optimized itoa function”; 2015.https://stackoverflow.com/a/32818030/5183410, StackOverflow answer describing a SIMD variant of Mathisen’s algorithm, Accessed: 14 January 2026. 24

  25. [25]

    Ry ¯u: Fast Float-to-String Conversion

    Adams U. Ry ¯u: Fast Float-to-String Conversion. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI 2018, New Y ork, NY, USA: Association for Computing Machinery; 2018. p. 270–282

  26. [26]

    Faster remainder by direct computation: Applications to compilers and software libraries

    Lemire D, Kaser O, Kurz N. Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience 2019;49(6):953–970. 25 A|LIST OF ACRONYMS Acronym Full form ASCII American Standard Code for Information Interchange AVX Advanced Vector Extensions AVX-512 Advanced Vector Extensions 512-bit AVX-512BW AVX-5...