Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

Shigeo Mitsunari; Takashi Hoshino

arxiv: 2604.07902 · v1 · submitted 2026-04-09 · 💻 cs.PL · cs.AR

Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

Shigeo Mitsunari , Takashi Hoshino This is my paper

Pith reviewed 2026-05-10 18:05 UTC · model grok-4.3

classification 💻 cs.PL cs.AR

keywords unsigned divisionconstant divisioncompiler optimization64-bit architecturemultiplicative inversecode generationperformance measurement

0 comments

The pith

A revised method for replacing 32-bit unsigned division by constants with multiplication and shifts runs faster on 64-bit processors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the common technique for turning constant division into a short sequence of multiplies and shifts, when applied to 32-bit numbers, produces code that underuses the 64-bit registers present on modern CPUs. By recalculating the multiplier and the accompanying shifts to operate on the full width of those registers, the generated instructions take fewer cycles while still returning the exact quotient for every input. This matters because constant division occurs in many inner loops of compiled programs, so any reduction in its cost directly lowers overall runtime on current hardware. The authors demonstrate the change by modifying compiler code generation and timing the results on two different 64-bit processor families.

Core claim

The central claim is that an adjusted algorithm for choosing the magic multiplier and shift counts for 32-bit unsigned division by a constant produces shorter or lower-latency instruction sequences on 64-bit targets than the sequences produced by prior methods tuned for 32-bit targets, while preserving exact mathematical equivalence for all inputs.

What carries the argument

A revised selection rule for the magic multiplier and the post-multiplication shift-and-adjust steps that incorporates 64-bit multiplication to reduce total operation cost.

If this is right

Compilers targeting 64-bit CPUs can emit lower-cost code for any source-level division by a constant.
Workloads that perform frequent constant divisions will finish in less time without altering program semantics.
The change preserves identical numerical results to true division for all inputs and all divisors.
The technique can be added to existing compiler backends for 64-bit instruction sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same width-exploitation idea could be applied to constant modulo or to signed division to obtain similar gains.
On architectures that support even wider registers the approach might be generalized to 64-bit operands using 128-bit intermediates.
The performance benefit will be largest in code that repeatedly divides by the same small set of constants.

Load-bearing premise

That the measured speedups in a controlled microbenchmark will appear in real compiled programs and that the new instruction sequences compute the correct division result for every possible 32-bit input.

What would settle it

Comparing the output of the optimized code against a reference division implementation for every value from 0 to 2^32-1 across a set of constant divisors, or timing representative full applications before and after the compiler change.

read the original abstract

Granlund and Montgomery proposed an optimization method for unsigned integer division by constants [3]. Their method (called the GM method in this paper) was further improved in part by works such as [1] and [7], and is now adopted by major compilers including GCC, Clang, Microsoft Compiler, and Apple Clang. However, for example, for x/7, the generated code is designed for 32-bit CPUs and therefore does not fully exploit 64-bit capabilities. This paper proposes an optimization method for 32-bit unsigned division by constants targeting 64-bit CPUs. We implemented patches for LLVM/GCC and achieved speedups of 1.67x on Intel Xeon w9-3495X (Sapphire Rapids) and 1.98x on Apple M4 (Apple M-series SoC) in the microbenchmark described later. The LLVM patch has already been merged into llvm:main [6], demonstrating the practical applicability of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an optimization method for 32-bit unsigned division by constants targeting 64-bit CPUs, extending the Granlund-Montgomery approach to better exploit 64-bit registers and instructions. The authors implement patches for LLVM and GCC, report speedups of 1.67x on Intel Xeon w9-3495X and 1.98x on Apple M4 in a described microbenchmark, and note that the LLVM change has been merged upstream.

Significance. If the method is proven correct for all inputs and the reported speedups generalize beyond the microbenchmark, the work could improve code generation for division in compilers on modern 64-bit hardware. The concrete compiler patches and upstream LLVM merge are notable strengths demonstrating practical applicability.

major comments (2)

[Evaluation] The microbenchmark construction and divisor selection criteria (described in the evaluation section) are not shown to exercise the same instruction sequences and latency bottlenecks that occur in hot loops of real programs, nor to use a representative set of divisors that survive constant propagation; this undermines the claim that the 64-bit method is practically superior to GM.
[Method] The claim that the method produces identical results to true division for every input and every constant divisor is load-bearing for the contribution but lacks an explicit proof, exhaustive test cases, or formal argument in the manuscript.

minor comments (1)

[Abstract] The abstract refers to 'the microbenchmark described later' without a one-sentence summary of its structure, which would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The upstream LLVM merge provides strong evidence of the method's practicality. We respond point-by-point to the major comments below and outline revisions to address the concerns.

read point-by-point responses

Referee: [Evaluation] The microbenchmark construction and divisor selection criteria (described in the evaluation section) are not shown to exercise the same instruction sequences and latency bottlenecks that occur in hot loops of real programs, nor to use a representative set of divisors that survive constant propagation; this undermines the claim that the 64-bit method is practically superior to GM.

Authors: The microbenchmark isolates the exact instruction sequences (64-bit mul, shifts, and adds) produced by the optimizer for 32-bit unsigned division, which are the primary latency contributors in hot loops after constant propagation. The selected divisors (e.g., 3, 5, 7, 9, 10, 11, 13, ...) are those for which the optimization applies and that commonly survive constant folding in real code. While broader whole-program benchmarks would strengthen the case, the measured speedups (1.67x on Xeon, 1.98x on M4) and the fact that the LLVM patch was reviewed and merged upstream already indicate practical superiority over the 32-bit GM sequences. In the revision we will add a paragraph in the evaluation section justifying the divisor selection with reference to typical compiler behavior and common constants. revision: partial
Referee: [Method] The claim that the method produces identical results to true division for every input and every constant divisor is load-bearing for the contribution but lacks an explicit proof, exhaustive test cases, or formal argument in the manuscript.

Authors: We agree an explicit argument is warranted. The method extends the Granlund-Montgomery construction by computing a 64-bit magic multiplier M and shift s such that the 32-bit quotient is obtained via a single 64-bit multiply followed by a right shift; the defining inequalities of the original GM proof (M * d <= 2^32 + 2^{32-s} and related bounds) continue to hold when the intermediate computations use 64-bit registers. We will insert a short formal subsection deriving these conditions from the 32-bit case and stating that the 64-bit operations preserve them. We will also report the results of our exhaustive verification (all 2^32 inputs for a representative set of divisors) and note that the LLVM patch underwent community review including correctness checks before merging. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic adaptation is independently implemented and benchmarked

full rationale

The paper describes an adaptation of the Granlund-Montgomery method for 64-bit targets, provides an implementation via LLVM/GCC patches, and reports measured speedups from a microbenchmark. No equations, fitted parameters, or predictions are present that reduce by construction to the same inputs or to self-citations. The cited prior works [3,1,7] are external, and the LLVM merge citation [6] serves only as evidence of acceptance rather than a load-bearing premise for the performance claims. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering optimization with no mathematical axioms, free parameters, or invented entities; it relies on standard compiler techniques and hardware assumptions.

pith-pipeline@v0.9.0 · 5466 in / 1020 out tokens · 51086 ms · 2026-05-10T18:05:41.211039+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Cavagnino and A

D. Cavagnino and A. E. Werbrouck. Efficient algorithms for integer division by constants using multipli- cation.The Computer Journal, 51(4):470–480, 2008

work page 2008
[2]

A. Fog. Instruction tables, 9 2025.https://www.agner.org/optimize/instruction_tables.pdf

work page 2025
[3]

Granlund and P

T. Granlund and P. Montgomery. Division by invariant integers using multiplication.ACM SIGPLAN Notices, 29(6):61–72, 1994

work page 1994
[4]

Lemire, C

D. Lemire, C. Bartlett, and O. Kaser. Integer division by constants: Optimal bounds.CoRR, abs/2012.12369, 2020

work page arXiv 2012
[5]

M. Shigeo. optimize-udiv32-on-64bit, 2026. https://github.com/herumi/gcc/commits/optimize-udiv32-on-64bit/. 6

work page 2026
[6]

M. Shigeo. [SelectionDAG] Optimize 32-bit udiv with 33-bit magic constants on 64-bit targets, 2026. https://github.com/llvm/llvm-project/pull/181288

work page 2026
[7]

H. S. Warren, Jr.Hacker’s Delight, Second Edition. Pearson Education, 2013. 7

work page 2013

[1] [1]

Cavagnino and A

D. Cavagnino and A. E. Werbrouck. Efficient algorithms for integer division by constants using multipli- cation.The Computer Journal, 51(4):470–480, 2008

work page 2008

[2] [2]

A. Fog. Instruction tables, 9 2025.https://www.agner.org/optimize/instruction_tables.pdf

work page 2025

[3] [3]

Granlund and P

T. Granlund and P. Montgomery. Division by invariant integers using multiplication.ACM SIGPLAN Notices, 29(6):61–72, 1994

work page 1994

[4] [4]

Lemire, C

D. Lemire, C. Bartlett, and O. Kaser. Integer division by constants: Optimal bounds.CoRR, abs/2012.12369, 2020

work page arXiv 2012

[5] [5]

M. Shigeo. optimize-udiv32-on-64bit, 2026. https://github.com/herumi/gcc/commits/optimize-udiv32-on-64bit/. 6

work page 2026

[6] [6]

M. Shigeo. [SelectionDAG] Optimize 32-bit udiv with 33-bit magic constants on 64-bit targets, 2026. https://github.com/llvm/llvm-project/pull/181288

work page 2026

[7] [7]

H. S. Warren, Jr.Hacker’s Delight, Second Edition. Pearson Education, 2013. 7

work page 2013