Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets
Pith reviewed 2026-05-10 18:05 UTC · model grok-4.3
The pith
A revised method for replacing 32-bit unsigned division by constants with multiplication and shifts runs faster on 64-bit processors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an adjusted algorithm for choosing the magic multiplier and shift counts for 32-bit unsigned division by a constant produces shorter or lower-latency instruction sequences on 64-bit targets than the sequences produced by prior methods tuned for 32-bit targets, while preserving exact mathematical equivalence for all inputs.
What carries the argument
A revised selection rule for the magic multiplier and the post-multiplication shift-and-adjust steps that incorporates 64-bit multiplication to reduce total operation cost.
If this is right
- Compilers targeting 64-bit CPUs can emit lower-cost code for any source-level division by a constant.
- Workloads that perform frequent constant divisions will finish in less time without altering program semantics.
- The change preserves identical numerical results to true division for all inputs and all divisors.
- The technique can be added to existing compiler backends for 64-bit instruction sets.
Where Pith is reading between the lines
- The same width-exploitation idea could be applied to constant modulo or to signed division to obtain similar gains.
- On architectures that support even wider registers the approach might be generalized to 64-bit operands using 128-bit intermediates.
- The performance benefit will be largest in code that repeatedly divides by the same small set of constants.
Load-bearing premise
That the measured speedups in a controlled microbenchmark will appear in real compiled programs and that the new instruction sequences compute the correct division result for every possible 32-bit input.
What would settle it
Comparing the output of the optimized code against a reference division implementation for every value from 0 to 2^32-1 across a set of constant divisors, or timing representative full applications before and after the compiler change.
read the original abstract
Granlund and Montgomery proposed an optimization method for unsigned integer division by constants [3]. Their method (called the GM method in this paper) was further improved in part by works such as [1] and [7], and is now adopted by major compilers including GCC, Clang, Microsoft Compiler, and Apple Clang. However, for example, for x/7, the generated code is designed for 32-bit CPUs and therefore does not fully exploit 64-bit capabilities. This paper proposes an optimization method for 32-bit unsigned division by constants targeting 64-bit CPUs. We implemented patches for LLVM/GCC and achieved speedups of 1.67x on Intel Xeon w9-3495X (Sapphire Rapids) and 1.98x on Apple M4 (Apple M-series SoC) in the microbenchmark described later. The LLVM patch has already been merged into llvm:main [6], demonstrating the practical applicability of the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an optimization method for 32-bit unsigned division by constants targeting 64-bit CPUs, extending the Granlund-Montgomery approach to better exploit 64-bit registers and instructions. The authors implement patches for LLVM and GCC, report speedups of 1.67x on Intel Xeon w9-3495X and 1.98x on Apple M4 in a described microbenchmark, and note that the LLVM change has been merged upstream.
Significance. If the method is proven correct for all inputs and the reported speedups generalize beyond the microbenchmark, the work could improve code generation for division in compilers on modern 64-bit hardware. The concrete compiler patches and upstream LLVM merge are notable strengths demonstrating practical applicability.
major comments (2)
- [Evaluation] The microbenchmark construction and divisor selection criteria (described in the evaluation section) are not shown to exercise the same instruction sequences and latency bottlenecks that occur in hot loops of real programs, nor to use a representative set of divisors that survive constant propagation; this undermines the claim that the 64-bit method is practically superior to GM.
- [Method] The claim that the method produces identical results to true division for every input and every constant divisor is load-bearing for the contribution but lacks an explicit proof, exhaustive test cases, or formal argument in the manuscript.
minor comments (1)
- [Abstract] The abstract refers to 'the microbenchmark described later' without a one-sentence summary of its structure, which would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The upstream LLVM merge provides strong evidence of the method's practicality. We respond point-by-point to the major comments below and outline revisions to address the concerns.
read point-by-point responses
-
Referee: [Evaluation] The microbenchmark construction and divisor selection criteria (described in the evaluation section) are not shown to exercise the same instruction sequences and latency bottlenecks that occur in hot loops of real programs, nor to use a representative set of divisors that survive constant propagation; this undermines the claim that the 64-bit method is practically superior to GM.
Authors: The microbenchmark isolates the exact instruction sequences (64-bit mul, shifts, and adds) produced by the optimizer for 32-bit unsigned division, which are the primary latency contributors in hot loops after constant propagation. The selected divisors (e.g., 3, 5, 7, 9, 10, 11, 13, ...) are those for which the optimization applies and that commonly survive constant folding in real code. While broader whole-program benchmarks would strengthen the case, the measured speedups (1.67x on Xeon, 1.98x on M4) and the fact that the LLVM patch was reviewed and merged upstream already indicate practical superiority over the 32-bit GM sequences. In the revision we will add a paragraph in the evaluation section justifying the divisor selection with reference to typical compiler behavior and common constants. revision: partial
-
Referee: [Method] The claim that the method produces identical results to true division for every input and every constant divisor is load-bearing for the contribution but lacks an explicit proof, exhaustive test cases, or formal argument in the manuscript.
Authors: We agree an explicit argument is warranted. The method extends the Granlund-Montgomery construction by computing a 64-bit magic multiplier M and shift s such that the 32-bit quotient is obtained via a single 64-bit multiply followed by a right shift; the defining inequalities of the original GM proof (M * d <= 2^32 + 2^{32-s} and related bounds) continue to hold when the intermediate computations use 64-bit registers. We will insert a short formal subsection deriving these conditions from the 32-bit case and stating that the 64-bit operations preserve them. We will also report the results of our exhaustive verification (all 2^32 inputs for a representative set of divisors) and note that the LLVM patch underwent community review including correctness checks before merging. revision: yes
Circularity Check
No significant circularity; algorithmic adaptation is independently implemented and benchmarked
full rationale
The paper describes an adaptation of the Granlund-Montgomery method for 64-bit targets, provides an implementation via LLVM/GCC patches, and reports measured speedups from a microbenchmark. No equations, fitted parameters, or predictions are present that reduce by construction to the same inputs or to self-citations. The cited prior works [3,1,7] are external, and the LLVM merge citation [6] serves only as evidence of acceptance rather than a load-bearing premise for the performance claims. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
D. Cavagnino and A. E. Werbrouck. Efficient algorithms for integer division by constants using multipli- cation.The Computer Journal, 51(4):470–480, 2008
work page 2008
-
[2]
A. Fog. Instruction tables, 9 2025.https://www.agner.org/optimize/instruction_tables.pdf
work page 2025
-
[3]
T. Granlund and P. Montgomery. Division by invariant integers using multiplication.ACM SIGPLAN Notices, 29(6):61–72, 1994
work page 1994
- [4]
-
[5]
M. Shigeo. optimize-udiv32-on-64bit, 2026. https://github.com/herumi/gcc/commits/optimize-udiv32-on-64bit/. 6
work page 2026
-
[6]
M. Shigeo. [SelectionDAG] Optimize 32-bit udiv with 33-bit magic constants on 64-bit targets, 2026. https://github.com/llvm/llvm-project/pull/181288
work page 2026
-
[7]
H. S. Warren, Jr.Hacker’s Delight, Second Edition. Pearson Education, 2013. 7
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.