Improved Scaling for Fast Mode of Ozaki Scheme II

Daisuke Takahashi; Shota Kawakami

arxiv: 2606.29129 · v1 · pith:NST57IJ2new · submitted 2026-06-28 · 💻 cs.MS · cs.DC· cs.NA· math.NA

Improved Scaling for Fast Mode of Ozaki Scheme II

Shota Kawakami , Daisuke Takahashi This is my paper

Pith reviewed 2026-06-30 02:34 UTC · model grok-4.3

classification 💻 cs.MS cs.DCcs.NAmath.NA

keywords Ozaki schemematrix multiplicationChinese remainder theoremscaling formulafast modehigh-precision emulationscale invarianceCauchy-Schwarz inequality

0 comments

The pith

A revised scaling formula derived via Cauchy-Schwarz from the CRT uniqueness condition makes Ozaki Scheme II fast mode scale-invariant by construction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Ozaki scheme II emulates high-precision matrix multiplication by converting inputs to integer matrices via scaling and then applying the Chinese remainder theorem to low-precision operations. The original fast-mode scaling formula, which relies on the Cauchy-Schwarz inequality, lacks scale invariance: scaling the input matrices by any constant alters the effective bit width of the resulting integers and can cause accuracy loss or outright CRT recovery failure. The paper derives a new scaling formula directly from the CRT uniqueness condition that remains unchanged under input scaling and always satisfies the uniqueness requirement. This revision adds no extra arithmetic or checks compared with the original fast mode. On an NVIDIA GH200 GPU the new fast mode reaches accuracy levels comparable to the slower accurate mode while preserving the original fast-mode throughput.

Core claim

What carries the argument

The revised scaling formula obtained by applying the Cauchy-Schwarz inequality directly to the CRT uniqueness condition, ensuring scale invariance and unconditional satisfaction of the uniqueness bound.

If this is right

Fast mode now achieves accuracy comparable to accurate mode while retaining its original throughput on GPU hardware.
The accuracy-throughput trade-off of Ozaki scheme II improves because the method removes the accuracy limitation of fast mode without incurring the throughput cost of accurate mode.
No runtime adjustments or extra matrix multiplications are required when input matrices have different magnitude ranges.
The scheme can be applied directly to arbitrary input matrices without prior normalization steps that would otherwise be needed to avoid CRT failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scale-invariance defect may exist in other CRT-based high-precision emulation techniques that rely on Cauchy-Schwarz bounds, suggesting analogous fixes could be derived for them.
Because the formula is parameter-free and derived from the uniqueness condition alone, it may extend without modification to higher-precision integer formats or to matrix multiplications performed on different low-precision hardware.
Applications that feed matrices of unknown or varying dynamic range into Ozaki multiplication can now rely on the fast path without separate magnitude analysis.

Load-bearing premise

The derivation via Cauchy-Schwarz inequality applied to the CRT uniqueness condition produces a scaling formula that remains valid and sufficient for all possible input matrix values and scalings without requiring additional runtime checks or adjustments.

What would settle it

Execute the revised fast-mode scaling on the same matrix pair multiplied by constants of widely different magnitudes (for example 1, 10, 100, and 1000) and confirm that the recovered product accuracy stays comparable to accurate mode and that CRT recovery never fails, whereas the original formula produces failures or accuracy loss for some of those scalings.

Figures

Figures reproduced from arXiv: 2606.29129 by Daisuke Takahashi, Shota Kawakami.

**Figure 1.** Figure 1: (A ′B ′ )ij in fast mode versus scalar α = 2s , where A = αAˆ and B = αBˆ with aˆij = ˆbij = 1 (all-ones matrices), and N = 20 moduli. The horizontal black line marks the CRT recovery threshold P/2. The dark-shaded region indicates where the CRT uniqueness condition is violated ((A ′B ′ )ij > P/2), and the light-shaded region indicates where it is satisfied. Thin and thick lines correspond to k = 1024 and … view at source ↗

**Figure 2.** Figure 2: Maximum relative error of DGEMM with respect to double-double precision versus scalar α = 2s for cuBLAS, OS II-fast, and OS II-accu, where A and B are the input matrices given by A = αAˆ and B = αBˆ, with Aˆ and Bˆ random matrices generated via (15). Each column corresponds to a different value of ϕ controlling the spread of element magnitudes. Matrix dimensions are m = n = k ∈ {1024, 16384} and N = 20 mod… view at source ↗

**Figure 3.** Figure 3: Maximum relative error of DGEMM with respect to double-double precision versus scalar α = 2s for cuBLAS, OS II-fast, and OS II-accu, where A and B are the input matrices given by A = αAˆ and B = αBˆ, with aˆij = ˆbij = 1 (all-ones matrices), m = n = k ∈ {1024, 16384}, and N = 20 moduli. cuBLAS achieves a maximum relative error of exactly 0 in all cases. maximum relative error reaches approximately 1 in the… view at source ↗

**Figure 4.** Figure 4: Maximum relative error with respect to double-double precision of DGEMM (top) and SGEMM (bottom) versus number of moduli N for cuBLAS, OS II-fast, OS II-accu, and OS II-prop, using random matrices with m = n = k and varying ϕ. Thin and thick lines correspond to m = n = k = 1024 and m = n = k = 16384, respectively. ϕ ∈ {3, 4}, the maximum relative error of OS II-fast exceeds that of cuBLAS. In contrast, OS … view at source ↗

**Figure 5.** Figure 5: Throughput (TFLOPS) of DGEMM for cuBLAS, OS II-fast, OS II-accu, and OS II-prop versus number of moduli N for random matrices with ϕ = 0.5 (top) and ϕ = 4 (bottom), for varying m = n = k [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Throughput (TFLOPS) of SGEMM for cuBLAS, OS II-fast, OS II-accu, and OS II-prop versus number of moduli N for random matrices with ϕ = 0.5 (top) and ϕ = 1.5 (bottom), for varying m = n = k. represented after scaling; consequently, the zero fractions of the two methods become comparable. With a small number of moduli, however, the representable range is narrower for OS II-fast due to its smaller scaling fac… view at source ↗

**Figure 7.** Figure 7: Accuracy–throughput trade-off for DGEMM (top, ϕ ∈ {0.5, 1, 2, 3, 4}) and SGEMM (bottom, ϕ ∈ {0, 0.5, 1, 1.5}) at m = n = k = 16384. The horizontal axis shows throughput in TFLOPS and the vertical axis shows maximum relative error with respect to double-double precision. Each curve is traced from right to left as the number of moduli N increases. The black dot marks the throughput and accuracy of cuBLAS; th… view at source ↗

read the original abstract

Ozaki scheme II emulates high-precision matrix multiplication using low-precision integer matrix operations based on the Chinese remainder theorem (CRT). It first scales the high-precision matrices to convert them into integer matrices. For this scaling step, Ozaki scheme II provides two modes: accurate mode, which uses INT8 matrix multiplication to estimate scaling factors, and fast mode, which applies the Cauchy--Schwarz inequality at lower computational cost. We show that the existing formula lacks scale invariance; multiplying the input matrices by a constant changes the effective bit width of the integer matrices in the scaling step, causing accuracy degradation or CRT recovery failure. To address this, we propose a revised scaling formula derived from the CRT uniqueness condition via the Cauchy--Schwarz inequality. The proposed formula is scale-invariant by construction, guarantees that the CRT uniqueness condition is always satisfied, and introduces no additional overhead over the original fast mode. Experiments on an NVIDIA GH200 GPU show that the proposed method achieves accuracy comparable to that of accurate mode while maintaining throughput comparable to that of fast mode. In the accuracy--throughput trade-off, the proposed method overcomes the accuracy limitation of fast mode and the throughput constraint of accurate mode, offering a superior accuracy and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fixes a scale-invariance defect in Ozaki scheme II fast mode by deriving a new Cauchy-Schwarz scaling formula from the CRT uniqueness condition that adds no overhead and passes GPU tests.

read the letter

The main point is that the existing fast-mode scaling in Ozaki scheme II is not scale-invariant: scaling the input matrices changes the effective bit width and can break the CRT uniqueness bound or hurt accuracy. The authors derive a revised formula directly from the CRT uniqueness condition via Cauchy-Schwarz, which by construction stays invariant and keeps the summed products inside the uniqueness limit for any scaling.

This is new. The defect itself had not been reported in the prior literature they cite, and the derivation avoids fitted parameters or extra runtime checks. The experiments on an NVIDIA GH200 show accuracy matching the accurate mode while throughput stays comparable to the original fast mode, which is the practical payoff.

The work is clean on its own terms. The argument is internally consistent, the inequality application is straightforward, and no hidden assumptions about input ranges appear to undermine the guarantee. The only minor softness is that the experiments are confined to one GPU architecture; broader hardware coverage would strengthen the throughput claim but is not required for the core result.

This paper is for specialists in numerical linear algebra who emulate higher precision on low-precision hardware, especially those already using or extending Ozaki schemes. A reader who needs the accuracy-throughput trade-off improved will get direct value from the formula and the numbers.

It deserves a serious referee. The fix is targeted, the math is reproducible from the given steps, and the experiments back the claim without circularity.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies that the existing fast-mode scaling formula in Ozaki scheme II lacks scale invariance: multiplying input matrices by a constant alters the effective bit width of the resulting integer matrices, which can degrade accuracy or cause CRT recovery failure. It proposes a revised scaling formula obtained by applying the Cauchy-Schwarz inequality directly to the CRT uniqueness condition. The new formula is asserted to be scale-invariant by construction, to guarantee that the CRT uniqueness condition is always satisfied for any input scaling, and to incur no additional overhead. GPU experiments on an NVIDIA GH200 are reported to show accuracy comparable to accurate mode while preserving throughput comparable to fast mode, thereby improving the accuracy-throughput trade-off.

Significance. If the derivation is complete and the experimental claims hold without post-hoc adjustments, the result would strengthen the practical applicability of Ozaki scheme II for high-precision matrix multiplication on GPUs. The scale-invariant guarantee addresses a concrete robustness issue in the fast mode, and the absence of extra operations preserves the performance advantage. Reproducible GPU timing and accuracy data would constitute a useful contribution to mathematical software for mixed-precision linear algebra.

major comments (2)

[Abstract / proposed formula] Abstract and description of the proposed formula: the manuscript states that the revised formula is derived from the CRT uniqueness condition via the Cauchy-Schwarz inequality and is scale-invariant by construction, yet the full algebraic steps that produce the explicit formula from the uniqueness bound are not supplied. Without these steps it is impossible to confirm that the inequality application yields a bound that remains valid and sufficient for arbitrary matrix entries and scalings, which is the load-bearing claim.
[Experiments] Experiments section: the manuscript reports accuracy comparable to accurate mode and throughput comparable to fast mode on an NVIDIA GH200, but provides neither an error analysis nor the rules used to select or exclude input matrices and scalings. This omission prevents verification that the guarantee holds without post-hoc adjustments across the full range of inputs, directly affecting the weakest assumption identified in the review.

minor comments (1)

[Abstract] The abstract would benefit from stating the explicit mathematical expression of the revised scaling formula rather than only describing its properties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the positive assessment of the significance of our work on improving the scale invariance of the fast mode in Ozaki scheme II. Below, we provide point-by-point responses to the major comments. We will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract / proposed formula] Abstract and description of the proposed formula: the manuscript states that the revised formula is derived from the CRT uniqueness condition via the Cauchy-Schwarz inequality and is scale-invariant by construction, yet the full algebraic steps that produce the explicit formula from the uniqueness bound are not supplied. Without these steps it is impossible to confirm that the inequality application yields a bound that remains valid and sufficient for arbitrary matrix entries and scalings, which is the load-bearing claim.

Authors: We agree that the full algebraic derivation steps are not explicitly provided in the current manuscript. In the revised version, we will add a new section detailing the complete derivation: starting from the CRT uniqueness condition, applying the Cauchy-Schwarz inequality to bound the product of the scaled matrices, and arriving at the scale-invariant formula. This will demonstrate that the resulting bound is valid and sufficient for arbitrary matrix entries and input scalings, as the inequality holds independently of the scaling factor. revision: yes
Referee: [Experiments] Experiments section: the manuscript reports accuracy comparable to accurate mode and throughput comparable to fast mode on an NVIDIA GH200, but provides neither an error analysis nor the rules used to select or exclude input matrices and scalings. This omission prevents verification that the guarantee holds without post-hoc adjustments across the full range of inputs, directly affecting the weakest assumption identified in the review.

Authors: We acknowledge that the manuscript lacks a formal error analysis and explicit rules for input matrix selection. The experiments used randomly generated matrices with entries drawn from uniform distributions and tested multiple scaling factors to illustrate the issue with the original formula and the robustness of the new one. In the revision, we will include an error analysis deriving the expected error from the theoretical bound and specify the input generation procedure, including the ranges and number of trials. This will enable independent verification of the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives a revised scaling formula by applying the Cauchy-Schwarz inequality directly to the CRT uniqueness condition, producing a scale-invariant expression that satisfies the condition by construction. This is a standard first-principles derivation from the stated mathematical constraint rather than a fit, self-definition, or self-citation chain. No load-bearing step reduces to its own inputs by construction, and the abstract and reader's summary describe an independent guarantee without fitted parameters or renamed empirical patterns. The derivation is self-contained against the external CRT condition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard properties of the Chinese Remainder Theorem for unique recovery and the validity of applying the Cauchy-Schwarz inequality to bound matrix entries for the uniqueness condition.

axioms (1)

domain assumption The CRT uniqueness condition must hold for the integer matrices produced by the scaling step in order to recover the correct high-precision result.
Invoked as the target condition that the new scaling formula is constructed to satisfy.

pith-pipeline@v0.9.1-grok · 5747 in / 1388 out tokens · 51120 ms · 2026-06-30T02:34:46.264924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 10 canonical work pages · 1 internal anchor

[1]

, volume =

Katsuhisa Ozaki and Yuki Uchino and Toshiyuki Imamura , title =. , volume =. 2025 , month =. 2504.08009 , eprintclass =

work page arXiv 2025
[2]

, title =

Ozaki,Katsuhisa and Ogita,Takeshi and Oishi,Shin'ichi and Rump,Siegfried M. , title =. Numer. Algorithms , booktitle =. 2012 , month =. doi:10.1007/s11075-011-9478-1 , url =

work page doi:10.1007/s11075-011-9478-1 2012
[3]

, booktitle =

NVIDIA , title =. , booktitle =. 2024 , month =. doi:, url =

2024
[4]

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Yuki Uchino and Katsuhisa Ozaki and Toshiyuki Imamura , title =. , volume =. 2026 , month =. 2603.10634 , url =

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

, booktitle =

Uchino, Yuki and Ozaki, Katsuhisa and Imamura, Toshiyuki , title =. , booktitle =. 2025 , month =. doi:10.1145/3731599.3767539 , url =

work page doi:10.1145/3731599.3767539 2025
[6]

, volume =

Yuki Uchino and Katsuhisa Ozaki and Toshiyuki Imamura , title =. , volume =. 2026 , month =. 2602.02549 , eprintclass =

work page arXiv 2026
[7]

2026 , month=

Uchino, Yuki and Ma, Qianxiang and Imamura, Toshiyuki and Ozaki, Katsuhisa and Gutsche, Patrick Lars , booktitle=. 2026 , month=

2026
[8]

2026 , month =

Shota Kawakami , title =. 2026 , month =

2026
[9]

, booktitle =

Yuki Uchino , title =. , booktitle =. 2026 , month =. doi:, url =

2026
[10]

, booktitle =

NVIDIA , title =. , booktitle =. 2023 , month =. doi:, url =

2023
[11]

, booktitle =

NVIDIA , title =. , booktitle =. 2025 , month =. doi:, url =

2025
[12]

, booktitle =

NVIDIA , title =. , booktitle =. 2026 , month =. doi:, url =

2026
[13]

, booktitle =

AMD , title =. , booktitle =. 2026 , month =. doi:, url =

2026
[14]

2020 , month =

Mukunoki, Daichi and Ozaki, Katsuhisa and Ogita, Takeshi and Imamura, Toshiyuki , booktitle =. 2020 , month =. doi:10.1007/978-3-030-50743-5_12 , keywords =

work page doi:10.1007/978-3-030-50743-5_12 2020
[15]

2024 , month =

Ootomo, Hiroyuki and Ozaki, Katsuhisa and Yokota, Rio , journal =. 2024 , month =

2024
[16]

and Mary, Theo , year=

Higham, Nicholas J. and Mary, Theo , year=. Mixed precision algorithms in numerical linear algebra , volume=. doi:10.1017/S0962492922000022 , journal=

work page doi:10.1017/s0962492922000022
[17]

2411.12090 , url=

Jack Dongarra and John Gunnels and Harun Bayraktar and Azzam Haidar and Dan Ernst , year=. 2411.12090 , url=

work page arXiv
[18]

Mixed-precision ab initio tensor network state methods adapted for nvidia blackwell technology via emulated fp64 arithmetic , author=. J. Chem. Theory Comput. , year=
[19]

2407.13299 , url=

William Dawson and Katsuhisa Ozaki and Jens Domke and Takahito Nakajima , year=. 2407.13299 , url=

work page arXiv
[20]

2026 , pages =

Mukunoki, Daichi , booktitle =. 2026 , pages =. doi:10.1145/3784828.3785017 , numpages =

work page doi:10.1145/3784828.3785017 2026

[1] [1]

, volume =

Katsuhisa Ozaki and Yuki Uchino and Toshiyuki Imamura , title =. , volume =. 2025 , month =. 2504.08009 , eprintclass =

work page arXiv 2025

[2] [2]

, title =

Ozaki,Katsuhisa and Ogita,Takeshi and Oishi,Shin'ichi and Rump,Siegfried M. , title =. Numer. Algorithms , booktitle =. 2012 , month =. doi:10.1007/s11075-011-9478-1 , url =

work page doi:10.1007/s11075-011-9478-1 2012

[3] [3]

, booktitle =

NVIDIA , title =. , booktitle =. 2024 , month =. doi:, url =

2024

[4] [4]

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Yuki Uchino and Katsuhisa Ozaki and Toshiyuki Imamura , title =. , volume =. 2026 , month =. 2603.10634 , url =

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

, booktitle =

Uchino, Yuki and Ozaki, Katsuhisa and Imamura, Toshiyuki , title =. , booktitle =. 2025 , month =. doi:10.1145/3731599.3767539 , url =

work page doi:10.1145/3731599.3767539 2025

[6] [6]

, volume =

Yuki Uchino and Katsuhisa Ozaki and Toshiyuki Imamura , title =. , volume =. 2026 , month =. 2602.02549 , eprintclass =

work page arXiv 2026

[7] [7]

2026 , month=

Uchino, Yuki and Ma, Qianxiang and Imamura, Toshiyuki and Ozaki, Katsuhisa and Gutsche, Patrick Lars , booktitle=. 2026 , month=

2026

[8] [8]

2026 , month =

Shota Kawakami , title =. 2026 , month =

2026

[9] [9]

, booktitle =

Yuki Uchino , title =. , booktitle =. 2026 , month =. doi:, url =

2026

[10] [10]

, booktitle =

NVIDIA , title =. , booktitle =. 2023 , month =. doi:, url =

2023

[11] [11]

, booktitle =

NVIDIA , title =. , booktitle =. 2025 , month =. doi:, url =

2025

[12] [12]

, booktitle =

NVIDIA , title =. , booktitle =. 2026 , month =. doi:, url =

2026

[13] [13]

, booktitle =

AMD , title =. , booktitle =. 2026 , month =. doi:, url =

2026

[14] [14]

2020 , month =

Mukunoki, Daichi and Ozaki, Katsuhisa and Ogita, Takeshi and Imamura, Toshiyuki , booktitle =. 2020 , month =. doi:10.1007/978-3-030-50743-5_12 , keywords =

work page doi:10.1007/978-3-030-50743-5_12 2020

[15] [15]

2024 , month =

Ootomo, Hiroyuki and Ozaki, Katsuhisa and Yokota, Rio , journal =. 2024 , month =

2024

[16] [16]

and Mary, Theo , year=

Higham, Nicholas J. and Mary, Theo , year=. Mixed precision algorithms in numerical linear algebra , volume=. doi:10.1017/S0962492922000022 , journal=

work page doi:10.1017/s0962492922000022

[17] [17]

2411.12090 , url=

Jack Dongarra and John Gunnels and Harun Bayraktar and Azzam Haidar and Dan Ernst , year=. 2411.12090 , url=

work page arXiv

[18] [18]

Mixed-precision ab initio tensor network state methods adapted for nvidia blackwell technology via emulated fp64 arithmetic , author=. J. Chem. Theory Comput. , year=

[19] [19]

2407.13299 , url=

William Dawson and Katsuhisa Ozaki and Jens Domke and Takahito Nakajima , year=. 2407.13299 , url=

work page arXiv

[20] [20]

2026 , pages =

Mukunoki, Daichi , booktitle =. 2026 , pages =. doi:10.1145/3784828.3785017 , numpages =

work page doi:10.1145/3784828.3785017 2026