Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Ash Lewis; Dhruv Atreja; George Hurn-Maloney; Henrijs Princis; Henry Fawcett; Julia White; Kelton Zhang; Matthew Thomas; Nikhil Nayak; Urchade Zaratiana

arxiv: 2605.20756 · v1 · pith:IPUKQ7MFnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Nikhil Nayak , Julia White , Urchade Zaratiana , Kelton Zhang , Henrijs Princis , Dhruv Atreja , Henry Fawcett , Matthew Thomas

show 2 more authors

George Hurn-Maloney Ash Lewis

This is my paper

Pith reviewed 2026-05-21 06:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords bias correctionpreconditioned optimizersAdamWSophiaShampoolanguage model pretrainingstochastic optimizationfinite sample bias

0 comments

The pith

Correcting two finite-sample biases in stochastic preconditioned updates improves language model optimizer performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Preconditioned optimizers such as AdamW rely on stochastic estimates of gradients and preconditioners from minibatches, but these estimates introduce two types of bias that the paper identifies and corrects. The first is coupling bias when gradient and preconditioner come from the same data, and the second is bias in the nonlinear inversion of the preconditioner. By using independent microbatches for cross-fitting and subtracting the leading bias term estimated from variability, the method yields lower held-out loss without harming downstream tasks. This matters because even small improvements in optimization efficiency can scale to better models at lower cost. The reported gains on small models suggest the effect is measurable and consistent across diagonal and matrix preconditioners.

Core claim

The paper claims that preconditioned optimizers suffer from gradient-preconditioner coupling bias and bias in the inverse or inverse-root of the preconditioner due to finite-sample stochasticity, and that a framework combining cross-fitted preconditioning from independent microbatches and variance-corrected inversion via delta-method adjustment removes these biases, leading to better pretraining loss on models like Qwen2.5-0.5B.

What carries the argument

The bias-correction framework using cross-fitted preconditioning from independent microbatches and variance-corrected inversion that subtracts the leading delta-method bias term estimated from microbatch variability.

If this is right

Bias correction reduces held-out pretraining loss by 0.15 nats for AdamW, 0.07 nats for Sophia, and 0.11 nats for Shampoo on Qwen2.5-0.5B.
Effects on mixed-quality pretraining and downstream instruction tuning remain neutral to positive.
The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods.
The single-batch implementation preserves training efficiency while making updates closer to population preconditioned descent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger models might see compounded gains from the correction over many training steps where small per-step improvements accumulate.
Similar bias-correction ideas could apply to other nonlinear operations in stochastic optimization beyond preconditioner inversion.
The method suggests treating optimizer stochasticity as containing fixable systematic error rather than pure irreducible noise.
Adaptive versions could vary correction strength based on observed microbatch variability during training.

Load-bearing premise

The leading delta-method bias term for the inverse or inverse-root of the preconditioner can be accurately estimated from microbatch variability and subtracted without introducing new higher-order errors that dominate at the batch sizes used in practice.

What would settle it

A direct comparison on held-out pretraining loss where the bias-corrected versions of AdamW, Sophia, or Shampoo fail to reduce loss by the reported amounts of 0.07 to 0.15 nats would show the correction does not deliver the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.20756 by Ash Lewis, Dhruv Atreja, George Hurn-Maloney, Henrijs Princis, Henry Fawcett, Julia White, Kelton Zhang, Matthew Thomas, Nikhil Nayak, Urchade Zaratiana.

**Figure 2.** Figure 2: Sophia pretraining loss curves for Qwen2.5-0.5B trained from random initialization. Thin [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Shampoo pretraining loss curves for Qwen2.5-0.5B trained from random initialization with [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper flags two finite-sample biases in preconditioned optimizers and proposes microbatch cross-fitting plus a delta-method inversion correction, with modest loss gains on Qwen2.5-0.5B.

read the letter

The main point is that preconditioned optimizers like AdamW have two built-in finite-sample biases that this paper tries to fix: one from using the same batch for both gradient and preconditioner, and another from the bias introduced when you invert or take the root of the preconditioner estimate. They address the first with cross-fitted microbatch separation, so the preconditioner comes from one group and the gradient from another. For the second they use a variance-corrected inversion based on a delta-method expansion to subtract the leading bias term. This is applied to AdamW, Sophia, and Shampoo. What they do well is show consistent small improvements in pretraining loss on Qwen2.5-0.5B: 0.15 nats for AdamW, 0.07 for Sophia, 0.11 for Shampoo. The effects carry over neutrally or positively to mixed pretraining and instruction tuning. The approach is low-overhead and general across diagonal and matrix preconditioners. The soft spots are that the loss reductions are modest, and without detailed ablations or error bars it is hard to be sure the gains trace back to bias removal rather than the microbatch partitioning acting as a regularizer. The central assumption that the first-order delta-method term dominates and can be subtracted cleanly needs checking against higher-order effects at practical scales. If the preconditioner estimates vary a lot, the correction might not be as clean as hoped. This is useful for people running large-scale pretraining who are looking for small efficiency edges in the optimizer. Readers working on optimizer variants would find the bias breakdown helpful. The work has a solid empirical component and a well-defined proposal, so it deserves a serious referee. I would send it to peer review with requests for more ablation studies and validation of the approximation accuracy.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies two finite-sample biases in preconditioned optimizers (AdamW, Sophia, Shampoo) for language model training: (1) coupling bias between gradient and preconditioner estimates drawn from the same minibatch, and (2) bias in the inverse or inverse-root of the preconditioner arising from nonlinearity even when the preconditioner itself is unbiased. It proposes a single-batch correction framework using cross-fitted preconditioning from independent microbatch groups to remove coupling and variance-corrected inversion that subtracts the leading delta-method bias term estimated from microbatch variability. Experiments report held-out pretraining loss reductions of 0.15, 0.07, and 0.11 nats on Qwen2.5-0.5B, with neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning.

Significance. If the corrections accurately isolate and remove the claimed finite-sample effects, the work supplies a practical, low-overhead enhancement to widely deployed preconditioned optimizers. Concrete loss deltas on a public model together with applicability to both diagonal and matrix preconditioners constitute measurable strengths; the derivations remain parameter-free beyond the microbatch partitioning choice.

major comments (2)

[Variance-corrected inversion] Variance-corrected inversion section: the leading delta-method term is subtracted to correct bias in the inverse or inverse-root, yet no analysis, bounds, or ablation is supplied showing that quadratic and higher-order terms in the Taylor expansion remain smaller than the subtracted bias at the microbatch sizes and preconditioner variability used for Qwen2.5-0.5B pretraining. If these terms dominate, the observed 0.07–0.15 nat reductions could arise from incidental regularization rather than accurate bias removal.
[Experimental results] Experimental results on Qwen2.5-0.5B: the reported loss deltas of 0.15, 0.07, and 0.11 nats are presented without error bars, standard deviations across seeds, or statistical significance tests. This omission leaves open whether the improvements exceed training stochasticity and undermines the claim that the two biases are the dominant finite-sample effects.

minor comments (2)

[Methods] A short pseudocode or diagram clarifying the microbatch partitioning for cross-fitting would improve reproducibility of the single-batch implementation.
[Variance-corrected inversion] Notation for the delta-method expansion could be made more explicit by labeling the order of each term in the relevant equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the opportunity to clarify and strengthen our manuscript. Below we respond to each major comment.

read point-by-point responses

Referee: [Variance-corrected inversion] Variance-corrected inversion section: the leading delta-method term is subtracted to correct bias in the inverse or inverse-root, yet no analysis, bounds, or ablation is supplied showing that quadratic and higher-order terms in the Taylor expansion remain smaller than the subtracted bias at the microbatch sizes and preconditioner variability used for Qwen2.5-0.5B pretraining. If these terms dominate, the observed 0.07–0.15 nat reductions could arise from incidental regularization rather than accurate bias removal.

Authors: We thank the referee for highlighting this important point. While the delta method provides the leading bias term, we recognize that a formal analysis of the remainder terms would be valuable. In the revised manuscript, we will include an appendix deriving bounds on the higher-order terms under assumptions on the preconditioner variability, and provide empirical ablations showing that the correction remains effective even when microbatch sizes are varied. This will help confirm that the observed improvements stem from bias correction rather than regularization effects. revision: yes
Referee: [Experimental results] Experimental results on Qwen2.5-0.5B: the reported loss deltas of 0.15, 0.07, and 0.11 nats are presented without error bars, standard deviations across seeds, or statistical significance tests. This omission leaves open whether the improvements exceed training stochasticity and undermines the claim that the two biases are the dominant finite-sample effects.

Authors: We agree that reporting variability across runs is essential for robust claims. The current manuscript presents single-run results for computational reasons, but we will rerun the experiments with at least three independent seeds and include error bars, standard deviations, and p-values from paired t-tests or similar in the revised version. Preliminary checks suggest the improvements are consistent across seeds, but we will document this explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: framework defined from independent microbatch statistics and first-order expansion; results are held-out empirical measurements

full rationale

The paper defines cross-fitted preconditioning via independent microbatch groups and variance-corrected inversion via the leading delta-method term subtracted from observed microbatch variability. These constructions are explicit functions of the data splits and Taylor expansion; they do not presuppose the final loss reduction. The reported improvements (0.15/0.07/0.11 nats on held-out pretraining loss for AdamW/Sophia/Shampoo) are measured on separate validation data rather than being fitted parameters or tautological re-expressions of the correction itself. No self-citation chain is invoked to justify the central premise, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard statistical approximations rather than new postulates. No free parameters are introduced beyond conventional optimizer hyperparameters. No new physical or mathematical entities are invented.

axioms (2)

standard math The delta method provides a first-order accurate approximation to the bias of the inverse or inverse-root of a sample covariance or curvature matrix when the sample size is moderate.
Invoked to derive the variance-corrected inversion term.
domain assumption Microbatch statistics computed on disjoint partitions of the same minibatch are sufficiently independent for bias estimation.
Required for the cross-fitting construction to remove gradient-preconditioner coupling.

pith-pipeline@v0.9.0 · 5794 in / 1444 out tokens · 47661 ms · 2026-05-21T06:35:24.566623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Journal of Machine Learning Research , volume =

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , author =. Journal of Machine Learning Research , volume =. 2011 , url =

work page 2011
[2]

The Annals of Mathematical Statistics , volume =

A Stochastic Approximation Method , author =. The Annals of Mathematical Statistics , volume =. 1951 , doi =

work page 1951
[3]

USSR Computational Mathematics and Mathematical Physics , volume =

Some Methods of Speeding up the Convergence of Iteration Methods , author =. USSR Computational Mathematics and Mathematical Physics , volume =. 1964 , doi =

work page 1964
[4]

, journal =

Nesterov, Yurii E. , journal =. A Method of Solving the Convex Programming Problem with Convergence Rate

work page
[5]

SIAM Journal on Control and Optimization , volume =

Acceleration of Stochastic Approximation by Averaging , author =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =

work page 1992
[6]

SIAM Journal on Optimization , volume =

Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author =. SIAM Journal on Optimization , volume =. 2013 , doi =

work page 2013
[7]

, booktitle =

Ajalloeian, Ahmad and Stich, Sebastian U. , booktitle =. On the Convergence of. 2020 , url =

work page 2020
[8]

Proceedings of the 30th International Conference on Machine Learning , pages =

On the Importance of Initialization and Momentum in Deep Learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , volume =

work page 2013
[9]

and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =

Zhang, Michael R. and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =. Lookahead Optimizer:. 2019 , url =

work page 2019
[10]

International Conference on Learning Representations , year =

Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =

work page
[11]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page
[12]

and Kale, Satyen and Kumar, Sanjiv , booktitle =

Reddi, Sashank J. and Kale, Satyen and Kumar, Sanjiv , booktitle =. On the Convergence of. 2018 , url =

work page 2018
[13]

Proceedings of the 35th International Conference on Machine Learning , pages =

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

work page 2018
[14]

Large Batch Optimization for Deep Learning: Training

You, Yang and Li, Jing and Reddi, Sashank and Hseu, Jonathan and Kumar, Sanjiv and Bhojanapalli, Srinadh and Song, Xiaodan and Demmel, James and Keutzer, Kurt and Hsieh, Cho-Jui , booktitle =. Large Batch Optimization for Deep Learning: Training. 2020 , url =

work page 2020
[15]

Neural Computation , volume =

Natural Gradient Works Efficiently in Learning , author =. Neural Computation , volume =. 1998 , url =

work page 1998
[16]

Optimizing Neural Networks with

Martens, James and Grosse, Roger , booktitle =. Optimizing Neural Networks with. 2015 , volume =

work page 2015
[17]

Proceedings of the 35th International Conference on Machine Learning , pages =

Shampoo: Preconditioned Stochastic Tensor Optimization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

work page 2018
[18]

International Conference on Learning Representations , year =

Towards Practical Second Order Optimization for Deep Learning , author =. International Conference on Learning Representations , year =

work page
[19]

, booktitle =

Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham M. , booktitle =. 2025 , url =

work page 2025
[20]

International Conference on Learning Representations , year =

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training , author =. International Conference on Learning Representations , year =

work page
[21]

Advances in Neural Information Processing Systems , volume =

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , author =. Advances in Neural Information Processing Systems , volume =. 2013 , url =

work page 2013
[22]

2014 , url =

Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , booktitle =. 2014 , url =

work page 2014
[23]

International Conference on Learning Representations , year =

Sharpness-Aware Minimization for Efficiently Improving Generalization , author =. International Conference on Learning Representations , year =

work page
[24]

Biometrika , volume =

Notes on Bias in Estimation , author =. Biometrika , volume =. 1956 , url =

work page 1956
[25]

1982 , url =

The Jackknife, the Bootstrap and Other Resampling Plans , author =. 1982 , url =

work page 1982
[26]

Asymptotic Statistics , author =

work page
[27]

The Econometrics Journal , volume =

Double/debiased Machine Learning for Treatment and Structural Parameters , author =. The Econometrics Journal , volume =. 2018 , url =

work page 2018

[1] [1]

Journal of Machine Learning Research , volume =

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , author =. Journal of Machine Learning Research , volume =. 2011 , url =

work page 2011

[2] [2]

The Annals of Mathematical Statistics , volume =

A Stochastic Approximation Method , author =. The Annals of Mathematical Statistics , volume =. 1951 , doi =

work page 1951

[3] [3]

USSR Computational Mathematics and Mathematical Physics , volume =

Some Methods of Speeding up the Convergence of Iteration Methods , author =. USSR Computational Mathematics and Mathematical Physics , volume =. 1964 , doi =

work page 1964

[4] [4]

, journal =

Nesterov, Yurii E. , journal =. A Method of Solving the Convex Programming Problem with Convergence Rate

work page

[5] [5]

SIAM Journal on Control and Optimization , volume =

Acceleration of Stochastic Approximation by Averaging , author =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =

work page 1992

[6] [6]

SIAM Journal on Optimization , volume =

Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author =. SIAM Journal on Optimization , volume =. 2013 , doi =

work page 2013

[7] [7]

, booktitle =

Ajalloeian, Ahmad and Stich, Sebastian U. , booktitle =. On the Convergence of. 2020 , url =

work page 2020

[8] [8]

Proceedings of the 30th International Conference on Machine Learning , pages =

On the Importance of Initialization and Momentum in Deep Learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , volume =

work page 2013

[9] [9]

and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =

Zhang, Michael R. and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =. Lookahead Optimizer:. 2019 , url =

work page 2019

[10] [10]

International Conference on Learning Representations , year =

Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =

work page

[11] [11]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page

[12] [12]

and Kale, Satyen and Kumar, Sanjiv , booktitle =

Reddi, Sashank J. and Kale, Satyen and Kumar, Sanjiv , booktitle =. On the Convergence of. 2018 , url =

work page 2018

[13] [13]

Proceedings of the 35th International Conference on Machine Learning , pages =

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

work page 2018

[14] [14]

Large Batch Optimization for Deep Learning: Training

You, Yang and Li, Jing and Reddi, Sashank and Hseu, Jonathan and Kumar, Sanjiv and Bhojanapalli, Srinadh and Song, Xiaodan and Demmel, James and Keutzer, Kurt and Hsieh, Cho-Jui , booktitle =. Large Batch Optimization for Deep Learning: Training. 2020 , url =

work page 2020

[15] [15]

Neural Computation , volume =

Natural Gradient Works Efficiently in Learning , author =. Neural Computation , volume =. 1998 , url =

work page 1998

[16] [16]

Optimizing Neural Networks with

Martens, James and Grosse, Roger , booktitle =. Optimizing Neural Networks with. 2015 , volume =

work page 2015

[17] [17]

Proceedings of the 35th International Conference on Machine Learning , pages =

Shampoo: Preconditioned Stochastic Tensor Optimization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

work page 2018

[18] [18]

International Conference on Learning Representations , year =

Towards Practical Second Order Optimization for Deep Learning , author =. International Conference on Learning Representations , year =

work page

[19] [19]

, booktitle =

Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham M. , booktitle =. 2025 , url =

work page 2025

[20] [20]

International Conference on Learning Representations , year =

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training , author =. International Conference on Learning Representations , year =

work page

[21] [21]

Advances in Neural Information Processing Systems , volume =

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , author =. Advances in Neural Information Processing Systems , volume =. 2013 , url =

work page 2013

[22] [22]

2014 , url =

Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , booktitle =. 2014 , url =

work page 2014

[23] [23]

International Conference on Learning Representations , year =

Sharpness-Aware Minimization for Efficiently Improving Generalization , author =. International Conference on Learning Representations , year =

work page

[24] [24]

Biometrika , volume =

Notes on Bias in Estimation , author =. Biometrika , volume =. 1956 , url =

work page 1956

[25] [25]

1982 , url =

The Jackknife, the Bootstrap and Other Resampling Plans , author =. 1982 , url =

work page 1982

[26] [26]

Asymptotic Statistics , author =

work page

[27] [27]

The Econometrics Journal , volume =

Double/debiased Machine Learning for Treatment and Structural Parameters , author =. The Econometrics Journal , volume =. 2018 , url =

work page 2018