pith. sign in

arxiv: 2605.20756 · v1 · pith:IPUKQ7MFnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Pith reviewed 2026-05-21 06:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML
keywords bias correctionpreconditioned optimizersAdamWSophiaShampoolanguage model pretrainingstochastic optimizationfinite sample bias
0
0 comments X

The pith

Correcting two finite-sample biases in stochastic preconditioned updates improves language model optimizer performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Preconditioned optimizers such as AdamW rely on stochastic estimates of gradients and preconditioners from minibatches, but these estimates introduce two types of bias that the paper identifies and corrects. The first is coupling bias when gradient and preconditioner come from the same data, and the second is bias in the nonlinear inversion of the preconditioner. By using independent microbatches for cross-fitting and subtracting the leading bias term estimated from variability, the method yields lower held-out loss without harming downstream tasks. This matters because even small improvements in optimization efficiency can scale to better models at lower cost. The reported gains on small models suggest the effect is measurable and consistent across diagonal and matrix preconditioners.

Core claim

The paper claims that preconditioned optimizers suffer from gradient-preconditioner coupling bias and bias in the inverse or inverse-root of the preconditioner due to finite-sample stochasticity, and that a framework combining cross-fitted preconditioning from independent microbatches and variance-corrected inversion via delta-method adjustment removes these biases, leading to better pretraining loss on models like Qwen2.5-0.5B.

What carries the argument

The bias-correction framework using cross-fitted preconditioning from independent microbatches and variance-corrected inversion that subtracts the leading delta-method bias term estimated from microbatch variability.

If this is right

  • Bias correction reduces held-out pretraining loss by 0.15 nats for AdamW, 0.07 nats for Sophia, and 0.11 nats for Shampoo on Qwen2.5-0.5B.
  • Effects on mixed-quality pretraining and downstream instruction tuning remain neutral to positive.
  • The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods.
  • The single-batch implementation preserves training efficiency while making updates closer to population preconditioned descent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Larger models might see compounded gains from the correction over many training steps where small per-step improvements accumulate.
  • Similar bias-correction ideas could apply to other nonlinear operations in stochastic optimization beyond preconditioner inversion.
  • The method suggests treating optimizer stochasticity as containing fixable systematic error rather than pure irreducible noise.
  • Adaptive versions could vary correction strength based on observed microbatch variability during training.

Load-bearing premise

The leading delta-method bias term for the inverse or inverse-root of the preconditioner can be accurately estimated from microbatch variability and subtracted without introducing new higher-order errors that dominate at the batch sizes used in practice.

What would settle it

A direct comparison on held-out pretraining loss where the bias-corrected versions of AdamW, Sophia, or Shampoo fail to reduce loss by the reported amounts of 0.07 to 0.15 nats would show the correction does not deliver the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.20756 by Ash Lewis, Dhruv Atreja, George Hurn-Maloney, Henrijs Princis, Henry Fawcett, Julia White, Kelton Zhang, Matthew Thomas, Nikhil Nayak, Urchade Zaratiana.

Figure 1
Figure 1. Figure 1: AdamW pretraining comparisons for Qwen2.5-0.5B trained from random initialization. [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sophia pretraining loss curves for Qwen2.5-0.5B trained from random initialization. Thin [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Shampoo pretraining loss curves for Qwen2.5-0.5B trained from random initialization with [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies two finite-sample biases in preconditioned optimizers (AdamW, Sophia, Shampoo) for language model training: (1) coupling bias between gradient and preconditioner estimates drawn from the same minibatch, and (2) bias in the inverse or inverse-root of the preconditioner arising from nonlinearity even when the preconditioner itself is unbiased. It proposes a single-batch correction framework using cross-fitted preconditioning from independent microbatch groups to remove coupling and variance-corrected inversion that subtracts the leading delta-method bias term estimated from microbatch variability. Experiments report held-out pretraining loss reductions of 0.15, 0.07, and 0.11 nats on Qwen2.5-0.5B, with neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning.

Significance. If the corrections accurately isolate and remove the claimed finite-sample effects, the work supplies a practical, low-overhead enhancement to widely deployed preconditioned optimizers. Concrete loss deltas on a public model together with applicability to both diagonal and matrix preconditioners constitute measurable strengths; the derivations remain parameter-free beyond the microbatch partitioning choice.

major comments (2)
  1. [Variance-corrected inversion] Variance-corrected inversion section: the leading delta-method term is subtracted to correct bias in the inverse or inverse-root, yet no analysis, bounds, or ablation is supplied showing that quadratic and higher-order terms in the Taylor expansion remain smaller than the subtracted bias at the microbatch sizes and preconditioner variability used for Qwen2.5-0.5B pretraining. If these terms dominate, the observed 0.07–0.15 nat reductions could arise from incidental regularization rather than accurate bias removal.
  2. [Experimental results] Experimental results on Qwen2.5-0.5B: the reported loss deltas of 0.15, 0.07, and 0.11 nats are presented without error bars, standard deviations across seeds, or statistical significance tests. This omission leaves open whether the improvements exceed training stochasticity and undermines the claim that the two biases are the dominant finite-sample effects.
minor comments (2)
  1. [Methods] A short pseudocode or diagram clarifying the microbatch partitioning for cross-fitting would improve reproducibility of the single-batch implementation.
  2. [Variance-corrected inversion] Notation for the delta-method expansion could be made more explicit by labeling the order of each term in the relevant equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the opportunity to clarify and strengthen our manuscript. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Variance-corrected inversion] Variance-corrected inversion section: the leading delta-method term is subtracted to correct bias in the inverse or inverse-root, yet no analysis, bounds, or ablation is supplied showing that quadratic and higher-order terms in the Taylor expansion remain smaller than the subtracted bias at the microbatch sizes and preconditioner variability used for Qwen2.5-0.5B pretraining. If these terms dominate, the observed 0.07–0.15 nat reductions could arise from incidental regularization rather than accurate bias removal.

    Authors: We thank the referee for highlighting this important point. While the delta method provides the leading bias term, we recognize that a formal analysis of the remainder terms would be valuable. In the revised manuscript, we will include an appendix deriving bounds on the higher-order terms under assumptions on the preconditioner variability, and provide empirical ablations showing that the correction remains effective even when microbatch sizes are varied. This will help confirm that the observed improvements stem from bias correction rather than regularization effects. revision: yes

  2. Referee: [Experimental results] Experimental results on Qwen2.5-0.5B: the reported loss deltas of 0.15, 0.07, and 0.11 nats are presented without error bars, standard deviations across seeds, or statistical significance tests. This omission leaves open whether the improvements exceed training stochasticity and undermines the claim that the two biases are the dominant finite-sample effects.

    Authors: We agree that reporting variability across runs is essential for robust claims. The current manuscript presents single-run results for computational reasons, but we will rerun the experiments with at least three independent seeds and include error bars, standard deviations, and p-values from paired t-tests or similar in the revised version. Preliminary checks suggest the improvements are consistent across seeds, but we will document this explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: framework defined from independent microbatch statistics and first-order expansion; results are held-out empirical measurements

full rationale

The paper defines cross-fitted preconditioning via independent microbatch groups and variance-corrected inversion via the leading delta-method term subtracted from observed microbatch variability. These constructions are explicit functions of the data splits and Taylor expansion; they do not presuppose the final loss reduction. The reported improvements (0.15/0.07/0.11 nats on held-out pretraining loss for AdamW/Sophia/Shampoo) are measured on separate validation data rather than being fitted parameters or tautological re-expressions of the correction itself. No self-citation chain is invoked to justify the central premise, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard statistical approximations rather than new postulates. No free parameters are introduced beyond conventional optimizer hyperparameters. No new physical or mathematical entities are invented.

axioms (2)
  • standard math The delta method provides a first-order accurate approximation to the bias of the inverse or inverse-root of a sample covariance or curvature matrix when the sample size is moderate.
    Invoked to derive the variance-corrected inversion term.
  • domain assumption Microbatch statistics computed on disjoint partitions of the same minibatch are sufficiently independent for bias estimation.
    Required for the cross-fitting construction to remove gradient-preconditioner coupling.

pith-pipeline@v0.9.0 · 5794 in / 1444 out tokens · 47661 ms · 2026-05-21T06:35:24.566623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Journal of Machine Learning Research , volume =

    Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , author =. Journal of Machine Learning Research , volume =. 2011 , url =

  2. [2]

    The Annals of Mathematical Statistics , volume =

    A Stochastic Approximation Method , author =. The Annals of Mathematical Statistics , volume =. 1951 , doi =

  3. [3]

    USSR Computational Mathematics and Mathematical Physics , volume =

    Some Methods of Speeding up the Convergence of Iteration Methods , author =. USSR Computational Mathematics and Mathematical Physics , volume =. 1964 , doi =

  4. [4]

    , journal =

    Nesterov, Yurii E. , journal =. A Method of Solving the Convex Programming Problem with Convergence Rate

  5. [5]

    SIAM Journal on Control and Optimization , volume =

    Acceleration of Stochastic Approximation by Averaging , author =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =

  6. [6]

    SIAM Journal on Optimization , volume =

    Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author =. SIAM Journal on Optimization , volume =. 2013 , doi =

  7. [7]

    , booktitle =

    Ajalloeian, Ahmad and Stich, Sebastian U. , booktitle =. On the Convergence of. 2020 , url =

  8. [8]

    Proceedings of the 30th International Conference on Machine Learning , pages =

    On the Importance of Initialization and Momentum in Deep Learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , volume =

  9. [9]

    and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =

    Zhang, Michael R. and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =. Lookahead Optimizer:. 2019 , url =

  10. [10]

    International Conference on Learning Representations , year =

    Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =

  11. [11]

    International Conference on Learning Representations , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

  12. [12]

    and Kale, Satyen and Kumar, Sanjiv , booktitle =

    Reddi, Sashank J. and Kale, Satyen and Kumar, Sanjiv , booktitle =. On the Convergence of. 2018 , url =

  13. [13]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

  14. [14]

    Large Batch Optimization for Deep Learning: Training

    You, Yang and Li, Jing and Reddi, Sashank and Hseu, Jonathan and Kumar, Sanjiv and Bhojanapalli, Srinadh and Song, Xiaodan and Demmel, James and Keutzer, Kurt and Hsieh, Cho-Jui , booktitle =. Large Batch Optimization for Deep Learning: Training. 2020 , url =

  15. [15]

    Neural Computation , volume =

    Natural Gradient Works Efficiently in Learning , author =. Neural Computation , volume =. 1998 , url =

  16. [16]

    Optimizing Neural Networks with

    Martens, James and Grosse, Roger , booktitle =. Optimizing Neural Networks with. 2015 , volume =

  17. [17]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    Shampoo: Preconditioned Stochastic Tensor Optimization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

  18. [18]

    International Conference on Learning Representations , year =

    Towards Practical Second Order Optimization for Deep Learning , author =. International Conference on Learning Representations , year =

  19. [19]

    , booktitle =

    Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham M. , booktitle =. 2025 , url =

  20. [20]

    International Conference on Learning Representations , year =

    Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training , author =. International Conference on Learning Representations , year =

  21. [21]

    Advances in Neural Information Processing Systems , volume =

    Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , author =. Advances in Neural Information Processing Systems , volume =. 2013 , url =

  22. [22]

    2014 , url =

    Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , booktitle =. 2014 , url =

  23. [23]

    International Conference on Learning Representations , year =

    Sharpness-Aware Minimization for Efficiently Improving Generalization , author =. International Conference on Learning Representations , year =

  24. [24]

    Biometrika , volume =

    Notes on Bias in Estimation , author =. Biometrika , volume =. 1956 , url =

  25. [25]

    1982 , url =

    The Jackknife, the Bootstrap and Other Resampling Plans , author =. 1982 , url =

  26. [26]

    Asymptotic Statistics , author =

  27. [27]

    The Econometrics Journal , volume =

    Double/debiased Machine Learning for Treatment and Structural Parameters , author =. The Econometrics Journal , volume =. 2018 , url =