Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
Pith reviewed 2026-05-21 06:35 UTC · model grok-4.3
The pith
Correcting two finite-sample biases in stochastic preconditioned updates improves language model optimizer performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that preconditioned optimizers suffer from gradient-preconditioner coupling bias and bias in the inverse or inverse-root of the preconditioner due to finite-sample stochasticity, and that a framework combining cross-fitted preconditioning from independent microbatches and variance-corrected inversion via delta-method adjustment removes these biases, leading to better pretraining loss on models like Qwen2.5-0.5B.
What carries the argument
The bias-correction framework using cross-fitted preconditioning from independent microbatches and variance-corrected inversion that subtracts the leading delta-method bias term estimated from microbatch variability.
If this is right
- Bias correction reduces held-out pretraining loss by 0.15 nats for AdamW, 0.07 nats for Sophia, and 0.11 nats for Shampoo on Qwen2.5-0.5B.
- Effects on mixed-quality pretraining and downstream instruction tuning remain neutral to positive.
- The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods.
- The single-batch implementation preserves training efficiency while making updates closer to population preconditioned descent.
Where Pith is reading between the lines
- Larger models might see compounded gains from the correction over many training steps where small per-step improvements accumulate.
- Similar bias-correction ideas could apply to other nonlinear operations in stochastic optimization beyond preconditioner inversion.
- The method suggests treating optimizer stochasticity as containing fixable systematic error rather than pure irreducible noise.
- Adaptive versions could vary correction strength based on observed microbatch variability during training.
Load-bearing premise
The leading delta-method bias term for the inverse or inverse-root of the preconditioner can be accurately estimated from microbatch variability and subtracted without introducing new higher-order errors that dominate at the batch sizes used in practice.
What would settle it
A direct comparison on held-out pretraining loss where the bias-corrected versions of AdamW, Sophia, or Shampoo fail to reduce loss by the reported amounts of 0.07 to 0.15 nats would show the correction does not deliver the claimed improvement.
Figures
read the original abstract
Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies two finite-sample biases in preconditioned optimizers (AdamW, Sophia, Shampoo) for language model training: (1) coupling bias between gradient and preconditioner estimates drawn from the same minibatch, and (2) bias in the inverse or inverse-root of the preconditioner arising from nonlinearity even when the preconditioner itself is unbiased. It proposes a single-batch correction framework using cross-fitted preconditioning from independent microbatch groups to remove coupling and variance-corrected inversion that subtracts the leading delta-method bias term estimated from microbatch variability. Experiments report held-out pretraining loss reductions of 0.15, 0.07, and 0.11 nats on Qwen2.5-0.5B, with neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning.
Significance. If the corrections accurately isolate and remove the claimed finite-sample effects, the work supplies a practical, low-overhead enhancement to widely deployed preconditioned optimizers. Concrete loss deltas on a public model together with applicability to both diagonal and matrix preconditioners constitute measurable strengths; the derivations remain parameter-free beyond the microbatch partitioning choice.
major comments (2)
- [Variance-corrected inversion] Variance-corrected inversion section: the leading delta-method term is subtracted to correct bias in the inverse or inverse-root, yet no analysis, bounds, or ablation is supplied showing that quadratic and higher-order terms in the Taylor expansion remain smaller than the subtracted bias at the microbatch sizes and preconditioner variability used for Qwen2.5-0.5B pretraining. If these terms dominate, the observed 0.07–0.15 nat reductions could arise from incidental regularization rather than accurate bias removal.
- [Experimental results] Experimental results on Qwen2.5-0.5B: the reported loss deltas of 0.15, 0.07, and 0.11 nats are presented without error bars, standard deviations across seeds, or statistical significance tests. This omission leaves open whether the improvements exceed training stochasticity and undermines the claim that the two biases are the dominant finite-sample effects.
minor comments (2)
- [Methods] A short pseudocode or diagram clarifying the microbatch partitioning for cross-fitting would improve reproducibility of the single-batch implementation.
- [Variance-corrected inversion] Notation for the delta-method expansion could be made more explicit by labeling the order of each term in the relevant equation.
Simulated Author's Rebuttal
Thank you for the detailed review. We appreciate the opportunity to clarify and strengthen our manuscript. Below we respond to each major comment.
read point-by-point responses
-
Referee: [Variance-corrected inversion] Variance-corrected inversion section: the leading delta-method term is subtracted to correct bias in the inverse or inverse-root, yet no analysis, bounds, or ablation is supplied showing that quadratic and higher-order terms in the Taylor expansion remain smaller than the subtracted bias at the microbatch sizes and preconditioner variability used for Qwen2.5-0.5B pretraining. If these terms dominate, the observed 0.07–0.15 nat reductions could arise from incidental regularization rather than accurate bias removal.
Authors: We thank the referee for highlighting this important point. While the delta method provides the leading bias term, we recognize that a formal analysis of the remainder terms would be valuable. In the revised manuscript, we will include an appendix deriving bounds on the higher-order terms under assumptions on the preconditioner variability, and provide empirical ablations showing that the correction remains effective even when microbatch sizes are varied. This will help confirm that the observed improvements stem from bias correction rather than regularization effects. revision: yes
-
Referee: [Experimental results] Experimental results on Qwen2.5-0.5B: the reported loss deltas of 0.15, 0.07, and 0.11 nats are presented without error bars, standard deviations across seeds, or statistical significance tests. This omission leaves open whether the improvements exceed training stochasticity and undermines the claim that the two biases are the dominant finite-sample effects.
Authors: We agree that reporting variability across runs is essential for robust claims. The current manuscript presents single-run results for computational reasons, but we will rerun the experiments with at least three independent seeds and include error bars, standard deviations, and p-values from paired t-tests or similar in the revised version. Preliminary checks suggest the improvements are consistent across seeds, but we will document this explicitly. revision: yes
Circularity Check
No circularity: framework defined from independent microbatch statistics and first-order expansion; results are held-out empirical measurements
full rationale
The paper defines cross-fitted preconditioning via independent microbatch groups and variance-corrected inversion via the leading delta-method term subtracted from observed microbatch variability. These constructions are explicit functions of the data splits and Taylor expansion; they do not presuppose the final loss reduction. The reported improvements (0.15/0.07/0.11 nats on held-out pretraining loss for AdamW/Sophia/Shampoo) are measured on separate validation data rather than being fitted parameters or tautological re-expressions of the correction itself. No self-citation chain is invoked to justify the central premise, and the derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The delta method provides a first-order accurate approximation to the bias of the inverse or inverse-root of a sample covariance or curvature matrix when the sample size is moderate.
- domain assumption Microbatch statistics computed on disjoint partitions of the same minibatch are sufficiently independent for bias estimation.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Journal of Machine Learning Research , volume =
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , author =. Journal of Machine Learning Research , volume =. 2011 , url =
work page 2011
-
[2]
The Annals of Mathematical Statistics , volume =
A Stochastic Approximation Method , author =. The Annals of Mathematical Statistics , volume =. 1951 , doi =
work page 1951
-
[3]
USSR Computational Mathematics and Mathematical Physics , volume =
Some Methods of Speeding up the Convergence of Iteration Methods , author =. USSR Computational Mathematics and Mathematical Physics , volume =. 1964 , doi =
work page 1964
-
[4]
Nesterov, Yurii E. , journal =. A Method of Solving the Convex Programming Problem with Convergence Rate
-
[5]
SIAM Journal on Control and Optimization , volume =
Acceleration of Stochastic Approximation by Averaging , author =. SIAM Journal on Control and Optimization , volume =. 1992 , doi =
work page 1992
-
[6]
SIAM Journal on Optimization , volume =
Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , author =. SIAM Journal on Optimization , volume =. 2013 , doi =
work page 2013
-
[7]
Ajalloeian, Ahmad and Stich, Sebastian U. , booktitle =. On the Convergence of. 2020 , url =
work page 2020
-
[8]
Proceedings of the 30th International Conference on Machine Learning , pages =
On the Importance of Initialization and Momentum in Deep Learning , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , volume =
work page 2013
-
[9]
and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =
Zhang, Michael R. and Lucas, James and Hinton, Geoffrey and Ba, Jimmy , booktitle =. Lookahead Optimizer:. 2019 , url =
work page 2019
-
[10]
International Conference on Learning Representations , year =
Adam: A Method for Stochastic Optimization , author =. International Conference on Learning Representations , year =
-
[11]
International Conference on Learning Representations , year =
Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =
-
[12]
and Kale, Satyen and Kumar, Sanjiv , booktitle =
Reddi, Sashank J. and Kale, Satyen and Kumar, Sanjiv , booktitle =. On the Convergence of. 2018 , url =
work page 2018
-
[13]
Proceedings of the 35th International Conference on Machine Learning , pages =
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =
work page 2018
-
[14]
Large Batch Optimization for Deep Learning: Training
You, Yang and Li, Jing and Reddi, Sashank and Hseu, Jonathan and Kumar, Sanjiv and Bhojanapalli, Srinadh and Song, Xiaodan and Demmel, James and Keutzer, Kurt and Hsieh, Cho-Jui , booktitle =. Large Batch Optimization for Deep Learning: Training. 2020 , url =
work page 2020
-
[15]
Natural Gradient Works Efficiently in Learning , author =. Neural Computation , volume =. 1998 , url =
work page 1998
-
[16]
Optimizing Neural Networks with
Martens, James and Grosse, Roger , booktitle =. Optimizing Neural Networks with. 2015 , volume =
work page 2015
-
[17]
Proceedings of the 35th International Conference on Machine Learning , pages =
Shampoo: Preconditioned Stochastic Tensor Optimization , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =
work page 2018
-
[18]
International Conference on Learning Representations , year =
Towards Practical Second Order Optimization for Deep Learning , author =. International Conference on Learning Representations , year =
-
[19]
Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham M. , booktitle =. 2025 , url =
work page 2025
-
[20]
International Conference on Learning Representations , year =
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training , author =. International Conference on Learning Representations , year =
-
[21]
Advances in Neural Information Processing Systems , volume =
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , author =. Advances in Neural Information Processing Systems , volume =. 2013 , url =
work page 2013
-
[22]
Defazio, Aaron and Bach, Francis and Lacoste-Julien, Simon , booktitle =. 2014 , url =
work page 2014
-
[23]
International Conference on Learning Representations , year =
Sharpness-Aware Minimization for Efficiently Improving Generalization , author =. International Conference on Learning Representations , year =
-
[24]
Notes on Bias in Estimation , author =. Biometrika , volume =. 1956 , url =
work page 1956
-
[25]
The Jackknife, the Bootstrap and Other Resampling Plans , author =. 1982 , url =
work page 1982
-
[26]
Asymptotic Statistics , author =
-
[27]
The Econometrics Journal , volume =
Double/debiased Machine Learning for Treatment and Structural Parameters , author =. The Econometrics Journal , volume =. 2018 , url =
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.