FOAM: Blocked State Folding for Memory-Efficient LLM Training
Pith reviewed 2026-05-17 00:43 UTC · model grok-4.3
The pith
FOAM compresses Adam optimizer states via block-wise gradient means and residual corrections to match full convergence while cutting memory overhead by up to 90 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FOAM folds optimizer states by replacing per-parameter moment estimates with block-wise gradient means and recovers lost detail through an explicit residual correction. Under standard non-convex optimization settings the method attains the same convergence rate as vanilla Adam. Experiments confirm that the approach eliminates up to 90 percent of optimizer-state memory, accelerates convergence in wall-clock time, and remains compatible with other memory-saving techniques while matching or exceeding the performance of both full-rank and prior compressed baselines.
What carries the argument
Block-wise gradient mean with residual correction: the mechanism that approximates Adam's first and second moments by averaging gradients inside each parameter block and subtracting the induced approximation error to keep the update direction faithful.
If this is right
- Optimizer-state memory becomes a much smaller fraction of total training footprint, allowing larger batch sizes or model scales on fixed hardware.
- Training dynamics remain governed by the same non-convex rate guarantees as Adam, so hyper-parameter schedules transfer with little adjustment.
- The method stacks directly with other memory reducers such as low-rank adapters or quantization for additive savings.
- Wall-clock throughput rises because reduced memory traffic and fewer state updates free up compute resources.
Where Pith is reading between the lines
- The same folding pattern could be applied to other adaptive methods that maintain per-parameter statistics, such as RMSprop or Lion.
- Adaptive block sizing per layer or attention head might further reduce memory while controlling approximation error.
- In distributed settings the compressed states could lower communication volume when synchronizing optimizer buffers across nodes.
Load-bearing premise
The combination of block-wise averaging and residual correction must preserve enough gradient information that the standard Adam convergence proof continues to apply without new bias or variance terms that would change the rate.
What would settle it
Measure the number of steps or wall-clock time required for FOAM and vanilla Adam to reach the same validation loss on a fixed benchmark model; if the curves diverge beyond statistical noise while using identical learning rates and batch sizes, the equivalence claim is refuted.
Figures
read the original abstract
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM eliminates up to 90\% of the memory overhead of optimizer states and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines. Code is available at https://github.com/zqOuO/FOAM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FOAM, a method that compresses Adam optimizer states for LLM training via block-wise gradient means plus a residual correction term. It claims this yields convergence rates equivalent to vanilla Adam in standard non-convex settings, eliminates up to 90% of optimizer-state memory overhead, accelerates convergence, and remains compatible with other memory-efficient optimizers.
Significance. If the equivalence claim is rigorously supported, FOAM would provide a practical route to lower memory footprints during large-model training while retaining Adam-level guarantees and throughput. The reported compatibility with existing methods and availability of code are positive indicators of utility.
major comments (2)
- [Theoretical analysis section (around the statement of equivalence to Adam)] The central theoretical claim (convergence equivalence to Adam) rests on the assertion that block-wise means plus residual correction preserve the first- and second-moment dynamics without introducing non-vanishing bias or extra variance terms that would invalidate the standard O(1/sqrt(T)) rate. No derivation sketch, bias bound, or variance analysis is supplied to confirm this; the residual correction is described only at a high level.
- [§4] §4 (experiments): reported memory reductions and convergence speed-ups are presented without error bars, number of independent runs, or explicit data-exclusion rules. This makes it impossible to judge whether the 90% memory saving and faster convergence are statistically reliable or sensitive to particular hyper-parameter choices.
minor comments (2)
- [Abstract] Abstract: the phrase 'accelerates convergence' should be qualified by the baseline and the metric (steps to target loss, wall-clock time, etc.).
- [Method section] Notation for the residual correction term is introduced without an explicit equation number or clear definition of the block partitioning; readers must infer the exact update rule.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for recognizing the potential utility of FOAM. We address each major comment below and will revise the manuscript to strengthen the presentation of both the theoretical analysis and the experimental results.
read point-by-point responses
-
Referee: [Theoretical analysis section (around the statement of equivalence to Adam)] The central theoretical claim (convergence equivalence to Adam) rests on the assertion that block-wise means plus residual correction preserve the first- and second-moment dynamics without introducing non-vanishing bias or extra variance terms that would invalidate the standard O(1/sqrt(T)) rate. No derivation sketch, bias bound, or variance analysis is supplied to confirm this; the residual correction is described only at a high level.
Authors: We agree that a more explicit derivation would improve clarity. The current manuscript states the equivalence under standard non-convex assumptions but presents the residual correction at a high level in the main text. In the revision we will add a concise proof sketch to the theoretical analysis section that (i) shows the block-wise mean operator introduces a bias term whose expectation vanishes under the standard bounded-gradient assumption, (ii) bounds the additional variance introduced by the residual correction, and (iii) demonstrates that these terms do not alter the O(1/sqrt(T)) rate obtained by vanilla Adam. The full proof will remain in the appendix for completeness. revision: yes
-
Referee: [§4] §4 (experiments): reported memory reductions and convergence speed-ups are presented without error bars, number of independent runs, or explicit data-exclusion rules. This makes it impossible to judge whether the 90% memory saving and faster convergence are statistically reliable or sensitive to particular hyper-parameter choices.
Authors: We acknowledge that the experimental section would benefit from additional statistical reporting. In the revised manuscript we will (i) report all key metrics with error bars computed over at least three independent random seeds, (ii) explicitly state the number of runs and the random-seed protocol, (iii) describe any data-exclusion or outlier-handling rules, and (iv) add a short paragraph discussing sensitivity to the block-size hyper-parameter. revision: yes
Circularity Check
No significant circularity detected in convergence claim or method definition
full rationale
The paper states that FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex settings by using block-wise gradient means plus residual correction. This is presented as an approximation whose error is absorbed into existing proof constants rather than being defined in terms of the target rate itself. No self-citation is invoked as a uniqueness theorem or load-bearing premise, no fitted parameter is renamed as a prediction, and no ansatz is smuggled via prior work. The derivation chain remains self-contained against the external Adam baseline with stated assumptions that do not include the FOAM result by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions for non-convex stochastic optimization (smoothness, bounded variance, etc.)
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FOAM replaces each group of 2^l consecutive elements with their mean value... Mt = ~Mt E(l) + Rt, Vt = ~Vt E(l) + R^2_t
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
∆t√Vt +ϵ 2# = 3c1 4 ∥∇f(W t)∥2 − 1 c1 E
URL https://openreview.net/forum? id=mJrPkdcZDj. 11 FOAM: Blocked State Folding for Memory-Efficient LLM Training Appendix for FOAM: Blocked State Folding for Memory-Efficient LLM Training A. Lemmas and Proofs In this section, we present the proofs of the theorems discussed in the main text. Before beginning our proof, we first establish some useful lemma...
work page 2014
-
[2]
These details are provided in Table 10
models used during pre-training. These details are provided in Table 10. Table 10.Architecture hyperparameters of LLaMA for pre-training. Batch size and training data amount are specified in tokens. Model Params Hidden Intermediate Heads Layers Iteration Training tokens LLaMA 60M 512 1376 8 8 10K 1.3B 130M 768 2048 12 12 20K 2.6B 350M 1024 2736 16 24 60K ...
work page 2048
-
[3]
models. For each model, we adjust the learning rate within the range {5e-4, 1e-3, 2.5e-3, 5e-3, 1e-2}, keeping the memory-efficient scaling factors unchanged, and train for a total of 20k iterations, covering 2.6B tokens. The experimental results are shown in Figure 7. FOAM continues to achieve leading performance on these models, which demonstrates the s...
work page 2024
-
[4]
use a hybrid optimizer setup—employing vanilla Adam for modules like Embeddings and LayerNorm, while applying compressed-state optimization to Attention and MLP modules, with a scaling factor α used to adjust the learning rates across modules. For modules such as Embeddings and LayerNorm, the learning rate lr is applied, while Attention and MLP modules us...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.