pith. sign in

arxiv: 2512.07112 · v2 · pith:FUFDSRD3new · submitted 2025-12-08 · 💻 cs.LG · cs.AI

FOAM: Blocked State Folding for Memory-Efficient LLM Training

Pith reviewed 2026-05-17 00:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords memory-efficient optimizersLLM trainingAdam optimizeroptimizer state compressionblock-wise approximationresidual correctionnon-convex convergencetraining memory reduction
0
0 comments X

The pith

FOAM compresses Adam optimizer states via block-wise gradient means and residual corrections to match full convergence while cutting memory overhead by up to 90 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FOAM as a way to train large language models with far less memory when using memory-heavy optimizers like Adam. It replaces the full first and second moment buffers with averages computed over blocks of parameters and adds a residual term that restores the information discarded by the averaging. The authors show that this construction preserves the standard convergence rate of Adam under typical non-convex stochastic optimization assumptions. The practical result is that the dominant memory cost of optimizer states drops sharply without requiring extra projection matrices or freezing weights. If the claim holds, practitioners can train larger models or longer schedules on the same hardware while keeping training dynamics and final quality comparable to the uncompressed baseline.

Core claim

FOAM folds optimizer states by replacing per-parameter moment estimates with block-wise gradient means and recovers lost detail through an explicit residual correction. Under standard non-convex optimization settings the method attains the same convergence rate as vanilla Adam. Experiments confirm that the approach eliminates up to 90 percent of optimizer-state memory, accelerates convergence in wall-clock time, and remains compatible with other memory-saving techniques while matching or exceeding the performance of both full-rank and prior compressed baselines.

What carries the argument

Block-wise gradient mean with residual correction: the mechanism that approximates Adam's first and second moments by averaging gradients inside each parameter block and subtracting the induced approximation error to keep the update direction faithful.

If this is right

  • Optimizer-state memory becomes a much smaller fraction of total training footprint, allowing larger batch sizes or model scales on fixed hardware.
  • Training dynamics remain governed by the same non-convex rate guarantees as Adam, so hyper-parameter schedules transfer with little adjustment.
  • The method stacks directly with other memory reducers such as low-rank adapters or quantization for additive savings.
  • Wall-clock throughput rises because reduced memory traffic and fewer state updates free up compute resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same folding pattern could be applied to other adaptive methods that maintain per-parameter statistics, such as RMSprop or Lion.
  • Adaptive block sizing per layer or attention head might further reduce memory while controlling approximation error.
  • In distributed settings the compressed states could lower communication volume when synchronizing optimizer buffers across nodes.

Load-bearing premise

The combination of block-wise averaging and residual correction must preserve enough gradient information that the standard Adam convergence proof continues to apply without new bias or variance terms that would change the rate.

What would settle it

Measure the number of steps or wall-clock time required for FOAM and vanilla Adam to reach the same validation loss on a fixed benchmark model; if the curves diverge beyond statistical noise while using identical learning rates and batch sizes, the equivalence claim is refuted.

Figures

Figures reproduced from arXiv: 2512.07112 by Dongsheng Li, Jiahuan Wang, Ping Luo, Tao Sun, Ziqing Wen.

Figure 1
Figure 1. Figure 1: Overview of FOAM optimizer with a fold level of l. due to its training efficiency. However, Adam introduces significant memory overhead, consuming twice the model size for storing optimizer states, which makes LLM pre￾training and fine-tuning not only compute-intensive but also memory-bound. For instance, even with extremely small training batch sizes, pre-training a 7B model in BF16 still requires at leas… view at source ↗
Figure 2
Figure 2. Figure 2: FOAM performance preview on LLM pre-training. Figure (a) and (b): Perplexity learning curves for pre-training LLaMA￾350M and LLaMA-1.3B on C4. FOAM demonstrates superior validation perplexity compared with other baselines. Figure (c): optimizer memory footprint for pre-training LLaMA models. FOAM achieves an approximate 50% reduction in overall training memory consumption, and FOAM-Mini further pushes the … view at source ↗
Figure 3
Figure 3. Figure 3: Additional Investigation of FOAM. (a) Impact of the FOAM level l: FOAM exhibits strong robustness across varying memory constraints. (b) Extended training of LLaMA-130M on 39B tokens. (c) Integration of FOAM with Adam-Mini and MUON. (Hoffmann et al., 2022). As shown in Figure 3b, FOAM remains stable and continues to improve, demonstrating strong robustness even under extreme training durations. Integration… view at source ↗
Figure 4
Figure 4. Figure 4: Validation PPL with or without residual. Impact of the Residual. We conduct an ablation study on the residual term in Eq. (6) by pre-training the LLaMA-60M and 130M models with and without Rt. In addition, we mea￾sure the cosine similarity between the update magnitudes produced by FOAM and those of Adam, comparing versions with and without the residual term (Appendix [PITH_FULL_IMAGE:figures/full_fig_p008… view at source ↗
Figure 5
Figure 5. Figure 5: Cosine Similarities between the Update Matrices of FOAM with or without Residual and Adam. We report the average similarity across all modules within each layer. As observed, the update matrices including the residual term exhibit a higher cosine similarity with Adam’s updates compared to those without the residual. Specifically, for the setting l = 3, FOAM updates maintain a cosine similarity greater than… view at source ↗
Figure 6
Figure 6. Figure 6: PPL learning curves of pre-training LLaMA-60M and 130M on C4 C. Addtional Experiment Results C.1. GPT-2 and DeBERTa Experiments In this section, we test the performance of the FOAM we proposed on GPT-2 (Radford et al., 2019), and DeBERTa (He et al., 2021) models. For each model, we adjust the learning rate within the range {5e-4, 1e-3, 2.5e-3, 5e-3, 1e-2}, keeping the memory-efficient scaling factors uncha… view at source ↗
Figure 7
Figure 7. Figure 7: Pre-training GPT-2 and DeBERTa models on C4. (a): Pre-training GPT-2-base model. (b): Pre-training DeBERTa-base model. FOAM continues to achieve leading performance on these models. 2k 4k 6k 8k 10k Training Iterations 28 30 32 34 36 38 40 Validation Perplexity( ) alpha = 0.1 alpha = 0.3 alpha = 0.5 alpha = 0.7 (a) 60M model pre-training 4k 8k 12k 16k 20k Training Iterations 21 23 25 27 29 31 33 Validation … view at source ↗
Figure 8
Figure 8. Figure 8: Study the effects of α in Algorithm 1. As observed, FOAM is not sensitive to the choice of α in our tests. Concretely, most current memory-efficient optimizers (Zhao et al., 2024; Jordan et al., 2024; Chen et al., 2024; Zhu et al., 2025) use a hybrid optimizer setup—employing vanilla Adam for modules like Embeddings and LayerNorm, while applying compressed-state optimization to Attention and MLP modules, w… view at source ↗
Figure 9
Figure 9. Figure 9: Comparing FOAM with Adam employing block-wise learning rates. Block-wise Adam converges more rapidly and yields a lower final validation loss than the uniform-rate variant. FOAM ’s performance remains on par with that of block-wise Adam. D. Future Works Several directions remain open for further exploration: • Due to limitations in computational resources, the maximum model size we used to validate the eff… view at source ↗
Figure 10
Figure 10. Figure 10: The variation of residual energy ratio throughout training with different fold level l. We report the average energy ratio across all modules within each layer.It can be observed that the energy ratio of the residual increases as l grows, with the values concentrating around 1 − 1 2l . This implies that the residual Rt captures most of the energy from the original gradient, highlighting the necessity of i… view at source ↗
Figure 11
Figure 11. Figure 11: Ablation study of R 2 t on the second-order moment. The results show that incorporating residuals into the second-order moment leads to a more stable decrease in the training curve. In contrast, without residuals, the training exhibits a faster initial decrease but experiences a rise in loss during later stages. E. Benchmark Details In this work, we evaluate our methods using several widely adopted benchm… view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM eliminates up to 90\% of the memory overhead of optimizer states and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines. Code is available at https://github.com/zqOuO/FOAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FOAM, a method that compresses Adam optimizer states for LLM training via block-wise gradient means plus a residual correction term. It claims this yields convergence rates equivalent to vanilla Adam in standard non-convex settings, eliminates up to 90% of optimizer-state memory overhead, accelerates convergence, and remains compatible with other memory-efficient optimizers.

Significance. If the equivalence claim is rigorously supported, FOAM would provide a practical route to lower memory footprints during large-model training while retaining Adam-level guarantees and throughput. The reported compatibility with existing methods and availability of code are positive indicators of utility.

major comments (2)
  1. [Theoretical analysis section (around the statement of equivalence to Adam)] The central theoretical claim (convergence equivalence to Adam) rests on the assertion that block-wise means plus residual correction preserve the first- and second-moment dynamics without introducing non-vanishing bias or extra variance terms that would invalidate the standard O(1/sqrt(T)) rate. No derivation sketch, bias bound, or variance analysis is supplied to confirm this; the residual correction is described only at a high level.
  2. [§4] §4 (experiments): reported memory reductions and convergence speed-ups are presented without error bars, number of independent runs, or explicit data-exclusion rules. This makes it impossible to judge whether the 90% memory saving and faster convergence are statistically reliable or sensitive to particular hyper-parameter choices.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'accelerates convergence' should be qualified by the baseline and the metric (steps to target loss, wall-clock time, etc.).
  2. [Method section] Notation for the residual correction term is introduced without an explicit equation number or clear definition of the block partitioning; readers must infer the exact update rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the potential utility of FOAM. We address each major comment below and will revise the manuscript to strengthen the presentation of both the theoretical analysis and the experimental results.

read point-by-point responses
  1. Referee: [Theoretical analysis section (around the statement of equivalence to Adam)] The central theoretical claim (convergence equivalence to Adam) rests on the assertion that block-wise means plus residual correction preserve the first- and second-moment dynamics without introducing non-vanishing bias or extra variance terms that would invalidate the standard O(1/sqrt(T)) rate. No derivation sketch, bias bound, or variance analysis is supplied to confirm this; the residual correction is described only at a high level.

    Authors: We agree that a more explicit derivation would improve clarity. The current manuscript states the equivalence under standard non-convex assumptions but presents the residual correction at a high level in the main text. In the revision we will add a concise proof sketch to the theoretical analysis section that (i) shows the block-wise mean operator introduces a bias term whose expectation vanishes under the standard bounded-gradient assumption, (ii) bounds the additional variance introduced by the residual correction, and (iii) demonstrates that these terms do not alter the O(1/sqrt(T)) rate obtained by vanilla Adam. The full proof will remain in the appendix for completeness. revision: yes

  2. Referee: [§4] §4 (experiments): reported memory reductions and convergence speed-ups are presented without error bars, number of independent runs, or explicit data-exclusion rules. This makes it impossible to judge whether the 90% memory saving and faster convergence are statistically reliable or sensitive to particular hyper-parameter choices.

    Authors: We acknowledge that the experimental section would benefit from additional statistical reporting. In the revised manuscript we will (i) report all key metrics with error bars computed over at least three independent random seeds, (ii) explicitly state the number of runs and the random-seed protocol, (iii) describe any data-exclusion or outlier-handling rules, and (iv) add a short paragraph discussing sensitivity to the block-size hyper-parameter. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in convergence claim or method definition

full rationale

The paper states that FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex settings by using block-wise gradient means plus residual correction. This is presented as an approximation whose error is absorbed into existing proof constants rather than being defined in terms of the target rate itself. No self-citation is invoked as a uniqueness theorem or load-bearing premise, no fitted parameter is renamed as a prediction, and no ansatz is smuggled via prior work. The derivation chain remains self-contained against the external Adam baseline with stated assumptions that do not include the FOAM result by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on standard non-convex optimization assumptions for its convergence guarantee and introduces block partitioning and residual correction as the main algorithmic additions; no explicit free parameters or new entities are named in the abstract.

axioms (1)
  • domain assumption Standard assumptions for non-convex stochastic optimization (smoothness, bounded variance, etc.)
    Invoked to claim that FOAM inherits Adam's convergence rate.

pith-pipeline@v0.9.0 · 5492 in / 1189 out tokens · 44322 ms · 2026-05-17T00:43:16.536911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    ∆t√Vt +ϵ 2# = 3c1 4 ∥∇f(W t)∥2 − 1 c1 E

    URL https://openreview.net/forum? id=mJrPkdcZDj. 11 FOAM: Blocked State Folding for Memory-Efficient LLM Training Appendix for FOAM: Blocked State Folding for Memory-Efficient LLM Training A. Lemmas and Proofs In this section, we present the proofs of the theorems discussed in the main text. Before beginning our proof, we first establish some useful lemma...

  2. [2]

    These details are provided in Table 10

    models used during pre-training. These details are provided in Table 10. Table 10.Architecture hyperparameters of LLaMA for pre-training. Batch size and training data amount are specified in tokens. Model Params Hidden Intermediate Heads Layers Iteration Training tokens LLaMA 60M 512 1376 8 8 10K 1.3B 130M 768 2048 12 12 20K 2.6B 350M 1024 2736 16 24 60K ...

  3. [3]

    models. For each model, we adjust the learning rate within the range {5e-4, 1e-3, 2.5e-3, 5e-3, 1e-2}, keeping the memory-efficient scaling factors unchanged, and train for a total of 20k iterations, covering 2.6B tokens. The experimental results are shown in Figure 7. FOAM continues to achieve leading performance on these models, which demonstrates the s...

  4. [4]

    use a hybrid optimizer setup—employing vanilla Adam for modules like Embeddings and LayerNorm, while applying compressed-state optimization to Attention and MLP modules, with a scaling factor α used to adjust the learning rates across modules. For modules such as Embeddings and LayerNorm, the learning rate lr is applied, while Attention and MLP modules us...