pith. sign in

arxiv: 2604.09406 · v1 · submitted 2026-04-10 · 💻 cs.LG

OASIS: Online Activation Subspace Learning for Memory-Efficient Training

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation subspacememory-efficient traininglow-rank gradientsLLM fine-tuningonline subspace learningprojection-aware optimizeractivation memory reductionpretraining efficiency
0
0 comments X

The pith

OASIS projects activations onto an evolving low-dimensional subspace to achieve up to 2x lower peak memory in LLM training while matching full fine-tuning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training large language models is limited by the memory required to store intermediate activations during forward and backward passes. OASIS addresses this by continuously learning and updating a low-dimensional activation subspace online, then projecting activations onto it to shrink their storage needs without changing the forward pass. The same subspace induces low-rank forms for gradients, allowing optimizer states to be stored compactly as well. A projection-aware optimizer moves these states across subspace updates to preserve training stability. If correct, this would let users train larger models on the same hardware or use less hardware for current models, while beating prior low-rank memory reduction techniques.

Core claim

OASIS tracks an evolving low-dimensional activation subspace during training and projects intermediate activations onto it, which reduces activation memory and naturally produces low-rank gradient representations so that both gradients and optimizer states can be maintained directly in the subspace. A projection-aware optimizer transports optimizer states across subspace updates to keep training stable. On various finetuning and pretraining tasks this yields up to 2× lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.

What carries the argument

The online activation subspace learning algorithm that continuously updates a low-dimensional projection for activations, inducing low-rank gradients and optimizer states handled by a projection-aware optimizer.

If this is right

  • Peak memory usage falls by up to a factor of two relative to standard full fine-tuning.
  • Final model quality on finetuning and pretraining tasks stays comparable to full-parameter training.
  • Performance exceeds that of earlier low-rank weight or periodic-projection methods.
  • Gradients and optimizer states fit in the same reduced subspace, cutting their memory cost.
  • Training stability holds across subspace updates because the optimizer explicitly transports states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because the forward pass remains unchanged, the method can be dropped into existing training codebases without model redesign.
  • Continuous subspace adaptation may handle distribution shifts during long pretraining runs more gracefully than fixed or periodically reset projections.
  • The same projection principle could be tested on memory bottlenecks beyond activations, such as in attention key-value caches during inference.

Load-bearing premise

Projecting activations onto a continuously updated low-dimensional subspace preserves enough information to compute accurate gradients and support stable optimization without degrading final model quality.

What would settle it

A side-by-side run on a standard pretraining benchmark where the OASIS model shows a clear gap in final loss or downstream accuracy versus full fine-tuning at identical step count and batch size would falsify the performance-matching claim.

Figures

Figures reproduced from arXiv: 2604.09406 by Kaushik Roy, Sakshi Choudhary, Utkarsh Saxena.

Figure 1
Figure 1. Figure 1: Memory breakdown and subspace dynamics during training. (a) As batch size [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Memory breakdown across training components for different methods on LLaMA [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Subspace drift during pretraining on C4 for Llama-130M and Llama-350M. Both [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of periodic PCA and OASIS during finetuning on GSM8K with Llama [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of subspace rank on finetuning performance on GSM8K with Llama-2 7B. OASIS consistently outper￾forms periodic PCA across ranks. 0.001 0.01 0.05 0.1 0.5 1 Subspace learning rate 35 36 37 38 39 40 Accuracy (%) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Training large language models (LLMs) is constrained by memory requirements, with activations accounting for a substantial fraction of the total footprint. Existing approaches reduce memory using low-rank weight parameterizations or low-rank gradient subspaces for optimizer states, while activation memory is addressed through architectural modifications or compression schemes based on periodically updated projections. We propose OASIS, an online activation subspace learning algorithm for memory-efficient training that tracks and continuously updates a low-dimensional activation subspace during training. Intermediate activations are projected onto this evolving subspace, reducing memory without modifying forward-pass computations. The evolving activation subspace induces low-rank gradient representations, enabling both gradients and optimizer states to be maintained directly in this subspace, while a projection-aware optimizer consistently transports optimizer states across subspace updates for stable training. Across various finetuning and pretraining tasks, OASIS achieves up to $2\times$ lower peak memory than full fine-tuning while matching its performance and outperforming prior low-rank methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes OASIS, an online activation subspace learning method for memory-efficient LLM training. It continuously tracks and updates a low-dimensional subspace for intermediate activations, projects them to reduce memory footprint without changing the forward pass, derives low-rank gradients in the subspace, and uses a projection-aware optimizer to maintain state consistency across updates. Experiments across finetuning and pretraining tasks report up to 2× lower peak memory than full fine-tuning while matching performance and outperforming prior low-rank approaches.

Significance. If the empirical results hold under scrutiny, OASIS addresses a key bottleneck in activation memory for large-model training via a dynamic, online subspace mechanism that preserves forward-pass fidelity and enables consistent low-rank optimization. The internal consistency of the projection, gradient, and optimizer transport construction is a strength, as is the focus on activation memory rather than solely parameter or gradient compression.

major comments (2)
  1. The central claim that the online subspace projection preserves sufficient information for accurate gradients and stable optimization (without degrading model quality) is load-bearing; the manuscript should include a formal argument or bound showing that the projection error does not accumulate to affect the final loss or convergence rate, particularly when the subspace dimension is small relative to activation size.
  2. Experimental validation of the 2× memory reduction and performance parity is central but currently hard to assess in detail; the paper must provide the full list of tasks, model sizes, exact baselines (including LoRA, MeZO, and other activation-compression methods), number of runs, and error bars in the main results table or figure.
minor comments (3)
  1. Clarify the exact update rule and frequency for the subspace tracker (e.g., in the methods section) and whether it introduces any additional compute overhead that offsets the memory savings.
  2. The notation for the projection matrix and its evolution should be made consistent between the algorithmic description and the optimizer-state transport equations.
  3. Include an ablation on subspace dimension choice and its effect on both memory and accuracy to justify the reported operating point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of OASIS's contributions to activation memory reduction, and the recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: The central claim that the online subspace projection preserves sufficient information for accurate gradients and stable optimization (without degrading model quality) is load-bearing; the manuscript should include a formal argument or bound showing that the projection error does not accumulate to affect the final loss or convergence rate, particularly when the subspace dimension is small relative to activation size.

    Authors: We appreciate the call for theoretical grounding. A tight formal bound on error accumulation for a continuously evolving subspace is challenging to derive without restrictive assumptions on activation statistics and update dynamics that may not generalize. The online update mechanism in OASIS is specifically intended to adapt the subspace frequently enough to keep projection error from accumulating meaningfully. In the revision we add a new subsection (4.3) with an empirical analysis of per-step projection error across training, demonstrating that error remains bounded and does not correlate with performance drops. We also include a brief discussion of the conditions (subspace dimension relative to activation rank and update frequency) under which fidelity is preserved. A full convergence-rate proof is left for future work. revision: partial

  2. Referee: Experimental validation of the 2× memory reduction and performance parity is central but currently hard to assess in detail; the paper must provide the full list of tasks, model sizes, exact baselines (including LoRA, MeZO, and other activation-compression methods), number of runs, and error bars in the main results table or figure.

    Authors: We agree that experimental details should be fully transparent. The revised manuscript expands Table 1 and the experimental section to list all tasks (GLUE, SuperGLUE, C4, The Pile), model sizes (7B, 13B, 30B), and exact baselines (full fine-tuning, LoRA, MeZO, ActQuant, and periodic-projection methods). All reported numbers are now averages over three independent runs with standard-deviation error bars; the updated table and Appendix C contain the complete protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces OASIS as a novel online activation subspace learning algorithm for reducing activation memory during LLM training. The abstract and description outline the method—continuously tracking an evolving low-dimensional subspace, projecting activations onto it without altering the forward pass, inducing low-rank gradients, and employing a projection-aware optimizer—without any equations, derivations, or self-referential reductions. Central claims rest on empirical performance matching full fine-tuning and outperforming prior low-rank methods against external baselines, with no fitted inputs renamed as predictions, no load-bearing self-citations, and no uniqueness theorems imported from prior author work. The construction is self-contained and does not reduce to its own inputs by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters or axioms; the method implicitly assumes a stable low-dimensional subspace exists and can be tracked online without loss of training signal.

free parameters (1)
  • subspace dimension
    The rank or size of the learned activation subspace is a critical hyperparameter whose value is not stated.

pith-pipeline@v0.9.0 · 5462 in / 1090 out tokens · 36358 ms · 2026-05-10T18:16:38.975162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    0362 #1 ^H 2

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...