hub Canonical reference

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian · 2024 · cs.LG · arXiv 2403.03507

Canonical reference. 100% of citing Pith papers cite this work as background.

27 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 27 citing papers arXiv PDF

abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

math.OC · 2026-05-07 · unverdicted · novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

cs.LG · 2026-04-11 · unverdicted · novelty 6.0

Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newton-Schulz iterations by 40%.

STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.

Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.

Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction

cs.LG · 2026-04-01 · conditional · novelty 6.0

SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.

BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

cs.LG · 2025-12-13 · unverdicted · novelty 6.0

BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.

CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.

Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers

cs.LG · 2025-09-18 · unverdicted · novelty 6.0

A framework combining stochastic zeroth-order optimization and dynamic low-rank surrogate modeling with an implicit projector-splitting integrator enables end-to-end training of hybrid neural networks containing black-box physical layers and reaches near-digital accuracy on vision, audio, and text任务

Memory-Efficient Differentially Private Training with Gradient Random Projection

cs.LG · 2025-06-18 · conditional · novelty 6.0

DP-GRAPE reduces memory in differentially private neural network training by using random Gaussian projections on gradients instead of SVD, achieving comparable privacy-utility tradeoffs to DP-SGD and scaling to 6.7B parameter models.

PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.

GWT: Scalable Optimizer State Compression for Large Language Model Training

cs.LG · 2025-01-13 · unverdicted · novelty 6.0

GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.

Proof-of-Learning with Incentive Security

cs.CR · 2024-04-13 · unverdicted · novelty 6.0

The paper introduces an incentive-secure Proof-of-Learning protocol for blockchain consensus that claims provable security against two attacks, reduced computational overhead, and guarantees even with untrusted problem providers and verifiers.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

cs.CL · 2026-05-06 · unverdicted · novelty 5.0 · 2 refs

LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.

ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

cs.LG · 2026-03-10 · unverdicted · novelty 5.0

HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.

On the Convergence Analysis of Muon

stat.ML · 2025-05-29 · unverdicted · novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.

Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

cs.AI · 2026-04-13 · unverdicted · novelty 4.0

A Lingua Franca reactor-based method is proposed to address nondeterminism in agentic AI for human-in-the-loop cyber-physical systems such as driving coaches.

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

cs.LG · 2026-02-25 · unverdicted · novelty 4.0

Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.

citing papers explorer

Showing 27 of 27 citing papers.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters cs.LG · 2026-05-12 · unverdicted · none · ref 70 · internal anchor
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds cs.LG · 2026-05-10 · unverdicted · none · ref 71 · internal anchor
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition math.OC · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 67 · 2 links · internal anchor
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization cs.LG · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 89 · internal anchor
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning cs.LG · 2026-04-11 · unverdicted · none · ref 2 · internal anchor
Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newton-Schulz iterations by 40%.
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training cs.LG · 2026-04-08 · unverdicted · none · ref 36 · internal anchor
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters cs.LG · 2026-04-03 · unverdicted · none · ref 18 · internal anchor
PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction cs.LG · 2026-04-01 · conditional · none · ref 9 · internal anchor
SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models cs.LG · 2025-12-13 · unverdicted · none · ref 28 · internal anchor
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure cs.LG · 2025-09-23 · unverdicted · none · ref 64 · internal anchor
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers cs.LG · 2025-09-18 · unverdicted · none · ref 21 · internal anchor
A framework combining stochastic zeroth-order optimization and dynamic low-rank surrogate modeling with an implicit projector-splitting integrator enables end-to-end training of hybrid neural networks containing black-box physical layers and reaches near-digital accuracy on vision, audio, and text任务
Memory-Efficient Differentially Private Training with Gradient Random Projection cs.LG · 2025-06-18 · conditional · none · ref 20 · internal anchor
DP-GRAPE reduces memory in differentially private neural network training by using random Gaussian projections on gradients instead of SVD, achieving comparable privacy-utility tradeoffs to DP-SGD and scaling to 6.7B parameter models.
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention cs.CL · 2025-06-16 · unverdicted · none · ref 41 · internal anchor
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
GWT: Scalable Optimizer State Compression for Large Language Model Training cs.LG · 2025-01-13 · unverdicted · none · ref 54 · internal anchor
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
Proof-of-Learning with Incentive Security cs.CR · 2024-04-13 · unverdicted · none · ref 70 · internal anchor
The paper introduces an incentive-secure Proof-of-Learning protocol for blockchain consensus that claims provable security against two attacks, reduced computational overhead, and guarantees even with untrusted problem providers and verifiers.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 92 · internal anchor
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training cs.CL · 2026-05-06 · unverdicted · none · ref 33 · 2 links · internal anchor
LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity cs.LG · 2026-05-05 · unverdicted · none · ref 15 · internal anchor
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction cs.LG · 2026-03-10 · unverdicted · none · ref 34 · internal anchor
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
On the Convergence Analysis of Muon stat.ML · 2025-05-29 · unverdicted · none · ref 28 · internal anchor
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems cs.AI · 2026-04-13 · unverdicted · none · ref 42 · internal anchor
A Lingua Franca reactor-based method is proposed to address nondeterminism in agentic AI for human-in-the-loop cyber-physical systems such as driving coaches.
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training cs.LG · 2026-02-25 · unverdicted · none · ref 40 · internal anchor
Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control cs.LG · 2025-12-27 · unverdicted · none · ref 4 · internal anchor
AdaFRUGAL automates FRUGAL's static hyperparameters with linear decay on subspace ratio and loss-aware update frequency, delivering competitive accuracy with lower memory and faster training on C4, VietVault, and GLUE.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey cs.LG · 2024-03-21 · accept · none · ref 139 · internal anchor
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026) cs.CL · 2025-01-03 · unverdicted · none · ref 152 · internal anchor
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer