BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
Subspace optimization for large language models with convergence guarantees
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 5roles
background 1polarities
background 1representative citing papers
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
DP-GRAPE reduces memory in differentially private neural network training by using random Gaussian projections on gradients instead of SVD, achieving comparable privacy-utility tradeoffs to DP-SGD and scaling to 6.7B parameter models.
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
citing papers explorer
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
-
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
-
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
-
Memory-Efficient Differentially Private Training with Gradient Random Projection
DP-GRAPE reduces memory in differentially private neural network training by using random Gaussian projections on gradients instead of SVD, achieving comparable privacy-utility tradeoffs to DP-SGD and scaling to 6.7B parameter models.
-
GWT: Scalable Optimizer State Compression for Large Language Model Training
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.