Recognition: 2 theorem links
· Lean TheoremGaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Pith reviewed 2026-05-16 23:47 UTC · model grok-4.3
The pith
GaLore projects full gradients onto low-rank subspaces periodically, cutting optimizer memory by 65.5% while training every parameter of large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient Low-Rank Projection reduces optimizer-state memory by up to 65.5% by decomposing gradients into low-rank factors whose bases are recomputed periodically, while the actual weight updates remain full rank and the resulting models match the quality of conventional training on both C4 pre-training and GLUE fine-tuning.
What carries the argument
Periodic low-rank projection of gradients, which decomposes each gradient matrix into a pair of low-rank factors stored by the optimizer while the weight update itself stays full rank.
If this is right
- Pre-training a 7B model becomes feasible on a single 24GB GPU without model parallelism, checkpointing, or offloading.
- An 8-bit version further reduces optimizer memory by up to 82.5% and total training memory by 63.3%.
- Performance stays comparable to full-rank training across both pre-training and fine-tuning regimes.
- No full-rank warm-start is required, unlike some low-rank adaptation approaches.
Where Pith is reading between the lines
- The same projection idea could be applied to other first-order optimizers beyond Adam.
- Lower memory use may allow larger batch sizes or longer context lengths on the same hardware.
- Periodic recomputation opens the possibility of making the projection interval itself adaptive during training.
Load-bearing premise
That recomputing the low-rank bases at intervals keeps the projected gradients close enough to the original ones for the optimizer to reach models of comparable quality.
What would settle it
Run identical 7B pre-training on C4 with both GaLore and a full-memory baseline and compare final validation perplexity or downstream GLUE scores; a large gap would falsify the claim.
read the original abstract
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GaLore, a training method that projects gradients onto a low-rank subspace via periodic SVD-based basis updates, enabling full-parameter optimization of LLMs with substantially reduced optimizer memory (up to 65.5% savings). It reports performance parity with standard Adam on LLaMA 1B/7B pre-training using up to 19.7B C4 tokens and on RoBERTa fine-tuning for GLUE, including the first claimed demonstration of 7B pre-training on a single 24GB consumer GPU without parallelism, checkpointing, or offloading. An 8-bit variant further reduces memory.
Significance. If the empirical results hold under rigorous controls, the work has clear significance for lowering hardware barriers to LLM pre-training. Demonstrating viable 7B-scale training on consumer GPUs directly addresses a practical bottleneck and could accelerate research in resource-limited settings. The approach's retention of full-rank parameter updates distinguishes it from adaptation methods like LoRA.
major comments (3)
- [§3] §3 (Method), Algorithm 1: The central mechanism relies on updating low-rank bases every T steps via SVD of the gradient matrix. No analysis or bounds are given on how quickly the gradient subspace evolves during 19.7B-token pre-training; if drift exceeds the update interval, the projected direction introduces accumulating bias relative to full gradients, undermining the claim of comparable optimization dynamics.
- [§4.1] §4.1 (Pre-training experiments): The reported performance parity for LLaMA 7B lacks ablations on the free parameters r (projection rank) and T (update frequency). Without these, it is impossible to determine whether the chosen values are robust or were tuned post-hoc to match baseline quality.
- [Table 2] Table 2 (Memory and performance): The 65.5% optimizer memory reduction and GLUE results are presented without explicit confirmation that all baselines (including 8-bit Adam) used identical learning-rate schedules, batch sizes, and warm-up protocols; any mismatch would invalidate the cross-method comparison.
minor comments (2)
- [Abstract] Abstract: The phrase 'for the first time' for 7B single-GPU training should be qualified with a short citation or discussion of prior single-GPU attempts.
- [§4.3] §4.3 (8-bit variant): Clarify the interaction between 8-bit quantization and the low-rank projection step, including any additional error introduced.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [§3] §3 (Method), Algorithm 1: The central mechanism relies on updating low-rank bases every T steps via SVD of the gradient matrix. No analysis or bounds are given on how quickly the gradient subspace evolves during 19.7B-token pre-training; if drift exceeds the update interval, the projected direction introduces accumulating bias relative to full gradients, undermining the claim of comparable optimization dynamics.
Authors: We agree that providing analysis on the evolution of the gradient subspace would be valuable. While we do not provide theoretical bounds in the current manuscript, our experiments show that GaLore achieves performance comparable to full-rank training, indicating that the periodic updates with the selected T effectively capture the relevant subspace without significant bias accumulation. In the revised version, we will include an empirical analysis of the subspace drift, such as measuring the angle between successive low-rank bases over the course of training, to better justify the update frequency. revision: partial
-
Referee: [§4.1] §4.1 (Pre-training experiments): The reported performance parity for LLaMA 7B lacks ablations on the free parameters r (projection rank) and T (update frequency). Without these, it is impossible to determine whether the chosen values are robust or were tuned post-hoc to match baseline quality.
Authors: We appreciate this suggestion. The values of r and T were chosen based on preliminary experiments to balance memory savings and performance, but we acknowledge the need for more comprehensive ablations. In the revision, we will add ablation studies varying r and T for the LLaMA 7B pre-training, demonstrating the robustness of the results within reasonable ranges of these hyperparameters. revision: yes
-
Referee: [Table 2] Table 2 (Memory and performance): The 65.5% optimizer memory reduction and GLUE results are presented without explicit confirmation that all baselines (including 8-bit Adam) used identical learning-rate schedules, batch sizes, and warm-up protocols; any mismatch would invalidate the cross-method comparison.
Authors: We confirm that all methods, including the 8-bit Adam baseline, were trained using identical hyperparameters: the same learning rate schedule, batch size, and warm-up protocol as detailed in Section 4. To make this explicit, we will add a clarifying statement in the caption of Table 2 and in the experimental setup section of the revised manuscript. revision: yes
Circularity Check
No significant circularity in GaLore derivation
full rationale
The paper proposes GaLore as an algorithmic modification to the optimizer: gradients are projected onto a low-rank subspace obtained via periodic SVD of the gradient matrix, with bases updated every T steps. Memory savings (up to 65.5% in optimizer states) are direct measurements of reduced state sizes under BF16/8-bit quantization, not quantities derived from fitted constants or self-referential equations. Performance equivalence to full Adam is shown via empirical pre-training on LLaMA 1B/7B with C4 (19.7B tokens) and fine-tuning on GLUE; no derivation step reduces to a self-citation chain, ansatz smuggled via prior work, or renaming of known results. The central claim rests on measured quantities and experimental validation rather than any load-bearing self-definition or fitted-input prediction.
Axiom & Free-Parameter Ledger
free parameters (1)
- projection rank r
axioms (1)
- domain assumption Gradients admit a low-rank approximation that preserves sufficient directional information for effective Adam-style updates when the basis is refreshed periodically.
Forward citations
Cited by 20 Pith papers
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
-
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
-
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-...
-
AdamO: A Collapse-Suppressed Optimizer for Offline RL
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
-
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newto...
-
STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training
STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.
-
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
-
Spectral Compact Training: Pre-Training Large Language Models via Permanent Truncated SVD and Stiefel QR Retraction
SCT pre-trains LLMs by keeping weights as compact SVD factors with Stiefel QR retraction, delivering up to 199x memory reduction per layer and allowing 70B-parameter training on a Steam Deck.
-
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LoPT achieves competitive task performance in LLM post-training by limiting task gradients to the upper model half and training the lower half with local feature reconstruction.
-
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LoPT delivers competitive LLM post-training results by training only the top half on the task objective and using feature reconstruction to update the bottom half.
-
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
-
Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems
A Lingua Franca reactor-based method is proposed to address nondeterminism in agentic AI for human-in-the-loop cyber-physical systems such as driving coaches.
-
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training
Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
-
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
AdaFRUGAL automates FRUGAL's static hyperparameters with linear decay on subspace ratio and loss-aware update frequency, delivering competitive accuracy with lower memory and faster training on C4, VietVault, and GLUE.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
Reference graph
Works this paper leans on
-
[1]
Memory efficient adaptive optimization
Anil, R., Gupta, V., Koren, T., and Singer, Y. Memory efficient adaptive optimization. Advances in Neural Information Processing Systems, 2019
work page 2019
-
[2]
Belle: Be everyone's large language model engine
BELLEGroup. Belle: Be everyone's large language model engine. https://github.com/LianjiaTech/BELLE, 2023
work page 2023
-
[3]
Continual learning in low-rank orthogonal subspaces
Chaudhry, A., Khan, N., Dokania, P., and Torr, P. Continual learning in low-rank orthogonal subspaces. Advances in Neural Information Processing Systems, 2020
work page 2020
-
[4]
Non- Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression
Chen, H., Raskutti, G., and Yuan, M. Non- Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression . Journal of Machine Learning Research, 2019
work page 2019
-
[5]
Training Deep Nets with Sublinear Memory Cost
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training Deep Nets with Sublinear Memory Cost . ArXiv preprint arXiv:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Chen, Y. and Wainwright, M. J. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ArXiv preprint arXiv:1509.03025, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
W., Sutton, C., Gehrmann, S., et al
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 2023
work page 2023
-
[8]
Cosson, R., Jadbabaie, A., Makur, A., Reisizadeh, A., and Shah, D. Low- Rank Gradient Descent . IEEE Open Journal of Control Systems, 2023
work page 2023
-
[9]
8-bit optimizers via block-wise quantization
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[10]
Qlora: Efficient finetuning of quantized llms
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[11]
Delta Tuning : A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., Yi, J., Zhao, W., Wang, X., Liu, Z., Zheng, H.-T., Chen, J., Liu, Y., Tang, J., Li, J., and Sun, M. Delta Tuning : A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models . ArXiv preprint arXiv:2203.06904, 2022
-
[12]
An image is worth 16x16 words: Transformers for image recognition at scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021
work page 2021
-
[13]
C., Zadrazil, P., Kabel, A., Beaufays, F., and Motta, G
Gooneratne, M., Sim, K. C., Zadrazil, P., Kabel, A., Beaufays, F., and Motta, G. Low-rank gradient approximation for memory-efficient on-device training of deep neural network. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 . IEEE , 2020
work page 2020
-
[14]
Gradient Descent Happens in a Tiny Subspace
Gur-Ari , G., Roberts, D. A., and Dyer, E. Gradient Descent Happens in a Tiny Subspace . ArXiv preprint arXiv:1812.04754, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Flora: Low-Rank Adapters Are Secretly Gradient Compressors
Hao, Y., Cao, Y., and Mou, L. Flora: Low-Rank Adapters Are Secretly Gradient Compressors . ArXiv preprint arXiv:2402.03293, 2024
-
[16]
Denoising diffusion probabilistic models
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020
work page 2020
-
[17]
J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[18]
Huang, S., Hoskins, B. D., Daniels, M. W., Stiles, M. D., and Adam, G. C. Low- Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays . ACM Journal on Emerging Technologies in Computing Systems, 2023
work page 2023
-
[19]
R., Locatelli, A., Venkitesh, B., Ba, J., Gal, Y., and Gomez, A
Kamalakara, S. R., Locatelli, A., Venkitesh, B., Ba, J., Gal, Y., and Gomez, A. N. Exploring Low Rank Training of Deep Neural Networks . ArXiv preprint arXiv:2209.13569, 2022
-
[20]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015
work page 2015
-
[21]
o pf, A., Kilcher, Y., von R \
K \"o pf, A., Kilcher, Y., von R \"u tte, D., Anagnostidis, S., Tam, Z. R., Stevens, K., Barhoum, A., Nguyen, D., Stanley, O., Nagyfi, R., et al. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[22]
W., Fort, S., Becker, N., and Ganguli, S
Larsen, B. W., Fort, S., Becker, N., and Ganguli, S. How many degrees of freedom do we need to train deep networks: a loss landscape perspective. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[23]
Lee, Y. and Choi, S. Gradient-based meta-learning with learned layerwise metric and subspace. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 . PMLR , 2018
work page 2018
-
[24]
Memory efficient optimizers with 4-bit states
Li, B., Chen, J., and Zhu, J. Memory efficient optimizers with 4-bit states. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[25]
Relo RA : High-rank training through low-rank updates
Lialin, V., Muckatira, S., Shivagunde, N., and Rumshisky, A. Relo RA : High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[26]
Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources
Lin, H., Zhang, H., Ma, Y., He, T., Zhang, Z., Zha, S., and Li, M. Dynamic mini-batch sgd for elastic distributed training: Learning in the limbo of resources. arXiv preprint arXiv:1904.12043, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[27]
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019
work page 2019
-
[28]
AdaLomo : Low-memory Optimization with Adaptive Learning Rate
Lv, K., Yan, H., Guo, Q., Lv, H., and Qiu, X. AdaLomo : Low-memory Optimization with Adaptive Learning Rate . ArXiv preprint arXiv:2310.10195, 2023 a
-
[29]
Full Parameter Fine-tuning for Large Language Models with Limited Resources
Lv, K., Yang, Y., Liu, T., Gao, Q., Guo, Q., and Qiu, X. Full Parameter Fine-tuning for Large Language Models with Limited Resources . ArXiv preprint arXiv:2306.09782, 2023 b
-
[30]
Error Feedback Can Accurately Compress Preconditioners
Modoranu, I.-V., Kalinov, A., Kurtic, E., Frantar, E., and Alistarh, D. Error Feedback Can Accurately Compress Preconditioners . ArXiv preprint arXiv:2306.06098, 2023
-
[31]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020
work page 2020
-
[33]
Zero: Memory optimizations toward training trillion parameter models
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
work page 2020
-
[34]
SQ u AD : 100,000+ questions for machine comprehension of text
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016
work page 2016
-
[35]
Tied- Lora : Enhacing parameter efficiency of LoRA with weight tying
Renduchintala, A., Konuk, T., and Kuchaiev, O. Tied- Lora : Enhacing parameter efficiency of LoRA with weight tying. ArXiv preprint arXiv:2311.09578, 2023
-
[36]
GLU Variants Improve Transformer
Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[37]
Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 . PMLR , 2018
work page 2018
-
[38]
Sheng, Y., Cao, S., Li, D., Hooper, C., Lee, N., Yang, S., Chou, C., Zhu, B., Zheng, L., Keutzer, K., Gonzalez, J. E., and Stoica, I. S- LoRA : Serving Thousands of Concurrent LoRA Adapters . ArXiv preprint arXiv:2311.03285, 2023
-
[39]
Gemma: Open Models Based on Gemini Research and Technology
Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi \`e re, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Understanding self-supervised learning with dual deep networks
Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding self-supervised learning with dual deep networks. ArXiv preprint arXiv:2010.00578, 2020
-
[41]
Tian, Y., Wang, Y., Zhang, Z., Chen, B., and Du, S. S. Jo MA : Demystifying multilayer transformers via joint dynamics of MLP and attention. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Vogels, T., Karimireddy, S. P., and Jaggi, M. Practical low-rank communication compression in decentralized deep learning. Advances in Neural Information Processing Systems, 2020
work page 2020
-
[44]
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019
work page 2019
-
[45]
Atomo: Communication-efficient learning via atomic sparsification
Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. Advances in neural information processing systems, 31, 2018
work page 2018
-
[46]
Cuttlefish: Low-rank model training without all the tuning
Wang, H., Agarwal, S., Tanaka, Y., Xing, E., Papailiopoulos, D., et al. Cuttlefish: Low-rank model training without all the tuning. Proceedings of Machine Learning and Systems, 2023 a
work page 2023
-
[47]
MultiLoRA : Democratizing LoRA for Better Multi-Task Learning
Wang, Y., Lin, Y., Zeng, X., and Zhang, G. MultiLoRA : Democratizing LoRA for Better Multi-Task Learning . ArXiv preprint arXiv:2311.11501, 2023 b
-
[48]
Stable and low-precision training for large-scale vision-language models
Wortsman, M., Dettmers, T., Zettlemoyer, L., Morcos, A., Farhadi, A., and Schmidt, L. Stable and low-precision training for large-scale vision-language models. Advances in Neural Information Processing Systems, 2023
work page 2023
-
[49]
Chain of LoRA : Efficient Fine-tuning of Language Models via Residual Learning
Xia, W., Qin, C., and Hazan, E. Chain of LoRA : Efficient Fine-tuning of Language Models via Residual Learning . ArXiv preprint arXiv:2401.04151, 2024
-
[50]
Yang, G., Simon, J. B., and Bernstein, J. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023
-
[51]
Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling Vision Transformers . In 2022 IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . IEEE, 2022
work page 2022
-
[52]
Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[53]
LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning
Zhang, L., Zhang, L., Shi, S., Chu, X., and Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Zhao, J., Schaefer, F. T., and Anandkumar, A. Zero initialization: Initializing neural networks with only zeros and ones. Transactions on Machine Learning Research, 2022
work page 2022
-
[55]
Inrank: Incremental low-rank learning
Zhao, J., Zhang, Y., Chen, B., Sch \"a fer, F., and Anandkumar, A. Inrank: Incremental low-rank learning. arXiv preprint arXiv:2306.11250, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.