{"work":{"id":"c34c4978-e4f0-49cb-98d8-47f58142e848","openalex_id":null,"doi":null,"arxiv_id":"2403.03507","raw_key":null,"title":"GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection","authors":null,"authors_text":"Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian","year":2024,"venue":"cs.LG","abstract":"Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.","external_url":"https://arxiv.org/abs/2403.03507","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:55:26.401397+00:00","pith_arxiv_id":"2403.03507","created_at":"2026-05-10T23:50:57.118065+00:00","updated_at":"2026-05-25T06:55:26.401397+00:00","title_quality_ok":true,"display_title":"GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection","render_title":"GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection"},"hub":{"state":{"work_id":"c34c4978-e4f0-49cb-98d8-47f58142e848","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":27,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2024-03-21T17:55:50+00:00","last_pith_cited_at":"2026-05-12T17:59:34+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-05T05:48:34.947252+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":7}],"polarity_counts":[{"context_polarity":"background","n":7}],"runs":{},"summary":{},"graph":{},"authors":[]}}