pith. machine review for the scientific record. sign in

arxiv: 2605.13652 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords low-rank pre-trainingloss landscapespectral analysislanguage model optimizationactivation similaritygeneralizationmemory-efficient trainingdownstream performance
0
0 comments X

The pith

Low-rank pre-training methods reach geometrically distinct loss basins than full-rank training even at matched perplexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether low-rank pre-training approaches for language models reach the same solutions as full-rank training when validation perplexity matches. It evaluates five low-rank methods against full-rank training at three model sizes using 16 metrics that cover one-dimensional loss landscapes in random and principal directions, checkpoint interpolation paths, spectral properties of weights and updates, and layer-wise activation similarity. The measurements show that low-rank solutions occupy different basins, with full-rank training sharper along random directions but less sharp along the top principal component. Activation patterns in later layers also diverge as training proceeds. Perplexity alone does not reliably predict downstream performance, whereas the added geometric and spectral signals improve that prediction.

Core claim

Low-rank pre-training methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with one method tracking full-rank most closely. Validation perplexity does not translate to downstream performance at every scale.

What carries the argument

Sixteen metrics across four dimensions: one-dimensional loss landscapes along random and top-K PCA directions, one-dimensional interpolation between checkpoints, spectral structure of weights and learned updates, and activation similarity to full-rank training.

If this is right

  • Low-rank methods cannot be substituted for one another on the basis of perplexity alone.
  • Downstream performance can differ across methods even when validation perplexity matches.
  • Geometric and spectral metrics give stronger signals for generalization than perplexity by itself.
  • Later-layer activation divergence implies that representation learning changes under rank constraints.
  • Method choice may need to be scale-dependent to approximate full-rank basin geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The basin differences could produce varying robustness to data shifts or adversarial examples not captured by standard benchmarks.
  • Practitioners might monitor landscape sharpness during training to decide when a low-rank run is close enough to full-rank behavior.
  • Hybrid update rules that periodically inject full-rank corrections could be tested to retain efficiency while steering toward preferred basins.

Load-bearing premise

The sixteen chosen metrics on loss landscapes, spectra, and activations are sufficient to detect meaningful differences in solution quality and generalization behavior.

What would settle it

A controlled run in which low-rank and full-rank models reach identical downstream task scores and exhibit matching loss curvature along every tested direction at matched perplexity would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.13652 by Anna Rumshisky, Namrata Shivagunde, Sherin Muckatira, Vijeta Deshpande.

Figure 1
Figure 1. Figure 1: 1-D loss landscape (a) random direction (b) top-1 PCA direction. GaLore, CoLA and ReLoRA converge to sharper basin than full-rank. GaLore has relatively smaller σ1 at every scale (∼ 3 throughout training) yet still produces moderate-to-high sharpness (∼ 0.005−0.007) — the loss elevates substantially, indicating a very steep loss landscape along its leading direction. Similar pattern is seen in CoLA and ReL… view at source ↗
Figure 2
Figure 2. Figure 2: 1-D interpolation (a) CCBH (b) IMBH and ReLoRA exhibit relatively low mutual barriers. SLTrain, by contrast, shows substantially higher barriers against all other low-rank methods, placing it in a distinctly separate valley. At 130M and 350M, the full-rank versus low-rank barriers decrease, and low-rank vs. low-rank increases. Fira and ReLoRA retain the smallest mutual barriers, GaLore occupies an intermed… view at source ↗
Figure 3
Figure 3. Figure 3: Rank and spectral metrics at 350M. the deviation, and Fira benefits from this more than any other method. Per-layer dynamics (Row 3). For both Fira and CoLA, L2 distance grows in the later layers as training progresses, with the final layer reducing drift relative to full-rank. CoLA is directionally off at every layer (cos ≈ 0), while Fira preserves angular alignment. Linear CKA deviates most in the middle… view at source ↗
Figure 4
Figure 4. Figure 4: Activation deviation with full-rank baseline. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Zero-shot downstream performance. Predictor LOSO Pearson LOMO Pearson R 2 (in-sample) val loss only 0.873 0.864 0.841 geometry only (8 feats) 0.498 0.431 0.558 val loss + geometry (9 feats) 0.913 0.895 0.907 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 1-D loss landscape at 60M parameters. Top: centered loss profile L(α) − L(0) averaged over 100 random directions, at five training checkpoints (1k, 3k, 5k, 8k, 10k). Bottom-left: average sharpness with respect to training step. Bottom-right: average direction variance. ReLoRA is omitted from the plot as it makes it harder to view other methods. The plot including ReLoRA is given in 7 . 0.002 0.000 0.002 0.… view at source ↗
Figure 7
Figure 7. Figure 7: 1-D loss landscape at 60M parameters for All methods. Top: centered loss profile L(α) − L(0) averaged over 100 random directions, at five training checkpoints (1k, 3k, 5k, 8k, 10k). Bottom-left: average expected sharpness with respect to training step. Bottom-right: average direction variance. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 1-D loss landscape at 130M parameters. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: 1-D loss landscape at 130M parameters for [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 1-D loss landscape at 350M parameters. Same layout as 6. 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 ( ) (0) step 6000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 12000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 24000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 48000 0.002 0.000 0.002 0.0 0.1 0.2 0.3 0.4 step 60000 10000 20000 30000 40000 50000 60000 training step 0.000 0.025 0.050 0.075 0.100 0.125 expected shar… view at source ↗
Figure 11
Figure 11. Figure 11: 1-D loss landscape at 350M parameters for [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 1-D loss landscape along top-k PCA directions at 60M parameters for k ∈ {1, 5, 10, 20}. Top four rows: centered loss profile at five training checkpoints. Bottom: expected sharpness (left column of summary panels) and across-component direction variance (right column) as a function of training step, per k. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 1-D loss landscape along top-k PCA directions at 130M parameters for k ∈ {1, 5, 10, 20}. Layout identical to [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 1-D loss landscape along top-k PCA directions at 350M parameters for k ∈ {1, 5, 10, 20}. Layout identical to [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Rank and Spectral metrics for 60M. 5k 10k 15k 20k training step 200 300 400 500 600 effective rank of W 5k 10k 15k 20k training step 20 40 60 80 100 stable rank of W 5k 10k 15k 20k training step 0.00 0.05 0.10 0.15 0.20 0.25 spectral gap of W 5k 10k 15k 20k training step 300 400 500 600 700 # > 0.1 o f W 5k 10k 15k 20k training step 0 200 400 600 e f f e c tiv e r a n k o f W 5k 10k 15k 20k training step … view at source ↗
Figure 16
Figure 16. Figure 16: Rank and Spectral metrics for 130M. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Activation L2 distance layer-wise 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Activation linear CKA similarity 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Activation cosine similarity layer-wise 25 [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
read the original abstract

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that low-rank pre-training methods (GaLore, Fira, CoLA, SLTrain, ReLoRA) for LLMs at 60M–350M scales reach solutions that are geometrically and spectrally distinct from full-rank training and from each other, even at comparable validation perplexity. Using 16 metrics on 1-D loss landscapes (random and top-K PCA directions), checkpoint interpolation, weight/update spectra, and activation cosine similarities, it shows full-rank training occupies sharper basins along random directions while low-rank methods are sharper along the top-1 PCA direction, with later-layer activation divergence and distinct spectral structures. It further asserts that perplexity alone fails to predict downstream performance at every scale, but incorporating the geometric/spectral metrics improves such predictions.

Significance. If the reported distinctions are robust, the work advances evaluation of memory-efficient LLM training by demonstrating that perplexity matching does not imply solution equivalence. The multi-scale, multi-metric analysis could guide method selection and motivate new low-rank designs that target specific landscape regions. Strengths include the breadth of metrics and explicit comparison to full-rank baselines across methods.

major comments (3)
  1. [§4] §4 (Experimental Setup and Results): No error bars, number of random seeds, or statistical significance tests are reported for the 16 metrics or the observed differences in basin sharpness and activation divergence. This is load-bearing for the non-equivalence claim, as run-to-run variability could explain the qualitative distinctions.
  2. [§5] §5 (Downstream Prediction): The claim that 'adding geometric and spectral metrics improves the prediction' of downstream performance lacks any quantitative support such as R² increments, regression coefficients, effect sizes, or controls for model scale. Without these, the assertion that the metrics capture generalization-relevant properties beyond perplexity remains unsubstantiated.
  3. [§3.1–3.2] §3.1–3.2 (Loss Landscape): The 1-D slices claim full-rank settles into a 'sharper basin' along random directions and the reverse for top-1 PCA, but no curvature quantification (e.g., second-derivative estimates) or robustness checks across seeds are provided, leaving the geometric distinction descriptive rather than rigorously established.
minor comments (2)
  1. [Figures] Figure captions and axis labels for the 1-D loss plots and spectra should explicitly state the number of samples or interpolation points used.
  2. [§2] Notation for the 16 metrics is introduced without a consolidated table; a summary table listing each metric, its formula, and the dimension it probes would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thoughtful review. We will revise the manuscript to address the concerns regarding statistical rigor and quantitative support for the claims. Below we respond point-by-point.

read point-by-point responses
  1. Referee: §4 (Experimental Setup and Results): No error bars, number of random seeds, or statistical significance tests are reported for the 16 metrics or the observed differences in basin sharpness and activation divergence. This is load-bearing for the non-equivalence claim, as run-to-run variability could explain the qualitative distinctions.

    Authors: We agree that reporting variability is important. The primary experiments used single seeds per method and scale due to the high computational cost, particularly at 350M parameters. However, we ran the 60M and 130M models with 3 seeds and observed consistent trends. In the revised version, we will include error bars for the smaller scales and add a discussion of variability. For the 350M scale, we will note the limitation and rely on cross-scale consistency. revision: partial

  2. Referee: §5 (Downstream Prediction): The claim that 'adding geometric and spectral metrics improves the prediction' of downstream performance lacks any quantitative support such as R² increments, regression coefficients, effect sizes, or controls for model scale. Without these, the assertion that the metrics capture generalization-relevant properties beyond perplexity remains unsubstantiated.

    Authors: We will strengthen this section by adding quantitative regression analysis. Specifically, we will report R² values for linear regressions predicting downstream performance using perplexity alone versus perplexity plus the geometric/spectral metrics, including controls for model scale. This will provide the requested effect sizes and demonstrate the improvement. revision: yes

  3. Referee: §3.1–3.2 (Loss Landscape): The 1-D slices claim full-rank settles into a 'sharper basin' along random directions and the reverse for top-1 PCA, but no curvature quantification (e.g., second-derivative estimates) or robustness checks across seeds are provided, leaving the geometric distinction descriptive rather than rigorously established.

    Authors: The 1-D slices serve as visual aids, with the distinctions corroborated by the full set of 16 metrics including checkpoint interpolation and spectral analyses. We will add approximate curvature estimates (e.g., finite-difference second derivatives along the directions) for the 60M and 130M models in the revision. Robustness across seeds will be addressed as noted in the response to §4. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with independent metrics

full rationale

The paper conducts a direct empirical study comparing five low-rank pre-training methods to full-rank baselines across three model scales using 16 independent metrics (1-D loss landscapes, checkpoint interpolation, weight/update spectra, and activation similarities). No mathematical derivations, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations are present. All claims rest on experimental observations against external full-rank controls rather than any self-referential reduction or ansatz smuggling. The analysis is self-contained with no steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparative study of existing low-rank pre-training methods using new evaluation metrics. No free parameters, axioms, or invented entities are introduced in the central claim.

pith-pipeline@v0.9.0 · 5646 in / 1261 out tokens · 28470 ms · 2026-05-14T19:16:23.841652+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction.

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    A modern look at the relationship between sharpness and generalization.arXiv preprint arXiv:2302.07011, 2023

    Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generalization.arXiv preprint arXiv:2302.07011, 2023

  2. [2]

    Understanding pre-training and fine-tuning from loss landscape perspec- tives.arXiv preprint arXiv:2505.17646, 2025

    Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Unveiling the basin-like loss landscape in large language models.arXiv preprint arXiv:2505.17646, 2025

  3. [3]

    Fira: Can we achieve full-rank training of llms under low-rank constraint?ArXiv, abs/2410.01623, 2024

    Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?ArXiv, abs/2410.01623, 2024

  4. [4]

    Linear mode connectivity and the lottery ticket hypothesis

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational conference on machine learning, pages 3259–3269. PMLR, 2020

  5. [5]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  6. [6]

    SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining.arXiv preprint arXiv:2406.02214, 2024

    Andi Han, Jiaxiang Li, Wei Huang, Mingyi Hong, Akiko Takeda, Pratik Jawanpuria, and Bamdev Mishra. SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining.arXiv preprint arXiv:2406.02214, 2024

  7. [7]

    Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293, 2024

    Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293, 2024

  8. [8]

    Galore-mini: Low rank gradient learning with fewer learning rates

    Weihao Huang, Zhenyu Zhang, Yushun Zhang, Zhi-Quan Luo, Ruoyu Sun, and Zhangyang Wang. Galore-mini: Low rank gradient learning with fewer learning rates. InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024

  9. [9]

    From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients

    AJAY KUMAR JAISW AL, Lu Yin, Zhenyu Zhang, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients

  10. [10]

    On the maximum hessian eigenvalue and generalization

    Simran Kaur, Jeremy Cohen, and Zachary Chase Lipton. On the maximum hessian eigenvalue and generalization. InProceedings on, pages 51–65. PMLR, 2023

  11. [11]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning (ICML), 2019

  12. [12]

    Visualizing the loss landscape of neural nets

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  13. [13]

    Lost: Low-rank and sparse pre-training for large language models.arXiv preprint arXiv:2508.02668, 2025

    Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, and Xilu Wang. Lost: Low-rank and sparse pre-training for large language models.arXiv preprint arXiv:2508.02668, 2025

  14. [14]

    Flat-lora: Low-rank adaptation over a flat loss landscape.arXiv preprint arXiv:2409.14396, 2024

    Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, and Xiaolin Huang. Flat-lora: Low-rank adaptation over a flat loss landscape.arXiv preprint arXiv:2409.14396, 2024

  15. [15]

    ReLoRA: High-Rank Training Through Low-Rank Updates

    Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-Rank Training Through Low-Rank Updates. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  16. [16]

    Same pre-training loss, better downstream: Implicit bias matters for language models

    Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models. InInternational Conference on Machine Learning, 2022. 10

  17. [17]

    On the optimization landscape of low rank adaptation methods for large language models

    Xu-Hui Liu, Yali Du, Jun Wang, and Yang Yu. On the optimization landscape of low rank adaptation methods for large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  18. [18]

    Cola: Compute-efficient pre-training of llms via low-rank activation

    Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Mingsong Yan, Zi Yang, Paul D Hovland, Bogdan Nicolae, Franck Cappello, Sui Tang, and Zheng Zhang. Cola: Compute-efficient pre-training of llms via low-rank activation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4627–4645, 2025

  19. [19]

    LoQT: Low Rank Adapters for Quantized Training.arXiv preprint arXiv:2405.16528, 2024

    Sebastian Loeschcke, Mads Toftrup, Michael J Kastoryano, Serge Belongie, and Vésteinn Snæb- jarnarson. LoQT: Low Rank Adapters for Quantized Training.arXiv preprint arXiv:2405.16528, 2024

  20. [20]

    Velora: Memory efficient training using rank-1 sub-token projections.Advances in Neural Information Processing Systems, 37:42292–42310, 2024

    Roy Miles, Pradyumna Reddy, Ismail Elezi, and Jiankang Deng. Velora: Memory efficient training using rank-1 sub-token projections.Advances in Neural Information Processing Systems, 37:42292–42310, 2024

  21. [21]

    Grass: Com- pute efficient low-memory llm training with structured sparse gradients.arXiv preprint arXiv:2406.17660, 2024

    Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. Grass: Com- pute efficient low-memory llm training with structured sparse gradients.arXiv preprint arXiv:2406.17660, 2024

  22. [22]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  23. [23]

    Namrata Shivagunde, Mayank Kulkarni, Giannis Karamanolakis, Jack G. M. FitzGerald, Yan- nick Versley, Saleh Soltan, V olkan Cevher, Jianhua Lu, and Anna Rumshisky. Approximations may be all you need: Towards pre-training llms with low-rank decomposition and optimizers. 2024

  24. [24]

    Galore 2: Large-scale llm pre-training by gradient low-rank projection

    DiJia Su, Andrew Gu, Jane Xu, Yuan Tian, and Jiawei Zhao. Galore 2: Large-scale llm pre-training by gradient low-rank projection. 2025

  25. [25]

    Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192,

    Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

  26. [26]

    Coap: Memory-efficient training with correlation-aware gradient projection

    Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Linjie Luo, and Bo Yuan. Coap: Memory-efficient training with correlation-aware gradient projection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30116–30126, 2025

  27. [27]

    Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients

    Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients. arXiv preprint arXiv:2407.08296, 2024

  28. [28]

    GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024

  29. [29]

    Switchlora: Switched low-rank adaptation can learn full-rank information.arXiv preprint arXiv:2406.06564, 2024

    Kaiye Zhou, Shucheng Wang, and Jun Xu. Switchlora: Switched low-rank adaptation can learn full-rank information.arXiv preprint arXiv:2406.06564, 2024

  30. [30]

    Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success

    Lorenzo Zhou, Bo Zhao, Runpeng Yu, and Emanuele Rodolà. Demystifying mergeability: Interpretable properties to predict model merging success. InarXiv preprint arXiv:2601.22285, 2026. A More details on metrics We provide more details on the metrics in this section. 11 A.1 Loss landscape related metrics Direction variance equation is given below DV= 1 2N PN...