Recognition: 2 theorem links
· Lean TheoremBeyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Pith reviewed 2026-05-14 19:16 UTC · model grok-4.3
The pith
Low-rank pre-training methods reach geometrically distinct loss basins than full-rank training even at matched perplexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Low-rank pre-training methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with one method tracking full-rank most closely. Validation perplexity does not translate to downstream performance at every scale.
What carries the argument
Sixteen metrics across four dimensions: one-dimensional loss landscapes along random and top-K PCA directions, one-dimensional interpolation between checkpoints, spectral structure of weights and learned updates, and activation similarity to full-rank training.
If this is right
- Low-rank methods cannot be substituted for one another on the basis of perplexity alone.
- Downstream performance can differ across methods even when validation perplexity matches.
- Geometric and spectral metrics give stronger signals for generalization than perplexity by itself.
- Later-layer activation divergence implies that representation learning changes under rank constraints.
- Method choice may need to be scale-dependent to approximate full-rank basin geometry.
Where Pith is reading between the lines
- The basin differences could produce varying robustness to data shifts or adversarial examples not captured by standard benchmarks.
- Practitioners might monitor landscape sharpness during training to decide when a low-rank run is close enough to full-rank behavior.
- Hybrid update rules that periodically inject full-rank corrections could be tested to retain efficiency while steering toward preferred basins.
Load-bearing premise
The sixteen chosen metrics on loss landscapes, spectra, and activations are sufficient to detect meaningful differences in solution quality and generalization behavior.
What would settle it
A controlled run in which low-rank and full-rank models reach identical downstream task scores and exhibit matching loss curvature along every tested direction at matched perplexity would falsify the central claim.
Figures
read the original abstract
Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that low-rank pre-training methods (GaLore, Fira, CoLA, SLTrain, ReLoRA) for LLMs at 60M–350M scales reach solutions that are geometrically and spectrally distinct from full-rank training and from each other, even at comparable validation perplexity. Using 16 metrics on 1-D loss landscapes (random and top-K PCA directions), checkpoint interpolation, weight/update spectra, and activation cosine similarities, it shows full-rank training occupies sharper basins along random directions while low-rank methods are sharper along the top-1 PCA direction, with later-layer activation divergence and distinct spectral structures. It further asserts that perplexity alone fails to predict downstream performance at every scale, but incorporating the geometric/spectral metrics improves such predictions.
Significance. If the reported distinctions are robust, the work advances evaluation of memory-efficient LLM training by demonstrating that perplexity matching does not imply solution equivalence. The multi-scale, multi-metric analysis could guide method selection and motivate new low-rank designs that target specific landscape regions. Strengths include the breadth of metrics and explicit comparison to full-rank baselines across methods.
major comments (3)
- [§4] §4 (Experimental Setup and Results): No error bars, number of random seeds, or statistical significance tests are reported for the 16 metrics or the observed differences in basin sharpness and activation divergence. This is load-bearing for the non-equivalence claim, as run-to-run variability could explain the qualitative distinctions.
- [§5] §5 (Downstream Prediction): The claim that 'adding geometric and spectral metrics improves the prediction' of downstream performance lacks any quantitative support such as R² increments, regression coefficients, effect sizes, or controls for model scale. Without these, the assertion that the metrics capture generalization-relevant properties beyond perplexity remains unsubstantiated.
- [§3.1–3.2] §3.1–3.2 (Loss Landscape): The 1-D slices claim full-rank settles into a 'sharper basin' along random directions and the reverse for top-1 PCA, but no curvature quantification (e.g., second-derivative estimates) or robustness checks across seeds are provided, leaving the geometric distinction descriptive rather than rigorously established.
minor comments (2)
- [Figures] Figure captions and axis labels for the 1-D loss plots and spectra should explicitly state the number of samples or interpolation points used.
- [§2] Notation for the 16 metrics is introduced without a consolidated table; a summary table listing each metric, its formula, and the dimension it probes would improve clarity.
Simulated Author's Rebuttal
Thank you for the thoughtful review. We will revise the manuscript to address the concerns regarding statistical rigor and quantitative support for the claims. Below we respond point-by-point.
read point-by-point responses
-
Referee: §4 (Experimental Setup and Results): No error bars, number of random seeds, or statistical significance tests are reported for the 16 metrics or the observed differences in basin sharpness and activation divergence. This is load-bearing for the non-equivalence claim, as run-to-run variability could explain the qualitative distinctions.
Authors: We agree that reporting variability is important. The primary experiments used single seeds per method and scale due to the high computational cost, particularly at 350M parameters. However, we ran the 60M and 130M models with 3 seeds and observed consistent trends. In the revised version, we will include error bars for the smaller scales and add a discussion of variability. For the 350M scale, we will note the limitation and rely on cross-scale consistency. revision: partial
-
Referee: §5 (Downstream Prediction): The claim that 'adding geometric and spectral metrics improves the prediction' of downstream performance lacks any quantitative support such as R² increments, regression coefficients, effect sizes, or controls for model scale. Without these, the assertion that the metrics capture generalization-relevant properties beyond perplexity remains unsubstantiated.
Authors: We will strengthen this section by adding quantitative regression analysis. Specifically, we will report R² values for linear regressions predicting downstream performance using perplexity alone versus perplexity plus the geometric/spectral metrics, including controls for model scale. This will provide the requested effect sizes and demonstrate the improvement. revision: yes
-
Referee: §3.1–3.2 (Loss Landscape): The 1-D slices claim full-rank settles into a 'sharper basin' along random directions and the reverse for top-1 PCA, but no curvature quantification (e.g., second-derivative estimates) or robustness checks across seeds are provided, leaving the geometric distinction descriptive rather than rigorously established.
Authors: The 1-D slices serve as visual aids, with the distinctions corroborated by the full set of 16 metrics including checkpoint interpolation and spectral analyses. We will add approximate curvature estimates (e.g., finite-difference second derivatives along the directions) for the 60M and 130M models in the revision. Robustness across seeds will be addressed as noted in the response to §4. revision: partial
Circularity Check
No circularity: purely empirical comparison with independent metrics
full rationale
The paper conducts a direct empirical study comparing five low-rank pre-training methods to full-rank baselines across three model scales using 16 independent metrics (1-D loss landscapes, checkpoint interpolation, weight/update spectra, and activation similarities). No mathematical derivations, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations are present. All claims rest on experimental observations against external full-rank controls rather than any self-referential reduction or ansatz smuggling. The analysis is self-contained with no steps that collapse to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maksym Andriushchenko, Francesco Croce, Maximilian Müller, Matthias Hein, and Nicolas Flammarion. A modern look at the relationship between sharpness and generalization.arXiv preprint arXiv:2302.07011, 2023
-
[2]
Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Unveiling the basin-like loss landscape in large language models.arXiv preprint arXiv:2505.17646, 2025
-
[3]
Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint?ArXiv, abs/2410.01623, 2024
-
[4]
Linear mode connectivity and the lottery ticket hypothesis
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InInternational conference on machine learning, pages 3259–3269. PMLR, 2020
work page 2020
-
[5]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[6]
Andi Han, Jiaxiang Li, Wei Huang, Mingyi Hong, Akiko Takeda, Pratik Jawanpuria, and Bamdev Mishra. SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining.arXiv preprint arXiv:2406.02214, 2024
-
[7]
Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293, 2024
Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293, 2024
-
[8]
Galore-mini: Low rank gradient learning with fewer learning rates
Weihao Huang, Zhenyu Zhang, Yushun Zhang, Zhi-Quan Luo, Ruoyu Sun, and Zhangyang Wang. Galore-mini: Low rank gradient learning with fewer learning rates. InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024
work page 2024
-
[9]
From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients
AJAY KUMAR JAISW AL, Lu Yin, Zhenyu Zhang, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. From galore to welore: How low-rank weights non-uniformly emerge from low-rank gradients
-
[10]
On the maximum hessian eigenvalue and generalization
Simran Kaur, Jeremy Cohen, and Zachary Chase Lipton. On the maximum hessian eigenvalue and generalization. InProceedings on, pages 51–65. PMLR, 2023
work page 2023
-
[11]
Similarity of neural network representations revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning (ICML), 2019
work page 2019
-
[12]
Visualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[13]
Jiaxi Li, Lu Yin, Li Shen, Jinjin Xu, Liwu Xu, Tianjin Huang, Wenwu Wang, Shiwei Liu, and Xilu Wang. Lost: Low-rank and sparse pre-training for large language models.arXiv preprint arXiv:2508.02668, 2025
-
[14]
Flat-lora: Low-rank adaptation over a flat loss landscape.arXiv preprint arXiv:2409.14396, 2024
Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, and Xiaolin Huang. Flat-lora: Low-rank adaptation over a flat loss landscape.arXiv preprint arXiv:2409.14396, 2024
-
[15]
ReLoRA: High-Rank Training Through Low-Rank Updates
Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-Rank Training Through Low-Rank Updates. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
work page 2024
-
[16]
Same pre-training loss, better downstream: Implicit bias matters for language models
Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: Implicit bias matters for language models. InInternational Conference on Machine Learning, 2022. 10
work page 2022
-
[17]
On the optimization landscape of low rank adaptation methods for large language models
Xu-Hui Liu, Yali Du, Jun Wang, and Yang Yu. On the optimization landscape of low rank adaptation methods for large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[18]
Cola: Compute-efficient pre-training of llms via low-rank activation
Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Mingsong Yan, Zi Yang, Paul D Hovland, Bogdan Nicolae, Franck Cappello, Sui Tang, and Zheng Zhang. Cola: Compute-efficient pre-training of llms via low-rank activation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4627–4645, 2025
work page 2025
-
[19]
LoQT: Low Rank Adapters for Quantized Training.arXiv preprint arXiv:2405.16528, 2024
Sebastian Loeschcke, Mads Toftrup, Michael J Kastoryano, Serge Belongie, and Vésteinn Snæb- jarnarson. LoQT: Low Rank Adapters for Quantized Training.arXiv preprint arXiv:2405.16528, 2024
-
[20]
Roy Miles, Pradyumna Reddy, Ismail Elezi, and Jiankang Deng. Velora: Memory efficient training using rank-1 sub-token projections.Advances in Neural Information Processing Systems, 37:42292–42310, 2024
work page 2024
-
[21]
Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, and Virginia Smith. Grass: Com- pute efficient low-memory llm training with structured sparse gradients.arXiv preprint arXiv:2406.17660, 2024
-
[22]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[23]
Namrata Shivagunde, Mayank Kulkarni, Giannis Karamanolakis, Jack G. M. FitzGerald, Yan- nick Versley, Saleh Soltan, V olkan Cevher, Jianhua Lu, and Anna Rumshisky. Approximations may be all you need: Towards pre-training llms with low-rank decomposition and optimizers. 2024
work page 2024
-
[24]
Galore 2: Large-scale llm pre-training by gradient low-rank projection
DiJia Su, Andrew Gu, Jane Xu, Yuan Tian, and Jiawei Zhao. Galore 2: Large-scale llm pre-training by gradient low-rank projection. 2025
work page 2025
-
[25]
Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024
-
[26]
Coap: Memory-efficient training with correlation-aware gradient projection
Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Linjie Luo, and Bo Yuan. Coap: Memory-efficient training with correlation-aware gradient projection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30116–30126, 2025
work page 2025
-
[27]
Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients
Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients. arXiv preprint arXiv:2407.08296, 2024
-
[28]
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024
work page 2024
-
[29]
Kaiye Zhou, Shucheng Wang, and Jun Xu. Switchlora: Switched low-rank adaptation can learn full-rank information.arXiv preprint arXiv:2406.06564, 2024
-
[30]
Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success
Lorenzo Zhou, Bo Zhao, Runpeng Yu, and Emanuele Rodolà. Demystifying mergeability: Interpretable properties to predict model merging success. InarXiv preprint arXiv:2601.22285, 2026. A More details on metrics We provide more details on the metrics in this section. 11 A.1 Loss landscape related metrics Direction variance equation is given below DV= 1 2N PN...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.