Low-rank pre-training methods converge to geometrically and spectrally distinct basins and show diverging activations compared to full-rank training at 60M-350M scales.
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
fields
cs.LG 2verdicts
UNVERDICTED 2roles
baseline 1polarities
baseline 1representative citing papers
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
citing papers explorer
-
Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Low-rank pre-training methods converge to geometrically and spectrally distinct basins and show diverging activations compared to full-rank training at 60M-350M scales.
-
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.