hub

Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754

URLhttps://arxiv · 2018 · cs.LG · arXiv 1812.04754

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open full Pith review browse 19 citing papers arXiv PDF

abstract

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Scaling Laws for Neural Language Models

cs.LG · 2020-01-23 · unverdicted · novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

cs.LG · 2026-01-31 · unverdicted · novelty 7.0

Deep linear networks with balanced data covariance exhibit Hessian spectral bifurcation whose dominant-to-bulk eigenvalue ratio scales linearly with depth.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap flow equation.

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

TLoRA: Task-aware Low Rank Adaptation of Large Language Models

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.

Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations

cs.IT · 2026-04-16 · unverdicted · novelty 6.0

A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.

Grokking as Dimensional Phase Transition in Neural Networks

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

On the Convergence Analysis of Muon

stat.ML · 2025-05-29 · unverdicted · novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.

DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training

cs.LG · 2026-05-03 · unverdicted · novelty 5.0

DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts while maintaining comparable test accuracy.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

cs.CV · 2025-02-14 · unverdicted · novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

cs.LG · 2019-07-24 · unverdicted · novelty 4.0

Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

citing papers explorer

Showing 19 of 19 citing papers.

Scaling Laws for Neural Language Models cs.LG · 2020-01-23 · unverdicted · none · ref 4
Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
AMUSE: Anytime Muon with Stable Gradient Evaluation cs.LG · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Depth, Not Data: An Analysis of Hessian Spectral Bifurcation cs.LG · 2026-01-31 · unverdicted · none · ref 9 · internal anchor
Deep linear networks with balanced data covariance exhibit Hessian spectral bifurcation whose dominant-to-bulk eigenvalue ratio scales linearly with depth.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection cs.LG · 2024-03-06 · conditional · none · ref 14 · internal anchor
GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression cs.LG · 2026-04-08 · unverdicted · none · ref 3
The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap flow equation.
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training cs.LG · 2026-03-30 · unverdicted · none · ref 17 · internal anchor
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 134 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training cs.LG · 2026-05-08 · unverdicted · none · ref 34
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization cs.LG · 2026-05-07 · unverdicted · none · ref 6
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-2 and LLaMA pre-training scales.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 16
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 14
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.
Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations cs.IT · 2026-04-16 · unverdicted · none · ref 29
A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.
Grokking as Dimensional Phase Transition in Neural Networks cs.LG · 2026-04-06 · unverdicted · none · ref 22
Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 253
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 176
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
On the Convergence Analysis of Muon stat.ML · 2025-05-29 · unverdicted · none · ref 9 · internal anchor
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training cs.LG · 2026-05-03 · unverdicted · none · ref 32
DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts while maintaining comparable test accuracy.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 215 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization cs.LG · 2019-07-24 · unverdicted · none · ref 23 · internal anchor
Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.

Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer