Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
Super-convergence: Very fast training of neural networks using large learning rates
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
citing papers explorer
-
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
-
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.