Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
hub Canonical reference
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.
hub tools
citation-role summary
citation-polarity summary
roles
background 8polarities
background 8representative citing papers
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Derives explicit step-size conditions ensuring the metastability behavior of discrete SGD under heavy-tailed noise approximates its continuous SDE limit.
FLORA is an octree-based deep learning framework with auxiliary data fusion that predicts forest attributes from heterogeneous LiDAR, achieving rRMSE of 12.3% for dominant height and 39% for total volume on 32k French NFI plots.
Hessian eigenvector displacement and inverse participation ratio metrics show SGD stabilizing leading curvature directions while Adam causes more reorganization and parameter localization in MLP training.
Derives closed-form gradient of WS upper bound on Hessian max eigenvalue for 3-layer cross-entropy NNs and proposes HSR regularization to steer toward flat minima.
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
Batch normalization amplifies memorization of outlier samples in deep neural networks, directly increasing susceptibility to membership inference attacks.
PAC-Bayes applied to low-sharpness flat minima yields non-vacuous generalization bounds for boolean functions whose Fourier spectra are sparse and low-degree, with parameters estimable by property testing.
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3, SegFormer, Mask2Former and MaskFormer.
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
citing papers explorer
-
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
-
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.
-
Characterizing and Correcting Effective Target Shift in Online Learning
Online kernel regression equals offline regression with shifted targets; correcting the targets lets online learning match offline performance and outperform true targets in continual image classification.
-
ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees
ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.
-
Estimating Implicit Regularization in Deep Learning
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces
SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
-
FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data
FLORA is an octree-based deep learning framework with auxiliary data fusion that predicts forest attributes from heterogeneous LiDAR, achieving rRMSE of 12.3% for dominant height and 39% for total volume on 32k French NFI plots.
-
Characterizing Optimizer-Dependent Training Dynamics Through Hessian Eigenvector Displacement and Localization
Hessian eigenvector displacement and inverse participation ratio metrics show SGD stabilizing leading curvature directions while Adam causes more reorganization and parameter localization in MLP training.
-
Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks
Derives closed-form gradient of WS upper bound on Hessian max eigenvalue for 3-layer cross-entropy NNs and proposes HSR regularization to steer toward flat minima.
-
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
-
Batch Normalization Amplifies Memorization and Privacy Risks
Batch normalization amplifies memorization of outlier samples in deep neural networks, directly increasing susceptibility to membership inference attacks.
-
A Sharper Picture of Generalization in Transformers
PAC-Bayes applied to low-sharpness flat minima yields non-vacuous generalization bounds for boolean functions whose Fourier spectra are sparse and low-degree, with parameters estimable by property testing.
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
-
Generalization at the Edge of Stability
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
-
Lorentz Framework for Semantic Segmentation
A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3, SegFormer, Mask2Former and MaskFormer.
-
Robust Policy Optimization to Prevent Catastrophic Forgetting
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
-
Demystifying Data Organization for Enhanced LLM Training
Four guidelines for data organization and two new ordering methods (STR and SAW) improve LLM training stability and performance across scales when reusing pre-computed scores.
-
Comparing Classical Simulation and Sample-Based Learning of Quantum Systems: Learning the Hardness of Quantum Systems from Samples
Empirical study finds neural-network learning difficulty (via Hessian eigenvalue and random subspace optimization) correlates with classical simulation hardness parameterized by MPS bond dimension and T-gate count.
-
Improving Generalization by Permutation Routing Across Model Copies
Replicating models and routing their local losses via permutations from a mixing kernel Q enables structured message sharing that improves generalization.
-
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
-
Sampling Parallelism for Fast and Efficient Bayesian Learning
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
-
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
-
Spectral methods: crucial for machine learning, natural for quantum computers?
Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
-
Intelligence Inertia: Physical Isomorphism and Applications
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
-
Learning Sparse Compositional Functions with Norm-Constrained Neural Networks
Derives approximation rates and excess risk bounds for Frobenius norm-constrained DNNs learning sparse compositional functions on DAGs, applicable to multi-index models and binary trees while avoiding the curse of dimensionality.
-
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gradient responsiveness objectives, claiming SOTA results.
-
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots
OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.