SMA-DP-SGD augments DP-SGD with a spectral memory-aware fractional branch from prior privatized updates to improve accuracy on CIFAR and MNIST while preserving conditional differential privacy.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 5years
2026 5roles
background 1polarities
background 1representative citing papers
Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
citing papers explorer
-
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.