SMA-DP-SGD augments DP-SGD with a spectral memory-aware fractional branch from prior privatized updates to improve accuracy on CIFAR and MNIST while preserving conditional differential privacy.
Martin, Tongsu (Serena) Peng, and Michael W
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 6years
2026 6roles
background 1polarities
background 1representative citing papers
Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
Marchenko-Pastur random-matrix pruning of DNNs yields theoretical certificates for accuracy preservation under small fine-tuning and empirical ImageNet results with 50-60% MAC reduction and sub-2pp accuracy drops on ViT and CNN models.
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.