pith. sign in

Rotational equilibrium: How weight decay balances learning across neural networks

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.LG 2 cs.CV 1

years

2026 3

roles

background 1

polarities

background 1

representative citing papers

Does Weight Decay Enhance Training Stability?

cs.LG · 2026-05-15 · conditional · novelty 6.0

Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.

Demystifying Manifold Constraints in LLM Pre-training

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.

citing papers explorer

Showing 3 of 3 citing papers.

  • Does Weight Decay Enhance Training Stability? cs.LG · 2026-05-15 · conditional · none · ref 30

    Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.

  • Demystifying Manifold Constraints in LLM Pre-training cs.LG · 2026-05-06 · unverdicted · none · ref 16

    Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.

  • Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning cs.CV · 2026-04-21 · unverdicted · none · ref 17

    DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.