arXiv preprint arXiv:2206.05794 , year=

· 2022 · arXiv 2206.05794

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.

Evolutionary Search for Automated Design of Uncertainty Quantification Methods

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLM-driven evolutionary search discovers unsupervised UQ methods as Python programs that improve ROC-AUC by up to 6.7% over manual baselines on atomic claim verification across 9 datasets with OOD generalization.

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Deep linear network theory derives logarithmic decay for cross-entropy loss under gap-growth conditions versus polynomial closure for Schatten-regularized structural energy under late-time KL tails, separating fitting from simplification; conditional reductions extend this to ReLU MLPs with fixed ac

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

cs.LG · 2026-05-19 · conditional · novelty 6.0

Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.

Does Weight Decay Enhance Training Stability?

cs.LG · 2026-05-15 · conditional · novelty 6.0

Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

cs.LG · 2026-06-05 · unverdicted · novelty 4.0

A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.

citing papers explorer

Showing 4 of 4 citing papers after filters.

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes cs.LG · 2026-05-21 · unverdicted · none · ref 137
Depth induces an implicit low-rank bias in deep unconstrained feature models trained with unregularized multiclass cross-entropy, promoting softmax codes over neural collapse via more efficient norm propagation.
Evolutionary Search for Automated Design of Uncertainty Quantification Methods cs.CL · 2026-04-03 · unverdicted · none · ref 2
LLM-driven evolutionary search discovers unsupervised UQ methods as Python programs that improve ROC-AUC by up to 6.7% over manual baselines on atomic claim verification across 9 datasets with OOD generalization.
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction cs.LG · 2026-06-04 · unverdicted · none · ref 25
Deep linear network theory derives logarithmic decay for cross-entropy loss under gap-growth conditions versus polynomial closure for Schatten-regularized structural energy under late-time KL tails, separating fitting from simplification; conditional reductions extend this to ReLU MLPs with fixed ac
Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling cs.LG · 2026-06-05 · unverdicted · none · ref 40
A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.

arXiv preprint arXiv:2206.05794 , year=

fields

years

verdicts

representative citing papers

citing papers explorer