Canonical reference

Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

Sepp Hochreiter, Jürgen Schmidhuber · 1997 · Neural Computation · DOI 10.1162/neco.1997.9.1.1 · arXiv gov/9117894

Canonical reference. 100% of citing Pith papers cite this work as background.

13 Pith papers citing it

479 external citations · Crossref

Background 100% of classified citations

open at publisher browse 13 citing papers arXiv PDF

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Are Flat Minima an Illusion?

cs.LG · 2026-03-24 · unverdicted · novelty 8.0

Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

Navigating Potholes with Geometry-Aware Sharpness Minimization

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

On the Generalization of Knowledge Distillation: An Information-Theoretic View

cs.IT · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.

Estimating Implicit Regularization in Deep Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.

When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

cs.LG · 2026-04-16 · conditional · novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

For losses with product-stable minima, gradient descent on l(xy) converges provably at the edge of stability, with bifurcation diagrams characterizing the resulting stable oscillations and sharpness.

Feature Starvation as Geometric Instability in Sparse Autoencoders

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

The Role of Symmetry in Optimizing Overparameterized Networks

cs.LG · 2026-04-28 · unverdicted · novelty 6.0 · 2 refs

Overparameterization adds symmetries that precondition the Hessian for better minima and increase the probability mass of global minima near typical initializations.

From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

cs.LG · 2026-05-01 · unverdicted · novelty 5.0

EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.

Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

cs.LG · 2026-04-11 · unverdicted · novelty 5.0

A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.

A Physics-Inspired Optimizer: Velocity Regularized Adam

cs.LG · 2025-05-19 · unverdicted · novelty 5.0

VRAdam hybridizes Adam's per-parameter adaptation with a physics-inspired velocity regularizer to stabilize training at the edge of stability, delivering better empirical performance than AdamW and O(ln(N)/sqrt(N)) convergence bounds under mild assumptions.

citing papers explorer

Showing 13 of 13 citing papers.

Are Flat Minima an Illusion? cs.LG · 2026-03-24 · unverdicted · none · ref 5
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
Navigating Potholes with Geometry-Aware Sharpness Minimization cs.LG · 2026-05-15 · unverdicted · none · ref 6
LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization cs.LG · 2026-05-13 · unverdicted · none · ref 134
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
On the Generalization of Knowledge Distillation: An Information-Theoretic View cs.IT · 2026-05-13 · unverdicted · none · ref 14 · 2 links
Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
Estimating Implicit Regularization in Deep Learning stat.ML · 2026-05-06 · unverdicted · none · ref 17
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence cs.LG · 2026-04-16 · conditional · none · ref 8
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability cs.LG · 2026-04-03 · unverdicted · none · ref 1
For losses with product-stable minima, gradient descent on l(xy) converges provably at the edge of stability, with bifurcation diagrams characterizing the resulting stable oscillations and sharpness.
Feature Starvation as Geometric Instability in Sparse Autoencoders cs.LG · 2026-05-06 · unverdicted · none · ref 19
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting cs.LG · 2026-05-04 · unverdicted · none · ref 52
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
The Role of Symmetry in Optimizing Overparameterized Networks cs.LG · 2026-04-28 · unverdicted · none · ref 20 · 2 links
Overparameterization adds symmetries that precondition the Hessian for better minima and increase the probability mass of global minima near typical initializations.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 2
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks cs.LG · 2026-04-11 · unverdicted · none · ref 6
A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.
A Physics-Inspired Optimizer: Velocity Regularized Adam cs.LG · 2025-05-19 · unverdicted · none · ref 11
VRAdam hybridizes Adam's per-parameter adaptation with a physics-inspired velocity regularizer to stabilize training at the edge of stability, delivering better empirical performance than AdamW and O(ln(N)/sqrt(N)) convergence bounds under mild assumptions.

Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer