Spam: Spike- aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu · 2025 · arXiv 2501.06842

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.

Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

cs.LG · 2026-05-27 · unverdicted · novelty 5.0

Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention cs.LG · 2025-10-05 · unverdicted · none · ref 10
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency cs.LG · 2026-06-05 · unverdicted · none · ref 11
PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training cs.LG · 2026-05-30 · unverdicted · none · ref 21
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization cs.LG · 2026-05-27 · unverdicted · none · ref 14
Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.

Spam: Spike- aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842

fields

years

verdicts

representative citing papers

citing papers explorer