A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871,

Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al · 2023 · arXiv 2304.09871

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

GWT: Scalable Optimizer State Compression for Large Language Model Training

cs.LG · 2025-01-13 · unverdicted · novelty 6.0

GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.

Emerging Properties in Unified Multimodal Pretraining

cs.CV · 2025-05-20 · unverdicted · novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

Open-Sora: Democratizing Efficient Video Production for All

cs.CV · 2024-12-29 · unverdicted · novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

PowLU: An Activation Function for Stable Pre-Training of LLMs

cs.CL · 2026-05-25 · unverdicted · novelty 4.0

PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

cs.CL · 2025-07-07 · unverdicted · novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

cs.LG · 2026-05-09 · unverdicted · novelty 3.0

This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

cs.LG · 2026-05-07 · 2 refs

citing papers explorer

Showing 8 of 8 citing papers after filters.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention cs.LG · 2025-10-05 · unverdicted · none · ref 17
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 23
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
GWT: Scalable Optimizer State Compression for Large Language Model Training cs.LG · 2025-01-13 · unverdicted · none · ref 30
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 52
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 21
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.
PowLU: An Activation Function for Stable Pre-Training of LLMs cs.CL · 2026-05-25 · unverdicted · none · ref 15
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 55
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers cs.LG · 2026-05-09 · unverdicted · none · ref 25
This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.

A theory on Adam instability in large-scale machine learning.arXiv preprint arXiv:2304.09871,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer