Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Low-precision training of large language models: Methods, challenges, and opportunities
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
LC-QAT achieves data-efficient 2-bit weight-only QAT for LLMs by representing quantized weights as a learned affine transform over discrete vectors, supporting end-to-end optimization from a high-quality PTQ start.
MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.
StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
Proposes High-Precision Scoring (HPS) and Tie-aware Retrieval Metrics (TRM) to reduce tie-induced instability in low-precision retrieval evaluation.
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
citing papers explorer
No citing papers match the current filters.