High-probability ergodic and last-iterate complexity guarantees for random reshuffling SGD on smooth nonconvex optimization that match best in-expectation bounds up to logarithmic factors without extra assumptions.
Parallel stochastic gradient algorithms for large-scale matrix completion
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.
citing papers explorer
-
High Probability Guarantees for Random Reshuffling
High-probability ergodic and last-iterate complexity guarantees for random reshuffling SGD on smooth nonconvex optimization that match best in-expectation bounds up to logarithmic factors without extra assumptions.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.