SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
BERT: Pre-training of deep bidirectional transformers for language understanding
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
FedFrozen improves stability in heterogeneous federated Transformer training by warming up the full model then freezing the attention kernel (query/key) while optimizing the value block under a fixed kernel.
citing papers explorer
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing
FedFrozen improves stability in heterogeneous federated Transformer training by warming up the full model then freezing the attention kernel (query/key) while optimizing the value block under a fixed kernel.