LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.
Then LARS convergence rate can be written in the following manner: (E[∥∇f(xa)∥)2≤ O ((f(x1)− f(x∗))L∞ T ψL ψ2g +∥σ∥2 T ψ2 σ ψ2g )
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2019 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.