TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup

Warmup uses 5% of total steps for every run · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

cs.LG · 2026-02-11 · unverdicted · novelty 6.0

TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss increases.

citing papers explorer

Showing 1 of 1 citing paper.

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers cs.LG · 2026-02-11 · unverdicted · none · ref 14
TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss increases.

TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup

fields

years

verdicts

representative citing papers

citing papers explorer