Both λQ and λK remain below one, so the locality constants are not absorbing an uncontrolled blow-up

Settingλ Q λK RP /∥X∥ 2 F Control 0 · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

citing papers explorer

Showing 1 of 1 citing paper.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining cs.CL · 2026-05-11 · unverdicted · none · ref 21
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

Both λQ and λK remain below one, so the locality constants are not absorbing an uncontrolled blow-up

fields

years

verdicts

representative citing papers

citing papers explorer