On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tieyan Liu · 2020

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

NEST: Nested Event Stream Transformer for Sequences of Multisets

cs.LG · 2026-01-31 · unverdicted · novelty 7.0

NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.

Rethinking Cross-Layer Information Routing in Diffusion Transformers

cs.CV · 2026-05-20 · conditional · novelty 6.0

DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.

HRM-Text: Efficient Pretraining Beyond Scaling

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.

Attention Drift: What Autoregressive Speculative Decoding Models Learn

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improves acceptance length up to 2x on perturbed templates and 1.18x on long-context data

citing papers explorer

Showing 4 of 4 citing papers.

NEST: Nested Event Stream Transformer for Sequences of Multisets cs.LG · 2026-01-31 · unverdicted · none · ref 39
NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.
Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · conditional · none · ref 61
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
HRM-Text: Efficient Pretraining Beyond Scaling cs.CL · 2026-05-20 · unverdicted · none · ref 9
A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.
Attention Drift: What Autoregressive Speculative Decoding Models Learn cs.LG · 2026-05-11 · unverdicted · none · ref 21
Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improves acceptance length up to 2x on perturbed templates and 1.18x on long-context data

On layer normalization in the transformer architecture

fields

years

verdicts

representative citing papers

citing papers explorer