hub

On layer normalization in the transformer architecture

On · 2002 · arXiv 2002.04745

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

support 1 use method 1

representative citing papers

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

cs.NI · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.

Longformer: The Long-Document Transformer

cs.CL · 2020-04-10 · accept · novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's transverse projection exposes.

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

cs.LG · 2026-02-11 · unverdicted · novelty 6.0

TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss increases.

Predicting the thermodynamics in the chromosphere from the translation of SDO data into the IRIS$^{2}$ inversion results using a visual transformer model

astro-ph.SR · 2026-04-23 · unverdicted · novelty 5.0

A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.

Attention Residuals

cs.CL · 2026-03-16 · unverdicted · novelty 5.0

Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.

Multi-Gate Residuals

cs.LG · 2026-05-22 · unverdicted · novelty 3.0

Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

cs.LG · 2026-01-20 · unverdicted · novelty 3.0

A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

citing papers explorer

Showing 10 of 10 citing papers.

Stability and Generalization in Looped Transformers cs.LG · 2026-04-16 · unverdicted · none · ref 22
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity stat.ML · 2026-05-08 · unverdicted · none · ref 92
Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks cs.NI · 2026-05-03 · unverdicted · none · ref 10 · 2 links
A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.
Longformer: The Long-Document Transformer cs.CL · 2020-04-10 · accept · none · ref 127
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 12
Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's transverse projection exposes.
Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers cs.LG · 2026-02-11 · unverdicted · none · ref 10
TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss increases.
Predicting the thermodynamics in the chromosphere from the translation of SDO data into the IRIS$^{2}$ inversion results using a visual transformer model astro-ph.SR · 2026-04-23 · unverdicted · none · ref 24
A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 61
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
Multi-Gate Residuals cs.LG · 2026-05-22 · unverdicted · none · ref 19
Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 167
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

On layer normalization in the transformer architecture

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer