pith. sign in

On layer normalization in the transformer architecture

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

years

2026 4

representative citing papers

HRM-Text: Efficient Pretraining Beyond Scaling

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.

Attention Drift: What Autoregressive Speculative Decoding Models Learn

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improves acceptance length up to 2x on perturbed templates and 1.18x on long-context data

citing papers explorer

Showing 4 of 4 citing papers.

  • NEST: Nested Event Stream Transformer for Sequences of Multisets cs.LG · 2026-01-31 · unverdicted · none · ref 39

    NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.

  • Rethinking Cross-Layer Information Routing in Diffusion Transformers cs.CV · 2026-05-20 · conditional · none · ref 61

    DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.

  • HRM-Text: Efficient Pretraining Beyond Scaling cs.CL · 2026-05-20 · unverdicted · none · ref 9

    A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.

  • Attention Drift: What Autoregressive Speculative Decoding Models Learn cs.LG · 2026-05-11 · unverdicted · none · ref 21

    Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improves acceptance length up to 2x on perturbed templates and 1.18x on long-context data