super hub Mixed citations

Layer Normalization

Jamie Ryan Kiros, Jimmy Lei Ba · 2016 · stat.ML · arXiv 1607.06450

Mixed citation behavior. Most common role is background (58%).

343 Pith papers citing it

Background 58% of classified citations

open full Pith review browse 343 citing papers more from Jamie Ryan Kiros arXiv PDF

abstract

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 44 method 23 baseline 2 other 2

citation-polarity summary

background 41 use method 23 unclear 5 baseline 2

claims ledger

abstract Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not

authors

and Geoffrey E Jamie Ryan Kiros Jimmy Lei Ba

co-cited works

representative citing papers

CanViT: Toward Active-Vision Foundation Models

cs.CV · 2026-03-23 · conditional · novelty 8.0

CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

What learning algorithm is in-context learning? Investigations with linear models

cs.LG · 2022-11-28 · accept · novelty 8.0

Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

MixTTA: Low-Rank Cross-Channel Mixing for Reliable Test-Time Adaptation

cs.LG · 2026-06-26 · unverdicted · novelty 7.0

MixTTA equips normalization layers with low-rank cross-channel transformations plus decoupling and spectral projections to improve test-time adaptation under distribution shifts.

SurGe: Improved Surface Geometry in Point Maps

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).

Attention-based optimizer for symmetry finding

quant-ph · 2026-05-28 · unverdicted · novelty 7.0

A Set-Transformer architecture with self-attention encodes Pauli-string correlations, optimizes via commutation objective, and finds symmetries with near-deterministic success on physical models like Ising and Toric code.

Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning

quant-ph · 2026-05-22 · unverdicted · novelty 7.0

CRiSP uses neural-guided MCTS and curriculum learning to insert Clifford prefixes before parameterized rotations in VQAs, yielding mean 3.17x and max 45x gains in energy accuracy on 22-qubit QAOA benchmarks versus prior Clifford initializers.

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.

Riemannian Networks over Full-Rank Correlation Matrices

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Riemannian networks are introduced for the full-rank correlation matrix manifold by extending MLR, FC, and convolutional layers to five geometries with backpropagation methods for two, showing effectiveness over SPD and Grassmannian baselines.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms

hep-ph · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.

Domain Transfer Becomes Identifiable via a Single Alignment

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Domain transfer becomes identifiable from marginals plus one anchor under Jacobian sparsity, enabled by a randomized masked finite-difference regularizer.

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

cs.LG · 2026-05-17 · accept · novelty 7.0 · 2 refs

The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.

ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

ChangeFlow reformulates remote sensing change detection as latent rectified-flow mask synthesis, reaching 80.4% average F1 across four benchmarks with 1.3-point gain and sampling-based ensembling.

Training-Free Generative Sampling via Moment-Matched Score Smoothing

stat.ML · 2026-05-14 · unverdicted · novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.

Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning

astro-ph.EP · 2026-05-12 · unverdicted · novelty 7.0

A W-Net deep learning model detects asteroids in TESS data independently of trajectory by rotating training image cubes and using adaptive normalization for data scaling.

QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

quant-ph · 2026-05-12 · unverdicted · novelty 7.0

QAP-Router models qubit routing as dynamic QAP and applies RL with a solution-aware Transformer to cut CNOT counts by 12-30% versus industry compilers on real circuit benchmarks.

citing papers explorer

Showing 23 of 23 citing papers after filters.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 3 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 2 · internal anchor
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Switchable Normalization for Learning-to-Normalize Deep Representation cs.CV · 2019-07-22 · unverdicted · none · ref 3 · internal anchor
Switchable Normalization learns per-layer weights to combine channel, layer, and minibatch normalizers, claiming robustness to batch size and better results than fixed normalizers on ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics.
A Self-Attentive model for Knowledge Tracing cs.LG · 2019-07-16 · unverdicted · none · ref 6 · internal anchor
SAKT uses self-attention to focus on relevant prior KCs for performance prediction and reports 4.43% average AUC improvement over DKT and DKVMN on real datasets.
Augmenting Self-attention with Persistent Memory cs.LG · 2019-07-02 · unverdicted · none · ref 23 · internal anchor
Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
Localizing Unseen Activities in Video via Image Query cs.CV · 2019-06-28 · unverdicted · none · ref 2 · internal anchor
Introduces Image-Based Activity Localization task for unseen activities, a self-attention interaction localizer using region self-attention and local transformer, and the ActivityIBAL dataset from ActivityNet.
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis eess.AS · 2019-06-26 · unverdicted · none · ref 9 · internal anchor
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
Deep Modular Co-Attention Networks for Visual Question Answering cs.CV · 2019-06-25 · conditional · none · ref 3 · internal anchor
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
Generating Long Sequences with Sparse Transformers cs.LG · 2019-04-23 · unverdicted · none · ref 2 · internal anchor
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
Compressive Transformers for Long-Range Sequence Modelling cs.LG · 2019-11-13 · unverdicted · none · ref 71 · internal anchor
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 3 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks cs.CL · 2019-07-25 · unverdicted · none · ref 1 · internal anchor
DropAttention regularizes attention weights in fully-connected self-attention networks to reduce overfitting and improve performance.
Construct Dynamic Graphs for Hand Gesture Recognition via Spatial-Temporal Attention cs.CV · 2019-07-20 · unverdicted · none · ref 16 · internal anchor
DG-STA builds dynamic graphs from hand skeletons, applies spatial-temporal self-attention to learn features, and uses a mask to cut cost by 99%, outperforming prior methods on DHG-14/28 and SHREC'17.
R-Transformer: Recurrent Neural Network Enhanced Transformer cs.LG · 2019-07-12 · unverdicted · none · ref 2 · internal anchor
R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.
QUOTIENT: Two-Party Secure Neural Network Training and Prediction cs.CR · 2019-07-08 · unverdicted · none · ref 37 · internal anchor
QUOTIENT achieves 50X faster WAN training time and 6% higher absolute accuracy for secure two-party DNN training by jointly optimizing a discretized training algorithm with a tailored secure protocol.
A Bi-directional Transformer for Musical Chord Recognition cs.SD · 2019-07-05 · unverdicted · none · ref 8 · internal anchor
A bi-directional Transformer achieves competitive chord recognition by using self-attention to capture long-term dependencies in audio in a single training phase.
Root Mean Square Layer Normalization cs.LG · 2019-10-16 · conditional · none · ref 3 · internal anchor
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.
Disentangled Makeup Transfer with Generative Adversarial Network cs.CV · 2019-07-02 · unverdicted · none · ref 1 · internal anchor
DMT uses identity and makeup encoders in a GAN to enable controllable makeup transfer from references and sampling of new styles from a prior distribution.
ARMIN: Towards a More Efficient and Light-weight Recurrent Memory Network cs.LG · 2019-06-28 · unverdicted · none · ref 1 · internal anchor
ARMIN introduces auto-addressing via hidden states and a novel RNN cell to produce a lighter recurrent memory network with lower overhead than existing MANNs or vanilla LSTMs.
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems cs.LG · 2019-07-16 · unverdicted · none · ref 27 · internal anchor
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
Mean Spectral Normalization of Deep Neural Networks for Embedded Automation cs.LG · 2019-07-09 · unverdicted · none · ref 11 · internal anchor
Proposes MSN reparameterization to address mean-drift in SN, claiming ~16% faster inference than BN with fewer parameters on CNNs and GANs.
AMI-Net+: A Novel Multi-Instance Neural Network for Medical Diagnosis from Incomplete and Imbalanced Data cs.LG · 2019-07-03 · unverdicted · none · ref 33 · internal anchor
AMI-Net+ extends AMI-Net by swapping cross-entropy for focal loss and adding self-adaptive instance-level pooling, then reports better performance than baselines on two real medical datasets.
Multilingual Bottleneck Features for Query by Example Spoken Term Detection cs.CL · 2019-06-30 · unverdicted · none · ref 34 · internal anchor
Multilingual bottleneck features extracted with residual networks outperform feedforward versions for QbE-STD on QUESST 2014 when trained on GlobalPhone.

Layer Normalization

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer