Group Normalization

Yuxin Wu , Kaiming He

Authors on Pith no claims yet

classification 💻 cs.CV cs.LG

keywords batchnormalizationgroupsizeswhendetectionerrorsegmentation

read the original abstract

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Determining star formation histories and age-metallicity relations with convolutional neural networks
astro-ph.GA 2026-05 unverdicted novelty 7.0

A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spec...
Generative diffusion models for spatiotemporal influenza forecasting
cs.LG 2026-04 unverdicted novelty 7.0

Influpaint uses generative diffusion models on image-encoded influenza data to produce realistic and diverse epidemic trajectories that match leading ensemble methods in accuracy.
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
Stable Long-Horizon Neural ODE Reduced-Order Models via Learned Feedback for Biological Growth and Remodeling
cs.CE 2026-04 conditional novelty 7.0

A closed-loop Neural ODE ROM with CNN-based feedback from growth features stabilizes long-horizon predictions of tissue growth and remodeling, reaching 90.3% clinical tolerance versus 43.7% for open-loop baselines.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
cs.LG 2026-04 unverdicted novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
ASTRAFier: A Novel and Scalable Transformer-based Stellar Variability Classifier
astro-ph.IM 2026-04 unverdicted novelty 5.0

ASTRAFier is a Transformer-BiLSTM-CNN model that classifies stellar variability from light curves, reporting 94.26% accuracy on Kepler data and 88.22% on TESS, then applied to 2.8 million TESS curves to release a catalog.