Big Transfer (BiT): General Visual Representation Learning

Alexander Kolesnikov; Jessica Yung; Joan Puigcerver; Lucas Beyer; Neil Houlsby; Sylvain Gelly; Xiaohua Zhai

arxiv: 1912.11370 · v3 · pith:UDS5LKDSnew · submitted 2019-12-24 · 💻 cs.CV · cs.LG

Big Transfer (BiT): General Visual Representation Learning

Alexander Kolesnikov , Lucas Beyer , Xiaohua Zhai , Joan Puigcerver , Jessica Yung , Sylvain Gelly , Neil Houlsby This is my paper

classification 💻 cs.CV cs.LG

keywords transferclassdatasetsexamplestaskcifar-10componentsilsvrc-2012

0 comments

read the original abstract

Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Silhouette Loss: Differentiable Global Structure Learning for Deep Representations
cs.LG 2026-03 unverdicted novelty 8.0

Soft Silhouette Loss offers a batch-global differentiable metric to promote intra-class compactness and inter-class separation in learned representations, boosting performance when hybridized with cross-entropy and co...
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
cs.CV 2021-03 accept novelty 8.0

Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
cs.RO 2026-05 unverdicted novelty 6.0

DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
Florence: A New Foundation Model for Computer Vision
cs.CV 2021-11 unverdicted novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition
cs.CV 2026-05 unverdicted novelty 5.0

Large-scale experiments demonstrate that data-aware augmentations applied only during training allow fine-grained image models to reach high accuracy without using discriminative crops at inference, lowering costs.