arxiv: 2111.06377 · v3 · submitted 2021-11-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Masked Autoencoders Are Scalable Vision Learners

Kaiming He , Xinlei Chen , Saining Xie , Yanghao Li , Piotr Doll\'ar , Ross Girshick

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords masked autoencodersself-supervised learningvision transformersimage reconstructionImageNet pretrainingscalable vision modelsViT-Huge

0 comments

The pith

Masked autoencoders learn scalable vision features by reconstructing heavily masked image patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple self-supervised method called masked autoencoders works well for training large vision models. Random patches of an input image are hidden, and the model must reconstruct the missing pixel values. An asymmetric design runs the main encoder only on the visible patches while a small decoder adds mask tokens to rebuild the full image. High masking ratios around 75 percent turn the task into a useful challenge that avoids easy shortcuts. This setup speeds up training by three times or more and lets a plain ViT-Huge model reach 87.8 percent accuracy on ImageNet-1K data alone, with stronger transfer to other tasks than supervised pre-training.

Core claim

Masked autoencoders are scalable self-supervised learners for computer vision. The approach masks random patches of the input image and reconstructs the missing pixels. It is based on an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches without mask tokens, along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Masking a high proportion of the input image, such as 75 percent, yields a nontrivial and meaningful self-supervisory task. Coupling these designs enables efficient training of large models that generalize well, for example a vanilla ViT-Huge model achieving the 87

What carries the argument

Asymmetric encoder-decoder where the encoder processes only visible patches and the lightweight decoder reconstructs the full image from latent features plus mask tokens.

If this is right

Training accelerates by 3x or more while accuracy improves.
Vanilla ViT-Huge reaches 87.8 percent accuracy on ImageNet-1K using only that data.
Transfer performance on downstream tasks exceeds supervised pre-training.
The method exhibits promising scaling behavior as model size grows.
High masking ratios produce meaningful self-supervision that supports large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking-plus-reconstruction pattern could apply directly to video or audio by hiding patches across time or frequency.
The training efficiency opens the door to pre-training on image collections far larger than ImageNet without labels.
Reconstruction objectives may serve as a drop-in replacement for contrastive losses when scaling vision transformers.
Hybrid versions that combine this decoder with contrastive heads could be tested on the same architectures.

Load-bearing premise

Masking a high proportion of the input creates a nontrivial self-supervisory task whose difficulty drives useful feature learning rather than trivial solutions.

What would settle it

A ViT-Huge model trained with this 75-percent masking method on ImageNet-1K reaches below 87.8 percent top-1 accuracy, or a lower masking ratio produces equal or higher accuracy.

read the original abstract

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAE's asymmetric encoder plus 75% masking gives a practical, scalable self-supervised baseline that beats supervised pre-training on ImageNet-1K for large ViTs.

read the letter

The main thing here is that masked autoencoders with an asymmetric encoder-decoder and 75% random patch masking scale self-supervised pre-training effectively for vision transformers, reaching 87.8% ImageNet-1K accuracy with a plain ViT-Huge using only that dataset and outperforming supervised baselines on transfer tasks. The design keeps the encoder running only on visible patches without mask tokens and uses a lightweight decoder to reconstruct, which cuts training time by 3x or more while the high masking ratio creates a harder task. Ablations on masking ratios and decoder depth back the choices, and downstream results on detection and segmentation show consistent gains over supervised pre-training. The numbers come from straightforward end-to-end runs on fixed public benchmarks, so there is no circularity or hidden fitting. What is new is the specific combination of high masking with the encoder ignoring mask tokens; earlier masked modeling work in vision did not use this asymmetry at these ratios. Soft spots are limited. The assumption that 75% masking forces nontrivial learning is tested by the ratio ablations, but the paper does not dig deeply into what representations are actually formed or test failure modes at still larger scales. The experiments stay within ViT backbones, so broader applicability is not shown. This is for groups working on self-supervised scaling and large vision models; anyone needing a simple, strong baseline for pre-training will get direct value from the empirical details and ablations. It deserves peer review because the evidence directly supports the scaling claims and the design is reproducible from the description.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces masked autoencoders (MAE) as a scalable self-supervised pre-training approach for vision. It masks a high fraction (e.g., 75%) of random image patches and reconstructs the missing pixels via an asymmetric encoder-decoder: the encoder processes only the visible patches (no mask tokens), while a lightweight decoder reconstructs the full image from the latent representation plus mask tokens. This design enables efficient training of large ViT models; a vanilla ViT-Huge achieves 87.8% top-1 accuracy on ImageNet-1K using only ImageNet-1K data and shows strong transfer gains over supervised pre-training.

Significance. If the empirical results hold, the work is significant because it demonstrates that a simple, high-masking-ratio reconstruction task combined with an asymmetric architecture can scale self-supervised learning to high-capacity vision models, yielding both 3x+ training acceleration and state-of-the-art ImageNet-1K accuracy among ImageNet-only methods. The extensive ablations on masking ratio and decoder depth, together with downstream transfer experiments, provide direct support for the central scalability claim.

minor comments (2)

[Abstract] Abstract: the statement that 87.8% is the 'best accuracy among methods that use only ImageNet-1K data' would be strengthened by an explicit footnote or table reference listing the exact competing methods and their scores.
[Section 4.2] The description of the masking ratio ablation would benefit from a brief statement of the reconstruction loss behavior at 75% versus lower ratios to make the 'nontrivial task' claim more concrete.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and insightful review, as well as the recommendation to accept the manuscript. The referee's summary accurately captures our core contributions regarding the asymmetric encoder-decoder design and high masking ratio in masked autoencoders for scalable self-supervised pre-training of vision transformers.

Circularity Check

0 steps flagged

No significant circularity; empirical method is self-contained

full rationale

The paper presents an empirical self-supervised method (asymmetric encoder-decoder with 75% random patch masking) whose core designs are stated directly as architectural choices and training procedures. All reported results, including the 87.8% ImageNet-1K accuracy for ViT-Huge, are obtained from end-to-end training and evaluation on fixed public benchmarks. No central quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation chain; the nontriviality of the masking task is tested via ablations rather than assumed by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard Vision Transformer architecture and pixel reconstruction loss from prior literature; the main addition is the masking strategy and asymmetry, which are validated empirically rather than derived from axioms.

free parameters (1)

masking ratio = 75%
Set to 75% after ablation; chosen because lower ratios make the task too easy.

axioms (1)

standard math Vision Transformer patch embedding and self-attention from Dosovitskiy et al. 2020
The encoder is a standard ViT; no new mathematical foundation is introduced.

pith-pipeline@v0.9.0 · 5499 in / 1200 out tokens · 26738 ms · 2026-05-16T06:49:33.669252+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we mask random patches of the input image and reconstruct the missing pixels... masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
Representing 3D Faces with Learnable B-Spline Volumes
cs.CV 2026-04 unverdicted novelty 7.0

CUBE encodes 3D faces via a grid of learned high-dimensional B-spline features that map parametrically to a base shape plus MLP-refined displacements, enabling dense correspondence and state-of-the-art registration fr...
Learning to Discover at Test Time
cs.LG 2026-01 unverdicted novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
cs.LG 2022-11 conditional novelty 7.0

PatchTST uses subseries patching and channel-independent Transformers to deliver significantly better long-term multivariate time series forecasting and strong self-supervised transfer performance.
Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting
physics.ao-ph 2026-04 unverdicted novelty 6.0

ESFM is a single open foundation model that unifies heterogeneous Earth data sources and forecasts missing regions while preserving inter-variable physical relationships.
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
cs.CV 2026-04 unverdicted novelty 6.0

LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts an...
Self-supervised Pretraining of Cell Segmentation Models
cs.CV 2026-04 unverdicted novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi
cs.CV 2026-04 unverdicted novelty 6.0

Scale-ALiBi adds a spatial-scale bias to ALiBi attention, enabling effective representation learning across high- and low-resolution optical and SAR satellite images.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Physics-Informed Transformer for Real-Time High-Fidelity Topology Optimization
cs.CE 2026-04 unverdicted novelty 6.0

A transformer model with self-attention and auxiliary physics losses learns a direct non-iterative mapping from loads and fields to manufacturable optimized topologies.
C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
cs.LG 2026-03 unverdicted novelty 6.0

C²FG provides a time-dependent exponential decay control for classifier-free guidance based on theoretical upper bounds on conditional-unconditional score discrepancies in diffusion processes.
Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery
cs.CV 2026-02 unverdicted novelty 6.0

PerASCD sets new state-of-the-art Sek scores on SECOND and LandsatSCD datasets by using a modular cascaded gated decoder on PerA foundation model features plus a new consistency loss.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs
cs.CV 2026-02 unverdicted novelty 5.0

PANC augments Normalized Cut with anchor-augmented token graphs using priors to steer spectral partitions, yielding mIoU gains of 2.3-8.7% over baselines on DUTS-TE, DUT-OMRON, and CrackForest.
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
cs.RO 2026-02 unverdicted novelty 5.0

An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
cs.CV 2026-04 unverdicted novelty 4.0

DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.
AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer
eess.IV 2026-04 unverdicted novelty 4.0

An attention-based fusion model combining semi-supervised CT segmentation, radiomics, and clinical features predicts metastatic recurrence, overall survival, and disease-free survival in HPV+ oropharyngeal cancer with...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 22 Pith papers · 5 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021. Accessed in June 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Self-organizing neural network that discovers surfaces in random-dot stereograms

Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Na- ture, 1992

work page 1992
[4]

Language mod- els are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 2020
[5]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

work page 2021
[6]

Generative pretraining from pix- els

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pix- els. In ICML, 2020

work page 2020
[7]

A simple framework for contrastive learning of visual rep- resentations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual rep- resentations. In ICML, 2020

work page 2020
[8]

Exploring simple Siamese represen- tation learning

Xinlei Chen and Kaiming He. Exploring simple Siamese represen- tation learning. In CVPR, 2021

work page 2021
[9]

An empirical study of training self-supervised Vision Transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised Vision Transformers. In ICCV, 2021

work page 2021
[10]

ELECTRA: Pre-training text encoders as discriminators rather than generators

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020

work page 2020
[11]

Support-vector networks

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 1995

work page 1995
[12]

Ran- daugment: Practical automated data augmentation with a reduced search space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Ran- daugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020

work page 2020
[13]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[14]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

work page 2019
[15]

Unsupervised visual representation learning by context prediction

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015

work page 2015
[16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

work page 2021
[17]

Unsuper- vised representation learning by predicting image rotations

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsuper- vised representation learning by predicting image rotations. In ICLR, 2018

work page 2018
[18]

Understanding the difﬁculty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010

work page 2010
[19]

Self-supervised pretraining of visual features in the wild

Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Is- han Misra, Armand Joulin, and Piotr Bojanowski. Self-supervised pretraining of visual features in the wild. arXiv:2103.01988, 2021

work page arXiv 2021
[20]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Boot- strap your own latent - a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boot- strap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020

work page 2020
[22]

Dimensionality reduction by learning an invariant mapping

Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006

work page 2006
[23]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020

work page 2020
[24]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask R-CNN. In ICCV, 2017

work page 2017
[25]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016

work page 2016
[26]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021

work page 2021
[27]

Benchmarking neural net- work robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural net- work robustness to common corruptions and perturbations. In ICLR, 2019

work page 2019
[28]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021

work page 2021
[29]

Autoencoders, minimum description length, and helmholtz free energy

Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and helmholtz free energy. In NeurIPS, 1994

work page 1994
[30]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- berger. Deep networks with stochastic depth. In ECCV, 2016

work page 2016
[31]

Batch normalization: Accel- erating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accel- erating deep network training by reducing internal covariate shift. In ICML, 2015

work page 2015
[32]

Quality-agnostic image recognition via in- vertible decoder

Insoo Kim, Seungju Han, Ji-won Baek, Seong-Jin Park, Jae-Joon Han, and Jinwoo Shin. Quality-agnostic image recognition via in- vertible decoder. In CVPR, 2021

work page 2021
[33]

Imagenet clas- siﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet clas- siﬁcation with deep convolutional neural networks. In NeurIPS, 2012

work page 2012
[34]

Backpropagation applied to handwritten zip code recognition.Neu- ral computation, 1989

Yann LeCun, Bernhard Boser, John S Denker, Donnie Hender- son, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neu- ral computation, 1989

work page 1989
[35]

Benchmarking detection transfer learning with vision transformers

Yanghao Li, Saining Xie, Xinlei Chen, Piotr Doll ´ar, Kaiming He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. In preparation, 2021. 9

work page 2021
[36]

Feature pyramid networks for ob- ject detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for ob- ject detection. In CVPR, 2017

work page 2017
[37]

Mi- crosoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft COCO: Common objects in context. In ECCV, 2014

work page 2014
[38]

SGDR: Stochastic gradient de- scent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient de- scent with warm restarts. In ICLR, 2017

work page 2017
[39]

Decoupled weight decay regu- larization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. In ICLR, 2019

work page 2019
[40]

Exploring the limits of weakly supervised pre- training

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pre- training. In ECCV, 2018

work page 2018
[41]

Towards robust vision trans- former

Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision trans- former. arXiv:2105.07926, 2021

work page arXiv 2021
[42]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016

work page 2016
[43]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representa- tion learning with contrastive predictive coding.arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Neu- ral discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neu- ral discrete representation learning. In NeurIPS, 2017

work page 2017
[45]

Learning features by watching objects move

Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017

work page 2017
[46]

Context encoders: Feature learning by inpaint- ing

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpaint- ing. In CVPR, 2016

work page 2016
[47]

Improving language understanding by generative pre- training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre- training. 2018

work page 2018
[48]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[49]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. JMLR, 2020

work page 2020
[50]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021

work page 2021
[51]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[52]

Rethinking the inception architec- ture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architec- ture for computer vision. In CVPR, 2016

work page 2016
[53]

Training data-efﬁcient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efﬁcient image transformers & distillation through attention. In ICML, 2021

work page 2021
[54]

Graﬁt: Learning ﬁne-grained image repre- sentations with coarse labels

Hugo Touvron, Alexandre Sablayrolles, Matthijs Douze, Matthieu Cord, and Herv´e J´egou. Graﬁt: Learning ﬁne-grained image repre- sentations with coarse labels. In ICCV, 2021

work page 2021
[55]

Fixing the train-test resolution discrepancy

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herv ´e J´egou. Fixing the train-test resolution discrepancy. arXiv:1906.06423, 2019

work page arXiv 1906
[56]

The iNaturalist species classiﬁcation and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Be- longie. The iNaturalist species classiﬁcation and detection dataset. In CVPR, 2018

work page 2018
[57]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

work page 2017
[58]

Extracting and composing robust features with denoising autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008

work page 2008
[59]

Stacked denoising au- toencoders: Learning useful representations in a deep network with a local denoising criterion

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L´eon Bottou. Stacked denoising au- toencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 2010

work page 2010
[60]

Learning robust global representations by penalizing local predic- tive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predic- tive power. In NeurIPS, 2019

work page 2019
[61]

Unsupervised learning of vi- sual representations using videos

Xiaolong Wang and Abhinav Gupta. Unsupervised learning of vi- sual representations using videos. In ICCV, 2015

work page 2015
[62]

Unsuper- vised feature learning via non-parametric instance discrimination

Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsuper- vised feature learning via non-parametric instance discrimination. In CVPR, 2018

work page 2018
[63]

Uniﬁed perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Uniﬁed perceptual parsing for scene understanding. InECCV, 2018

work page 2018
[64]

Early convolutions help transformers see better

Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll´ar, and Ross Girshick. Early convolutions help transformers see better. In NeurIPS, 2021

work page 2021
[65]

How transferable are features in deep neural networks? In NeurIPS, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NeurIPS, 2014

work page 2014
[66]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv:1708.03888, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

VOLO: Vision outlooker for visual recognition.arXiv:2106.13112, 2021

Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. VOLO: Vision outlooker for visual recognition.arXiv:2106.13112, 2021

work page arXiv 2021
[68]

Cutmix: Regularization strategy to train strong classiﬁers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In ICCV, 2019

work page 2019
[69]

mixup: Beyond empirical risk minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018

work page 2018
[70]

Colorful image colorization

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016

work page 2016
[71]

Learning deep features for scene recognition using Places database

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using Places database. In NeurIPS, 2014

work page 2014
[72]

Semantic understanding of scenes through the ADE20K dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019. 10 A. Implementation Details A.1. ImageNet Experiments ViT architecture. We follow the standard ViT architecture [16]. It has a stack of Transformer blocks [57], and each block consists...

work page 2019
[73]

Our MAE does not use relative position or layer scaling (which are used in the code of [2])

(the sine-cosine version) to both the encoder and de- coder inputs. Our MAE does not use relative position or layer scaling (which are used in the code of [2]). We extract features from the encoder output for ﬁne- tuning and linear probing. As ViT has a class token [16], to adapt to this design, in our MAE pre-training we append an auxiliary dummy token t...

work page
[74]

ViT has a stack of Transformer blocks that all produce feature maps at a single scale ( e.g., stride 16)

in Mask R-CNN [24]. ViT has a stack of Transformer blocks that all produce feature maps at a single scale ( e.g., stride 16). We equally divide this stack into 4 subsets and apply convolutions to upsample or downsample the inter- mediate feature maps for producing different scales (stride 4, 8, 16, or 32, the same as a standard ResNet [25]). FPN is built ...

work page