Recognition: 2 theorem links
· Lean TheoremMasked Autoencoders Are Scalable Vision Learners
Pith reviewed 2026-05-16 06:49 UTC · model grok-4.3
The pith
Masked autoencoders learn scalable vision features by reconstructing heavily masked image patches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Masked autoencoders are scalable self-supervised learners for computer vision. The approach masks random patches of the input image and reconstructs the missing pixels. It is based on an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches without mask tokens, along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Masking a high proportion of the input image, such as 75 percent, yields a nontrivial and meaningful self-supervisory task. Coupling these designs enables efficient training of large models that generalize well, for example a vanilla ViT-Huge model achieving the 87
What carries the argument
Asymmetric encoder-decoder where the encoder processes only visible patches and the lightweight decoder reconstructs the full image from latent features plus mask tokens.
If this is right
- Training accelerates by 3x or more while accuracy improves.
- Vanilla ViT-Huge reaches 87.8 percent accuracy on ImageNet-1K using only that data.
- Transfer performance on downstream tasks exceeds supervised pre-training.
- The method exhibits promising scaling behavior as model size grows.
- High masking ratios produce meaningful self-supervision that supports large models.
Where Pith is reading between the lines
- The same masking-plus-reconstruction pattern could apply directly to video or audio by hiding patches across time or frequency.
- The training efficiency opens the door to pre-training on image collections far larger than ImageNet without labels.
- Reconstruction objectives may serve as a drop-in replacement for contrastive losses when scaling vision transformers.
- Hybrid versions that combine this decoder with contrastive heads could be tested on the same architectures.
Load-bearing premise
Masking a high proportion of the input creates a nontrivial self-supervisory task whose difficulty drives useful feature learning rather than trivial solutions.
What would settle it
A ViT-Huge model trained with this 75-percent masking method on ImageNet-1K reaches below 87.8 percent top-1 accuracy, or a lower masking ratio produces equal or higher accuracy.
read the original abstract
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces masked autoencoders (MAE) as a scalable self-supervised pre-training approach for vision. It masks a high fraction (e.g., 75%) of random image patches and reconstructs the missing pixels via an asymmetric encoder-decoder: the encoder processes only the visible patches (no mask tokens), while a lightweight decoder reconstructs the full image from the latent representation plus mask tokens. This design enables efficient training of large ViT models; a vanilla ViT-Huge achieves 87.8% top-1 accuracy on ImageNet-1K using only ImageNet-1K data and shows strong transfer gains over supervised pre-training.
Significance. If the empirical results hold, the work is significant because it demonstrates that a simple, high-masking-ratio reconstruction task combined with an asymmetric architecture can scale self-supervised learning to high-capacity vision models, yielding both 3x+ training acceleration and state-of-the-art ImageNet-1K accuracy among ImageNet-only methods. The extensive ablations on masking ratio and decoder depth, together with downstream transfer experiments, provide direct support for the central scalability claim.
minor comments (2)
- [Abstract] Abstract: the statement that 87.8% is the 'best accuracy among methods that use only ImageNet-1K data' would be strengthened by an explicit footnote or table reference listing the exact competing methods and their scores.
- [Section 4.2] The description of the masking ratio ablation would benefit from a brief statement of the reconstruction loss behavior at 75% versus lower ratios to make the 'nontrivial task' claim more concrete.
Simulated Author's Rebuttal
We thank the referee for the positive and insightful review, as well as the recommendation to accept the manuscript. The referee's summary accurately captures our core contributions regarding the asymmetric encoder-decoder design and high masking ratio in masked autoencoders for scalable self-supervised pre-training of vision transformers.
Circularity Check
No significant circularity; empirical method is self-contained
full rationale
The paper presents an empirical self-supervised method (asymmetric encoder-decoder with 75% random patch masking) whose core designs are stated directly as architectural choices and training procedures. All reported results, including the 87.8% ImageNet-1K accuracy for ViT-Huge, are obtained from end-to-end training and evaluation on fixed public benchmarks. No central quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation chain; the nontriviality of the masking task is tested via ablations rather than assumed by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- masking ratio =
75%
axioms (1)
- standard math Vision Transformer patch embedding and self-attention from Dosovitskiy et al. 2020
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we mask random patches of the input image and reconstruct the missing pixels... masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
Representing 3D Faces with Learnable B-Spline Volumes
CUBE encodes 3D faces via a grid of learned high-dimensional B-spline features that map parametrically to a base shape plus MLP-refined displacements, enabling dense correspondence and state-of-the-art registration fr...
-
Learning to Discover at Test Time
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
PatchTST uses subseries patching and channel-independent Transformers to deliver significantly better long-term multivariate time series forecasting and strong self-supervised transfer performance.
-
Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting
ESFM is a single open foundation model that unifies heterogeneous Earth data sources and forecasts missing regions while preserving inter-variable physical relationships.
-
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts an...
-
Self-supervised Pretraining of Cell Segmentation Models
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
-
Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi
Scale-ALiBi adds a spatial-scale bias to ALiBi attention, enabling effective representation learning across high- and low-resolution optical and SAR satellite images.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Physics-Informed Transformer for Real-Time High-Fidelity Topology Optimization
A transformer model with self-attention and auxiliary physics losses learns a direct non-iterative mapping from loads and fields to manufacturable optimized topologies.
-
C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
C²FG provides a time-dependent exponential decay control for classifier-free guidance based on theoretical upper bounds on conditional-unconditional score discrepancies in diffusion processes.
-
Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery
PerASCD sets new state-of-the-art Sek scores on SECOND and LandsatSCD datasets by using a modular cascaded gated decoder on PerA foundation model features plus a new consistency loss.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs
PANC augments Normalized Cut with anchor-augmented token graphs using priors to steer spectral partitions, yielding mIoU gains of 2.3-8.7% over baselines on DUTS-TE, DUT-OMRON, and CrackForest.
-
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.
-
AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer
An attention-based fusion model combining semi-supervised CT segmentation, radiomics, and clinical features predicts metastatic recurrence, overall survival, and disease-free survival in HPV+ oropharyngeal cancer with...
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021. Accessed in June 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Self-organizing neural network that discovers surfaces in random-dot stereograms
Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Na- ture, 1992
work page 1992
-
[4]
Language mod- els are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
work page 2020
-
[5]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021
work page 2021
-
[6]
Generative pretraining from pix- els
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pix- els. In ICML, 2020
work page 2020
-
[7]
A simple framework for contrastive learning of visual rep- resentations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual rep- resentations. In ICML, 2020
work page 2020
-
[8]
Exploring simple Siamese represen- tation learning
Xinlei Chen and Kaiming He. Exploring simple Siamese represen- tation learning. In CVPR, 2021
work page 2021
-
[9]
An empirical study of training self-supervised Vision Transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised Vision Transformers. In ICCV, 2021
work page 2021
-
[10]
ELECTRA: Pre-training text encoders as discriminators rather than generators
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020
work page 2020
-
[11]
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 1995
work page 1995
-
[12]
Ran- daugment: Practical automated data augmentation with a reduced search space
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Ran- daugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020
work page 2020
-
[13]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009
work page 2009
-
[14]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019
work page 2019
-
[15]
Unsupervised visual representation learning by context prediction
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015
work page 2015
-
[16]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021
work page 2021
-
[17]
Unsuper- vised representation learning by predicting image rotations
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsuper- vised representation learning by predicting image rotations. In ICLR, 2018
work page 2018
-
[18]
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010
work page 2010
-
[19]
Self-supervised pretraining of visual features in the wild
Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Is- han Misra, Armand Joulin, and Piotr Bojanowski. Self-supervised pretraining of visual features in the wild. arXiv:2103.01988, 2021
-
[20]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Boot- strap your own latent - a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boot- strap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020
work page 2020
-
[22]
Dimensionality reduction by learning an invariant mapping
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006
work page 2006
-
[23]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020
work page 2020
-
[24]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask R-CNN. In ICCV, 2017
work page 2017
-
[25]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016
work page 2016
-
[26]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021
work page 2021
-
[27]
Benchmarking neural net- work robustness to common corruptions and perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural net- work robustness to common corruptions and perturbations. In ICLR, 2019
work page 2019
-
[28]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021
work page 2021
-
[29]
Autoencoders, minimum description length, and helmholtz free energy
Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and helmholtz free energy. In NeurIPS, 1994
work page 1994
-
[30]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- berger. Deep networks with stochastic depth. In ECCV, 2016
work page 2016
-
[31]
Batch normalization: Accel- erating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accel- erating deep network training by reducing internal covariate shift. In ICML, 2015
work page 2015
-
[32]
Quality-agnostic image recognition via in- vertible decoder
Insoo Kim, Seungju Han, Ji-won Baek, Seong-Jin Park, Jae-Joon Han, and Jinwoo Shin. Quality-agnostic image recognition via in- vertible decoder. In CVPR, 2021
work page 2021
-
[33]
Imagenet clas- sification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet clas- sification with deep convolutional neural networks. In NeurIPS, 2012
work page 2012
-
[34]
Backpropagation applied to handwritten zip code recognition.Neu- ral computation, 1989
Yann LeCun, Bernhard Boser, John S Denker, Donnie Hender- son, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neu- ral computation, 1989
work page 1989
-
[35]
Benchmarking detection transfer learning with vision transformers
Yanghao Li, Saining Xie, Xinlei Chen, Piotr Doll ´ar, Kaiming He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. In preparation, 2021. 9
work page 2021
-
[36]
Feature pyramid networks for ob- ject detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for ob- ject detection. In CVPR, 2017
work page 2017
-
[37]
Mi- crosoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft COCO: Common objects in context. In ECCV, 2014
work page 2014
-
[38]
SGDR: Stochastic gradient de- scent with warm restarts
Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient de- scent with warm restarts. In ICLR, 2017
work page 2017
-
[39]
Decoupled weight decay regu- larization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. In ICLR, 2019
work page 2019
-
[40]
Exploring the limits of weakly supervised pre- training
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pre- training. In ECCV, 2018
work page 2018
-
[41]
Towards robust vision trans- former
Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision trans- former. arXiv:2105.07926, 2021
-
[42]
Unsupervised learning of visual representations by solving jigsaw puzzles
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016
work page 2016
-
[43]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representa- tion learning with contrastive predictive coding.arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Neu- ral discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neu- ral discrete representation learning. In NeurIPS, 2017
work page 2017
-
[45]
Learning features by watching objects move
Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017
work page 2017
-
[46]
Context encoders: Feature learning by inpaint- ing
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpaint- ing. In CVPR, 2016
work page 2016
-
[47]
Improving language understanding by generative pre- training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre- training. 2018
work page 2018
-
[48]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[49]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020
work page 2020
-
[50]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021
work page 2021
-
[51]
Very deep convolutional networks for large-scale image recognition
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
work page 2015
-
[52]
Rethinking the inception architec- ture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architec- ture for computer vision. In CVPR, 2016
work page 2016
-
[53]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through attention. In ICML, 2021
work page 2021
-
[54]
Grafit: Learning fine-grained image repre- sentations with coarse labels
Hugo Touvron, Alexandre Sablayrolles, Matthijs Douze, Matthieu Cord, and Herv´e J´egou. Grafit: Learning fine-grained image repre- sentations with coarse labels. In ICCV, 2021
work page 2021
-
[55]
Fixing the train-test resolution discrepancy
Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herv ´e J´egou. Fixing the train-test resolution discrepancy. arXiv:1906.06423, 2019
-
[56]
The iNaturalist species classification and detection dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Be- longie. The iNaturalist species classification and detection dataset. In CVPR, 2018
work page 2018
-
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017
work page 2017
-
[58]
Extracting and composing robust features with denoising autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008
work page 2008
-
[59]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L´eon Bottou. Stacked denoising au- toencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 2010
work page 2010
-
[60]
Learning robust global representations by penalizing local predic- tive power
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predic- tive power. In NeurIPS, 2019
work page 2019
-
[61]
Unsupervised learning of vi- sual representations using videos
Xiaolong Wang and Abhinav Gupta. Unsupervised learning of vi- sual representations using videos. In ICCV, 2015
work page 2015
-
[62]
Unsuper- vised feature learning via non-parametric instance discrimination
Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsuper- vised feature learning via non-parametric instance discrimination. In CVPR, 2018
work page 2018
-
[63]
Unified perceptual parsing for scene understanding
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InECCV, 2018
work page 2018
-
[64]
Early convolutions help transformers see better
Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll´ar, and Ross Girshick. Early convolutions help transformers see better. In NeurIPS, 2021
work page 2021
-
[65]
How transferable are features in deep neural networks? In NeurIPS, 2014
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NeurIPS, 2014
work page 2014
-
[66]
Large Batch Training of Convolutional Networks
Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv:1708.03888, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[67]
VOLO: Vision outlooker for visual recognition.arXiv:2106.13112, 2021
Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. VOLO: Vision outlooker for visual recognition.arXiv:2106.13112, 2021
-
[68]
Cutmix: Regularization strategy to train strong classifiers with localizable features
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019
work page 2019
-
[69]
mixup: Beyond empirical risk minimization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018
work page 2018
-
[70]
Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016
work page 2016
-
[71]
Learning deep features for scene recognition using Places database
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using Places database. In NeurIPS, 2014
work page 2014
-
[72]
Semantic understanding of scenes through the ADE20K dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019. 10 A. Implementation Details A.1. ImageNet Experiments ViT architecture. We follow the standard ViT architecture [16]. It has a stack of Transformer blocks [57], and each block consists...
work page 2019
-
[73]
Our MAE does not use relative position or layer scaling (which are used in the code of [2])
(the sine-cosine version) to both the encoder and de- coder inputs. Our MAE does not use relative position or layer scaling (which are used in the code of [2]). We extract features from the encoder output for fine- tuning and linear probing. As ViT has a class token [16], to adapt to this design, in our MAE pre-training we append an auxiliary dummy token t...
-
[74]
in Mask R-CNN [24]. ViT has a stack of Transformer blocks that all produce feature maps at a single scale ( e.g., stride 16). We equally divide this stack into 4 subsets and apply convolutions to upsample or downsample the inter- mediate feature maps for producing different scales (stride 4, 8, 16, or 32, the same as a standard ResNet [25]). FPN is built ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.