BEiT: BERT Pre-Training of Image Transformers
Pith reviewed 2026-05-13 11:44 UTC · model grok-4.3
The pith
BEiT pre-trains vision transformers by recovering discrete visual tokens from masked image patches, reaching 83.2% ImageNet-1K accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEiT pre-trains a vision transformer encoder by feeding it corrupted images consisting of visible patches plus mask tokens, then requiring it to reconstruct the discrete visual tokens that a separate tokenizer assigned to the original full image. The same encoder weights are later fine-tuned directly on downstream tasks without further architectural changes.
What carries the argument
Masked image modeling objective that recovers discrete visual tokens from a set of randomly masked image patches.
If this is right
- Vision transformers can reach competitive ImageNet accuracy using only ImageNet-1K for pre-training instead of larger labeled collections.
- The same transformer backbone works for both the masked pre-training stage and subsequent fine-tuning on classification or segmentation.
- Larger models benefit more from this pre-training, as shown by the jump from base to large size on the same data.
- Semantic segmentation performance improves when the encoder has first learned to predict visual tokens from masked patches.
Where Pith is reading between the lines
- Better tokenizers could raise the upper bound on what the masked modeling signal can teach the transformer.
- The same masked-token recipe might transfer to video or audio by swapping in an appropriate tokenizer for those domains.
- Combining the token-prediction loss with other self-supervised objectives could produce even stronger starting weights for fine-tuning.
Load-bearing premise
The separate tokenizer must generate discrete visual tokens that carry rich semantic content rather than collapsing to low-level patterns.
What would settle it
A BEiT model fine-tuned on ImageNet-1K classification that matches or falls below the accuracy of an identically sized DeiT model trained from scratch would show the pre-training step added no value.
read the original abstract
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BEiT, a self-supervised pre-training method for vision Transformers that adapts the BERT masked modeling paradigm. Each image is tokenized into discrete visual tokens via a separately trained dVAE; random patches are masked and the Transformer is trained to recover the original visual tokens from the corrupted input. After pre-training, the encoder is fine-tuned on downstream tasks. Key empirical claims are that base-size BEiT reaches 83.2% top-1 accuracy on ImageNet-1K (outperforming from-scratch DeiT at 81.8%) and large-size BEiT reaches 86.3% using only ImageNet-1K data, exceeding supervised ViT-L pre-trained on ImageNet-22K (85.2%).
Significance. If the central assumption about the tokenizer holds, the work shows that a BERT-style masked token prediction objective can be transferred to vision Transformers and yields competitive or superior ImageNet performance with substantially less supervised data than prior supervised pre-training. The public release of code and models is a positive contribution to reproducibility.
major comments (3)
- [§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.
- [Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.
- [§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.
minor comments (2)
- [Abstract] Abstract: the phrase 'competitive results with previous pre-training methods' is vague; listing the main baselines (DeiT, ViT, etc.) would improve clarity.
- [§2.1] §2.1: notation for 'visual tokens' versus standard ViT patch embeddings is introduced without a clear notational distinction, which can confuse readers familiar with the ViT paper.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and valuable suggestions. We have revised the manuscript to address the major comments and provide additional details and experiments where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.
Authors: We thank the referee for highlighting this aspect. The dVAE is trained on ImageNet-1K following the original dVAE paper, and serves as a fixed discretization step. To address the concern, we have performed additional ablations on codebook size (1024, 2048, 4096, 8192) and included the results in the revised Section 3.2. The ImageNet accuracy varies by at most 0.4% across these sizes, supporting that the MIM objective is the key contributor to the performance gains over DeiT. We have also added a discussion on why dVAE was chosen over other tokenization methods. revision: yes
-
Referee: [Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.
Authors: We agree that reporting statistical reliability is important for such claims. In the revised manuscript, we have updated Table 1 to include the mean accuracy and standard deviation computed over three independent runs with different random seeds. For the base model, BEiT achieves 83.2% ± 0.15%, compared to DeiT's 81.8% ± 0.20%. The improvement is consistent across runs. revision: yes
-
Referee: [§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.
Authors: We confirm that the fine-tuning protocol for BEiT-Large is exactly the same as that used for the supervised ViT-Large in the original ViT work, including identical data augmentations (RandAugment, Mixup, CutMix), optimizer (AdamW with the same hyperparameters), learning rate schedule, and number of epochs. We have added an explicit statement and a reference to the exact settings from Dosovitskiy et al. in the revised Section 4.2 to clarify this. revision: yes
Circularity Check
BEiT pre-training objective is independently defined and externally validated
full rationale
The paper defines its masked image modeling task as recovering discrete visual tokens produced by a separately trained tokenizer, with the objective stated independently of any downstream metrics. Reported gains (e.g., 83.2% base BEiT vs. 81.8% DeiT on ImageNet-1K) are empirical results from fine-tuning on standard held-out benchmarks, not reductions of the claimed performance to the pre-training inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described pipeline; the tokenizer is an external component whose quality is not derived from BEiT equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A separately trained tokenizer produces discrete visual tokens that are a suitable prediction target for masked image modeling.
invented entities (1)
-
visual tokens
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 52 Pith papers
-
Masked Autoencoders Are Scalable Vision Learners
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Rethink MAE with Linear Time-Invariant Dynamics
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain gener...
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
-
Recurrent Video Masked Autoencoders
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better pa...
-
Adversarial Video Promotion Against Text-to-Video Retrieval
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
-
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.
-
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
-
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
TAPE introduces temporal-aware token pruning for diffusion-based video generation, using frame smoothing, layer reselection, and timestep budgets to achieve speedups while maintaining visual fidelity and coherence.
-
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 63...
-
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
MAEPose is a masked autoencoder that learns spatiotemporal representations from unlabeled mmWave radar videos to estimate human poses, outperforming baselines by up to 22.1% in MPJPE.
-
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
BrainDINO delivers a single self-supervised brain MRI representation that generalizes to tumor segmentation, disease classification, brain age estimation, and other tasks without volumetric pretraining or full fine-tuning.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.
-
Rapidly deploying on-device eye tracking by distilling visual foundation models
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
-
Generative Event Pretraining with Foundation Model Alignment
GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.
-
AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers
AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, ...
-
Tight Clusters Make Specialized Experts
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already ...
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
-
Vector-quantized Image Modeling with Improved VQGAN
Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.
-
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
-
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.
-
PRAGMA: Revolut Foundation Model
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
-
Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis
New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.
-
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
-
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
FMSD improves cross-dataset generalization in deepfake detection by using gradient-based layer masking to select forgery-sensitive weights and SVD to split them into preserved semantic and multiple learnable artifact ...
-
PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training
PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million u...
-
Towards Robust and Realistic Human Pose Estimation via WiFi Signals
DT-Pose reformulates WiFi HPE as domain-consistent representation learning via temporal contrastive masked pretraining plus hybrid topology-constrained decoding to yield more accurate and realistic 2D/3D poses.
-
Disentangled Generative Graph Representation Learning
DiGGR introduces a self-supervised graph representation learning framework that disentangles latent factors to guide mask modeling and improve representation quality on graph tasks.
-
FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition
FruitEnsemble uses a weighted ensemble of backbones for top-3 candidates followed by MLLM arbitration on low-confidence samples to reach 70.49% accuracy on a new 306-class fruit dataset.
-
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild
The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.
-
Frabjous: Deep Learning Fast Radio Burst Morphologies
Frabjous applies deep learning to classify FRB morphologies into five classes at 55% accuracy by augmenting limited real data with simulations.
-
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.
-
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
Reference graph
Works this paper leans on
-
[1]
UniLMv2: Pseudo- masked language models for unified language model pre-training
[BDW+20] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. UniLMv2: Pseudo- masked language models for unified language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 , volume 119 of Proceedings of Machine Learning R...
work page 2020
-
[2]
Improved Baselines with Momentum Contrastive Learning
[CFGH20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297,
work page internal anchor Pith review arXiv 2003
-
[3]
Exploring simple siamese representation learning, 2020
[CH20] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566,
-
[4]
A Simple Framework for Contrastive Learning of Visual Representations
[CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709,
work page internal anchor Pith review arXiv 2002
-
[5]
Emerging Properties in Self-Supervised Vision Transformers
[CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo- janowski, and Armand Joulin. Emerging properties in self-supervised vision transform- ers. arXiv preprint arXiv:2104.14294,
work page internal anchor Pith review arXiv
-
[6]
An empirical study of training self-supervised vision transformers
[CXH21] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self- supervised vision transformers. ArXiv, abs/2104.02057,
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[DBK+20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
BERT: pre- training of deep bidirectional transformers for language understanding
10 [DCLT19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational ...
work page 2019
-
[9]
Self-attention attribution: Interpreting information interactions inside Transformer
[HDWX20] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside Transformer. arXiv preprint arXiv:2004.11207,
-
[10]
[HSL+16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 646–661, Cham,
work page 2016
-
[11]
Categorical reparameterization with gumbel- softmax
[JGP17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,
work page 2017
-
[12]
[KW14] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014 ,
work page 2014
-
[13]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[LLC+21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,
work page internal anchor Pith review arXiv
-
[14]
Representation Learning with Contrastive Predictive Coding
[OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Zero-Shot Text-to-Image Generation
[RPG+21] A. Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092,
work page internal anchor Pith review arXiv
-
[16]
Training data-efficient image transformers & distillation through attention
Association for Computational Linguistics. [TCD+20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay- rolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. preprint arXiv:2012.12877,
-
[17]
Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F
[TCS+21] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. arXiv preprint arXiv:2103.17239,
-
[18]
Selfie: Self-supervised pretraining for image embedding
[TLL19] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940,
-
[19]
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Curran Associates Inc. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processi...
work page 2017
-
[20]
Self-supervised learning with swin transformers,
[XLY+21] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553,
-
[21]
Scaling vision transform- ers, 6 2021
[ZKHB21] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. arXiv preprint arXiv:2106.04560,
- [22]
-
[23]
*: result is taken from [CXH21]
The results, unless otherwise indicated, are all obtained by base-size models. *: result is taken from [CXH21]. G Hyperparameters for Pre-Training Hyperparameters Base Size Large Size Layers 12 24 Hidden size 768 1024 FFN inner hidden size 3072 4096 Attention heads 12 16 Attention head size 64 Patch size 16 × 16 Training epochs 800 Batch size 2048 Adamϵ 1...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.