arxiv: 2111.07832 · v3 · submitted 2021-11-15 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou , Chen Wei , Huiyu Wang , Wei Shen , Cihang Xie , Alan Yuille , Tao Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 02:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningmasked image modelingonline tokenizerself-distillationvision transformersImageNet pre-trainingdense prediction tasks

0 comments

The pith

iBOT uses a jointly learned online tokenizer for masked image modeling to reach 82.3 percent linear probing accuracy on ImageNet-1K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents iBOT as a self-supervised pre-training method that adapts masked language modeling ideas to images by predicting masked patches with semantically meaningful tokens. It performs self-distillation where the teacher network serves as an online tokenizer that generates tokens during training, while also distilling class tokens to capture global semantics. This setup avoids a separate pre-training stage for the tokenizer and produces representations with strong linear probing and fine-tuning results on ImageNet plus gains on dense tasks like detection and segmentation. The approach also yields local semantic patterns that improve robustness to image corruptions.

Core claim

iBOT performs masked prediction on image patches by treating the teacher network as an online tokenizer through self-distillation on masked patch tokens, combined with self-distillation on the class token to acquire visual semantics. The tokenizer is learned jointly with the masked image modeling objective, removing the need for a multi-stage pipeline that pre-trains the tokenizer beforehand.

What carries the argument

The online tokenizer, obtained from the teacher network via self-distillation, that supplies semantically meaningful visual tokens for the masked prediction task.

If this is right

The representations support 82.3 percent linear probing accuracy and 87.8 percent fine-tuning accuracy on ImageNet-1K.
Models obtain leading results on object detection, instance segmentation, and semantic segmentation.
Local semantic patterns improve robustness to common image corruptions.
The single-stage training pipeline simplifies self-supervised pre-training of vision transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Joint online tokenization may reduce the engineering overhead when adapting masked modeling to new image domains or modalities.
The emergence of local semantics could be tested by measuring how well the learned tokens align with human-annotated object parts.
Scaling the same self-distillation recipe to video or multi-view data might yield analogous gains in temporal or 3D tasks.

Load-bearing premise

Self-distillation with an online tokenizer produces semantically meaningful visual tokens without any prior pre-training of a separate tokenizer.

What would settle it

Train an ablation of iBOT that replaces the online tokenizer with fixed random tokens or a frozen pre-trained tokenizer and check whether linear probing accuracy on ImageNet-1K falls substantially below 82.3 percent.

read the original abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iBOT gets competitive ImageNet numbers by training the visual tokenizer jointly through self-distillation on masked patches instead of a separate pre-training stage.

read the letter

iBOT's main contribution is showing that you can learn the tokenizer on the fly with the main model. They run self-distillation on masked patch tokens so the teacher network doubles as the online tokenizer, plus class-token distillation for semantics. This removes the usual multi-stage pipeline where the tokenizer gets pre-trained first on its own data or objective. The reported numbers are 82.3% linear probing and 87.8% fine-tuning on ImageNet-1K, with additional gains on robustness to corruptions and on dense tasks like detection and segmentation. Those downstream results are the part that feels most useful, because they suggest the learned tokens carry local semantic structure rather than just low-level cues. The approach sits cleanly on top of prior MIM work like BEiT, and the citation pattern looks appropriate. The soft spot is the stability of the patch-token branch. Self-distillation on masked patches can collapse or produce low-entropy assignments unless the right centering, sharpening, or stop-gradient tricks are balanced correctly, and the abstract gives no specifics on those choices for the tokenizer head. If the full paper has ablations that rule out trivial solutions and show diverse token usage, the concern shrinks; otherwise it stays the main thing to check. The experimental setup follows standard ImageNet protocols, so the numbers are at least comparable to recent baselines. This paper is aimed at people already working on masked image modeling and self-supervised ViTs. A reader who wants a simpler training recipe for visual tokens will find the joint-learning angle practical. It deserves a serious referee because the empirical results are competitive and the simplification is concrete enough to test.

Referee Report

2 major / 2 minor

Summary. The paper introduces iBOT, a self-supervised framework for masked image modeling (MIM) that employs an online tokenizer obtained via self-distillation on masked patch tokens together with class-token distillation. The tokenizer is trained jointly with the MIM objective, avoiding any separate pre-training stage. The authors report 82.3% linear-probing and 87.8% fine-tuning top-1 accuracy on ImageNet-1K, plus improved robustness to corruptions and strong results on object detection, instance segmentation, and semantic segmentation.

Significance. If the joint-training procedure indeed yields non-collapsed, semantically meaningful visual tokens, the method would remove a multi-stage pipeline that has been standard in prior MIM work, thereby simplifying self-supervised pre-training of vision transformers while maintaining or exceeding state-of-the-art accuracy on both classification and dense-prediction benchmarks.

major comments (2)

[Method] The central claim that the jointly trained teacher functions as a semantically meaningful tokenizer rests on the assumption that self-distillation on masked patches does not collapse. The manuscript does not explicitly describe the stabilizers (centering, sharpening, stop-gradient schedules, or multi-crop) applied specifically to the patch-token branch; without these details the MIM objective could reduce to trivial reconstruction. This issue is load-bearing for the headline result.
[Experiments] Table 1 and the experimental protocol section report 82.3% linear probing and 87.8% fine-tuning on ImageNet-1K, yet provide no ablation isolating the contribution of the online tokenizer versus standard self-distillation or the effect of removing any collapse-prevention terms. Such an ablation is required to substantiate that the online tokenizer is the decisive factor.

minor comments (2)

[Method] Notation for the teacher update (momentum coefficient, stop-gradient) is introduced without a compact equation; adding a single displayed equation would improve clarity.
[Experiments] The robustness and dense-task results are presented without error bars or multiple random seeds; reporting standard deviations would strengthen the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate clarifications and new experiments into the revised manuscript.

read point-by-point responses

Referee: [Method] The central claim that the jointly trained teacher functions as a semantically meaningful tokenizer rests on the assumption that self-distillation on masked patches does not collapse. The manuscript does not explicitly describe the stabilizers (centering, sharpening, stop-gradient schedules, or multi-crop) applied specifically to the patch-token branch; without these details the MIM objective could reduce to trivial reconstruction. This issue is load-bearing for the headline result.

Authors: We agree that the stabilizers must be described explicitly for the patch-token branch. iBOT applies centering, sharpening, stop-gradient, and multi-crop augmentation to the masked patch-token self-distillation (following the DINO formulation but applied only to visible patches after masking). We will revise the method section to detail these components, their schedules, and how they are isolated to the patch branch, ensuring the online tokenizer remains non-collapsed and semantically meaningful. revision: yes
Referee: [Experiments] Table 1 and the experimental protocol section report 82.3% linear probing and 87.8% fine-tuning on ImageNet-1K, yet provide no ablation isolating the contribution of the online tokenizer versus standard self-distillation or the effect of removing any collapse-prevention terms. Such an ablation is required to substantiate that the online tokenizer is the decisive factor.

Authors: We acknowledge that dedicated ablations would strengthen the claims. In the revision we will add experiments comparing (i) full iBOT against a standard self-distillation baseline without the online tokenizer, and (ii) variants that disable individual collapse-prevention terms (centering, sharpening). These results will be reported alongside the existing numbers to isolate the tokenizer's contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in iBOT's empirical self-supervised framework

full rationale

The paper presents iBOT as a joint-training method for masked image modeling with an online tokenizer obtained via self-distillation on patch and class tokens. All central claims (82.3% linear probing, 87.8% fine-tuning on ImageNet-1K, robustness and dense-task gains) are supported by experimental results rather than any mathematical derivation chain. No load-bearing step reduces a prediction to a fitted input by construction, invokes a self-citation uniqueness theorem, or renames a known result. The online tokenizer is defined procedurally and validated empirically; its non-collapse is an empirical question addressed by the reported ablations, not assumed by definition. This is the normal non-circular outcome for an empirical method paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim relies on the effectiveness of the proposed online tokenizer mechanism, which is an assumption from the domain of self-supervised learning.

free parameters (1)

model hyperparameters
Training involves many hyperparameters not detailed in abstract.

axioms (1)

domain assumption Masked prediction with self-distillation leads to good visual representations
Fundamental to the iBOT framework.

pith-pipeline@v0.9.0 · 5511 in / 1146 out tokens · 54941 ms · 2026-05-14T02:05:18.808809+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A satellite foundation model for improved wealth monitoring
cs.CY 2026-04 unverdicted novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
cs.CV 2026-04 unverdicted novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

BrainDINO delivers a single self-supervised brain MRI representation that generalizes to tumor segmentation, disease classification, brain age estimation, and other tasks without volumetric pretraining or full fine-tuning.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking
cs.CV 2026-04 unverdicted novelty 6.0

PolarMAE is a new unsupervised pre-training method for fetal ultrasound that uses progressive visual-semantic screening, acoustic-bounded constraints, and polar-texture masking to reach state-of-the-art performance on...
Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge
cs.CV 2026-04 conditional novelty 6.0

Self-supervised pretraining on 60K clinical-style brain MRIs improves out-of-domain generalization on classification, segmentation, and regression tasks, with hybrid objectives and small models showing strong results.
Generative Data-engine Foundation Model for Universal Few-shot 2D Vascular Image Segmentation
eess.IV 2026-04 unverdicted novelty 6.0

UniVG synthesizes diverse vascular images via compositional learning and few-shot adaptation to reach fully-supervised segmentation performance on 11 tasks across 5 modalities using only 5 labeled examples each.
Self-supervised Pretraining of Cell Segmentation Models
cs.CV 2026-04 unverdicted novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
cs.SD 2026-04 unverdicted novelty 6.0

TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
cs.CV 2026-04 unverdicted novelty 6.0

TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery
cs.CV 2026-04 unverdicted novelty 6.0

Smart Transfer adapts vision foundation models using pixel-wise clustering and distance-penalized triplet loss for rapid cross-region building damage mapping after earthquakes.
Rapidly deploying on-device eye tracking by distilling visual foundation models
cs.CV 2026-04 unverdicted novelty 6.0

DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction
cs.CV 2026-05 unverdicted novelty 5.0

A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
LychSim: A Controllable and Interactive Simulation Framework for Vision Research
cs.CV 2026-05 unverdicted novelty 4.0

LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.
iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents
cs.CV 2026-04 unverdicted novelty 4.0

iDocV2 reaches 0.612 precision on small non-square pattern queries in historical documents while running 10 times faster than state-of-the-art dense-based approaches.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 20 Pith papers · 2 internal anchors

[1]

Self-supervised classiﬁcation network

Elad Amrani and Alex Bronstein. Self-supervised classiﬁcation network. arXiv preprint arXiv:2103.10994,

work page arXiv
[2]

SiT: Self-supervised vision transformer

Sara Atito, Muhammad Awais, and Josef Kittler. SiT: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602,

work page arXiv
[3]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Mask R-CNN

10 Published as a conference paper at ICLR 2022 Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Mask R-CNN. In ICCV,

work page 2022
[5]

Efﬁcient self-supervised vision transformers for representation learning

Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. Efﬁcient self-supervised vision transformers for representation learning. arXiv preprint arXiv:2106.09785, 2021a. Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, et al. MST: Masked self-supe...

work page arXiv
[6]

ICCV, 2021.https://arxiv.or g/abs/2103.14030 35 Supplementary Material S1

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self- supervised learning: Generative or contrastive. TKDE, 2021a. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021b. I...

work page arXiv
[7]

Intriguing properties of vision transformers

Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. arXiv preprint arXiv:2105.10497,

work page arXiv
[8]

Vimpac: Video pre-training via masked token prediction and contrastive learning

11 Published as a conference paper at ICLR 2022 Hao Tan, Jie Lei, Thomas Wolf, and Mohit Bansal. Vimpac: Video pre-training via masked token prediction and contrastive learning. arXiv preprint arXiv:2106.11250,

work page arXiv 2022
[9]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trans- lation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Self- supervised learning with swin transformers

Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self- supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021a. Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Ex- ploring pixel-level consistency for unsupervised visual representation learning. In CVPR, 20...

work page arXiv
[11]

Self-supervised visual representations learning by contrastive mask prediction

Yucheng Zhao, Guangting Wang, Chong Luo, Wenjun Zeng, and Zheng-Jun Zha. Self-supervised visual representations learning by contrastive mask prediction. arXiv preprint arXiv:2108.07954,

work page arXiv
[12]

12 Published as a conference paper at ICLR 2022 A P SEUDOCODE Algorithm 1: iBOT PyTorch-like Pseudocode w/o multi-crop augmentation Input: gs,gt ; // student and teacher network C,C′ ; // center on [CLS] token and patch tokens τs,τ t ; // temperature on [CLS] token for student and teacher network τ′ s,τ′ t ; // temperature on patch tokens for student and ...

work page 2022
[13]

𝑥!𝑥" !𝑥!!𝑥

The most intuitive ideas are to compute as (b) or (c). In (b), MIM is only performed on global crops. This pipeline is unstable during training, and we observe a dip in the NMI training curve. We hypothesize that it can be caused by the distribution mismatch of masked global crops and non-masked local crops. To alleviate this, a straightforward solution i...

work page 2022
[14]

We observe the latter practice performs sightly better since it is more ﬂexible in task composition and data in a batch is mutually independent

is chosen, MIM is performed for both of the two global crops. We observe the latter practice performs sightly better since it is more ﬂexible in task composition and data in a batch is mutually independent. Range of Scales in Multi-Crop. We further study the performance with different local and global scale. Following DINO (Caron et al., 2021), we conduct...

work page 2021
[15]

rely heavily on multi-crop augmentation during pre-training. Except for several speciﬁc self-supervised methods (Grill et al., 2020), multi-crop works well on most of the self-supervised methods and consistently yields performance gain (Caron et al., 2021). While a more fair comparison with our methods without multi-crop augmentation can be conducted, we ...

work page 2020
[16]

We empirically ﬁnd that ﬁne-tuning protocol used in BEiT consistently yields better ﬁne-tuning results and greatly reduces the training epochs. By default, we use a layerwise decay of 0.75 with a training epoch of 200 for ViT-S/16, a layerwise decay of 0.65 with a training epoch of 100 for ViT-B/16, and a layerwise decay of 0.75 with a training epoch of 5...

work page 2022
[17]

17 Published as a conference paper at ICLR 2022 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Information Loss (%) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ImageNet Val Acc@1 R

We hypothe- size that the data distribution plays a more crucial rule under evaluation protocols based on frozen features, such that models pre-trained with smaller ImageNet-1K dataset consistently achieve better results. 17 Published as a conference paper at ICLR 2022 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Information Loss (%) 0.0 0.1 0.2 0.3 0.4 0....

work page 2022
[18]

Adding a variance upon the ﬁxed value can also consistently bring a performance gain, which can be explained as stronger data augmentation

We observe that the performance is not sensitive to variant prediction ratios between0.05 and 0.4. Adding a variance upon the ﬁxed value can also consistently bring a performance gain, which can be explained as stronger data augmentation. The teacher output of non-masked images is now pulled together with the student output of masked images with different...

work page 2022
[19]

iBOT pre-trained with 800 epochs brings a 0.9% improvement over previous state-of-the-art method

pre-trained with800 epochs in less than 100 epochs. iBOT pre-trained with 800 epochs brings a 0.9% improvement over previous state-of-the-art method. Time and Memory Requirements. BEiT is trained with a non-contrastive objective and without multi-crop augmentation, thus it consumes only a memory of 5.6G and takes 90.1h for 800 epochs. Comparing iBOT and D...

work page 2021
[20]

We ﬁnd patch clustering has slightly better performance in all three protocols compared to MPP, suggesting the beneﬁts brought by visual semantics

and patch clustering rely purely on ofﬂine statistics without the extra stage of online training. We ﬁnd patch clustering has slightly better performance in all three protocols compared to MPP, suggesting the beneﬁts brought by visual semantics. While BEiT has poor k-NN and linear probing accuracy, a good ﬁne-tuning result also suggests relatively low req...

work page 2021
[21]

For DINO, we directly use the projection head for [CLS] token and generate a 65536-d probability distribution for each patch token

For BEiT, the DALL-E encoder generates a discrete number for each patch token. For DINO, we directly use the projection head for [CLS] token and generate a 65536-d probability distribution for each patch token. The index with the highest probability is assigned for the token. Pattern Layout for [CLS] Token. We here also provide additional visualization of...

work page 2020
[22]

G.3 S PARSE CORRESPONDENCE . We consider a sparse correspondence task where the overlapped patches from two augmented views of one image, or patches from two images labeled as one class, are required to be matched. The correlation is sparse since at most 14× 14 matched pairs can be extracted with a ViT-S/16 model. We visualize 12 correspondences with the ...

work page 2022
[23]

In the second column, iBOT can match different parts of two instances from the same class (e.g., tiles and windows of two cars) despite their huge differences in texture or color

We observe empirically that iBOT perform well for two views drawn from one image, nearly matched the majority of correspondence correctly. In the second column, iBOT can match different parts of two instances from the same class (e.g., tiles and windows of two cars) despite their huge differences in texture or color. We observe the DINO also has comparabl...

work page 2022