A Unified View of Masked Image Modeling

Furu Wei; Hangbo Bao; Li Dong; Qixiang Ye; Zhiliang Peng

arxiv: 2210.10615 · v1 · pith:NM6F4SRHnew · submitted 2022-10-19 · 💻 cs.CV

A Unified View of Masked Image Modeling

Zhiliang Peng , Li Dong , Hangbo Bao , Qixiang Ye , Furu Wei This is my paper

classification 💻 cs.CV

keywords imagemaskedmaskdistillmodelingsemanticunifiedviewmethods

0 comments

read the original abstract

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ExPLoRe: Expert Patch-Level Loss Routing for Multi-Objective Masked Image Modeling
cs.CV 2026-06 unverdicted novelty 6.0

ExPLoRe turns MoE dispatch weights into per-patch loss coefficients for multi-objective masked image modeling, reporting gains on ImageNet-1K and ADE20K transfer.
Frabjous: Deep Learning Fast Radio Burst Morphologies
astro-ph.IM 2025-07 unverdicted novelty 4.0

Frabjous applies deep learning to classify FRB morphologies into five classes at 55% accuracy by augmenting limited real data with simulations.