Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, Ross Girshick · 2022

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.

Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

cs.CV · 2026-04-07 · unverdicted · novelty 5.0

VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.

citing papers explorer

Showing 4 of 4 citing papers.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 20
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register cs.CV · 2026-05-19 · unverdicted · none · ref 13
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes cs.CV · 2026-05-22 · unverdicted · none · ref 18
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion cs.CV · 2026-04-07 · unverdicted · none · ref 8
VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSCBench-KITTI-360 for monocular 3D scene completion.

Masked autoencoders are scalable vision learners

fields

years

verdicts

representative citing papers

citing papers explorer