hub

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille · 2021 · cs.CV · arXiv 2111.07832

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open full Pith review browse 27 citing papers arXiv PDF

abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

CanViT: Toward Active-Vision Foundation Models

cs.CV · 2026-03-23 · conditional · novelty 8.0

CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.

A satellite foundation model for improved wealth monitoring

cs.CY · 2026-04-25 · unverdicted · novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

cs.CV · 2026-04-23 · unverdicted · novelty 7.0 · 2 refs

VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

DualTrack: Sensorless 3D Ultrasound needs Local and Global Context

cs.CV · 2025-09-11 · unverdicted · novelty 7.0

DualTrack uses decoupled local spatiotemporal and global anatomical encoders with a fusion module to estimate probe trajectories from 2D ultrasound sequences, achieving sub-5 mm average reconstruction error on public benchmarks.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · conditional · novelty 6.0 · 2 refs

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.

PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

PolarMAE is a new unsupervised pre-training method for fetal ultrasound that uses progressive visual-semantic screening, acoustic-bounded constraints, and polar-texture masking to reach state-of-the-art performance on downstream interpretation tasks.

Generative Data-engine Foundation Model for Universal Few-shot 2D Vascular Image Segmentation

eess.IV · 2026-04-12 · unverdicted · novelty 6.0

UniVG synthesizes diverse vascular images via compositional learning and few-shot adaptation to reach fully-supervised segmentation performance on 11 tasks across 5 modalities using only 5 labeled examples each.

Self-supervised Pretraining of Cell Segmentation Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.

Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

cs.SD · 2026-04-09 · unverdicted · novelty 6.0

TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.

Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

Smart Transfer adapts vision foundation models using pixel-wise clustering and distance-penalized triplet loss for rapid cross-region building damage mapping after earthquakes.

Rapidly deploying on-device eye tracking by distilling visual foundation models

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.

MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning

cs.AI · 2026-02-08 · unverdicted · novelty 6.0

MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.

Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records

cs.CV · 2025-11-28 · unverdicted · novelty 6.0

SolarCHIP contrastively pretrains CNN and Vision Transformer backbones on SDO AIA-HMI data with multi-granularity objectives, achieving SOTA on cross-modal translation and flare classification especially in low-resource settings.

UNIV: Unified Foundation Model for Infrared and Visible Modalities

cs.CV · 2025-09-19 · unverdicted · novelty 6.0

UNIV introduces Patch Cross-modal Contrastive Learning (PCCL) to build a unified semantic feature space for infrared and visible modalities, supported by the new MVIP dataset of 98,992 aligned pairs, with reported gains on infrared segmentation and detection tasks.

Revisiting Feature Prediction for Learning Visual Representations from Video

cs.CV · 2024-02-15 · conditional · novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining

cs.CV · 2026-05-21 · unverdicted · novelty 5.0 · 2 refs

FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.

Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.

Sapiens2

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

cs.CV · 2026-04-13 · accept · novelty 5.0 · 2 refs

Self-supervised pretraining on large unlabeled clinical brain MRI data improves generalization to out-of-domain clinical tasks over supervised in-domain training, with task-specific optimal objectives and limited benefits from model scaling.

citing papers explorer

Showing 27 of 27 citing papers.

CanViT: Toward Active-Vision Foundation Models cs.CV · 2026-03-23 · conditional · none · ref 60 · internal anchor
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
A satellite foundation model for improved wealth monitoring cs.CY · 2026-04-25 · unverdicted · none · ref 41 · internal anchor
Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection cs.CV · 2026-04-23 · unverdicted · none · ref 44 · 2 links · internal anchor
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance cs.CV · 2026-04-09 · unverdicted · none · ref 59 · internal anchor
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
DualTrack: Sensorless 3D Ultrasound needs Local and Global Context cs.CV · 2025-09-11 · unverdicted · none · ref 19 · internal anchor
DualTrack uses decoupled local spatiotemporal and global anatomical encoders with a fusion module to estimate probe trajectories from 2D ultrasound sequences, achieving sub-5 mm average reconstruction error on public benchmarks.
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register cs.CV · 2026-05-19 · unverdicted · none · ref 44 · internal anchor
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 110 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · conditional · none · ref 32 · 2 links · internal anchor
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 65 · internal anchor
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking cs.CV · 2026-04-17 · unverdicted · none · ref 50 · internal anchor
PolarMAE is a new unsupervised pre-training method for fetal ultrasound that uses progressive visual-semantic screening, acoustic-bounded constraints, and polar-texture masking to reach state-of-the-art performance on downstream interpretation tasks.
Generative Data-engine Foundation Model for Universal Few-shot 2D Vascular Image Segmentation eess.IV · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
UniVG synthesizes diverse vascular images via compositional learning and few-shot adaptation to reach fully-supervised segmentation performance on 11 tasks across 5 modalities using only 5 labeled examples each.
Self-supervised Pretraining of Cell Segmentation Models cs.CV · 2026-04-12 · unverdicted · none · ref 51 · internal anchor
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning cs.SD · 2026-04-09 · unverdicted · none · ref 36 · internal anchor
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders cs.CV · 2026-04-08 · unverdicted · none · ref 15 · internal anchor
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery cs.CV · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
Smart Transfer adapts vision foundation models using pixel-wise clustering and distance-penalized triplet loss for rapid cross-region building damage mapping after earthquakes.
Rapidly deploying on-device eye tracking by distilling visual foundation models cs.CV · 2026-04-02 · unverdicted · none · ref 39 · internal anchor
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning cs.AI · 2026-02-08 · unverdicted · none · ref 9 · internal anchor
MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.
Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records cs.CV · 2025-11-28 · unverdicted · none · ref 52 · internal anchor
SolarCHIP contrastively pretrains CNN and Vision Transformer backbones on SDO AIA-HMI data with multi-granularity objectives, achieving SOTA on cross-modal translation and flare classification especially in low-resource settings.
UNIV: Unified Foundation Model for Infrared and Visible Modalities cs.CV · 2025-09-19 · unverdicted · none · ref 45 · internal anchor
UNIV introduces Patch Cross-modal Contrastive Learning (PCCL) to build a unified semantic feature space for infrared and visible modalities, supported by the new MVIP dataset of 98,992 aligned pairs, with reported gains on infrared segmentation and detection tasks.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 297 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining cs.CV · 2026-05-21 · unverdicted · none · ref 18 · 2 links · internal anchor
FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.
Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction cs.CV · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 28 · internal anchor
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge cs.CV · 2026-04-13 · accept · none · ref 33 · 2 links · internal anchor
Self-supervised pretraining on large unlabeled clinical brain MRI data improves generalization to out-of-domain clinical tasks over supervised in-domain training, with task-specific optimal objectives and limited benefits from model scaling.
LychSim: A Controllable and Interactive Simulation Framework for Vision Research cs.CV · 2026-05-12 · unverdicted · none · ref 66 · internal anchor
LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.
iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents cs.CV · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
iDocV2 reaches 0.612 precision on small non-square pattern queries in historical documents while running 10 times faster than state-of-the-art dense-based approaches.
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning cs.LG · 2026-04-30 · unreviewed · ref 76 · internal anchor

iBOT: Image BERT Pre-Training with Online Tokenizer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer