pith. sign in

hub

iBOT: Image BERT Pre-Training with Online Tokenizer

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it
abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

hub tools

citation-role summary

background 3

citation-polarity summary

roles

background 3

polarities

background 3

representative citing papers

CanViT: Toward Active-Vision Foundation Models

cs.CV · 2026-03-23 · conditional · novelty 8.0

CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.

A satellite foundation model for improved wealth monitoring

cs.CY · 2026-04-25 · unverdicted · novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.

DualTrack: Sensorless 3D Ultrasound needs Local and Global Context

cs.CV · 2025-09-11 · unverdicted · novelty 7.0

DualTrack uses decoupled local spatiotemporal and global anatomical encoders with a fusion module to estimate probe trajectories from 2D ultrasound sequences, achieving sub-5 mm average reconstruction error on public benchmarks.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · conditional · novelty 6.0 · 2 refs

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

Self-supervised Pretraining of Cell Segmentation Models

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.

MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning

cs.AI · 2026-02-08 · unverdicted · novelty 6.0

MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.

UNIV: Unified Foundation Model for Infrared and Visible Modalities

cs.CV · 2025-09-19 · unverdicted · novelty 6.0

UNIV introduces Patch Cross-modal Contrastive Learning (PCCL) to build a unified semantic feature space for infrared and visible modalities, supported by the new MVIP dataset of 98,992 aligned pairs, with reported gains on infrared segmentation and detection tasks.

Sapiens2

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.

citing papers explorer

Showing 27 of 27 citing papers.