CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
hub
iBOT: Image BERT Pre-Training with Online Tokenizer
27 Pith papers cite this work. Polarity classification is still indexing.
abstract
The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
DualTrack uses decoupled local spatiotemporal and global anatomical encoders with a fusion module to estimate probe trajectories from 2D ultrasound sequences, achieving sub-5 mm average reconstruction error on public benchmarks.
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
PolarMAE is a new unsupervised pre-training method for fetal ultrasound that uses progressive visual-semantic screening, acoustic-bounded constraints, and polar-texture masking to reach state-of-the-art performance on downstream interpretation tasks.
UniVG synthesizes diverse vascular images via compositional learning and few-shot adaptation to reach fully-supervised segmentation performance on 11 tasks across 5 modalities using only 5 labeled examples each.
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
Smart Transfer adapts vision foundation models using pixel-wise clustering and distance-penalized triplet loss for rapid cross-region building damage mapping after earthquakes.
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.
SolarCHIP contrastively pretrains CNN and Vision Transformer backbones on SDO AIA-HMI data with multi-granularity objectives, achieving SOTA on cross-modal translation and flare classification especially in low-resource settings.
UNIV introduces Patch Cross-modal Contrastive Learning (PCCL) to build a unified semantic feature space for infrared and visible modalities, supported by the new MVIP dataset of 98,992 aligned pairs, with reported gains on infrared segmentation and detection tasks.
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.
A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Self-supervised pretraining on large unlabeled clinical brain MRI data improves generalization to out-of-domain clinical tasks over supervised in-domain training, with task-specific optimal objectives and limited benefits from model scaling.
citing papers explorer
-
CanViT: Toward Active-Vision Foundation Models
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
-
A satellite foundation model for improved wealth monitoring
Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
DualTrack uses decoupled local spatiotemporal and global anatomical encoders with a fusion module to estimate probe trajectories from 2D ultrasound sequences, achieving sub-5 mm average reconstruction error on public benchmarks.
-
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking
PolarMAE is a new unsupervised pre-training method for fetal ultrasound that uses progressive visual-semantic screening, acoustic-bounded constraints, and polar-texture masking to reach state-of-the-art performance on downstream interpretation tasks.
-
Generative Data-engine Foundation Model for Universal Few-shot 2D Vascular Image Segmentation
UniVG synthesizes diverse vascular images via compositional learning and few-shot adaptation to reach fully-supervised segmentation performance on 11 tasks across 5 modalities using only 5 labeled examples each.
-
Self-supervised Pretraining of Cell Segmentation Models
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
-
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
-
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
-
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery
Smart Transfer adapts vision foundation models using pixel-wise clustering and distance-penalized triplet loss for rapid cross-region building damage mapping after earthquakes.
-
Rapidly deploying on-device eye tracking by distilling visual foundation models
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
-
MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning
MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.
-
Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records
SolarCHIP contrastively pretrains CNN and Vision Transformer backbones on SDO AIA-HMI data with multi-granularity objectives, achieving SOTA on cross-modal translation and flare classification especially in low-resource settings.
-
UNIV: Unified Foundation Model for Infrared and Visible Modalities
UNIV introduces Patch Cross-modal Contrastive Learning (PCCL) to build a unified semantic feature space for infrared and visible modalities, supported by the new MVIP dataset of 98,992 aligned pairs, with reported gains on infrared segmentation and detection tasks.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining
FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.
-
Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction
A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
-
Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge
Self-supervised pretraining on large unlabeled clinical brain MRI data improves generalization to out-of-domain clinical tasks over supervised in-domain training, with task-specific optimal objectives and limited benefits from model scaling.
-
LychSim: A Controllable and Interactive Simulation Framework for Vision Research
LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.
-
iDocV2: Leveraging Self-Supervision and Open-Set Detection for Improving Pattern Spotting in Historical Documents
iDocV2 reaches 0.612 precision on small non-square pattern queries in historical documents while running 10 times faster than state-of-the-art dense-based approaches.
- BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning