Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
hub Mixed citations
BEiT: BERT Pre-Training of Image Transformers
Mixed citation behavior. Most common role is background (42%).
abstract
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.
Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.
iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
ER-JEPA applies hierarchical Joint-Embedding Predictive Architecture to ECG time series and reports SOTA performance on the ST-MEM benchmark after pretraining on ~180k recordings.
MFASSL adds mirror-paired views, a lightweight Mirror-Fusion Attention module, and reflection-consistency losses to improve SSL on bilateral data with ~2.7% extra parameters.
HT-Bench is a large egocentric vision-plus-full-hand-tactile benchmark with four evaluation tasks; the proposed HandTouch encoder improves Recall@5 from 74.65% to 85.23%, reduces inpainting RMSE from 0.022 to 0.010, and raises OOD cIoU from 0.628 to 0.705.
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
SPADE is a split-and-delay embedding technique for multi-feature autoregressive transformers that achieves competitive performance on high-granularity calorimeter shower simulation.
AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
citing papers explorer
-
Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
-
Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams
Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.
-
Neural Scaling Laws for Jet Generation
Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Rethink MAE with Linear Time-Invariant Dynamics
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
-
Recurrent Video Masked Autoencoders
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
-
Adversarial Video Promotion Against Text-to-Video Retrieval
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
-
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
-
A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data
ER-JEPA applies hierarchical Joint-Embedding Predictive Architecture to ECG time series and reports SOTA performance on the ST-MEM benchmark after pretraining on ~180k recordings.
-
Mirror-Fusion Attention for Reflection-Aware Self-Supervised Representation Learning
MFASSL adds mirror-paired views, a lightweight Mirror-Fusion Attention module, and reflection-consistency losses to improve SSL on bilateral data with ~2.7% extra parameters.
-
Contrastive Action-Image Pre-training for Visuomotor Control
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
-
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
-
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
-
SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation
SPADE is a split-and-delay embedding technique for multi-feature autoregressive transformers that achieves competitive performance on high-granularity calorimeter shower simulation.
-
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 6314-scan dataset.
-
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.
-
Rapidly deploying on-device eye tracking by distilling visual foundation models
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
-
Generative Event Pretraining with Foundation Model Alignment
GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.
-
AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers
AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, reaching SOTA on AS-20K, ESC-50, and NSynth under fine-tuning.
-
Tight Clusters Make Specialized Experts
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
-
Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis
Zeus proposes a multi-scale Transformer with point-wise tokenization and Multi-Objective Temporal Masking to enable tuning-free performance on forecasting, interpolation, and other time series tasks.
-
DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning
DRDN reports 77.19% average accuracy on CIFAR100-B0 (10 steps) in from-scratch ViT class-incremental learning by routing MIM reconstruction through the backbone and using per-layer task token expansion.
-
Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality
MACCO applies cross-modal masked reconstruction of compositional concepts with inter- and intra-modal auxiliary objectives to improve visio-linguistic compositionality in VLMs.
-
NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization
NSVQ mitigates codebook collapse in large-codebook VQ by addressing encoder drift via non-stationary loss, replacement, and staged freezing, improving rFID from 2.39 to 2.10 on ImageNet-1k while achieving 100% utilization.
-
Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models
The work establishes an evaluation framework for personality induction and switching in MLLMs, reporting improved captioning but impaired VQA performance plus balancing and residual effects during multi-trait and dynamic conditions.
-
ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment
ClinReadNet introduces SOQN, (S)W-MTMSA, and HRPS loss to achieve SOTA no-reference IQA on LDCTIQAG2023 with PLCC 0.9507, SROCC 0.9554, KROCC 0.8629.
-
EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction
EEGDancer integrates VQ-VAE latent space learning, masked Transformer modeling, and SAC reinforcement learning to improve continuous EEG emotion prediction over prior methods on SEED datasets.
-
HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning
HQ-JEPA combines JEPA-style predictive self-supervision with cross-modal alignment and a SWAP-test-based quantum fidelity loss for learning representations from paired remote sensing imagery, reporting competitive results on GeoBench tasks.
-
SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation
SIGMA proposes a lightweight PEFT adapter consisting of scale-adaptive fusion and semantic modulation to bridge structural and distributional gaps when adapting vision foundation models to dense tasks.
-
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
MAEPose is a self-supervised masked autoencoder operating on mmWave spectrogram videos that learns motion representations and decodes multi-frame poses, outperforming baselines by up to 22.1% MPJPE.
-
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
BrainDINO, trained via self-distillation on millions of unlabeled axial brain MRI slices, yields a unified representation that equals or exceeds baselines across diverse neuroimaging tasks when used with a frozen encoder and lightweight heads.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
-
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.
-
PRAGMA: Revolut Foundation Model
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and lifetime value prediction using linear heads or light fine-tuning.
-
Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis
New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.
-
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
-
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
FMSD improves cross-dataset generalization in deepfake detection by using gradient-based layer masking to select forgery-sensitive weights and SVD to split them into preserved semantic and multiple learnable artifact subspaces with orthogonality constraints.