hub Mixed citations

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, Furu Wei · 2021 · cs.CV · arXiv 2106.08254

Mixed citation behavior. Most common role is background (42%).

53 Pith papers citing it

Background 42% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 5

citation-polarity summary

background 5 use method 4 unclear 2 extend 1

representative citing papers

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.

Neural Scaling Laws for Jet Generation

hep-ph · 2026-05-27 · unverdicted · novelty 7.0

Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

Rethink MAE with Linear Time-Invariant Dynamics

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

cs.CV · 2026-04-23 · unverdicted · novelty 7.0 · 2 refs

VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.

Recurrent Video Masked Autoencoders

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

Adversarial Video Promotion Against Text-to-Video Retrieval

cs.CV · 2025-08-09 · unverdicted · novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

cs.CV · 2024-12-11 · unverdicted · novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

cs.CV · 2023-02-23 · accept · novelty 7.0

ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.

iBOT: Image BERT Pre-Training with Online Tokenizer

cs.CV · 2021-11-15 · unverdicted · novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 6314-scan dataset.

When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.

Rapidly deploying on-device eye tracking by distilling visual foundation models

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.

Generative Event Pretraining with Foundation Model Alignment

cs.CV · 2026-03-24 · unverdicted · novelty 6.0

GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

cs.SD · 2025-12-03 · unverdicted · novelty 6.0

AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, reaching SOTA on AS-20K, ESC-50, and NSynth under fine-tuning.

Tight Clusters Make Specialized Experts

cs.LG · 2025-02-21 · unverdicted · novelty 6.0

Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.

YOLOv12: Attention-Centric Real-Time Object Detectors

cs.CV · 2025-02-18 · unverdicted · novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

citing papers explorer

Showing 40 of 40 citing papers after filters.

Masked Autoencoders Are Scalable Vision Learners cs.CV · 2021-11-11 · accept · none · ref 2 · internal anchor
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams cs.CV · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Rethink MAE with Linear Time-Invariant Dynamics cs.CV · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection cs.CV · 2026-04-23 · unverdicted · none · ref 41 · 2 links · internal anchor
VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation cs.CV · 2026-03-03 · unverdicted · none · ref 1 · internal anchor
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
Recurrent Video Masked Autoencoders cs.CV · 2025-12-15 · unverdicted · none · ref 6 · internal anchor
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
Adversarial Video Promotion Against Text-to-Video Retrieval cs.CV · 2025-08-09 · unverdicted · none · ref 3 · internal anchor
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding cs.CV · 2024-12-11 · unverdicted · none · ref 4 · internal anchor
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
Segment Anything cs.CV · 2023-04-05 · unverdicted · none · ref 5 · internal anchor
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth cs.CV · 2023-02-23 · accept · none · ref 3 · internal anchor
ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.
iBOT: Image BERT Pre-Training with Online Tokenizer cs.CV · 2021-11-15 · unverdicted · none · ref 3 · internal anchor
iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection cs.CV · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality cs.CV · 2026-05-07 · unverdicted · none · ref 86 · internal anchor
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis cs.CV · 2026-05-03 · unverdicted · none · ref 1 · internal anchor
ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 6314-scan dataset.
Rapidly deploying on-device eye tracking by distilling visual foundation models cs.CV · 2026-04-02 · unverdicted · none · ref 38 · internal anchor
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
Generative Event Pretraining with Foundation Model Alignment cs.CV · 2026-03-24 · unverdicted · none · ref 1 · internal anchor
GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.
YOLOv12: Attention-Centric Real-Time Object Detectors cs.CV · 2025-02-18 · unverdicted · none · ref 1 · internal anchor
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations cs.CV · 2024-12-19 · unverdicted · none · ref 79 · internal anchor
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 224 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 85 · internal anchor
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
EVA-CLIP: Improved Training Techniques for CLIP at Scale cs.CV · 2023-03-27 · conditional · none · ref 2 · internal anchor
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 22 · internal anchor
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Vector-quantized Image Modeling with Improved VQGAN cs.CV · 2021-10-09 · accept · none · ref 4 · internal anchor
Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.
HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning cs.CV · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
HQ-JEPA combines JEPA-style predictive self-supervision with cross-modal alignment and a SWAP-test-based quantum fidelity loss for learning representations from paired remote sensing imagery, reporting competitive results on GeoBench tasks.
SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation cs.CV · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
SIGMA proposes a lightweight PEFT adapter consisting of scale-adaptive fusion and semantic modulation to bridge structural and distributional gaps when adapting vision foundation models to dense tasks.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 117 · 2 links · internal anchor
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs cs.CV · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance cs.CV · 2026-04-17 · unverdicted · none · ref 19 · internal anchor
ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.
Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis cs.CV · 2026-04-07 · unverdicted · none · ref 46 · internal anchor
New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition cs.CV · 2026-01-03 · unverdicted · none · ref 61 · internal anchor
FMSD improves cross-dataset generalization in deepfake detection by using gradient-based layer masking to select forgery-sensitive weights and SVD to split them into preserved semantic and multiple learnable artifact subspaces with orthogonality constraints.
PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training cs.CV · 2025-08-13 · unverdicted · none · ref 2 · internal anchor
PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million unlabeled images.
Towards Robust and Realistic Human Pose Estimation via WiFi Signals cs.CV · 2025-01-16 · unverdicted · none · ref 1 · internal anchor
DT-Pose reformulates WiFi HPE as domain-consistent representation learning via temporal contrastive masked pretraining plus hybrid topology-constrained decoding to yield more accurate and realistic 2D/3D poses.
FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition cs.CV · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
FruitEnsemble uses a weighted ensemble of backbones for top-3 candidates followed by MLLM arbitration on low-confidence samples to reach 70.49% accuracy on a new 306-class fruit dataset.
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild cs.CV · 2026-04-13 · unverdicted · none · ref 5 · internal anchor
The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning cs.CV · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video cs.CV · 2026-04-30 · unreviewed · ref 1 · internal anchor
Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · unreviewed · ref 1 · 2 links · internal anchor

BEiT: BERT Pre-Training of Image Transformers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer