pith. sign in

hub Mixed citations

BEiT: BERT Pre-Training of Image Transformers

Mixed citation behavior. Most common role is background (42%).

53 Pith papers citing it
Background 42% of classified citations
abstract

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

hub tools

citation-role summary

background 7 method 5

citation-polarity summary

clear filters

representative citing papers

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Neural Scaling Laws for Jet Generation

hep-ph · 2026-05-27 · unverdicted · novelty 7.0

Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.

Rethink MAE with Linear Time-Invariant Dynamics

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

Recurrent Video Masked Autoencoders

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

iBOT: Image BERT Pre-Training with Online Tokenizer

cs.CV · 2021-11-15 · unverdicted · novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.

Tight Clusters Make Specialized Experts

cs.LG · 2025-02-21 · unverdicted · novelty 6.0

Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.

YOLOv12: Attention-Centric Real-Time Object Detectors

cs.CV · 2025-02-18 · unverdicted · novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

citing papers explorer

Showing 9 of 9 citing papers after filters.

  • Recurrent Video Masked Autoencoders cs.CV · 2025-12-15 · unverdicted · none · ref 6 · internal anchor

    RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

  • Adversarial Video Promotion Against Text-to-Video Retrieval cs.CV · 2025-08-09 · unverdicted · none · ref 3 · internal anchor

    Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

  • AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers cs.SD · 2025-12-03 · unverdicted · none · ref 40 · internal anchor

    AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, reaching SOTA on AS-20K, ESC-50, and NSynth under fine-tuning.

  • Tight Clusters Make Specialized Experts cs.LG · 2025-02-21 · unverdicted · none · ref 3 · internal anchor

    Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.

  • YOLOv12: Attention-Centric Real-Time Object Detectors cs.CV · 2025-02-18 · unverdicted · none · ref 1 · internal anchor

    YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

  • LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 155 · internal anchor

    LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

  • PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training cs.CV · 2025-08-13 · unverdicted · none · ref 2 · internal anchor

    PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million unlabeled images.

  • Towards Robust and Realistic Human Pose Estimation via WiFi Signals cs.CV · 2025-01-16 · unverdicted · none · ref 1 · internal anchor

    DT-Pose reformulates WiFi HPE as domain-consistent representation learning via temporal contrastive masked pretraining plus hybrid topology-constrained decoding to yield more accurate and realistic 2D/3D poses.

  • Frabjous: Deep Learning Fast Radio Burst Morphologies astro-ph.IM · 2025-07-20 · unverdicted · none · ref 7 · internal anchor

    Frabjous applies deep learning to classify FRB morphologies into five classes at 55% accuracy by augmenting limited real data with simulations.