BEiT: BERT Pre-Training of Image Transformers

Furu Wei; Hangbo Bao; Li Dong; Songhao Piao

arxiv: 2106.08254 · v2 · submitted 2021-06-15 · 💻 cs.CV · cs.LG

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao , Li Dong , Songhao Piao , Furu Wei This is my paper

Pith reviewed 2026-05-13 11:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords BEiTvision transformermasked image modelingself-supervised pre-trainingImageNet classificationBERT adaptationdiscrete visual tokens

0 comments

The pith

BEiT pre-trains vision transformers by recovering discrete visual tokens from masked image patches, reaching 83.2% ImageNet-1K accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BEiT, a self-supervised pre-training approach for vision transformers that follows the BERT pattern of masked modeling. Each image is first converted into a sequence of discrete visual tokens by a separate tokenizer; random patches are then masked, and the transformer is trained to predict the original tokens for those masked positions from the remaining visible patches. After this pre-training on unlabeled data, the model is fine-tuned by adding task-specific layers, yielding strong results on image classification and semantic segmentation. The base-size model achieves 83.2% top-1 accuracy on ImageNet-1K, exceeding a from-scratch DeiT baseline, while the large-size model reaches 86.3% using only ImageNet-1K data and surpasses a larger ViT model that relied on supervised pre-training over the bigger ImageNet-22K set.

Core claim

BEiT pre-trains a vision transformer encoder by feeding it corrupted images consisting of visible patches plus mask tokens, then requiring it to reconstruct the discrete visual tokens that a separate tokenizer assigned to the original full image. The same encoder weights are later fine-tuned directly on downstream tasks without further architectural changes.

What carries the argument

Masked image modeling objective that recovers discrete visual tokens from a set of randomly masked image patches.

If this is right

Vision transformers can reach competitive ImageNet accuracy using only ImageNet-1K for pre-training instead of larger labeled collections.
The same transformer backbone works for both the masked pre-training stage and subsequent fine-tuning on classification or segmentation.
Larger models benefit more from this pre-training, as shown by the jump from base to large size on the same data.
Semantic segmentation performance improves when the encoder has first learned to predict visual tokens from masked patches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better tokenizers could raise the upper bound on what the masked modeling signal can teach the transformer.
The same masked-token recipe might transfer to video or audio by swapping in an appropriate tokenizer for those domains.
Combining the token-prediction loss with other self-supervised objectives could produce even stronger starting weights for fine-tuning.

Load-bearing premise

The separate tokenizer must generate discrete visual tokens that carry rich semantic content rather than collapsing to low-level patterns.

What would settle it

A BEiT model fine-tuned on ImageNet-1K classification that matches or falls below the accuracy of an identically sized DeiT model trained from scratch would show the pre-training step added no value.

read the original abstract

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEiT shows masked visual token prediction pre-trains ViTs to solid ImageNet gains, but the lift depends on an unablated dVAE tokenizer whose contribution is not isolated.

read the letter

The main thing here is that BEiT adapts BERT-style masked modeling to vision transformers by predicting discrete visual tokens from masked patches, and it reports clear accuracy lifts over from-scratch DeiT training on ImageNet-1K. Base BEiT reaches 83.2% top-1 while large BEiT hits 86.3% using only ImageNet-1K data, beating supervised ViT-L on the larger ImageNet-22K set. This is new relative to the cited ViT and DeiT work, and the setup keeps the pre-training objective independent of downstream metrics, which is a clean design choice. Releasing code and models also helps anyone who wants to check the numbers directly. The paper does a straightforward job transferring the BERT idea without overcomplicating the architecture. The soft spots are real but not fatal. The results rest on the separate dVAE tokenizer producing stable, non-trivial targets; if it collapses or mostly encodes low-level statistics the masked modeling signal would weaken. The abstract and summary give no ablations on codebook size, training data for the tokenizer, or comparisons against random or constant targets, so it is hard to tell how much of the reported gain comes from the BERT objective versus the tokenizer itself. No error bars appear either, which leaves the size of the improvement over baselines only partially quantified. The stress-test concern holds up on the given text. This paper is for groups working on self-supervised pre-training for transformers in vision. A reader already following ViT scaling or BERT-style objectives will get the most out of it. The thinking is clear and the claims are falsifiable enough to warrant closer inspection. I would bring it to a reading group as a maybe to walk through the tokenizer details. I would cite it for the masked token results and the ImageNet numbers. It deserves peer review because the numbers are competitive and the method is simple to test further.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces BEiT, a self-supervised pre-training method for vision Transformers that adapts the BERT masked modeling paradigm. Each image is tokenized into discrete visual tokens via a separately trained dVAE; random patches are masked and the Transformer is trained to recover the original visual tokens from the corrupted input. After pre-training, the encoder is fine-tuned on downstream tasks. Key empirical claims are that base-size BEiT reaches 83.2% top-1 accuracy on ImageNet-1K (outperforming from-scratch DeiT at 81.8%) and large-size BEiT reaches 86.3% using only ImageNet-1K data, exceeding supervised ViT-L pre-trained on ImageNet-22K (85.2%).

Significance. If the central assumption about the tokenizer holds, the work shows that a BERT-style masked token prediction objective can be transferred to vision Transformers and yields competitive or superior ImageNet performance with substantially less supervised data than prior supervised pre-training. The public release of code and models is a positive contribution to reproducibility.

major comments (3)

[§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.
[Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.
[§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.

minor comments (2)

[Abstract] Abstract: the phrase 'competitive results with previous pre-training methods' is vague; listing the main baselines (DeiT, ViT, etc.) would improve clarity.
[§2.1] §2.1: notation for 'visual tokens' versus standard ViT patch embeddings is introduced without a clear notational distinction, which can confuse readers familiar with the ViT paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and valuable suggestions. We have revised the manuscript to address the major comments and provide additional details and experiments where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.

Authors: We thank the referee for highlighting this aspect. The dVAE is trained on ImageNet-1K following the original dVAE paper, and serves as a fixed discretization step. To address the concern, we have performed additional ablations on codebook size (1024, 2048, 4096, 8192) and included the results in the revised Section 3.2. The ImageNet accuracy varies by at most 0.4% across these sizes, supporting that the MIM objective is the key contributor to the performance gains over DeiT. We have also added a discussion on why dVAE was chosen over other tokenization methods. revision: yes
Referee: [Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.

Authors: We agree that reporting statistical reliability is important for such claims. In the revised manuscript, we have updated Table 1 to include the mean accuracy and standard deviation computed over three independent runs with different random seeds. For the base model, BEiT achieves 83.2% ± 0.15%, compared to DeiT's 81.8% ± 0.20%. The improvement is consistent across runs. revision: yes
Referee: [§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.

Authors: We confirm that the fine-tuning protocol for BEiT-Large is exactly the same as that used for the supervised ViT-Large in the original ViT work, including identical data augmentations (RandAugment, Mixup, CutMix), optimizer (AdamW with the same hyperparameters), learning rate schedule, and number of epochs. We have added an explicit statement and a reference to the exact settings from Dosovitskiy et al. in the revised Section 4.2 to clarify this. revision: yes

Circularity Check

0 steps flagged

BEiT pre-training objective is independently defined and externally validated

full rationale

The paper defines its masked image modeling task as recovering discrete visual tokens produced by a separately trained tokenizer, with the objective stated independently of any downstream metrics. Reported gains (e.g., 83.2% base BEiT vs. 81.8% DeiT on ImageNet-1K) are empirical results from fine-tuning on standard held-out benchmarks, not reductions of the claimed performance to the pre-training inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described pipeline; the tokenizer is an external component whose quality is not derived from BEiT equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a pre-trained discrete visual tokenizer whose output tokens serve as reconstruction targets; no free parameters are fitted inside the BEiT transformer itself beyond standard training hyperparameters.

axioms (1)

domain assumption A separately trained tokenizer produces discrete visual tokens that are a suitable prediction target for masked image modeling.
Invoked in the description of the two-view pre-training setup; the quality of these tokens is not derived from the BEiT loss.

invented entities (1)

visual tokens no independent evidence
purpose: Discrete reconstruction targets for the masked modeling objective
Generated by an external tokenizer; no independent evidence of their semantic richness is supplied in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1286 out tokens · 33104 ms · 2026-05-13T11:44:06.846514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Masked Autoencoders Are Scalable Vision Learners
cs.CV 2021-11 accept novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain gener...
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
cs.CV 2026-04 unverdicted novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
Recurrent Video Masked Autoencoders
cs.CV 2025-12 unverdicted novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better pa...
Adversarial Video Promotion Against Text-to-Video Retrieval
cs.CV 2025-08 unverdicted novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
cs.CV 2024-12 unverdicted novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
Segment Anything
cs.CV 2023-04 unverdicted novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
cs.CV 2023-02 accept novelty 7.0

ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.
iBOT: Image BERT Pre-Training with Online Tokenizer
cs.CV 2021-11 unverdicted novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

TAPE introduces temporal-aware token pruning for diffusion-based video generation, using frame smoothing, layer reselection, and timestep budgets to achieve speedups while maintaining visual fidelity and coherence.
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
cs.CV 2026-05 unverdicted novelty 6.0

AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
cs.CV 2026-05 unverdicted novelty 6.0

ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 63...
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
cs.CV 2026-04 unverdicted novelty 6.0

MAEPose is a masked autoencoder that learns spatiotemporal representations from unlabeled mmWave radar videos to estimate human poses, outperforming baselines by up to 22.1% in MPJPE.
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

BrainDINO delivers a single self-supervised brain MRI representation that generalizes to tumor segmentation, disease classification, brain age estimation, and other tasks without volumetric pretraining or full fine-tuning.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
cs.CL 2026-04 unverdicted novelty 6.0

MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.
Rapidly deploying on-device eye tracking by distilling visual foundation models
cs.CV 2026-04 unverdicted novelty 6.0

DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
Generative Event Pretraining with Foundation Model Alignment
cs.CV 2026-03 unverdicted novelty 6.0

GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.
AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers
cs.SD 2025-12 unverdicted novelty 6.0

AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, ...
Tight Clusters Make Specialized Experts
cs.LG 2025-02 unverdicted novelty 6.0

Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
LIMO: Less is More for Reasoning
cs.CL 2025-02 unverdicted novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already ...
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
cs.CL 2024-11 conditional novelty 6.0

MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
cs.CV 2023-03 conditional novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Vector-quantized Image Modeling with Improved VQGAN
cs.CV 2021-10 accept novelty 6.0

Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
cs.CV 2026-05 unverdicted novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
cs.CV 2026-04 unverdicted novelty 5.0

ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.
PRAGMA: Revolut Foundation Model
cs.LG 2026-04 unverdicted novelty 5.0

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis
cs.CV 2026-04 unverdicted novelty 5.0

New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
cs.AI 2026-03 unverdicted novelty 5.0

Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
cs.CV 2026-01 unverdicted novelty 5.0

FMSD improves cross-dataset generalization in deepfake detection by using gradient-based layer masking to select forgery-sensitive weights and SVD to split them into preserved semantic and multiple learnable artifact ...
PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training
cs.CV 2025-08 unverdicted novelty 5.0

PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million u...
Towards Robust and Realistic Human Pose Estimation via WiFi Signals
cs.CV 2025-01 unverdicted novelty 5.0

DT-Pose reformulates WiFi HPE as domain-consistent representation learning via temporal contrastive masked pretraining plus hybrid topology-constrained decoding to yield more accurate and realistic 2D/3D poses.
Disentangled Generative Graph Representation Learning
cs.LG 2024-08 unverdicted novelty 5.0

DiGGR introduces a self-supervised graph representation learning framework that disentangles latent factors to guide mask modeling and improve representation quality on graph tasks.
FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition
cs.CV 2026-05 unverdicted novelty 4.0

FruitEnsemble uses a weighted ensemble of backbones for top-3 candidates followed by MLLM arbitration on low-confidence samples to reach 70.49% accuracy on a new 306-class fruit dataset.
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild
cs.CV 2026-04 unverdicted novelty 4.0

The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.
Frabjous: Deep Learning Fast Radio Burst Morphologies
astro-ph.IM 2025-07 unverdicted novelty 4.0

Frabjous applies deep learning to classify FRB morphologies into five classes at 55% accuracy by augmenting limited real data with simulations.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
cs.CV 2026-04 unverdicted novelty 3.0

DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
cs.MM 2024-10 unverdicted novelty 3.0

Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 49 Pith papers · 7 internal anchors

[1]

UniLMv2: Pseudo- masked language models for uniﬁed language model pre-training

[BDW+20] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. UniLMv2: Pseudo- masked language models for uniﬁed language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 , volume 119 of Proceedings of Machine Learning R...

work page 2020
[2]

Improved Baselines with Momentum Contrastive Learning

[CFGH20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297,

work page internal anchor Pith review arXiv 2003
[3]

Exploring simple siamese representation learning, 2020

[CH20] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566,

work page arXiv 2011
[4]

A Simple Framework for Contrastive Learning of Visual Representations

[CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709,

work page internal anchor Pith review arXiv 2002
[5]

Emerging Properties in Self-Supervised Vision Transformers

[CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo- janowski, and Armand Joulin. Emerging properties in self-supervised vision transform- ers. arXiv preprint arXiv:2104.14294,

work page internal anchor Pith review arXiv
[6]

An empirical study of training self-supervised vision transformers

[CXH21] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self- supervised vision transformers. ArXiv, abs/2104.02057,

work page arXiv
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[DBK+20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

BERT: pre- training of deep bidirectional transformers for language understanding

10 [DCLT19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational ...

work page 2019
[9]

Self-attention attribution: Interpreting information interactions inside Transformer

[HDWX20] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside Transformer. arXiv preprint arXiv:2004.11207,

work page arXiv 2004
[10]

Weinberger

[HSL+16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 646–661, Cham,

work page 2016
[11]

Categorical reparameterization with gumbel- softmax

[JGP17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

work page 2017
[12]

Kingma and Max Welling

[KW14] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014 ,

work page 2014
[13]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

[LLC+21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,

work page internal anchor Pith review arXiv
[14]

Representation Learning with Contrastive Predictive Coding

[OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Zero-Shot Text-to-Image Generation

[RPG+21] A. Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092,

work page internal anchor Pith review arXiv
[16]

Training data-efficient image transformers & distillation through attention

Association for Computational Linguistics. [TCD+20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay- rolles, and Hervé Jégou. Training data-efﬁcient image transformers & distillation through attention. preprint arXiv:2012.12877,

work page arXiv 2012
[17]

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

[TCS+21] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. arXiv preprint arXiv:2103.17239,

work page arXiv
[18]

Selﬁe: Self-supervised pretraining for image embedding

[TLL19] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selﬁe: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940,

work page arXiv 1906
[19]

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Curran Associates Inc. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processi...

work page 2017
[20]

Self-supervised learning with swin transformers,

[XLY+21] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553,

work page arXiv
[21]

Scaling vision transform- ers, 6 2021

[ZKHB21] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. arXiv preprint arXiv:2106.04560,

work page arXiv
[22]

[ZLZ+20] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, abs/2012.15840,

work page arXiv 2012
[23]

*: result is taken from [CXH21]

The results, unless otherwise indicated, are all obtained by base-size models. *: result is taken from [CXH21]. G Hyperparameters for Pre-Training Hyperparameters Base Size Large Size Layers 12 24 Hidden size 768 1024 FFN inner hidden size 3072 4096 Attention heads 12 16 Attention head size 64 Patch size 16 × 16 Training epochs 800 Batch size 2048 Adamϵ 1...

work page 2048

[1] [1]

UniLMv2: Pseudo- masked language models for uniﬁed language model pre-training

[BDW+20] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. UniLMv2: Pseudo- masked language models for uniﬁed language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 , volume 119 of Proceedings of Machine Learning R...

work page 2020

[2] [2]

Improved Baselines with Momentum Contrastive Learning

[CFGH20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297,

work page internal anchor Pith review arXiv 2003

[3] [3]

Exploring simple siamese representation learning, 2020

[CH20] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566,

work page arXiv 2011

[4] [4]

A Simple Framework for Contrastive Learning of Visual Representations

[CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709,

work page internal anchor Pith review arXiv 2002

[5] [5]

Emerging Properties in Self-Supervised Vision Transformers

[CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo- janowski, and Armand Joulin. Emerging properties in self-supervised vision transform- ers. arXiv preprint arXiv:2104.14294,

work page internal anchor Pith review arXiv

[6] [6]

An empirical study of training self-supervised vision transformers

[CXH21] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self- supervised vision transformers. ArXiv, abs/2104.02057,

work page arXiv

[7] [7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[DBK+20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[8] [8]

BERT: pre- training of deep bidirectional transformers for language understanding

10 [DCLT19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational ...

work page 2019

[9] [9]

Self-attention attribution: Interpreting information interactions inside Transformer

[HDWX20] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside Transformer. arXiv preprint arXiv:2004.11207,

work page arXiv 2004

[10] [10]

Weinberger

[HSL+16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 646–661, Cham,

work page 2016

[11] [11]

Categorical reparameterization with gumbel- softmax

[JGP17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

work page 2017

[12] [12]

Kingma and Max Welling

[KW14] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014 ,

work page 2014

[13] [13]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

[LLC+21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,

work page internal anchor Pith review arXiv

[14] [14]

Representation Learning with Contrastive Predictive Coding

[OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Zero-Shot Text-to-Image Generation

[RPG+21] A. Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092,

work page internal anchor Pith review arXiv

[16] [16]

Training data-efficient image transformers & distillation through attention

Association for Computational Linguistics. [TCD+20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay- rolles, and Hervé Jégou. Training data-efﬁcient image transformers & distillation through attention. preprint arXiv:2012.12877,

work page arXiv 2012

[17] [17]

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

[TCS+21] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. arXiv preprint arXiv:2103.17239,

work page arXiv

[18] [18]

Selﬁe: Self-supervised pretraining for image embedding

[TLL19] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selﬁe: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940,

work page arXiv 1906

[19] [19]

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Curran Associates Inc. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processi...

work page 2017

[20] [20]

Self-supervised learning with swin transformers,

[XLY+21] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553,

work page arXiv

[21] [21]

Scaling vision transform- ers, 6 2021

[ZKHB21] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. arXiv preprint arXiv:2106.04560,

work page arXiv

[22] [22]

[ZLZ+20] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, abs/2012.15840,

work page arXiv 2012

[23] [23]

*: result is taken from [CXH21]

The results, unless otherwise indicated, are all obtained by base-size models. *: result is taken from [CXH21]. G Hyperparameters for Pre-Training Hyperparameters Base Size Large Size Layers 12 24 Hidden size 768 1024 FFN inner hidden size 3072 4096 Attention heads 12 16 Attention head size 64 Patch size 16 × 16 Training epochs 800 Batch size 2048 Adamϵ 1...

work page 2048