hub Mixed citations

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, Furu Wei · 2021 · cs.CV · arXiv 2106.08254

Mixed citation behavior. Most common role is background (42%).

70 Pith papers citing it

Background 42% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 5

citation-polarity summary

background 5 use method 4 unclear 2 extend 1

representative citing papers

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.

Neural Scaling Laws for Jet Generation

hep-ph · 2026-05-27 · unverdicted · novelty 7.0

Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

Rethink MAE with Linear Time-Invariant Dynamics

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.

VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

cs.CV · 2026-04-23 · unverdicted · novelty 7.0 · 2 refs

VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.

Recurrent Video Masked Autoencoders

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.

Adversarial Video Promotion Against Text-to-Video Retrieval

cs.CV · 2025-08-09 · unverdicted · novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

cs.CV · 2024-12-11 · unverdicted · novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

cs.CV · 2023-02-23 · accept · novelty 7.0

ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.

iBOT: Image BERT Pre-Training with Online Tokenizer

cs.CV · 2021-11-15 · unverdicted · novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.

A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

ER-JEPA applies hierarchical Joint-Embedding Predictive Architecture to ECG time series and reports SOTA performance on the ST-MEM benchmark after pretraining on ~180k recordings.

Mirror-Fusion Attention for Reflection-Aware Self-Supervised Representation Learning

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

MFASSL adds mirror-paired views, a lightweight Mirror-Fusion Attention module, and reflection-consistency losses to improve SSL on bilateral data with ~2.7% extra parameters.

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

cs.RO · 2026-06-17 · conditional · novelty 6.0

HT-Bench is a large egocentric vision-plus-full-hand-tactile benchmark with four evaluation tasks; the proposed HandTouch encoder improves Recall@5 from 74.65% to 85.23%, reduces inpainting RMSE from 0.022 to 0.010, and raises OOD cIoU from 0.628 to 0.705.

Contrastive Action-Image Pre-training for Visuomotor Control

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.

SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation

physics.ins-det · 2026-06-09 · unverdicted · novelty 6.0

SPADE is a split-and-delay embedding technique for multi-feature autoregressive transformers that achieves competitive performance on high-granularity calorimeter shower simulation.

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

citing papers explorer

Showing 7 of 7 citing papers after filters.

A Lightweight Self-Supervised Learning Framework for Multivariate Time Series using Hierarchical-JEPA on ECG Data cs.LG · 2026-07-01 · unverdicted · none · ref 5 · internal anchor
ER-JEPA applies hierarchical Joint-Embedding Predictive Architecture to ECG time series and reports SOTA performance on the ST-MEM benchmark after pretraining on ~180k recordings.
Tight Clusters Make Specialized Experts cs.LG · 2025-02-21 · unverdicted · none · ref 3 · internal anchor
Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.
Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis cs.LG · 2026-07-02 · unverdicted · none · ref 72 · internal anchor
Zeus proposes a multi-scale Transformer with point-wise tokenization and Multi-Objective Temporal Masking to enable tuning-free performance on forecasting, interpolation, and other time series tasks.
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning cs.LG · 2026-04-30 · unverdicted · none · ref 23 · internal anchor
BrainDINO, trained via self-distillation on millions of unlabeled axial brain MRI slices, yields a unified representation that equals or exceeds baselines across diverse neuroimaging tasks when used with a frozen encoder and lightweight heads.
PRAGMA: Revolut Foundation Model cs.LG · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and lifetime value prediction using linear heads or light fine-tuning.
Disentangled Generative Graph Representation Learning cs.LG · 2024-08-24 · unverdicted · none · ref 2 · internal anchor
DiGGR introduces a self-supervised graph representation learning framework that disentangles latent factors to guide mask modeling and improve representation quality on graph tasks.
Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data cs.LG · 2026-07-02 · unverdicted · none · ref 23 · internal anchor
Abstract-only report: theoretical comparison finds MIM more robust than CL to non-IID data in D-SSL and robustness scales with connectivity; MAR loss proposed as practical application.

BEiT: BERT Pre-Training of Image Transformers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer