hub Mixed citations

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He · 2020 · cs.CV · arXiv 2003.04297

Mixed citation behavior. Most common role is background (57%).

40 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR's design improvements by implementing them in the MoCo framework. With simple modifications to MoCo---namely, using an MLP projection head and more data augmentation---we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible. Code will be made public.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 method 1

citation-polarity summary

background 4 baseline 1 unclear 1 use method 1

representative citing papers

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Emerging Properties in Self-Supervised Vision Transformers

cs.CV · 2021-04-29 · conditional · novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.

Targeted Downstream-Agnostic Attack

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Introduces Targeted Downstream-Agnostic Attack (TDAA) that uses a threat image as feature anchor and example-specific perturbations to achieve targeted attacks on unknown downstream tasks from pre-trained encoders.

SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.

Attention Transfer Is Not Universally Effective for Vision Transformers

cs.CV · 2026-05-08 · accept · novelty 7.0

Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.

TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models

cs.CV · 2026-05-07 · conditional · novelty 7.0

CA-DSSL enables effective self-supervised pretraining for 396K-parameter MCU backbones, reaching 62.7% linear-probe accuracy on CIFAR-100 and 94% of supervised performance while fitting in 378 KB INT8.

Generative Texture Filtering

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

cs.CV · 2026-02-13 · accept · novelty 7.0

CBEN provides paired optical-radar images with cloud occlusion, revealing 23-33 point AP drops in clear-sky trained models and 17-29 point relative gains when models are trained on cloudy data.

Joint Embedding Variational Bayes

cs.LG · 2026-02-05 · unverdicted · novelty 7.0

VJE is a new variational non-contrastive SSL method that models target embeddings with a directional-radial Student-t distribution to enable structured uncertainty estimation directly in the learned representation space.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

BEiT: BERT Pre-Training of Image Transformers

cs.CV · 2021-06-15 · conditional · novelty 7.0

BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

cs.CV · 2021-05-11 · accept · novelty 7.0

VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

ArmSSL is a black-box verifiable and adversarially robust watermarking framework for SSL pre-trained encoders using paired discrepancy enlargement, latent entanglement, distribution alignment, and reference-guided tuning.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · conditional · novelty 6.0 · 2 refs

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

TranCLR models continuous skeleton action spaces with transitional anchors and multi-level manifold calibration, yielding smoother and more accurate representations than binary contrastive methods.

Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

A 10.9M-parameter self-supervised model pretrained on 61k CAD meshes achieves R²=0.729 reconstruction and 98.1% top-1 retrieval on held-out data via masked normalized geometry reconstruction and multi-resolution contrastive learning.

Boosting Visual Instruction Tuning with Self-Supervised Guidance

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.

Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

TaCo contrastively embeds semantic, generative, and transformation tasks from medical imaging into a joint space to reveal which tasks cluster, blend, or remain distinct.

Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

cs.CV · 2026-03-13 · unverdicted · novelty 6.0

TPSNet combines CLIP text prompts and phase features as dual priors to deliver better semantic supervision and domain alignment than pseudo-label clustering in unsupervised cross-domain image retrieval.

Vision Transformers Need More Than Registers

cs.CV · 2026-02-25 · unverdicted · novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

cs.CV · 2025-11-11 · unverdicted · novelty 6.0

LandSegmenter creates a task-specific foundation model for LULC mapping using weak labels from existing products, an RS adapter, text encoder, and confidence-guided fusion to achieve competitive zero-shot performance across modalities and taxonomies.

CoUn: Empowering Machine Unlearning via Contrastive Learning

cs.LG · 2025-09-19 · unverdicted · novelty 6.0

CoUn emulates retrained-model behavior on forget data by using contrastive learning on retain data to adjust semantic representations while preserving retain clusters via supervised learning, outperforming prior MU methods in experiments.

citing papers explorer

Showing 40 of 40 citing papers.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States cs.LG · 2024-07-05 · conditional · none · ref 13 · internal anchor
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 72 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Emerging Properties in Self-Supervised Vision Transformers cs.CV · 2021-04-29 · conditional · none · ref 15 · internal anchor
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
Targeted Downstream-Agnostic Attack cs.CV · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
Introduces Targeted Downstream-Agnostic Attack (TDAA) that uses a threat image as feature anchor and example-specific perturbations to achieve targeted attacks on unknown downstream tasks from pre-trained encoders.
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data cs.LG · 2026-05-08 · unverdicted · none · ref 237 · internal anchor
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.
Attention Transfer Is Not Universally Effective for Vision Transformers cs.CV · 2026-05-08 · accept · none · ref 30 · internal anchor
Attention transfer from ViT teachers succeeds for only 7 of 11 families and fails for the rest because of architectural mismatch between teacher and student.
TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models cs.CV · 2026-05-07 · conditional · none · ref 5 · internal anchor
CA-DSSL enables effective self-supervised pretraining for 396K-parameter MCU backbones, reaching 62.7% linear-probe accuracy on CIFAR-100 and 94% of supervised performance while fitting in 378 KB INT8.
Generative Texture Filtering cs.CV · 2026-04-21 · unverdicted · none · ref 28 · internal anchor
A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding cs.CV · 2026-02-13 · accept · none · ref 39 · internal anchor
CBEN provides paired optical-radar images with cloud occlusion, revealing 23-33 point AP drops in clear-sky trained models and 17-29 point relative gains when models are trained on cloudy data.
Joint Embedding Variational Bayes cs.LG · 2026-02-05 · unverdicted · none · ref 3 · internal anchor
VJE is a new variational non-contrastive SSL method that models target embeddings with a directional-radial Student-t distribution to enable structured uncertainty estimation directly in the learned representation space.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 59 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
BEiT: BERT Pre-Training of Image Transformers cs.CV · 2021-06-15 · conditional · none · ref 2 · internal anchor
BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning cs.CV · 2021-05-11 · accept · none · ref 99 · internal anchor
VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders cs.CR · 2026-04-24 · unverdicted · none · ref 15 · internal anchor
ArmSSL is a black-box verifiable and adversarially robust watermarking framework for SSL pre-trained encoders using paired discrepancy enlargement, latent entanglement, distribution alignment, and reference-guided tuning.
Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · conditional · none · ref 9 · 2 links · internal anchor
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors cs.CV · 2026-04-20 · unverdicted · none · ref 5 · internal anchor
TranCLR models continuous skeleton action spaces with transitional anchors and multi-level manifold calibration, yielding smoother and more accurate representations than binary contrastive methods.
Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis cs.CV · 2026-04-19 · unverdicted · none · ref 16 · internal anchor
A 10.9M-parameter self-supervised model pretrained on 61k CAD meshes achieves R²=0.729 reconstruction and 98.1% top-1 retrieval on held-out data via masked normalized geometry reconstruction and multi-resolution contrastive learning.
Boosting Visual Instruction Tuning with Self-Supervised Guidance cs.CV · 2026-04-14 · unverdicted · none · ref 18 · internal anchor
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective cs.CV · 2026-04-07 · unverdicted · none · ref 16 · internal anchor
TaCo contrastively embeds semantic, generative, and transformation tasks from medical imaging into a joint space to reveal which tasks cluster, blend, or remain distinct.
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval cs.CV · 2026-03-13 · unverdicted · none · ref 6 · internal anchor
TPSNet combines CLIP text prompts and phase features as dual priors to deliver better semantic supervision and domain alignment than pseudo-label clustering in unsupervised cross-domain image retrieval.
Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 2 · internal anchor
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping cs.CV · 2025-11-11 · unverdicted · none · ref 2 · internal anchor
LandSegmenter creates a task-specific foundation model for LULC mapping using weak labels from existing products, an RS adapter, text encoder, and confidence-guided fusion to achieve competitive zero-shot performance across modalities and taxonomies.
CoUn: Empowering Machine Unlearning via Contrastive Learning cs.LG · 2025-09-19 · unverdicted · none · ref 30 · internal anchor
CoUn emulates retrained-model behavior on forget data by using contrastive learning on retain data to adjust semantic representations while preserving retain clusters via supervised learning, outperforming prior MU methods in experiments.
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning cs.CV · 2025-07-18 · conditional · none · ref 27 · internal anchor
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think cs.CV · 2024-10-09 · unverdicted · none · ref 124 · internal anchor
Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics cs.SD · 2024-06-03 · unverdicted · none · ref 109 · internal anchor
Introduces animal2vec, a self-supervised transformer for sparse bioacoustic audio, and the MeerKAT meerkat vocalization dataset, claiming outperformance over baselines including in few-shot settings.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 59 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 142 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Vector-quantized Image Modeling with Improved VQGAN cs.CV · 2021-10-09 · accept · none · ref 13 · internal anchor
Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 36 · 2 links · internal anchor
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging cs.CV · 2026-05-14 · unverdicted · none · ref 93 · internal anchor
A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classification tasks.
Information theoretic underpinning of self-supervised learning by clustering cs.LG · 2026-05-12 · unverdicted · none · ref 97 · internal anchor
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs cs.CV · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge cs.CV · 2026-04-13 · accept · none · ref 44 · 2 links · internal anchor
Self-supervised pretraining on large unlabeled clinical brain MRI data improves generalization to out-of-domain clinical tasks over supervised in-domain training, with task-specific optimal objectives and limited benefits from model scaling.
SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation cs.CV · 2026-04-11 · unverdicted · none · ref 52 · internal anchor
SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some supervised methods on benchmarks including Booster.
PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training cs.CV · 2025-08-13 · unverdicted · none · ref 11 · internal anchor
PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million unlabeled images.
On the Generalizability of Foundation Models for Crop Type Mapping cs.CV · 2024-09-14 · unverdicted · none · ref 37 · internal anchor
Sentinel-2-specific foundation models outperform ImageNet on multi-continent crop mapping, with 100 labels achieving high overall accuracy but 900 required to address class imbalance.
Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 10 · internal anchor
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning cs.LG · 2026-04-30 · unreviewed · ref 19 · internal anchor

Improved Baselines with Momentum Contrastive Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer