How to train your vit? data, augmentation, and regularization in vision transformers

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, Lucas Beyer · 2021 · arXiv 2106.10270

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

cs.LG · 2026-05-14 · conditional · novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

cs.LG · 2024-10-31 · unverdicted · novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Sigmoid Loss for Language Image Pre-Training

cs.CV · 2023-03-27 · conditional · novelty 6.0

SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.

Decision-Aware Attention Propagation for Vision Transformer Explainability

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

DAP improves ViT attribution maps by injecting decision-relevant gradients into attention propagation, producing more class-sensitive and faithful explanations than standard attention rollout.

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

cs.LG · 2024-03-21 · accept · novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

citing papers explorer

Showing 6 of 6 citing papers.

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm cs.LG · 2026-05-14 · conditional · none · ref 57
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control cs.LG · 2024-10-31 · unverdicted · none · ref 47
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 80
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Sigmoid Loss for Language Image Pre-Training cs.CV · 2023-03-27 · conditional · none · ref 42
SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
Decision-Aware Attention Propagation for Vision Transformer Explainability cs.CV · 2026-04-20 · unverdicted · none · ref 23
DAP improves ViT attribution maps by injecting decision-relevant gradients into attention propagation, producing more class-sensitive and faithful explanations than standard attention rollout.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey cs.LG · 2024-03-21 · accept · none · ref 185
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

How to train your vit? data, augmentation, and regularization in vision transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer