pith. machine review for the scientific record. sign in

arxiv: 2602.22394 · v2 · submitted 2026-02-25 · 💻 cs.CV

Recognition: unknown

Vision Transformers Need More Than Registers

Authors on Pith no claims yet
classification 💻 cs.CV
keywords artifactsvitsacrossanalysisbehaviordownstreamglobalshortcuts
0
0 comments X
read the original abstract

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

    cs.CV 2026-05 conditional novelty 7.0

    LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...

  2. SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

  3. Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.