pith. sign in

arxiv: 2212.06727 · v1 · pith:QU3LUUWRnew · submitted 2022-12-13 · 💻 cs.CV

What do Vision Transformers Learn? A Visual Exploration

classification 💻 cs.CV
keywords vitsfeaturesinformationlayerstransformersvisionarchitectureconvolutional
0
0 comments X
read the original abstract

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

    cs.CV 2026-05 unverdicted novelty 6.0

    LESSViT introduces a low-rank efficient spatial-spectral attention mechanism and a hyperspectral masked autoencoder to improve generalization across spectral configuration shifts in hyperspectral imagery.

  2. A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.

  3. Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

    cs.GR 2026-03 conditional novelty 6.0

    Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.

  4. Textual Supervision Enhances Geospatial Representations in Vision-Language Models

    cs.CV 2026-06 unverdicted novelty 3.0

    Textual supervision enhances geospatial representations in vision-language models relative to vision-only models, shown via evaluations on image clusters of varying localizability.