Scaling Vision Transformers to 22 Billion Parameters

Alexander Kolesnikov; Alexey Gritsenko; Andreas Steiner; Anurag Arnab; Aravindh Mahendran; Avital Oliver; Basil Mustafa; Carlos Riquelme; Cristina Vasconcelos; Daniel Keysers

arxiv: 2302.05442 · v1 · pith:VRCOCLKDnew · submitted 2023-02-10 · 💻 cs.CV · cs.AI· cs.LG

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani , Josip Djolonga , Basil Mustafa , Piotr Padlewski , Jonathan Heek , Justin Gilmer , Andreas Steiner , Mathilde Caron

show 34 more authors

Robert Geirhos Ibrahim Alabdulmohsin Rodolphe Jenatton Lucas Beyer Michael Tschannen Anurag Arnab Xiao Wang Carlos Riquelme Matthias Minderer Joan Puigcerver Utku Evci Manoj Kumar Sjoerd van Steenkiste Gamaleldin F. Elsayed Aravindh Mahendran Fisher Yu Avital Oliver Fantine Huot Jasmijn Bastings Mark Patrick Collier Alexey Gritsenko Vighnesh Birodkar Cristina Vasconcelos Yi Tay Thomas Mensink Alexander Kolesnikov Filip Paveti\'c Dustin Tran Thomas Kipf Mario Lu\v{c}i\'c Xiaohua Zhai Daniel Keysers Jeremiah Harmsen Neil Houlsby

This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords parametersscalingtransformersvisionvit-22bdemonstratesimprovedlanguage

0 comments

read the original abstract

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders
cs.LG 2026-06 unverdicted novelty 7.0

Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
Scaling Laws for Neural-Network Quantum States
cond-mat.dis-nn 2026-06 unverdicted novelty 6.0

Transformer wave functions for the J1-J2 Heisenberg model exhibit size-independent power-law decay of V-score with compute, with the exponent decreasing as frustration increases.
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
cs.LG 2026-05 conditional novelty 6.0

Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
cs.DC 2026-05 unverdicted novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal ...
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
PaLM-E: An Embodied Multimodal Language Model
cs.LG 2023-03 conditional novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
Unsupervised Semantic Segmentation Facilitates Model Understanding
cs.CV 2026-05 unverdicted novelty 5.0

A visualization protocol based on unsupervised semantic segmentation reveals positional biases, scaling behaviors, and boundary artifacts across self-supervised vision transformer models.
Unsupervised Semantic Segmentation Facilitates Model Understanding
cs.CV 2026-05 unverdicted novelty 5.0

A visualization protocol using unsupervised semantic segmentation outputs reveals positional biases, scaling behaviors, and boundary artifacts in self-supervised ViTs and distinguishes them from locality bias.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
cs.CV 2026-01 unverdicted novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-f...
Scalable Object Detection in the Car Interior With Vision Foundation Models
cs.CV 2025-08 unverdicted novelty 5.0

ODAL framework distributes vision foundation models across on-board and cloud for car interior object detection, with fine-tuned LLaVA 1.5 7B reaching 89% ODAL score, 71% improvement, and outperforming GPT-4o while re...
VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
cs.CL 2026-05 unverdicted novelty 4.0

VectraYX-Nano is a 42M-parameter Spanish cybersecurity LLM trained with curriculum learning and native MCP tool use, achieving 0.78 conversational gate and improved tool selection with denser data.
VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
cs.CL 2026-05 unverdicted novelty 4.0

A 42M-parameter Spanish cybersecurity LLM trained with curriculum learning and MCP tool use, reporting benchmark scores from corpus ablations and SFT rebalancing.
VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
cs.CL 2026-05 unverdicted novelty 4.0

Trains a 42M-parameter Spanish cybersecurity LLM from scratch with curriculum phases and achieves 0.23 tool-selection accuracy after SFT mixture rebalancing to 1:21 tool-use ratio.
Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation
q-bio.BM 2026-05 unverdicted novelty 4.0

Moirain models use multimodal SFT and DPO to generate novel RNA sequences with superior protein binding affinities in a zero-shot conditional setting.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
cs.LG 2026-01 unverdicted novelty 3.0

A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.