arxiv: 2512.01643 · v2 · submitted 2025-12-01 · 💻 cs.CV

ViT³: Unlocking Test-Time Training in Vision

Dongchen Han , Yining Li , Tianyu Li , Zixuan Cao , Ziming Wang , Jun Song , Yu Cheng , Bo Zheng

show 1 more author

Gao Huang

This is my paper

Pith reviewed 2026-05-17 02:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time trainingvision transformerslinear complexityimage classificationobject detectionsemantic segmentationefficient modeling

0 comments

The pith

ViT³ adapts test-time training to vision by distilling six design insights into a linear-complexity model that competes with Mamba and approaches vision transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper performs a series of experiments to understand how to build effective test-time training systems for images and other visual data. It extracts six practical guidelines for the inner learning module and its training procedure at test time. These guidelines are used to create the ViT³ architecture, which runs with linear complexity and can be parallelized. Readers would find this useful because it turns an abstract reformulation of attention into concrete, working rules for visual tasks like classifying images, generating them, detecting objects, and segmenting scenes. If the findings hold, they suggest that online learning during inference can serve as a viable alternative to both full attention and other linear approximations in computer vision.

Core claim

The authors show that a systematic empirical exploration of test-time training choices for vision leads to six design insights. Applying these insights produces ViT³, a model that reformulates the attention mechanism as constructing and training an inner model from key-value pairs at test time. This results in linear complexity with parallel computation and performance that matches or surpasses linear-complexity models such as Mamba while reducing the difference from optimized vision transformers across multiple visual benchmarks.

What carries the argument

The inner model constructed and trained online from key-value pairs during test-time inference, guided by the six distilled design principles.

If this is right

ViT³ achieves linear computational complexity for visual sequence modeling.
It performs competitively with Mamba and linear attention on image classification, generation, object detection, and semantic segmentation.
The architecture supports parallelizable computation.
Future visual TTT models can use the six insights as starting design principles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The six insights may help in adapting TTT to video sequences or other temporal visual data.
Hybrid designs combining ViT³ elements with existing transformer optimizations could further close the performance gap.
Testing the insights at much larger model scales would clarify their robustness.

Load-bearing premise

The six distilled design insights will generalize beyond the tested tasks and datasets and remain effective when the inner training is scaled up.

What would settle it

A clear falsifier would be if ViT³ shows substantially worse performance than Mamba or linear attention on a standard visual benchmark not used in the study, or if removing one of the key insights causes no drop in results.

Figures

Figures reproduced from arXiv: 2512.01643 by Bo Zheng, Dongchen Han, Gao Huang, Jun Song, Tianyu Li, Yining Li, Yu Cheng, Ziming Wang, Zixuan Cao.

**Figure 1.** Figure 1: Illustration of Softmax attention [58], linear attention [31], and Test-Time Training (TTT) module [55]. (a) Softmax attention can be viewed as building a two-layer MLP that directly uses the uncompressed keys K and values V , where the hidden width equals the sequence length N and the nonlinearity is Softmax. While effective, this N-width MLP leads to O(N 2 ) costs when applied to the queries Q ∈R N×d .… view at source ↗

**Figure 2.** Figure 2: Illustration of the TTT model building block. TTT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Results of TTT models with inner modules of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons between DeiT and ViT3 in (a) FPS on RTX3090, and (b) per image GPU memory usage. 5.4. Image Generation We further benchmark TTT method on the class-conditional image generation task using the ImageNet-1K dataset. Specifically, we replace the Softmax attention module in DiT [42] with our ViT3 block, yielding the DiT3 model family. Following standard protocol, we report FID on 50 000 validation … view at source ↗

read the original abstract

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code: github.com/LeapLabTHU/ViTTT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript conducts a systematic empirical study of Test-Time Training (TTT) designs for visual sequence modeling. From a series of experiments the authors distill six practical design insights, which they use to construct ViT³, a pure linear-complexity TTT architecture with parallelizable computation. ViT³ is evaluated on image classification, image generation, object detection, and semantic segmentation, where it is reported to match or outperform advanced linear-complexity baselines (Mamba, linear-attention variants) and to narrow the gap to highly optimized vision Transformers.

Significance. If the empirical results prove reproducible under matched capacity and training regimes, the work supplies concrete, actionable guidelines for visual TTT and supplies a competitive open-source baseline. The release of code and the breadth of tasks (classification, generation, detection, segmentation) are strengths that increase the potential utility of the study for the community.

minor comments (3)

The abstract states that six design insights are distilled but does not enumerate them; a concise bullet list in the abstract or a dedicated early section would improve scannability.
The experimental sections should explicitly report error bars, number of runs, and data-selection criteria for all quantitative tables; their absence makes it difficult to assess the stability of the claimed outperformance.
Notation for the inner-model update rule and the distinction between training-time and test-time parameters could be clarified with a small diagram or pseudocode block in the method section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our manuscript, the recognition of its significance for the community, and the recommendation for minor revision. We are pleased that the empirical study, the six design insights, the breadth of evaluated tasks, and the open-source code release are viewed as strengths.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical study that runs ablation experiments on TTT inner-module and training choices, distills six design insights from the observed results, and assembles ViT³ accordingly. All performance claims are tied directly to the reported numbers on classification, generation, detection, and segmentation benchmarks rather than to any closed-form derivation or self-referential equation. No load-bearing step reduces by construction to a fitted parameter or prior self-citation; the work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the work appears to rest on standard supervised-learning assumptions and hyper-parameter choices rather than new invented entities or unstated axioms.

pith-pipeline@v0.9.0 · 5567 in / 1057 out tokens · 29525 ms · 2026-05-17T02:56:09.319903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We distill six practical insights... loss functions L for which ∂²L/∂V̂∂V vanishes are not suitable... single epoch of full-batch inner training... large inner learning rate (1.0)... increasing inner model capacity... deep inner models suffer from optimization difficulties... convolutional architectures are particularly appropriate
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time... linear computational complexity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Training with KV Binding Is Secretly Linear Attention
cs.LG 2026-02 conditional novelty 8.0

Test-time training with KV binding reduces to learned linear attention.
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
cs.LG 2026-02 unverdicted novelty 7.0

MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
Linearizing Vision Transformer with Test-Time Training
cs.CV 2026-05 unverdicted novelty 6.0

Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 3 Pith papers · 6 internal anchors

[1]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Ti- tans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the con- text at test time.arXiv preprint arXiv:2505.23735, 2025. 2

work page arXiv 2025
[3]

Recur- rent memory transformer

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recur- rent memory transformer. InNeurIPS, 2022. 2

work page 2022
[4]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 1

work page 2020
[5]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2

work page internal anchor Pith review arXiv 2025
[6]

Rethinking attention with performers

Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InICLR, 2021. 1

work page 2021
[7]

Conditional positional encodings for vision transformers

Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. InICLR, 2023. 2

work page 2023
[8]

Randaugment: Practical automated data augmentation with a reduced search space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InCVPRW, 2020. 7

work page 2020
[9]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InCVPR, 2025. 2, 6

work page 2025
[10]

Shallow vs

Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum- product networks. InNeurIPS, 2011. 6

work page 2011
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 7

work page 2009
[12]

Cswin transformer: A general vision transformer backbone with cross-shaped windows

Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. InCVPR, 2022. 2, 6, 7, 8

work page 2022
[13]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 1, 2, 6, 7

work page 2021
[14]

Sharpness-aware training for free

Jiawei Du, Daquan Zhou, Jiashi Feng, Vincent Tan, and Joey Tianyi Zhou. Sharpness-aware training for free. In NeurIPS, 2022. 6, 7

work page 2022
[15]

The power of depth for feed- forward neural networks

Ronen Eldan and Ohad Shamir. The power of depth for feed- forward neural networks. InConference on learning theory,

work page
[16]

Rmt: Retentive networks meet vision trans- formers

Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision trans- formers. InCVPR, 2024. 6

work page 2024
[17]

Model- agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- agnostic meta-learning for fast adaptation of deep networks. InICML, 2017. 2, 4

work page 2017
[18]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Cmt: Convolutional neural networks meet vision transformers

Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt: Convolutional neural networks meet vision transformers. InCVPR, 2022. 2

work page 2022
[20]

Flatten transformer: Vision transformer using fo- cused linear attention

Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using fo- cused linear attention. InICCV, 2023. 1, 2, 3, 4, 7

work page 2023
[21]

Bridging the divide: Reconsidering softmax and linear attention

Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear attention. In NeurIPS, 2024. 3, 5

work page 2024
[22]

Demystify mamba in vision: A linear attention perspective

Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yi- fan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective. InNeurIPS, 2024. 4, 6, 7, 2

work page 2024
[23]

Agent attention: On the integration of softmax and linear attention

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, and Gao Huang. Agent attention: On the integration of softmax and linear attention. InECCV, 2024. 6

work page 2024
[24]

Neighborhood attention transformer

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, 2023. 2, 8

work page 2023
[25]

Fastervit: Fast vision transformers with hierarchical attention

Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. InICLR,

work page
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

work page
[27]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, 2017. 8

work page 2017
[28]

Mobilenets: Efficient convo- 9 lutional neural networks for mobile vision applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient convo- 9 lutional neural networks for mobile vision applications. In CVPR, 2017. 6

work page 2017
[29]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InCVPR, 2017. 6

work page 2017
[30]

Localmamba: Visual state space model with windowed selective scan

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu. Localmamba: Visual state space model with windowed selective scan. InECCVW, 2024. 6, 7, 8, 2

work page 2024
[31]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020. 1, 2, 3, 5

work page 2020
[32]

Mamba- nd: Selective state space modeling for multi-dimensional data

Shufan Li, Harkanwar Singh, and Aditya Grover. Mamba- nd: Selective state space modeling for multi-dimensional data. InECCV, 2024. 6

work page 2024
[33]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 8

work page 2014
[34]

Vmamba: Visual state space model

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. InNeurIPS, 2024. 2, 4, 6, 7, 8

work page 2024
[35]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 1, 2, 7

work page 2021
[36]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6, 8

work page 2022
[37]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2018. 7

work page 2018
[38]

Softmax-free linear transformers

Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, and Li Zhang. Softmax-free linear transformers. IJCV, 2024. 6, 7, 8

work page 2024
[39]

Meta-learning update rules for un- supervised representation learning

Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for un- supervised representation learning. InICLR, 2019. 2

work page 2019
[40]

On the number of linear regions of deep neural networks

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. InNeurIPS, 2014. 6

work page 2014
[41]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025. 2

work page 2025
[42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 1, 7, 8, 2

work page 2023
[43]

Efficientvmamba: Atrous selective scan for light weight visual mamba

Xiaohuan Pei, Tao Huang, and Chang Xu. Efficientvmamba: Atrous selective scan for light weight visual mamba. In AAAI, 2025. 2

work page 2025
[44]

Exponential expressivity in deep neural networks through transient chaos

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl- Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. InNeurIPS,

work page
[45]

cosformer: Rethinking softmax in attention

Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In ICLR, 2022. 1, 3, 5

work page 2022
[46]

On the expressive power of deep neural networks

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. InICML, 2017. 6

work page 2017
[47]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InCVPR, 2018. 6

work page 2018
[48]

Reasoning with latent thoughts: On the power of looped transformers

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Ku- mar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers. InICLR, 2025. 2

work page 2025
[49]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and J ¨urgen Schmidhuber. Linear transformers are secretly fast weight programmers. InICML,

work page
[50]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2002
[51]

Transnext: Robust foveal visual perception for vi- sion transformers

Dai Shi. Transnext: Robust foveal visual perception for vi- sion transformers. InCVPR, 2024. 6, 7, 8

work page 2024
[52]

Inception transformer

Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xin- chao Wang, and Shuicheng Yan. Inception transformer. In NeurIPS, 2022. 6

work page 2022
[53]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Vicinity vision transformer

Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Ling- peng Kong, and Yiran Zhong. Vicinity vision transformer. TPAMI, 2023. 6, 8

work page 2023
[55]

Learning to (learn at test time): Rnns with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. InICML, 2025. 1, 2, 3, 4, 5, 6

work page 2025
[56]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InICML, 2021. 1, 2, 3, 6, 7

work page 2021
[57]

Going deeper with im- age transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. InICCV, 2021. 2, 5

work page 2021
[58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 1, 2, 3, 4, 5

work page 2017
[59]

Deepnet: Scaling trans- formers to 1,000 layers.TPAMI, 2024

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling trans- formers to 1,000 layers.TPAMI, 2024. 5

work page 2024
[60]

Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions. InICCV, 2021. 2, 7

work page 2021
[61]

Internimage: Exploring large-scale vi- sion foundation models with deformable convolutions

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vi- sion foundation models with deformable convolutions. In CVPR, 2023. 6, 7

work page 2023
[62]

Vision transformer with deformable attention

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention. In CVPR, 2022. 2 10

work page 2022
[63]

Dat++: Spatially dynamic vision transformer with deformable attention.arXiv preprint arXiv:2309.01430,

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Dat++: Spatially dynamic vision transformer with deformable attention.arXiv preprint arXiv:2309.01430,

work page arXiv
[64]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024. 1

work page 2024
[65]

Unified perceptual parsing for scene understand- ing

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InECCV, 2018. 8

work page 2018
[66]

Mambatree: Tree topology is all you need in state space model

Yicheng Xiao, Lin Song, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan, et al. Mambatree: Tree topology is all you need in state space model. InNeurIPS, 2024. 4

work page 2024
[67]

Segformer: Simple and ef- ficient design for semantic segmentation with transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- ficient design for semantic segmentation with transformers. InNeurIPS, 2021. 1

work page 2021
[68]

Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention. InAAAI, 2021. 2

work page 2021
[69]

Focal self- attention for local-global interactions in vision transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self- attention for local-global interactions in vision transformers. InNeurIPS, 2021. 7, 8

work page 2021
[70]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024. 2

work page 2024
[71]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InNeurIPS, 2024. 2

work page 2024
[72]

Mambaout: Do we really need mamba for vision? InCVPR, 2025

Weihao Yu and Xinchao Wang. Mambaout: Do we really need mamba for vision? InCVPR, 2025. 4, 8

work page 2025
[73]

Inceptionnext: When inception meets convnext

Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. Inceptionnext: When inception meets convnext. InCVPR,

work page
[74]

Cutmix: Regu- larization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InICCV, 2019. 7

work page 2019
[75]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InICLR, 2018. 7

work page 2018
[76]

Vi- sion transformer with quadrangle attention.TPAMI, 2024

Qiming Zhang, Jing Zhang, Yufei Xu, and Dacheng Tao. Vi- sion transformer with quadrangle attention.TPAMI, 2024. 7

work page 2024
[77]

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Random erasing data augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InAAAI, 2020. 7

work page 2020
[79]

Semantic under- standing of scenes through the ade20k dataset.IJCV, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.IJCV, 2019. 8

work page 2019
[80]

Biformer: Vision transformer with bi-level routing attention

Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. Biformer: Vision transformer with bi-level routing attention. InCVPR, 2023. 2, 6

work page 2023

Showing first 80 references.