pith. machine review for the scientific record. sign in

arxiv: 2512.01643 · v2 · submitted 2025-12-01 · 💻 cs.CV

ViT³: Unlocking Test-Time Training in Vision

Pith reviewed 2026-05-17 02:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords test-time trainingvision transformerslinear complexityimage classificationobject detectionsemantic segmentationefficient modeling
0
0 comments X

The pith

ViT³ adapts test-time training to vision by distilling six design insights into a linear-complexity model that competes with Mamba and approaches vision transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper performs a series of experiments to understand how to build effective test-time training systems for images and other visual data. It extracts six practical guidelines for the inner learning module and its training procedure at test time. These guidelines are used to create the ViT³ architecture, which runs with linear complexity and can be parallelized. Readers would find this useful because it turns an abstract reformulation of attention into concrete, working rules for visual tasks like classifying images, generating them, detecting objects, and segmenting scenes. If the findings hold, they suggest that online learning during inference can serve as a viable alternative to both full attention and other linear approximations in computer vision.

Core claim

The authors show that a systematic empirical exploration of test-time training choices for vision leads to six design insights. Applying these insights produces ViT³, a model that reformulates the attention mechanism as constructing and training an inner model from key-value pairs at test time. This results in linear complexity with parallel computation and performance that matches or surpasses linear-complexity models such as Mamba while reducing the difference from optimized vision transformers across multiple visual benchmarks.

What carries the argument

The inner model constructed and trained online from key-value pairs during test-time inference, guided by the six distilled design principles.

If this is right

  • ViT³ achieves linear computational complexity for visual sequence modeling.
  • It performs competitively with Mamba and linear attention on image classification, generation, object detection, and semantic segmentation.
  • The architecture supports parallelizable computation.
  • Future visual TTT models can use the six insights as starting design principles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The six insights may help in adapting TTT to video sequences or other temporal visual data.
  • Hybrid designs combining ViT³ elements with existing transformer optimizations could further close the performance gap.
  • Testing the insights at much larger model scales would clarify their robustness.

Load-bearing premise

The six distilled design insights will generalize beyond the tested tasks and datasets and remain effective when the inner training is scaled up.

What would settle it

A clear falsifier would be if ViT³ shows substantially worse performance than Mamba or linear attention on a standard visual benchmark not used in the study, or if removing one of the key insights causes no drop in results.

Figures

Figures reproduced from arXiv: 2512.01643 by Bo Zheng, Dongchen Han, Gao Huang, Jun Song, Tianyu Li, Yining Li, Yu Cheng, Ziming Wang, Zixuan Cao.

Figure 1
Figure 1. Figure 1: Illustration of Softmax attention [58], linear atten￾tion [31], and Test-Time Training (TTT) module [55]. (a) Softmax attention can be viewed as building a two-layer MLP that directly uses the uncompressed keys K and values V , where the hidden width equals the sequence length N and the nonlinearity is Soft￾max. While effective, this N-width MLP leads to O(N 2 ) costs when applied to the queries Q ∈R N×d .… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the TTT model building block. TTT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of TTT models with inner modules of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons between DeiT and ViT3 in (a) FPS on RTX3090, and (b) per image GPU memory usage. 5.4. Image Generation We further benchmark TTT method on the class-conditional image generation task using the ImageNet-1K dataset. Specifically, we replace the Softmax attention module in DiT [42] with our ViT3 block, yielding the DiT3 model fam￾ily. Following standard protocol, we report FID on 50 000 validation … view at source ↗
read the original abstract

Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code: github.com/LeapLabTHU/ViTTT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript conducts a systematic empirical study of Test-Time Training (TTT) designs for visual sequence modeling. From a series of experiments the authors distill six practical design insights, which they use to construct ViT³, a pure linear-complexity TTT architecture with parallelizable computation. ViT³ is evaluated on image classification, image generation, object detection, and semantic segmentation, where it is reported to match or outperform advanced linear-complexity baselines (Mamba, linear-attention variants) and to narrow the gap to highly optimized vision Transformers.

Significance. If the empirical results prove reproducible under matched capacity and training regimes, the work supplies concrete, actionable guidelines for visual TTT and supplies a competitive open-source baseline. The release of code and the breadth of tasks (classification, generation, detection, segmentation) are strengths that increase the potential utility of the study for the community.

minor comments (3)
  1. The abstract states that six design insights are distilled but does not enumerate them; a concise bullet list in the abstract or a dedicated early section would improve scannability.
  2. The experimental sections should explicitly report error bars, number of runs, and data-selection criteria for all quantitative tables; their absence makes it difficult to assess the stability of the claimed outperformance.
  3. Notation for the inner-model update rule and the distinction between training-time and test-time parameters could be clarified with a small diagram or pseudocode block in the method section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our manuscript, the recognition of its significance for the community, and the recommendation for minor revision. We are pleased that the empirical study, the six design insights, the breadth of evaluated tasks, and the open-source code release are viewed as strengths.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical study that runs ablation experiments on TTT inner-module and training choices, distills six design insights from the observed results, and assembles ViT³ accordingly. All performance claims are tied directly to the reported numbers on classification, generation, detection, and segmentation benchmarks rather than to any closed-form derivation or self-referential equation. No load-bearing step reduces by construction to a fitted parameter or prior self-citation; the work therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents exhaustive enumeration; the work appears to rest on standard supervised-learning assumptions and hyper-parameter choices rather than new invented entities or unstated axioms.

pith-pipeline@v0.9.0 · 5567 in / 1057 out tokens · 29525 ms · 2026-05-17T02:56:09.319903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We distill six practical insights... loss functions L for which ∂²L/∂V̂∂V vanishes are not suitable... single epoch of full-batch inner training... large inner learning rate (1.0)... increasing inner model capacity... deep inner models suffer from optimization difficulties... convolutional architectures are particularly appropriate

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time... linear computational complexity

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Test-Time Training with KV Binding Is Secretly Linear Attention

    cs.LG 2026-02 conditional novelty 8.0

    Test-time training with KV binding reduces to learned linear attention.

  2. Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

    cs.LG 2026-02 unverdicted novelty 7.0

    MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.

  3. Linearizing Vision Transformer with Test-Time Training

    cs.CV 2026-05 unverdicted novelty 6.0

    Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 3 Pith papers · 6 internal anchors

  1. [1]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Ti- tans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024. 2

  2. [2]

    Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a

    Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the con- text at test time.arXiv preprint arXiv:2505.23735, 2025. 2

  3. [3]

    Recur- rent memory transformer

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recur- rent memory transformer. InNeurIPS, 2022. 2

  4. [4]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 1

  5. [5]

    TTT3R: 3D Reconstruction as Test-Time Training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2

  6. [6]

    Rethinking attention with performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InICLR, 2021. 1

  7. [7]

    Conditional positional encodings for vision transformers

    Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. InICLR, 2023. 2

  8. [8]

    Randaugment: Practical automated data augmentation with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InCVPRW, 2020. 7

  9. [9]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InCVPR, 2025. 2, 6

  10. [10]

    Shallow vs

    Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum- product networks. InNeurIPS, 2011. 6

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 7

  12. [12]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. InCVPR, 2022. 2, 6, 7, 8

  13. [13]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 1, 2, 6, 7

  14. [14]

    Sharpness-aware training for free

    Jiawei Du, Daquan Zhou, Jiashi Feng, Vincent Tan, and Joey Tianyi Zhou. Sharpness-aware training for free. In NeurIPS, 2022. 6, 7

  15. [15]

    The power of depth for feed- forward neural networks

    Ronen Eldan and Ohad Shamir. The power of depth for feed- forward neural networks. InConference on learning theory,

  16. [16]

    Rmt: Retentive networks meet vision trans- formers

    Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision trans- formers. InCVPR, 2024. 6

  17. [17]

    Model- agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- agnostic meta-learning for fast adaptation of deep networks. InICML, 2017. 2, 4

  18. [18]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2

  19. [19]

    Cmt: Convolutional neural networks meet vision transformers

    Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt: Convolutional neural networks meet vision transformers. InCVPR, 2022. 2

  20. [20]

    Flatten transformer: Vision transformer using fo- cused linear attention

    Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using fo- cused linear attention. InICCV, 2023. 1, 2, 3, 4, 7

  21. [21]

    Bridging the divide: Reconsidering softmax and linear attention

    Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear attention. In NeurIPS, 2024. 3, 5

  22. [22]

    Demystify mamba in vision: A linear attention perspective

    Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yi- fan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective. InNeurIPS, 2024. 4, 6, 7, 2

  23. [23]

    Agent attention: On the integration of softmax and linear attention

    Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, and Gao Huang. Agent attention: On the integration of softmax and linear attention. InECCV, 2024. 6

  24. [24]

    Neighborhood attention transformer

    Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, 2023. 2, 8

  25. [25]

    Fastervit: Fast vision transformers with hierarchical attention

    Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. InICLR,

  26. [26]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,

  27. [27]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, 2017. 8

  28. [28]

    Mobilenets: Efficient convo- 9 lutional neural networks for mobile vision applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient convo- 9 lutional neural networks for mobile vision applications. In CVPR, 2017. 6

  29. [29]

    Densely connected convolutional net- works

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InCVPR, 2017. 6

  30. [30]

    Localmamba: Visual state space model with windowed selective scan

    Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu. Localmamba: Visual state space model with windowed selective scan. InECCVW, 2024. 6, 7, 8, 2

  31. [31]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020. 1, 2, 3, 5

  32. [32]

    Mamba- nd: Selective state space modeling for multi-dimensional data

    Shufan Li, Harkanwar Singh, and Aditya Grover. Mamba- nd: Selective state space modeling for multi-dimensional data. InECCV, 2024. 6

  33. [33]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 8

  34. [34]

    Vmamba: Visual state space model

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. InNeurIPS, 2024. 2, 4, 6, 7, 8

  35. [35]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 1, 2, 7

  36. [36]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6, 8

  37. [37]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2018. 7

  38. [38]

    Softmax-free linear transformers

    Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, and Li Zhang. Softmax-free linear transformers. IJCV, 2024. 6, 7, 8

  39. [39]

    Meta-learning update rules for un- supervised representation learning

    Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for un- supervised representation learning. InICLR, 2019. 2

  40. [40]

    On the number of linear regions of deep neural networks

    Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. InNeurIPS, 2014. 6

  41. [41]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025. 2

  42. [42]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 1, 7, 8, 2

  43. [43]

    Efficientvmamba: Atrous selective scan for light weight visual mamba

    Xiaohuan Pei, Tao Huang, and Chang Xu. Efficientvmamba: Atrous selective scan for light weight visual mamba. In AAAI, 2025. 2

  44. [44]

    Exponential expressivity in deep neural networks through transient chaos

    Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl- Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. InNeurIPS,

  45. [45]

    cosformer: Rethinking softmax in attention

    Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In ICLR, 2022. 1, 3, 5

  46. [46]

    On the expressive power of deep neural networks

    Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. InICML, 2017. 6

  47. [47]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InCVPR, 2018. 6

  48. [48]

    Reasoning with latent thoughts: On the power of looped transformers

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Ku- mar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers. InICLR, 2025. 2

  49. [49]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and J ¨urgen Schmidhuber. Linear transformers are secretly fast weight programmers. InICML,

  50. [50]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 6

  51. [51]

    Transnext: Robust foveal visual perception for vi- sion transformers

    Dai Shi. Transnext: Robust foveal visual perception for vi- sion transformers. InCVPR, 2024. 6, 7, 8

  52. [52]

    Inception transformer

    Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xin- chao Wang, and Shuicheng Yan. Inception transformer. In NeurIPS, 2022. 6

  53. [53]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 2

  54. [54]

    Vicinity vision transformer

    Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Ling- peng Kong, and Yiran Zhong. Vicinity vision transformer. TPAMI, 2023. 6, 8

  55. [55]

    Learning to (learn at test time): Rnns with expressive hidden states

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. InICML, 2025. 1, 2, 3, 4, 5, 6

  56. [56]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InICML, 2021. 1, 2, 3, 6, 7

  57. [57]

    Going deeper with im- age transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. InICCV, 2021. 2, 5

  58. [58]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 1, 2, 3, 4, 5

  59. [59]

    Deepnet: Scaling trans- formers to 1,000 layers.TPAMI, 2024

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling trans- formers to 1,000 layers.TPAMI, 2024. 5

  60. [60]

    Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions. InICCV, 2021. 2, 7

  61. [61]

    Internimage: Exploring large-scale vi- sion foundation models with deformable convolutions

    Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vi- sion foundation models with deformable convolutions. In CVPR, 2023. 6, 7

  62. [62]

    Vision transformer with deformable attention

    Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention. In CVPR, 2022. 2 10

  63. [63]

    Dat++: Spatially dynamic vision transformer with deformable attention.arXiv preprint arXiv:2309.01430,

    Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Dat++: Spatially dynamic vision transformer with deformable attention.arXiv preprint arXiv:2309.01430,

  64. [64]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024. 1

  65. [65]

    Unified perceptual parsing for scene understand- ing

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InECCV, 2018. 8

  66. [66]

    Mambatree: Tree topology is all you need in state space model

    Yicheng Xiao, Lin Song, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan, et al. Mambatree: Tree topology is all you need in state space model. InNeurIPS, 2024. 4

  67. [67]

    Segformer: Simple and ef- ficient design for semantic segmentation with transformers

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- ficient design for semantic segmentation with transformers. InNeurIPS, 2021. 1

  68. [68]

    Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention. InAAAI, 2021. 2

  69. [69]

    Focal self- attention for local-global interactions in vision transformers

    Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self- attention for local-global interactions in vision transformers. InNeurIPS, 2021. 7, 8

  70. [70]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024. 2

  71. [71]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InNeurIPS, 2024. 2

  72. [72]

    Mambaout: Do we really need mamba for vision? InCVPR, 2025

    Weihao Yu and Xinchao Wang. Mambaout: Do we really need mamba for vision? InCVPR, 2025. 4, 8

  73. [73]

    Inceptionnext: When inception meets convnext

    Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. Inceptionnext: When inception meets convnext. InCVPR,

  74. [74]

    Cutmix: Regu- larization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InICCV, 2019. 7

  75. [75]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InICLR, 2018. 7

  76. [76]

    Vi- sion transformer with quadrangle attention.TPAMI, 2024

    Qiming Zhang, Jing Zhang, Yufei Xu, and Dacheng Tao. Vi- sion transformer with quadrangle attention.TPAMI, 2024. 7

  77. [77]

    Test-Time Training Done Right

    Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 4, 5, 6

  78. [78]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InAAAI, 2020. 7

  79. [79]

    Semantic under- standing of scenes through the ade20k dataset.IJCV, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.IJCV, 2019. 8

  80. [80]

    Biformer: Vision transformer with bi-level routing attention

    Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. Biformer: Vision transformer with bi-level routing attention. InCVPR, 2023. 2, 6

Showing first 80 references.