ViT³: Unlocking Test-Time Training in Vision
Pith reviewed 2026-05-17 02:56 UTC · model grok-4.3
The pith
ViT³ adapts test-time training to vision by distilling six design insights into a linear-complexity model that competes with Mamba and approaches vision transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that a systematic empirical exploration of test-time training choices for vision leads to six design insights. Applying these insights produces ViT³, a model that reformulates the attention mechanism as constructing and training an inner model from key-value pairs at test time. This results in linear complexity with parallel computation and performance that matches or surpasses linear-complexity models such as Mamba while reducing the difference from optimized vision transformers across multiple visual benchmarks.
What carries the argument
The inner model constructed and trained online from key-value pairs during test-time inference, guided by the six distilled design principles.
If this is right
- ViT³ achieves linear computational complexity for visual sequence modeling.
- It performs competitively with Mamba and linear attention on image classification, generation, object detection, and semantic segmentation.
- The architecture supports parallelizable computation.
- Future visual TTT models can use the six insights as starting design principles.
Where Pith is reading between the lines
- The six insights may help in adapting TTT to video sequences or other temporal visual data.
- Hybrid designs combining ViT³ elements with existing transformer optimizations could further close the performance gap.
- Testing the insights at much larger model scales would clarify their robustness.
Load-bearing premise
The six distilled design insights will generalize beyond the tested tasks and datasets and remain effective when the inner training is scaled up.
What would settle it
A clear falsifier would be if ViT³ shows substantially worse performance than Mamba or linear attention on a standard visual benchmark not used in the study, or if removing one of the key insights causes no drop in results.
Figures
read the original abstract
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code: github.com/LeapLabTHU/ViTTT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a systematic empirical study of Test-Time Training (TTT) designs for visual sequence modeling. From a series of experiments the authors distill six practical design insights, which they use to construct ViT³, a pure linear-complexity TTT architecture with parallelizable computation. ViT³ is evaluated on image classification, image generation, object detection, and semantic segmentation, where it is reported to match or outperform advanced linear-complexity baselines (Mamba, linear-attention variants) and to narrow the gap to highly optimized vision Transformers.
Significance. If the empirical results prove reproducible under matched capacity and training regimes, the work supplies concrete, actionable guidelines for visual TTT and supplies a competitive open-source baseline. The release of code and the breadth of tasks (classification, generation, detection, segmentation) are strengths that increase the potential utility of the study for the community.
minor comments (3)
- The abstract states that six design insights are distilled but does not enumerate them; a concise bullet list in the abstract or a dedicated early section would improve scannability.
- The experimental sections should explicitly report error bars, number of runs, and data-selection criteria for all quantitative tables; their absence makes it difficult to assess the stability of the claimed outperformance.
- Notation for the inner-model update rule and the distinction between training-time and test-time parameters could be clarified with a small diagram or pseudocode block in the method section.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our manuscript, the recognition of its significance for the community, and the recommendation for minor revision. We are pleased that the empirical study, the six design insights, the breadth of evaluated tasks, and the open-source code release are viewed as strengths.
Circularity Check
No significant circularity
full rationale
This is an empirical study that runs ablation experiments on TTT inner-module and training choices, distills six design insights from the observed results, and assembles ViT³ accordingly. All performance claims are tied directly to the reported numbers on classification, generation, detection, and segmentation benchmarks rather than to any closed-form derivation or self-referential equation. No load-bearing step reduces by construction to a fitted parameter or prior self-citation; the work therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We distill six practical insights... loss functions L for which ∂²L/∂V̂∂V vanishes are not suitable... single epoch of full-batch inner training... large inner learning rate (1.0)... increasing inner model capacity... deep inner models suffer from optimization difficulties... convolutional architectures are particularly appropriate
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time... linear computational complexity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Test-Time Training with KV Binding Is Secretly Linear Attention
Test-time training with KV binding reduces to learned linear attention.
-
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
-
Linearizing Vision Transformer with Test-Time Training
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
Reference graph
Works this paper leans on
-
[1]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Ti- tans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the con- text at test time.arXiv preprint arXiv:2505.23735, 2025. 2
-
[3]
Recur- rent memory transformer
Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recur- rent memory transformer. InNeurIPS, 2022. 2
work page 2022
-
[4]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InECCV, 2020. 1
work page 2020
-
[5]
TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[6]
Rethinking attention with performers
Krzysztof Choromanski, Valerii Likhosherstov, David Do- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InICLR, 2021. 1
work page 2021
-
[7]
Conditional positional encodings for vision transformers
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. InICLR, 2023. 2
work page 2023
-
[8]
Randaugment: Practical automated data augmentation with a reduced search space
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InCVPRW, 2020. 7
work page 2020
-
[9]
One-minute video generation with test-time training
Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InCVPR, 2025. 2, 6
work page 2025
-
[10]
Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum- product networks. InNeurIPS, 2011. 6
work page 2011
-
[11]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 7
work page 2009
-
[12]
Cswin transformer: A general vision transformer backbone with cross-shaped windows
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. InCVPR, 2022. 2, 6, 7, 8
work page 2022
-
[13]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 1, 2, 6, 7
work page 2021
-
[14]
Sharpness-aware training for free
Jiawei Du, Daquan Zhou, Jiashi Feng, Vincent Tan, and Joey Tianyi Zhou. Sharpness-aware training for free. In NeurIPS, 2022. 6, 7
work page 2022
-
[15]
The power of depth for feed- forward neural networks
Ronen Eldan and Ohad Shamir. The power of depth for feed- forward neural networks. InConference on learning theory,
-
[16]
Rmt: Retentive networks meet vision trans- formers
Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision trans- formers. InCVPR, 2024. 6
work page 2024
-
[17]
Model- agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- agnostic meta-learning for fast adaptation of deep networks. InICML, 2017. 2, 4
work page 2017
-
[18]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Cmt: Convolutional neural networks meet vision transformers
Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt: Convolutional neural networks meet vision transformers. InCVPR, 2022. 2
work page 2022
-
[20]
Flatten transformer: Vision transformer using fo- cused linear attention
Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using fo- cused linear attention. InICCV, 2023. 1, 2, 3, 4, 7
work page 2023
-
[21]
Bridging the divide: Reconsidering softmax and linear attention
Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear attention. In NeurIPS, 2024. 3, 5
work page 2024
-
[22]
Demystify mamba in vision: A linear attention perspective
Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yi- fan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, and Gao Huang. Demystify mamba in vision: A linear attention perspective. InNeurIPS, 2024. 4, 6, 7, 2
work page 2024
-
[23]
Agent attention: On the integration of softmax and linear attention
Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, and Gao Huang. Agent attention: On the integration of softmax and linear attention. InECCV, 2024. 6
work page 2024
-
[24]
Neighborhood attention transformer
Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In CVPR, 2023. 2, 8
work page 2023
-
[25]
Fastervit: Fast vision transformers with hierarchical attention
Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. InICLR,
-
[26]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
-
[27]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InICCV, 2017. 8
work page 2017
-
[28]
Mobilenets: Efficient convo- 9 lutional neural networks for mobile vision applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient convo- 9 lutional neural networks for mobile vision applications. In CVPR, 2017. 6
work page 2017
-
[29]
Densely connected convolutional net- works
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InCVPR, 2017. 6
work page 2017
-
[30]
Localmamba: Visual state space model with windowed selective scan
Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu. Localmamba: Visual state space model with windowed selective scan. InECCVW, 2024. 6, 7, 8, 2
work page 2024
-
[31]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020. 1, 2, 3, 5
work page 2020
-
[32]
Mamba- nd: Selective state space modeling for multi-dimensional data
Shufan Li, Harkanwar Singh, and Aditya Grover. Mamba- nd: Selective state space modeling for multi-dimensional data. InECCV, 2024. 6
work page 2024
-
[33]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 8
work page 2014
-
[34]
Vmamba: Visual state space model
Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model. InNeurIPS, 2024. 2, 4, 6, 7, 8
work page 2024
-
[35]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 1, 2, 7
work page 2021
-
[36]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 6, 8
work page 2022
-
[37]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2018. 7
work page 2018
-
[38]
Softmax-free linear transformers
Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, and Li Zhang. Softmax-free linear transformers. IJCV, 2024. 6, 7, 8
work page 2024
-
[39]
Meta-learning update rules for un- supervised representation learning
Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for un- supervised representation learning. InICLR, 2019. 2
work page 2019
-
[40]
On the number of linear regions of deep neural networks
Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. InNeurIPS, 2014. 6
work page 2014
-
[41]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InEMNLP, 2025. 2
work page 2025
-
[42]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 1, 7, 8, 2
work page 2023
-
[43]
Efficientvmamba: Atrous selective scan for light weight visual mamba
Xiaohuan Pei, Tao Huang, and Chang Xu. Efficientvmamba: Atrous selective scan for light weight visual mamba. In AAAI, 2025. 2
work page 2025
-
[44]
Exponential expressivity in deep neural networks through transient chaos
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl- Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. InNeurIPS,
-
[45]
cosformer: Rethinking softmax in attention
Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In ICLR, 2022. 1, 3, 5
work page 2022
-
[46]
On the expressive power of deep neural networks
Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. InICML, 2017. 6
work page 2017
-
[47]
Mobilenetv2: Inverted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InCVPR, 2018. 6
work page 2018
-
[48]
Reasoning with latent thoughts: On the power of looped transformers
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Ku- mar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers. InICLR, 2025. 2
work page 2025
-
[49]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and J ¨urgen Schmidhuber. Linear transformers are secretly fast weight programmers. InICML,
-
[50]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 6
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[51]
Transnext: Robust foveal visual perception for vi- sion transformers
Dai Shi. Transnext: Robust foveal visual perception for vi- sion transformers. InCVPR, 2024. 6, 7, 8
work page 2024
-
[52]
Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xin- chao Wang, and Shuicheng Yan. Inception transformer. In NeurIPS, 2022. 6
work page 2022
-
[53]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Weixuan Sun, Zhen Qin, Hui Deng, Jianyuan Wang, Yi Zhang, Kaihao Zhang, Nick Barnes, Stan Birchfield, Ling- peng Kong, and Yiran Zhong. Vicinity vision transformer. TPAMI, 2023. 6, 8
work page 2023
-
[55]
Learning to (learn at test time): Rnns with expressive hidden states
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states. InICML, 2025. 1, 2, 3, 4, 5, 6
work page 2025
-
[56]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InICML, 2021. 1, 2, 3, 6, 7
work page 2021
-
[57]
Going deeper with im- age transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. InICCV, 2021. 2, 5
work page 2021
-
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 1, 2, 3, 4, 5
work page 2017
-
[59]
Deepnet: Scaling trans- formers to 1,000 layers.TPAMI, 2024
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling trans- formers to 1,000 layers.TPAMI, 2024. 5
work page 2024
-
[60]
Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions. InICCV, 2021. 2, 7
work page 2021
-
[61]
Internimage: Exploring large-scale vi- sion foundation models with deformable convolutions
Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vi- sion foundation models with deformable convolutions. In CVPR, 2023. 6, 7
work page 2023
-
[62]
Vision transformer with deformable attention
Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention. In CVPR, 2022. 2 10
work page 2022
-
[63]
Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Dat++: Spatially dynamic vision transformer with deformable attention.arXiv preprint arXiv:2309.01430,
-
[64]
Gsva: Generalized segmentation via multimodal large language models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024. 1
work page 2024
-
[65]
Unified perceptual parsing for scene understand- ing
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InECCV, 2018. 8
work page 2018
-
[66]
Mambatree: Tree topology is all you need in state space model
Yicheng Xiao, Lin Song, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan, et al. Mambatree: Tree topology is all you need in state space model. InNeurIPS, 2024. 4
work page 2024
-
[67]
Segformer: Simple and ef- ficient design for semantic segmentation with transformers
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and ef- ficient design for semantic segmentation with transformers. InNeurIPS, 2021. 1
work page 2021
-
[68]
Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximat- ing self-attention. InAAAI, 2021. 2
work page 2021
-
[69]
Focal self- attention for local-global interactions in vision transformers
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self- attention for local-global interactions in vision transformers. InNeurIPS, 2021. 7, 8
work page 2021
-
[70]
Gated linear attention transformers with hardware-efficient training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024. 2
work page 2024
-
[71]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InNeurIPS, 2024. 2
work page 2024
-
[72]
Mambaout: Do we really need mamba for vision? InCVPR, 2025
Weihao Yu and Xinchao Wang. Mambaout: Do we really need mamba for vision? InCVPR, 2025. 4, 8
work page 2025
-
[73]
Inceptionnext: When inception meets convnext
Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. Inceptionnext: When inception meets convnext. InCVPR,
-
[74]
Cutmix: Regu- larization strategy to train strong classifiers with localizable features
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InICCV, 2019. 7
work page 2019
-
[75]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InICLR, 2018. 7
work page 2018
-
[76]
Vi- sion transformer with quadrangle attention.TPAMI, 2024
Qiming Zhang, Jing Zhang, Yufei Xu, and Dacheng Tao. Vi- sion transformer with quadrangle attention.TPAMI, 2024. 7
work page 2024
-
[77]
Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T Freeman, and Hao Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 4, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[78]
Random erasing data augmentation
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InAAAI, 2020. 7
work page 2020
-
[79]
Semantic under- standing of scenes through the ade20k dataset.IJCV, 2019
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.IJCV, 2019. 8
work page 2019
-
[80]
Biformer: Vision transformer with bi-level routing attention
Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. Biformer: Vision transformer with bi-level routing attention. InCVPR, 2023. 2, 6
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.