arxiv: 2604.20392 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Self-supervised pretraining for an iterative image size agnostic vision transformer

Nedyalko Prisadnikov , Danda Pani Paudel , Yuqian Fu , Luc Van Gool

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningvision transformerresolution agnosticfoveal transformerDINO self-distillationImageNet pretrainingiterative processing

0 comments

The pith

A sequential-to-global self-distillation framework pretrains iterative foveal-inspired vision transformers while keeping computational cost fixed across any input resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt DINO-style self-distillation into a sequential-to-global objective so that an iterative, foveal-inspired transformer can be pretrained at large scale. Standard vision transformers grow expensive with bigger images, so foundational models stay limited to low resolution; this approach removes that limit by processing a fixed-size context of multi-zoom patches repeatedly. An efficient integral-image method extracts the patches, allowing the model to maintain constant compute while still learning useful representations. The result is competitive accuracy on ImageNet-1K classification and downstream tasks without sacrificing the resolution-agnostic property built into the architecture.

Core claim

We introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

What carries the argument

The sequential-to-global self-distillation objective applied to the iterative foveal-inspired transformer that processes fixed-size multi-zoom patch contexts without backpropagation through time.

If this is right

The pretrained encoder can process images of arbitrary size while using the same number of operations as on small images.
The model reaches competitive top-1 accuracy on ImageNet-1K classification after large-scale pretraining.
Downstream classification tasks inherit the same resolution independence and constant compute property.
The iterative architecture becomes usable as a foundational backbone rather than only for supervised tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining recipe could be tested on other recurrent-style or iterative vision backbones to check whether self-distillation generalizes beyond the foveal design.
Because compute stays fixed, the model might be applied to very high-resolution inputs such as medical scans or satellite imagery without retraining or hardware scaling.
The absence of backpropagation through time during the iterative steps may simplify scaling to longer sequences or temporal data.

Load-bearing premise

The sequential-to-global SSL framework based on DINO self-distillation can successfully pretrain the iterative foveal-inspired transformer without training instabilities or loss of the resolution-agnostic property.

What would settle it

Training the model with the new framework and then measuring whether classification accuracy on ImageNet-1K drops below standard DINO baselines when input resolution is increased while holding total compute constant.

Figures

Figures reproduced from arXiv: 2604.20392 by Danda Pani Paudel, Luc Van Gool, Nedyalko Prisadnikov, Yuqian Fu.

**Figure 1.** Figure 1: Computational Efficiency and Scaling. (Left & Middle) Latency and GFLOPs of a standard ViT versus our foveal model (evaluated sequentially over 8 steps with a learned gaze policy). While standard ViT compute scales quadratically with the number of patches, our dynamic model maintains a strictly O(1) footprint. (Right) ImageNet-1K Top-1 accuracy across extreme evaluation resolutions. All models are pretrain… view at source ↗

**Figure 2.** Figure 2: Foveal Extraction Latency. Median latency for extracting 1,000 patches (10 steps of 100 patches) from a single image, simulating a high-throughput training workload. Including running the models. Compared to the supervised baseline [44], our integral image approach reduces overhead. The increase of latency with image size reflects the one-time cost for converting the image into a integral one. in the i… view at source ↗

**Figure 3.** Figure 3: Main architecture for sequential-to-global pretraining. The student network iteratively updates its state tokens, computing the distillation loss at each step against a static teacher embedding. Stop-gradients prevent backpropagation through time. 3.1 Sequential-to-Global Self-Supervised Pretraining Our pretraining approach builds upon the self-distillation framework of DINO [9, 41,51]. Standard DINO train… view at source ↗

**Figure 4.** Figure 4: Combining multi-zoom patches with focused grid of patches for informative context for the transformer. a) Shows the input image with the bounding-boxes of the multi-zoom crops (blue) and the foveal grid crops (red) overlayed, b) shows the extracted grid patches, and c) shows the multi-zoom patches. Note that the patches are allowed to span outside the image area by filling them with zero padding. foveal co… view at source ↗

read the original abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts DINO self-distillation to an iterative foveal transformer with a sequential-to-global trick and integral-image patches, but the abstract gives no numbers to back the competitive performance claim.

read the letter

The main takeaway is that they've figured out how to run DINO-style self-supervised pretraining on an iterative, foveal-inspired transformer without the compute blowing up as resolution increases. That keeps the model size-agnostic by design and could matter for tasks that need fine detail at scale. The new piece is the sequential-to-global adaptation of the DINO objective so the teacher-student distillation works across the recurrent-like patch processing steps, plus the integral-image extraction that makes pulling the multi-zoom patches cheap. It builds straight on the earlier supervised foveal transformer without reinventing the architecture. This addresses a real bottleneck: standard ViTs in SSL stay stuck at low resolution because attention scales quadratically, and the iterative approach sidesteps that while keeping a fixed budget. The idea is practical and the extension feels natural. The soft spot is the evidence. The abstract says they get competitive results on ImageNet-1K and downstream tasks, but it supplies no accuracies, no baselines, no ablations on the SSL adaptation itself, and no checks on whether the resolution-agnostic property holds after pretraining. Without those, you cannot tell if the framework trains stably or if the constant-compute claim survives in practice. The soundness is limited by that missing link between method and data. This is for people working on efficient, high-resolution vision backbones or anyone trying to scale SSL beyond fixed low-res grids. A reader who wants to test variable-resolution encoders would get value from the framework even if they have to fill in the experiments themselves. It deserves a serious referee because the problem is concrete, the adaptation is coherent, and the potential payoff is clear, though the current draft needs the quantitative sections and ablations before it can be evaluated properly. I would send it to review and ask for the full results and training details in the next iteration.

Referee Report

1 major / 1 minor

Summary. The paper claims to develop a self-supervised pretraining method for a foveal-inspired iterative vision transformer that is agnostic to image size. By using a sequential-to-global SSL framework derived from DINO self-distillation and an efficient integral-image patch extraction, it enables large-scale pretraining with constant computational cost. Competitive results are reported on ImageNet-1K and downstream classification tasks.

Significance. If the results hold, this would be a significant advance in making Vision Transformers more practical for self-supervised learning at scale. The ability to maintain constant compute across resolutions addresses a key limitation of standard ViTs, which scale poorly with image size. This could enable broader adoption of such models in applications with variable input resolutions and contribute to more efficient foundational vision models.

major comments (1)

[Abstract] The abstract asserts 'competitive performance on ImageNet-1K and downstream classification tasks' without providing any quantitative metrics, comparisons to baselines like standard DINO, ablation studies on the sequential-to-global framework, training details, or error analysis. Since the central claim depends on these empirical results to validate the SSL approach for the iterative transformer, this omission is load-bearing and prevents verification of the soundness of the method.

minor comments (1)

The abstract introduces several technical terms (e.g., 'sequential-to-global SSL framework', 'integral-image patch extraction') without brief explanations or references, which may hinder immediate understanding for readers unfamiliar with the prior foveal-inspired transformer work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive feedback. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] The abstract asserts 'competitive performance on ImageNet-1K and downstream classification tasks' without providing any quantitative metrics, comparisons to baselines like standard DINO, ablation studies on the sequential-to-global framework, training details, or error analysis. Since the central claim depends on these empirical results to validate the SSL approach for the iterative transformer, this omission is load-bearing and prevents verification of the soundness of the method.

Authors: We agree that the abstract would be strengthened by including key quantitative metrics to better support the central claims upfront. The full manuscript (Sections 4 and 5) provides these details, including ImageNet-1K performance numbers, direct comparisons to standard DINO, ablations on the sequential-to-global framework, training hyperparameters, and analysis. We will revise the abstract to incorporate specific performance metrics along with brief references to the comparisons, ablations, and training details reported in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed approach

full rationale

The paper describes an empirical self-supervised pretraining method that adapts DINO self-distillation to an iterative foveal-inspired transformer via a sequential-to-global framework and integral-image patch extraction. All central claims (competitive ImageNet-1K performance, constant compute across resolutions) rest on experimental validation rather than any derivation, prediction, or first-principles result that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method; the approach is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact hyperparameters or training choices; the primary unstated premise is that DINO-style self-distillation transfers to sequential iterative processing without additional regularization or architectural changes.

axioms (1)

domain assumption DINO self-distillation objective can be adapted to sequential recurrent-like processing without backpropagation through time.
The paper relies on this transfer to enable large-scale pretraining of the iterative model.

pith-pipeline@v0.9.0 · 5456 in / 1156 out tokens · 54509 ms · 2026-05-10T00:13:48.902017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 21 canonical work pages · 10 internal anchors

[1]

arXiv preprint arXiv:1911.05371 , year=

Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371 (2019)

work page arXiv 1911
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predic- tive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)

2023
[3]

Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

Bai, H., Zhou, Y., Wu, Y., Chan, C.M., Wen, P., Pan, K., Han, S., Guo, Y.: Glance- or-gaze: Incentivizing lmms to adaptively focus search via reinforcement learning. arXiv preprint arXiv:2601.13942 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025

Balestriero, R., LeCun, Y.: Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544 (2025)

work page arXiv 2025
[5]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[6]

Advances in Neural Information Processing Systems35, 11079–11091 (2022)

Bulatov, A., Kuratov, Y., Burtsev, M.: Recurrent memory transformer. Advances in Neural Information Processing Systems35, 11079–11091 (2022)

2022
[7]

In: European conference on computer vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

2020
[8]

Advances in neural information processing systems33, 9912–9924 (2020)

Caron,M., Misra, I.,Mairal,J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems33, 9912–9924 (2020)

2020
[9]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[10]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

2020
[11]

Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)

2021
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K.S., Campolongo, E.G., Rubenstein, D., Stewart, C.V., Karpatne, A., et al.: Prompt-cam: Making vision transformers interpretable for fine-grained analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4375–4385 (2025)

2025
[13]

Proceedings of the 11th annual conference on Computer graphics and interactive techniques (1984),https: //api.semanticscholar.org/CorpusID:2210332 16 N

Crow, F.C.: Summed-area tables for texture mapping. Proceedings of the 11th annual conference on Computer graphics and interactive techniques (1984),https: //api.semanticscholar.org/CorpusID:2210332 16 N. Prisadnikov et al

1984
[14]

Ad- vances in neural information processing systems26(2013)

Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Ad- vances in neural information processing systems26(2013)

2013
[15]

Vision Transformers Need Registers

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

work page internal anchor Pith review arXiv 2023
[16]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

2009
[17]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

2019
[18]

In: Proceedings of the IEEE international conference on computer vision

Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision. pp. 1422–1430 (2015)

2015
[19]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[20]

In: ICCV (2025)

Fu, Y., Wang, R., Ren, B., Sun, G., Gong, B., Fu, Y., Paudel, D.P., Huang, X., Van Gool, L.: Objectrelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives. In: ICCV (2025)

2025
[21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gehrig, M., Scaramuzza, D.: Recurrent vision transformers for object detection with event cameras. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13884–13893 (2023)

2023
[22]

Unsupervised representation learning by predicting image rotations

Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)

work page arXiv 2018
[23]

Trends in neurosciences15(1), 20–25 (1992)

Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends in neurosciences15(1), 20–25 (1992)

1992
[24]

Advances in neural information processing systems33, 21271–21284 (2020)

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)

2020
[25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022
[26]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)

2020
[27]

Neural computation 9(8), 1735–1780 (1997)

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

1997
[28]

In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology

Jiang, Y., Guo, Z., Rezazadegan Tavakoli, H., Leiva, L.A., Oulasvirta, A.: Eye- former: predicting personalized scanpaths with transformer-guided reinforcement learning. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. pp. 1–15 (2024)

2024
[29]

arXiv preprint arXiv:2105.14173 (2022) 14

Jonnalagadda, A., Wang, W.Y., Manjunath, B., Eckstein, M.P.: Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173 (2021)

work page arXiv 2021
[30]

Eye, robot: Learning to look to act with a bc-rl perception-action loop,

Kerr, J., Hari, K., Weber, E., Kim, C.M., Yi, B., Bonnen, T., Goldberg, K., Kanazawa, A.: Eye, robot: Learning to look to act with a bc-rl perception-action loop. arXiv preprint arXiv:2506.10968 (2025)

work page arXiv 2025
[31]

2, 2022-06-27

LeCun, Y., et al.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(1), 1–62 (2022) Self-supervised pretraining for iterative vision transformer 17

2022
[32]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[33]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

work page Pith review arXiv 2016
[34]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)

work page internal anchor Pith review arXiv 2013
[35]

Advances in neural information processing systems27(2014)

Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. Advances in neural information processing systems27(2014)

2014
[36]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Mondal, S., Yang, Z., Ahn, S., Samaras, D., Zelinsky, G., Hoai, M.: Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1441–1450 (2023)

2023
[37]

In: Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008)

Nilsback, M.E., Zisserman, A.: Automated flower classification over a large num- ber of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008)

2008
[38]

In: European conference on computer vision

Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision. pp. 69–84. Springer (2016)

2016
[39]

Sci- ence171(3968), 308–311 (1971)

Noton, D., Stark, L.: Scanpaths in eye movements during pattern perception. Sci- ence171(3968), 308–311 (1971)

1971
[40]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context en- coders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2536–2544 (2016)

2016
[43]

In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)

2014
[44]

arXiv preprint arXiv:2508.16317 (2025)

Prisadnikov, N., Paudel, D.P., Fu, Y., Van Gool, L.: Vision encoders should be image size agnostic and task driven. arXiv preprint arXiv:2508.16317 (2025)

work page arXiv 2025
[45]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

2021
[46]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan- guage understanding by generative pre-training (2018)

2018
[47]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

2020
[48]

& J´ egou, H

Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198 (2018)

work page arXiv 2018
[49]

Schmidhuber, J., Huber, R.: Learning to generate artificial fovea trajectories for targetdetection.InternationalJournalofNeuralSystems2(01n02),125–134(1991)

1991
[50]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 18 N. Prisadnikov et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Advances in neural information processing systems32(2019)

Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. Advances in neural information processing systems32(2019)

2019
[53]

arXiv preprint arXiv:2502.02763 (2025)

Traub, M., Butz, M.V.: Looking locally: Object-centric vision transformers as foun- dation models for efficient segmentation. arXiv preprint arXiv:2502.02763 (2025)

work page arXiv 2025
[54]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review arXiv 2025
[55]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

2017
[56]

Viola, M

Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA. pp. I:511–518. IEEE Computer Society (2001).https://doi.org/10. 1109/CVPR.2001.990517,https://doi.org/10.1109/CVPR.20...

work page doi:10.1109/cvpr.2001.990517 2001
[57]

Wah,C.,Branson,S.,Welinder,P.,Perona,P.,Belongie,S.,etal.:Thecaltech-ucsd birds-200-2011 dataset. Tech. rep., Technical Report CNS-TR-2011-001, California Institute of Technology (2011)

2011
[58]

Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE 78 (10) (1990) 1550–1560.doi:10.1109/5.58337

Werbos, P.: Backpropagation through time: what it does and how to do it. Proceed- ings of the IEEE78(10), 1550–1560 (1990).https://doi.org/10.1109/5.58337

work page doi:10.1109/5.58337 1990
[59]

Machine learning8, 229–256 (1992)

Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning8, 229–256 (1992)

1992
[60]

In: Findings of the association for computational linguistics: AACL-IJCNLP 2022

Wu, Q., Lan, Z., Qian, K., Gu, J., Geramifard, A., Yu, Z.: Memformer: A memory- augmented transformer for sequence modeling. In: Findings of the association for computational linguistics: AACL-IJCNLP 2022. pp. 308–318 (2022)

2022
[61]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, Z., Mondal, S., Ahn, S., Xue, R., Zelinsky, G., Hoai, M., Samaras, D.: Unify- ing top-down and bottom-up scanpath prediction using transformers. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1683–1693 (2024)

2024
[62]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

2023
[63]

In: European confer- ence on computer vision

Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European confer- ence on computer vision. pp. 649–666. Springer (2016) Self-supervised pretraining for iterative vision transformer 19 A Resolution-Agnostic Patch Extraction via Integral Images While the top-down patch extraction proposed in [44] achieves computational independence from i...

2016