pith. machine review for the scientific record. sign in

arxiv: 2604.20392 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Self-supervised pretraining for an iterative image size agnostic vision transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningvision transformerresolution agnosticfoveal transformerDINO self-distillationImageNet pretrainingiterative processing
0
0 comments X

The pith

A sequential-to-global self-distillation framework pretrains iterative foveal-inspired vision transformers while keeping computational cost fixed across any input resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt DINO-style self-distillation into a sequential-to-global objective so that an iterative, foveal-inspired transformer can be pretrained at large scale. Standard vision transformers grow expensive with bigger images, so foundational models stay limited to low resolution; this approach removes that limit by processing a fixed-size context of multi-zoom patches repeatedly. An efficient integral-image method extracts the patches, allowing the model to maintain constant compute while still learning useful representations. The result is competitive accuracy on ImageNet-1K classification and downstream tasks without sacrificing the resolution-agnostic property built into the architecture.

Core claim

We introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

What carries the argument

The sequential-to-global self-distillation objective applied to the iterative foveal-inspired transformer that processes fixed-size multi-zoom patch contexts without backpropagation through time.

If this is right

  • The pretrained encoder can process images of arbitrary size while using the same number of operations as on small images.
  • The model reaches competitive top-1 accuracy on ImageNet-1K classification after large-scale pretraining.
  • Downstream classification tasks inherit the same resolution independence and constant compute property.
  • The iterative architecture becomes usable as a foundational backbone rather than only for supervised tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining recipe could be tested on other recurrent-style or iterative vision backbones to check whether self-distillation generalizes beyond the foveal design.
  • Because compute stays fixed, the model might be applied to very high-resolution inputs such as medical scans or satellite imagery without retraining or hardware scaling.
  • The absence of backpropagation through time during the iterative steps may simplify scaling to longer sequences or temporal data.

Load-bearing premise

The sequential-to-global SSL framework based on DINO self-distillation can successfully pretrain the iterative foveal-inspired transformer without training instabilities or loss of the resolution-agnostic property.

What would settle it

Training the model with the new framework and then measuring whether classification accuracy on ImageNet-1K drops below standard DINO baselines when input resolution is increased while holding total compute constant.

Figures

Figures reproduced from arXiv: 2604.20392 by Danda Pani Paudel, Luc Van Gool, Nedyalko Prisadnikov, Yuqian Fu.

Figure 1
Figure 1. Figure 1: Computational Efficiency and Scaling. (Left & Middle) Latency and GFLOPs of a standard ViT versus our foveal model (evaluated sequentially over 8 steps with a learned gaze policy). While standard ViT compute scales quadratically with the number of patches, our dynamic model maintains a strictly O(1) footprint. (Right) ImageNet-1K Top-1 accuracy across extreme evaluation resolutions. All models are pretrain… view at source ↗
Figure 2
Figure 2. Figure 2: Foveal Extraction Latency. Me￾dian latency for extracting 1,000 patches (10 steps of 100 patches) from a single im￾age, simulating a high-throughput training workload. Including running the models. Compared to the supervised baseline [44], our integral image approach reduces over￾head. The increase of latency with image size reflects the one-time cost for convert￾ing the image into a integral one. in the i… view at source ↗
Figure 3
Figure 3. Figure 3: Main architecture for sequential-to-global pretraining. The student network iteratively updates its state tokens, computing the distillation loss at each step against a static teacher embedding. Stop-gradients prevent backpropagation through time. 3.1 Sequential-to-Global Self-Supervised Pretraining Our pretraining approach builds upon the self-distillation framework of DINO [9, 41,51]. Standard DINO train… view at source ↗
Figure 4
Figure 4. Figure 4: Combining multi-zoom patches with focused grid of patches for informative context for the transformer. a) Shows the input image with the bounding-boxes of the multi-zoom crops (blue) and the foveal grid crops (red) overlayed, b) shows the extracted grid patches, and c) shows the multi-zoom patches. Note that the patches are allowed to span outside the image area by filling them with zero padding. foveal co… view at source ↗
read the original abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to develop a self-supervised pretraining method for a foveal-inspired iterative vision transformer that is agnostic to image size. By using a sequential-to-global SSL framework derived from DINO self-distillation and an efficient integral-image patch extraction, it enables large-scale pretraining with constant computational cost. Competitive results are reported on ImageNet-1K and downstream classification tasks.

Significance. If the results hold, this would be a significant advance in making Vision Transformers more practical for self-supervised learning at scale. The ability to maintain constant compute across resolutions addresses a key limitation of standard ViTs, which scale poorly with image size. This could enable broader adoption of such models in applications with variable input resolutions and contribute to more efficient foundational vision models.

major comments (1)
  1. [Abstract] The abstract asserts 'competitive performance on ImageNet-1K and downstream classification tasks' without providing any quantitative metrics, comparisons to baselines like standard DINO, ablation studies on the sequential-to-global framework, training details, or error analysis. Since the central claim depends on these empirical results to validate the SSL approach for the iterative transformer, this omission is load-bearing and prevents verification of the soundness of the method.
minor comments (1)
  1. The abstract introduces several technical terms (e.g., 'sequential-to-global SSL framework', 'integral-image patch extraction') without brief explanations or references, which may hinder immediate understanding for readers unfamiliar with the prior foveal-inspired transformer work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive feedback. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts 'competitive performance on ImageNet-1K and downstream classification tasks' without providing any quantitative metrics, comparisons to baselines like standard DINO, ablation studies on the sequential-to-global framework, training details, or error analysis. Since the central claim depends on these empirical results to validate the SSL approach for the iterative transformer, this omission is load-bearing and prevents verification of the soundness of the method.

    Authors: We agree that the abstract would be strengthened by including key quantitative metrics to better support the central claims upfront. The full manuscript (Sections 4 and 5) provides these details, including ImageNet-1K performance numbers, direct comparisons to standard DINO, ablations on the sequential-to-global framework, training hyperparameters, and analysis. We will revise the abstract to incorporate specific performance metrics along with brief references to the comparisons, ablations, and training details reported in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed approach

full rationale

The paper describes an empirical self-supervised pretraining method that adapts DINO self-distillation to an iterative foveal-inspired transformer via a sequential-to-global framework and integral-image patch extraction. All central claims (competitive ImageNet-1K performance, constant compute across resolutions) rest on experimental validation rather than any derivation, prediction, or first-principles result that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method; the approach is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact hyperparameters or training choices; the primary unstated premise is that DINO-style self-distillation transfers to sequential iterative processing without additional regularization or architectural changes.

axioms (1)
  • domain assumption DINO self-distillation objective can be adapted to sequential recurrent-like processing without backpropagation through time.
    The paper relies on this transfer to enable large-scale pretraining of the iterative model.

pith-pipeline@v0.9.0 · 5456 in / 1156 out tokens · 54509 ms · 2026-05-10T00:13:48.902017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    arXiv preprint arXiv:1911.05371 , year=

    Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371 (2019)

  2. [2]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predic- tive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)

  3. [3]

    Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

    Bai, H., Zhou, Y., Wu, Y., Chan, C.M., Wen, P., Pan, K., Han, S., Guo, Y.: Glance- or-gaze: Incentivizing lmms to adaptively focus search via reinforcement learning. arXiv preprint arXiv:2601.13942 (2026)

  4. [4]

    Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025

    Balestriero, R., LeCun, Y.: Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544 (2025)

  5. [5]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  6. [6]

    Advances in Neural Information Processing Systems35, 11079–11091 (2022)

    Bulatov, A., Kuratov, Y., Burtsev, M.: Recurrent memory transformer. Advances in Neural Information Processing Systems35, 11079–11091 (2022)

  7. [7]

    In: European conference on computer vision

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

  8. [8]

    Advances in neural information processing systems33, 9912–9924 (2020)

    Caron,M., Misra, I.,Mairal,J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems33, 9912–9924 (2020)

  9. [9]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  10. [10]

    In: International conference on machine learning

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

  11. [11]

    Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)

  12. [12]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K.S., Campolongo, E.G., Rubenstein, D., Stewart, C.V., Karpatne, A., et al.: Prompt-cam: Making vision transformers interpretable for fine-grained analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4375–4385 (2025)

  13. [13]

    Proceedings of the 11th annual conference on Computer graphics and interactive techniques (1984),https: //api.semanticscholar.org/CorpusID:2210332 16 N

    Crow, F.C.: Summed-area tables for texture mapping. Proceedings of the 11th annual conference on Computer graphics and interactive techniques (1984),https: //api.semanticscholar.org/CorpusID:2210332 16 N. Prisadnikov et al

  14. [14]

    Ad- vances in neural information processing systems26(2013)

    Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Ad- vances in neural information processing systems26(2013)

  15. [15]

    Vision Transformers Need Registers

    Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

  16. [16]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  17. [17]

    In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

  18. [18]

    In: Proceedings of the IEEE international conference on computer vision

    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision. pp. 1422–1430 (2015)

  19. [19]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020)

  20. [20]

    In: ICCV (2025)

    Fu, Y., Wang, R., Ren, B., Sun, G., Gong, B., Fu, Y., Paudel, D.P., Huang, X., Van Gool, L.: Objectrelator: Enabling cross-view object relation understanding across ego-centric and exo-centric perspectives. In: ICCV (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gehrig, M., Scaramuzza, D.: Recurrent vision transformers for object detection with event cameras. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13884–13893 (2023)

  22. [22]

    Unsupervised representation learning by predicting image rotations

    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)

  23. [23]

    Trends in neurosciences15(1), 20–25 (1992)

    Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends in neurosciences15(1), 20–25 (1992)

  24. [24]

    Advances in neural information processing systems33, 21271–21284 (2020)

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  26. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)

  27. [27]

    Neural computation 9(8), 1735–1780 (1997)

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

  28. [28]

    In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology

    Jiang, Y., Guo, Z., Rezazadegan Tavakoli, H., Leiva, L.A., Oulasvirta, A.: Eye- former: predicting personalized scanpaths with transformer-guided reinforcement learning. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. pp. 1–15 (2024)

  29. [29]

    arXiv preprint arXiv:2105.14173 (2022) 14

    Jonnalagadda, A., Wang, W.Y., Manjunath, B., Eckstein, M.P.: Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173 (2021)

  30. [30]

    Eye, robot: Learning to look to act with a bc-rl perception-action loop,

    Kerr, J., Hari, K., Weber, E., Kim, C.M., Yi, B., Bonnen, T., Goldberg, K., Kanazawa, A.: Eye, robot: Learning to look to act with a bc-rl perception-action loop. arXiv preprint arXiv:2506.10968 (2025)

  31. [31]

    2, 2022-06-27

    LeCun, Y., et al.: A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review62(1), 1–62 (2022) Self-supervised pretraining for iterative vision transformer 17

  32. [32]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  33. [33]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  34. [34]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  35. [35]

    Advances in neural information processing systems27(2014)

    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. Advances in neural information processing systems27(2014)

  36. [36]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mondal, S., Yang, Z., Ahn, S., Samaras, D., Zelinsky, G., Hoai, M.: Gazeformer: Scalable, effective and fast prediction of goal-directed human attention. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1441–1450 (2023)

  37. [37]

    In: Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008)

    Nilsback, M.E., Zisserman, A.: Automated flower classification over a large num- ber of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (Dec 2008)

  38. [38]

    In: European conference on computer vision

    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision. pp. 69–84. Springer (2016)

  39. [39]

    Sci- ence171(3968), 308–311 (1971)

    Noton, D., Stark, L.: Scanpaths in eye movements during pattern perception. Sci- ence171(3968), 308–311 (1971)

  40. [40]

    Representation Learning with Contrastive Predictive Coding

    Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

  41. [41]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  42. [42]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context en- coders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2536–2544 (2016)

  43. [43]

    In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)

  44. [44]

    arXiv preprint arXiv:2508.16317 (2025)

    Prisadnikov, N., Paudel, D.P., Fu, Y., Van Gool, L.: Vision encoders should be image size agnostic and task driven. arXiv preprint arXiv:2508.16317 (2025)

  45. [45]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  46. [46]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan- guage understanding by generative pre-training (2018)

  47. [47]

    Journal of machine learning research21(140), 1–67 (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

  48. [48]

    & J´ egou, H

    Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198 (2018)

  49. [49]

    Schmidhuber, J., Huber, R.: Learning to generate artificial fovea trajectories for targetdetection.InternationalJournalofNeuralSystems2(01n02),125–134(1991)

  50. [50]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 18 N. Prisadnikov et al

  51. [51]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  52. [52]

    Advances in neural information processing systems32(2019)

    Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. Advances in neural information processing systems32(2019)

  53. [53]

    arXiv preprint arXiv:2502.02763 (2025)

    Traub, M., Butz, M.V.: Looking locally: Object-centric vision transformers as foun- dation models for efficient segmentation. arXiv preprint arXiv:2502.02763 (2025)

  54. [54]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

  55. [55]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  56. [56]

    Viola, M

    Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA. pp. I:511–518. IEEE Computer Society (2001).https://doi.org/10. 1109/CVPR.2001.990517,https://doi.org/10.1109/CVPR.20...

  57. [57]

    Wah,C.,Branson,S.,Welinder,P.,Perona,P.,Belongie,S.,etal.:Thecaltech-ucsd birds-200-2011 dataset. Tech. rep., Technical Report CNS-TR-2011-001, California Institute of Technology (2011)

  58. [58]

    Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE 78 (10) (1990) 1550–1560.doi:10.1109/5.58337

    Werbos, P.: Backpropagation through time: what it does and how to do it. Proceed- ings of the IEEE78(10), 1550–1560 (1990).https://doi.org/10.1109/5.58337

  59. [59]

    Machine learning8, 229–256 (1992)

    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning8, 229–256 (1992)

  60. [60]

    In: Findings of the association for computational linguistics: AACL-IJCNLP 2022

    Wu, Q., Lan, Z., Qian, K., Gu, J., Geramifard, A., Yu, Z.: Memformer: A memory- augmented transformer for sequence modeling. In: Findings of the association for computational linguistics: AACL-IJCNLP 2022. pp. 308–318 (2022)

  61. [61]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, Z., Mondal, S., Ahn, S., Xue, R., Zelinsky, G., Hoai, M., Samaras, D.: Unify- ing top-down and bottom-up scanpath prediction using transformers. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1683–1693 (2024)

  62. [62]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  63. [63]

    In: European confer- ence on computer vision

    Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European confer- ence on computer vision. pp. 649–666. Springer (2016) Self-supervised pretraining for iterative vision transformer 19 A Resolution-Agnostic Patch Extraction via Integral Images While the top-down patch extraction proposed in [44] achieves computational independence from i...