pith. sign in

arxiv: 2603.22570 · v2 · pith:4HK7HXV5new · submitted 2026-03-23 · 💻 cs.CV

CanViT: Toward Active-Vision Foundation Models

Pith reviewed 2026-05-21 10:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords active visionfoundation modelVision Transformerglimpsesscene reconstructionsegmentationImageNetADE20K
0
0 comments X

The pith

CanViT is a task- and policy-agnostic Vision Transformer that builds scene representations from sequential low-resolution glimpses via a retinotopic backbone and spatiotopic canvas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CanViT as the first active-vision foundation model that operates without task-specific pretraining or policy supervision. It couples a standard Vision Transformer backbone that processes individual low-resolution glimpses with a separate high-capacity canvas that maintains a scene-wide latent workspace. Binding between the two occurs through scene-relative rotary position embeddings and a new asymmetric cross-attention layer called Canvas Attention. The model is pretrained by reconstructing full-scene DINOv3 embeddings from random sequences of glimpses that vary in location, zoom, and length. Once pretrained, the frozen backbone already exceeds prior active-vision models on segmentation and classification while using far fewer FLOPs.

Core claim

By decoupling thinking inside the retinotopic backbone from memory inside the canvas and pretraining via policy-agnostic dense latent distillation from DINOv3, CanViT produces representations that transfer directly to active-vision benchmarks; a frozen CanViT-B reaches 38.5 percent mIoU on ADE20K from one low-resolution glimpse and 84.5 percent top-1 on ImageNet-1k after fine-tuning, while further glimpses raise ADE20K performance to 45.9 percent mIoU.

What carries the argument

The canvas, a high-capacity spatiotopic latent workspace that receives asymmetric cross-attention from the retinotopic backbone and stores scene-wide embeddings without self-attention or feed-forward layers on the canvas side.

If this is right

  • Additional glimpses raise ADE20K mIoU from 38.5 percent to 45.9 percent without retraining.
  • The same pretrained weights set a new active-vision state of the art of 84.5 percent top-1 on ImageNet-1k after fine-tuning.
  • The model generalizes to longer rollouts, larger scenes, and policies different from those seen in pretraining.
  • Inference cost remains low because canvas-side self-attention and fully-connected layers are eliminated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of retinotopic and spatiotopic representations could be tested on robotic navigation tasks where camera motion must be planned on the fly.
  • Because pretraining uses only unlabeled images, the same pipeline could be applied to video streams or egocentric datasets without new annotations.
  • Scaling the canvas size while keeping the backbone fixed might further close the remaining gap to passive foundation models.
  • The approach suggests that active-vision models may no longer require separate policy networks if the memory binding is sufficiently general.

Load-bearing premise

Reconstructing scene-wide DINOv3 embeddings from sequences of randomized low-resolution glimpses produces task- and policy-agnostic representations that transfer to downstream active-vision tasks.

What would settle it

Training a comparable architecture on the same data volume but with fixed full-resolution inputs instead of randomized glimpses, then measuring whether it still outperforms the glimpse-based CanViT on a held-out active-vision rollout benchmark.

Figures

Figures reproduced from arXiv: 2603.22570 by Audrey Durand, B. Suresh Krishna, Sabrina Du, Yoha\"i-Eliel Berreby.

Figure 1
Figure 1. Figure 1: A CanViT rollout. We consider a high-resolution scene (A). At each timestep 𝑡, CanViT ingests a 128 2 px glimpse (B, 1st row), a crop extracted at a viewpoint with center (𝑥𝑡 , 𝑦𝑡 ) ∈ [−1, +1] 2 and scale (zoom level) 𝑠𝑡 ∈ (0, 1]. This updates a scene-wide latent representation, the canvas, with which CanViT integrates broad context and fine detail from variable-scale glimpses, extrapolates to unobserved r… view at source ↗
Figure 2
Figure 2. Figure 2: CanViT architecture diagram. We adopt a dual-stream structure, equipping a ViT backbone (purple, left-hand columns), which processes localized glimpses (blue), with a canvas (red, right-hand columns), a fine-grained scene-wide spatio-semantic memory. At each timestep 𝑡, a glimpse is extracted from a viewpoint 𝒗𝑡 = (𝑥𝑡 , 𝑦𝑡 , 𝑠𝑡 ), patchified, and processed through the backbone, alongside a recurrent CLS to… view at source ↗
Figure 3
Figure 3. Figure 3: Left: A Canvas Attention round-trip (one Read and one Write) with a zoomed-out, full [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linear-probing benchmark results (frozen CanViT-B). (A) Accuracy–efficiency com￾parison with prior active models on ADE20K segmentation. (B) Effect of viewing policy and canvas resolution (ADE20K segmentation). (C) Effect of viewing policy (ImageNet-1k classification). Strong object classification without fine-tuning (Figure 4C). On ImageNet-1k ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Canvas updates within and across glimpses. CanViT sequentially performs multiple Canvas Write Attention operations per glimpse, with each producing a residual that then updates the canvas ( [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation pretraining loss curves (ImageNet-21k). 2×2 grid: rows = policy (R-IID Policy, F-IID Policy), columns = loss component (Patch MSE, CLS MSE). Bold line: EMA-smoothed (𝛼 = 0.01) over logged per-batch values; faint overlay: pre-EMA values. 11 + 1 variants (baseline + 11 ablations). See also [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analytical FLOP scaling. All curves computed from the formulas in Section G. (A) FLOPs for a single inference step vs. output resolution (canvas grid2 tokens for CanViT, image grid2 for DINOv3). CanViT’s backbone processes a fixed-size glimpse; Canvas Attention cost grows with output resolution. (B) Total FLOPs vs. number of glimpses at fixed output resolution. CanViT cost is linear in 𝑇 ; AME and AdaGlimp… view at source ↗
Figure 8
Figure 8. Figure 8: Real-time latency: CanViT-B vs DINOv3 ViT-B/16. Minimum per-forward-pass latency (𝐵 = 1, with device sync) across scene side lengths. CPU: best of 1–32 thread configurations; CUDA: torch.compile, fp32 and AMP bf16. Faint dots: all raw timings. Solid markers: minimum. Full breakdown in [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
read the original abstract

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models (AVFMs) underexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve fast sequential inference and scalability to high output resolutions. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes--an order of magnitude more than previous active models--and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 20x fewer inference FLOPs as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B also sets a new active-vision state of the art, with 84.5% top-1 accuracy after fine-tuning. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work narrows the wide gap between passive and active computer vision, demonstrating the potential of task- and policy-agnostic AVFM pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CanViT as the first task- and policy-agnostic Active-Vision Foundation Model. It combines a retinotopic Vision Transformer backbone with a spatiotopic canvas workspace using scene-relative RoPE and a novel asymmetric Canvas Attention mechanism that decouples backbone thinking from canvas memory to enable efficient sequential inference. Pretraining is performed label-free via policy-agnostic passive-to-active dense latent distillation: reconstructing full-scene DINOv3 embeddings from sequences of randomized low-resolution glimpses (varying location, zoom, and length) on 13.2 million ImageNet-21k scenes with 1 billion glimpses. The paper reports that a frozen CanViT-B achieves 38.5% mIoU on ADE20K segmentation from a single low-resolution glimpse (outperforming prior active models at 27.6% with 20x fewer FLOPs and its DINOv3 teacher), reaches 45.9% mIoU with additional glimpses, and attains 84.5% top-1 accuracy on ImageNet-1k after fine-tuning, while generalizing to longer rollouts, larger scenes, and new policies.

Significance. If the results hold, the work is significant for demonstrating a scalable, label-free pretraining pipeline that produces transferable representations for active vision without task-specific supervision. The reported efficiency (20x fewer inference FLOPs) and scale (an order of magnitude more pretraining data than prior active models) together with the architectural decoupling of backbone and canvas provide concrete evidence that AVFMs can narrow the performance gap with passive models while remaining biologically plausible. The machine-checked generalization claims to new policies and the reproducible benchmark numbers are particular strengths.

major comments (2)
  1. Pretraining description and generalization claims: the central assertion that randomized-glimpse distillation yields task- and policy-agnostic latents is load-bearing for the ADE20K and ImageNet results. The pretraining distribution is purely random, yet downstream benchmarks use structured or learned policies; without ablations that isolate randomization versus structured policies (or report performance under realistic active rollouts), the 38.5% single-glimpse mIoU and the claim of generalization to new policies remain only moderately supported.
  2. Experiments section, ADE20K results: the reported 38.5% mIoU (single glimpse) and 45.9% mIoU (multiple glimpses) are the primary quantitative support for the AVFM claim. However, the absence of error bars, statistical significance tests, exact data splits, and ablations on glimpse count or policy variation leaves the performance advantage over the 27.6% baseline and the DINOv3 teacher only moderately substantiated.
minor comments (2)
  1. The definition and implementation details of Canvas Attention (asymmetric cross-attention) would benefit from an explicit equation or pseudocode block to clarify how it eliminates canvas-side self-attention and FC layers.
  2. Figure captions and the main text should consistently report the exact resolution and number of glimpses used for each reported mIoU number to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the recommendation of major revision. We address each of the major comments in detail below and have revised the manuscript accordingly to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: Pretraining description and generalization claims: the central assertion that randomized-glimpse distillation yields task- and policy-agnostic latents is load-bearing for the ADE20K and ImageNet results. The pretraining distribution is purely random, yet downstream benchmarks use structured or learned policies; without ablations that isolate randomization versus structured policies (or report performance under realistic active rollouts), the 38.5% single-glimpse mIoU and the claim of generalization to new policies remain only moderately supported.

    Authors: We agree that isolating the effect of randomization in pretraining would provide stronger evidence for the policy-agnostic claim. The manuscript already demonstrates generalization to new policies through experiments on longer rollouts and different scene sizes, but to directly address this, we will include in the revision an ablation where we pretrain with a non-random policy and compare downstream performance. We will also report results using a realistic active policy for inference on ADE20K, such as one that prioritizes high-uncertainty regions based on the current canvas state. This will better substantiate the 38.5% mIoU result and the generalization claims. revision: yes

  2. Referee: Experiments section, ADE20K results: the reported 38.5% mIoU (single glimpse) and 45.9% mIoU (multiple glimpses) are the primary quantitative support for the AVFM claim. However, the absence of error bars, statistical significance tests, exact data splits, and ablations on glimpse count or policy variation leaves the performance advantage over the 27.6% baseline and the DINOv3 teacher only moderately substantiated.

    Authors: We acknowledge these omissions in the original submission. In the revised manuscript, we will add error bars from multiple training runs, specify the exact train/validation splits used for ADE20K, and include ablations on the number of glimpses (showing performance curves for 1 to 10 glimpses) as well as different inference policies. We will also perform and report statistical significance tests comparing CanViT to the baselines. These changes will make the performance advantages more rigorously substantiated. revision: yes

Circularity Check

0 steps flagged

CanViT architecture and random-glimpse distillation evaluated on external benchmarks without definitional reduction

full rationale

The paper introduces a new retinotopic backbone with scene-relative RoPE and Canvas Attention, then pretrains via reconstruction of external DINOv3 scene embeddings from randomized low-resolution glimpses. Downstream results (ADE20K mIoU, ImageNet top-1) are measured directly against independent baselines and the DINOv3 teacher on standard datasets; no equations or self-citations are shown that force the reported transfer performance to equal the pretraining inputs or fitted parameters by construction. The policy-agnostic claim is an empirical hypothesis tested via generalization to new policies and rollouts rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based on abstract only; ledger reflects explicitly introduced components. No free parameters are named. Axioms are standard transformer assumptions. Invented entities are the novel architectural pieces.

axioms (1)
  • standard math Vision Transformer backbone assumptions hold when adapted to retinotopic glimpses.
    Model builds directly on ViT-style processing for local views.
invented entities (2)
  • Canvas no independent evidence
    purpose: spatiotopic scene-wide latent workspace for high-capacity working memory
    Core memory component decoupled from the backbone for fast sequential inference.
  • Canvas Attention no independent evidence
    purpose: asymmetric cross-attention mechanism for efficient interaction with the canvas
    Novel mechanism that eliminates canvas-side self-attention and FC layers.

pith-pipeline@v0.9.0 · 5925 in / 1554 out tokens · 53072 ms · 2026-05-21T10:27:44.831879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

  1. [1]

    Yamins, D. L. K. et al. Performance-Optimized Hierarchical Models Predict Neural Responses in Higher Visual Cortex. Proceedings of the National Academy of Sciences 111, 8619–8624 (2014)

  2. [2]

    Yamins, D. L. K. & DiCarlo, J. J. Using Goal-Driven Deep Learning Models to Understand Sensory Cortex. Nature Neuroscience 19, 356–365 (2016)

  3. [3]

    Schrimpf, M. et al. Brain-Score: Which Artificial Neural Network for Object Recognition Is Most Brain- Like?. 407007 (2018) doi:10.1101/407007

  4. [4]

    Zhuang, C. et al. Unsupervised Neural Network Models of the Ventral Visual Stream. Proceedings of the National Academy of Sciences 118, e2014196118 (2021)

  5. [5]

    & Richards, B

    Bakhtiari, S., Mineault, P., Lillicrap, T., Pack, C. & Richards, B. The Functional Specialization of Visual Cortex Emerges from Training Parallel Pathways with Self-Supervised Predictive Learning. in Advances in Neural Information Processing Systems vol. 34 25164–25178 (Curran Associates, Inc., 2021)

  6. [6]

    J., Jones, E

    Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An Ecologically Motivated Image Dataset for Deep Learning Yields Better Models of Human Vision. Proceedings of the National Academy of Sciences 118, e2011417118 (2021)

  7. [7]

    Raugel, J. et al. Disentangling the Factors of Convergence between Brains and Computer Vision Models. (2025) doi:10.48550/arXiv.2508.18226

  8. [8]

    Yarbus, A. L. Eye Movements and Vision . (Springer US, Boston, MA, 1967). doi:10.1007/978-1-4899-5379-7

  9. [9]

    & Rothkopf, C

    Hoppe, D. & Rothkopf, C. A. Multi-Step Planning of Eye Movements in Visual Search. Scientific Reports 9, 144 (2019)

  10. [10]

    Baddeley, A. D. & Hitch, G. Working Memory. vol. 8 47–89 (1974)

  11. [11]

    Persistence of Visual Memory for Scenes

    Melcher, D. Persistence of Visual Memory for Scenes. Nature 412, 401 (2001)

  12. [12]

    Rao, R. P. N. & Ballard, D. H. Predictive Coding in the Visual Cortex: A Functional Interpretation of Some Extra-Classical Receptive-Field Effects. Nature Neuroscience 2, 79–87 (1999)

  13. [13]

    Gilbert, C. D. & Li, W. Top-down Influences on Visual Processing. Nature Reviews Neuroscience 14, 350– 363 (2013)

  14. [14]

    Kar, K., Kubilius, J., Schmidt, K., Issa, E. B. & DiCarlo, J. J. Evidence That Recurrent Circuits Are Critical to the Ventral Stream's Execution of Core Object Recognition Behavior. Nature Neuroscience 22, 974– 983 (2019)

  15. [15]

    & Kavukcuoglu, K

    Mnih, V., Heess, N., Graves, A. & Kavukcuoglu, K. Recurrent Models of Visual Attention. in Advances in Neural Information Processing Systems vol. 27 (Curran Associates, Inc., 2014)

  16. [16]

    & Kavukcuoglu, K

    Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple Object Recognition with Visual Attention. in 3rd Interna- tional Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds. Bengio, Y. & LeCun, Y.) (2015)

  17. [17]

    & Cai, J

    Ablavatski, A., Lu, S. & Cai, J. Enriched deep recurrent visual attention model for multiple object recog - nition. in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) 971–978 (2017)

  18. [18]

    Elsayed, G., Kornblith, S. & Le, Q. V. Saccader: Improving Accuracy of Hard Attention Models for Vision. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019)

  19. [19]

    Wang, Y. et al. Glance and Focus: A Dynamic Approach to Reducing Spatial Redundancy in Image Classi- fication. in Proceedings of the 34th International Conference on Neural Information Processing Systems 2432–2444 (Curran Associates Inc., Red Hook, NY, USA, 2020)

  20. [20]

    & Memon, N

    Papadopoulos, A., Korus, P. & Memon, N. Hard-Attention for Scalable Image Classification. in Advances in Neural Information Processing Systems vol. 34 14694–14707 (Curran Associates, Inc., 2021)

  21. [21]

    & Qiu, Q

    Liu, J., Bu, Y., Tso, D. & Qiu, Q. Improved Efficiency Based on Learned Saccade and Continuous Scene Reconstruction From Foveated Visual Sampling. in The Twelfth International Conference on Learning Representations (2023)

  22. [22]

    & Jazayeri, M

    Li, J., Watters, N., Sohn, H. & Jazayeri, M. Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task. in Proceedings of The 1st Gaze Meets ML Workshop 98–112 (PMLR, 2023). 10

  23. [23]

    & Trzcinski, T

    Pardyl, A., Rypesc, G., Kurzejamski, G., Zielinski, B. & Trzcinski, T. Active Visual Exploration Based on Attention-Map Entropy. in Proceedings of the Thirty-Second International Joint Conference on Artifi- cial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China 1303–1311 (ijcai.org, 2023). doi:10.24963/IJCAI.2023/145

  24. [24]

    Pardyl, A. et al. AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 112–129 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72664-4_7

  25. [25]

    Wang, Y. et al. Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Percep- tion. Nature Machine Intelligence 7, 1804–1822 (2025)

  26. [26]

    & Bashivan, P

    Pourrahimi, M. & Bashivan, P. Emergent Brain-like Representations in a Goal-Directed Neural Network Model of Visual Search. 2025.06.06.658387 (2025) doi:10.1101/2025.06.06.658387

  27. [27]

    Zhou, B. et al. Scene Parsing Through ADE20K Dataset. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 633–641 (2017)

  28. [28]

    Siméoni, O. et al. DINOv3. (2025) doi:10.48550/arXiv.2508.10104

  29. [29]

    He, K. et al. Masked Autoencoders Are Scalable Vision Learners. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (2022)

  30. [30]

    Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024)

  31. [31]

    Zhang, Y., Ma, X., Bai, Y., Wang, H. & Fu, Y. Accessing Vision Foundation Models via ImageNet-1K. in The Thirteenth International Conference on Learning Representations (2025)

  32. [32]

    Lee, J. et al. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. in Proceedings of the 36th International Conference on Machine Learning 3744–3753 (PMLR, 2019)

  33. [33]

    Jaegle, A. et al. Perceiver: General Perception with Iterative Attention. in Proceedings of the 38th Interna- tional Conference on Machine Learning 4651–4664 (PMLR, 2021)

  34. [34]

    Jaegle, A. et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs. in International Conference on Learning Representations (2021)

  35. [35]

    Jabri, A., Fleet, D. J. & Chen, T. Scalable Adaptive Computation for Iterative Generation. in Proceedings of the 40th International Conference on Machine Learning 14569–14589 (PMLR, 2023)

  36. [36]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. (2014) doi:10.48550/arXiv.1412.3555

  37. [37]

    & Schmidhuber, J

    Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. Neural Comput. 9, 1735–1780 (1997)

  38. [38]

    & Kaiser, L

    Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. & Kaiser, L. Universal Transformers. in International Conference on Learning Representations (2018)

  39. [39]

    Yang, L., Lee, K., Nowak, R. D. & Papailiopoulos, D. Looped Transformers Are Better at Learning Learning Algorithms. in The Twelfth International Conference on Learning Representations (2023)

  40. [40]

    & Reddi, S

    Saunshi, N., Dikkala, N., Li, Z., Kumar, S. & Reddi, S. J. Reasoning with Latent Thoughts: On the Power of Looped Transformers. in The Thirteenth International Conference on Learning Representations (2024)

  41. [41]

    Wang, G. et al. Hierarchical Reasoning Model. (2025) doi:10.48550/arXiv.2506.21734

  42. [42]

    Less is More: Recursive Reasoning with Tiny Networks

    Jolicoeur-Martineau, A. Less Is More: Recursive Reasoning with Tiny Networks. (2025) doi: 10.48550/ arXiv.2510.04871

  43. [43]

    Adaptive Computation Time for Recurrent Neural Networks

    Graves, A. Adaptive Computation Time for Recurrent Neural Networks. (2017) doi: 10.48550/ arXiv.1603.08983

  44. [44]

    & Blundell, C

    Banino, A., Balaguer, J. & Blundell, C. PonderNet: Learning to Ponder. in 8th ICML Workshop on Automated Machine Learning (AutoML) (2021)

  45. [45]

    Hao, S. et al. Training Large Language Models to Reason in a Continuous Latent Space. in Second Conference on Language Modeling (2025)

  46. [46]

    Geiping, J. et al. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. (2025) doi:10.48550/arXiv.2502.05171. 11

  47. [47]

    & Bojanowski, P

    Darcet, T., Oquab, M., Mairal, J. & Bojanowski, P. Vision Transformers Need Registers. in The Twelfth International Conference on Learning Representations (2023)

  48. [48]

    Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Representations (2020)

  49. [49]

    Tolman, E. C. Cognitive Maps in Rats and Men. Psychological Review 55, 189–208 (1948)

  50. [50]

    Su, J. et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing 568, 127063 (2024)

  51. [51]

    & Yun, S

    Heo, B., Park, S., Han, D. & Yun, S. Rotary Position Embedding for Vision Transformer. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 289–305 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72684-2_17

  52. [52]

    Ansel, J. et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 vol. 2 929–947 (Association for Computing Machinery, New York, NY, USA, 2024)

  53. [53]

    Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. http://github.com/jax- ml/jax (2018)

  54. [54]

    Heek, J. et al. Flax: A neural network library and ecosystem for JAX. http://github.com/google/flax (2024)

  55. [55]

    Tancik, M. et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. in Advances in Neural Information Processing Systems vol. 33 7537–7547 (Curran Associates, Inc., 2020)

  56. [56]

    Layer Normalization

    Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. (2016) doi:10.48550/arXiv.1607.06450

  57. [57]

    Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 211–252 (2015)

  58. [58]

    & Zelnik-Manor, L

    Ridnik, T., Ben-Baruch, E., Noy, A. & Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. (2021) doi:10.48550/arXiv.2104.10972

  59. [59]

    Zheng, S. et al. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6881–6890 (2021)

  60. [60]

    Zhou, J. et al. iBOT: Image BERT Pre-Training with Online Tokenizer. (2022) doi: 10.48550/ arXiv.2111.07832

  61. [61]

    Assran, M. et al. Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 15619–15629 (2023)

  62. [62]

    & Jégou, H

    Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. Going Deeper With Image Transform- ers. in Proceedings of the IEEE/CVF International Conference on Computer Vision 32–42 (2021)

  63. [63]

    Are we done with imagenet?arXiv preprint arXiv:2006.07159,

    Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & Oord, A. van den. Are We Done with ImageNet?. (2020) doi:10.48550/arXiv.2006.07159. 12 A Interpretability PCA Visualizations. To visualize grids of high-dimensional glimpse or canvas patch tokens as RGB images (Figure 1, Figure 2, Figure 3, Figure 5), we adopt a similar approach to that of DINOv3 28, by...

  64. [64]

    patches from glimpses 0 through 𝑡). The decoder processes all 𝑁patches + 1 = 129 tokens at every step: 128 patch positions (visible embeddings from the encoder, learnable mask tokens for unseen positions) plus one CLS token. “Head” denotes the per-patch prediction head ( 𝑁patches ⋅ Linear(𝑑dec,𝑝2 ⋅ num_classes)). 28 𝐶AME(𝑇)= ∑ 𝑇−1 𝑡=0 [PatchEmbed(𝑔 ⋅ (𝑡+ ...