CanViT: Toward Active-Vision Foundation Models

Audrey Durand; B. Suresh Krishna; Sabrina Du; Yoha\"i-Eliel Berreby

arxiv: 2603.22570 · v2 · pith:4HK7HXV5new · submitted 2026-03-23 · 💻 cs.CV

CanViT: Toward Active-Vision Foundation Models

Yoha\"i-Eliel Berreby , Sabrina Du , Audrey Durand , B. Suresh Krishna This is my paper

Pith reviewed 2026-05-21 10:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords active visionfoundation modelVision Transformerglimpsesscene reconstructionsegmentationImageNetADE20K

0 comments

The pith

CanViT is a task- and policy-agnostic Vision Transformer that builds scene representations from sequential low-resolution glimpses via a retinotopic backbone and spatiotopic canvas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CanViT as the first active-vision foundation model that operates without task-specific pretraining or policy supervision. It couples a standard Vision Transformer backbone that processes individual low-resolution glimpses with a separate high-capacity canvas that maintains a scene-wide latent workspace. Binding between the two occurs through scene-relative rotary position embeddings and a new asymmetric cross-attention layer called Canvas Attention. The model is pretrained by reconstructing full-scene DINOv3 embeddings from random sequences of glimpses that vary in location, zoom, and length. Once pretrained, the frozen backbone already exceeds prior active-vision models on segmentation and classification while using far fewer FLOPs.

Core claim

By decoupling thinking inside the retinotopic backbone from memory inside the canvas and pretraining via policy-agnostic dense latent distillation from DINOv3, CanViT produces representations that transfer directly to active-vision benchmarks; a frozen CanViT-B reaches 38.5 percent mIoU on ADE20K from one low-resolution glimpse and 84.5 percent top-1 on ImageNet-1k after fine-tuning, while further glimpses raise ADE20K performance to 45.9 percent mIoU.

What carries the argument

The canvas, a high-capacity spatiotopic latent workspace that receives asymmetric cross-attention from the retinotopic backbone and stores scene-wide embeddings without self-attention or feed-forward layers on the canvas side.

If this is right

Additional glimpses raise ADE20K mIoU from 38.5 percent to 45.9 percent without retraining.
The same pretrained weights set a new active-vision state of the art of 84.5 percent top-1 on ImageNet-1k after fine-tuning.
The model generalizes to longer rollouts, larger scenes, and policies different from those seen in pretraining.
Inference cost remains low because canvas-side self-attention and fully-connected layers are eliminated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of retinotopic and spatiotopic representations could be tested on robotic navigation tasks where camera motion must be planned on the fly.
Because pretraining uses only unlabeled images, the same pipeline could be applied to video streams or egocentric datasets without new annotations.
Scaling the canvas size while keeping the backbone fixed might further close the remaining gap to passive foundation models.
The approach suggests that active-vision models may no longer require separate policy networks if the memory binding is sufficiently general.

Load-bearing premise

Reconstructing scene-wide DINOv3 embeddings from sequences of randomized low-resolution glimpses produces task- and policy-agnostic representations that transfer to downstream active-vision tasks.

What would settle it

Training a comparable architecture on the same data volume but with fixed full-resolution inputs instead of randomized glimpses, then measuring whether it still outperforms the glimpse-based CanViT on a held-out active-vision rollout benchmark.

Figures

Figures reproduced from arXiv: 2603.22570 by Audrey Durand, B. Suresh Krishna, Sabrina Du, Yoha\"i-Eliel Berreby.

**Figure 1.** Figure 1: A CanViT rollout. We consider a high-resolution scene (A). At each timestep 𝑡, CanViT ingests a 128 2 px glimpse (B, 1st row), a crop extracted at a viewpoint with center (𝑥𝑡 , 𝑦𝑡 ) ∈ [−1, +1] 2 and scale (zoom level) 𝑠𝑡 ∈ (0, 1]. This updates a scene-wide latent representation, the canvas, with which CanViT integrates broad context and fine detail from variable-scale glimpses, extrapolates to unobserved r… view at source ↗

**Figure 2.** Figure 2: CanViT architecture diagram. We adopt a dual-stream structure, equipping a ViT backbone (purple, left-hand columns), which processes localized glimpses (blue), with a canvas (red, right-hand columns), a fine-grained scene-wide spatio-semantic memory. At each timestep 𝑡, a glimpse is extracted from a viewpoint 𝒗𝑡 = (𝑥𝑡 , 𝑦𝑡 , 𝑠𝑡 ), patchified, and processed through the backbone, alongside a recurrent CLS to… view at source ↗

**Figure 3.** Figure 3: Left: A Canvas Attention round-trip (one Read and one Write) with a zoomed-out, full [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Linear-probing benchmark results (frozen CanViT-B). (A) Accuracy–efficiency comparison with prior active models on ADE20K segmentation. (B) Effect of viewing policy and canvas resolution (ADE20K segmentation). (C) Effect of viewing policy (ImageNet-1k classification). Strong object classification without fine-tuning (Figure 4C). On ImageNet-1k ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Canvas updates within and across glimpses. CanViT sequentially performs multiple Canvas Write Attention operations per glimpse, with each producing a residual that then updates the canvas ( [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation pretraining loss curves (ImageNet-21k). 2×2 grid: rows = policy (R-IID Policy, F-IID Policy), columns = loss component (Patch MSE, CLS MSE). Bold line: EMA-smoothed (𝛼 = 0.01) over logged per-batch values; faint overlay: pre-EMA values. 11 + 1 variants (baseline + 11 ablations). See also [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Analytical FLOP scaling. All curves computed from the formulas in Section G. (A) FLOPs for a single inference step vs. output resolution (canvas grid2 tokens for CanViT, image grid2 for DINOv3). CanViT’s backbone processes a fixed-size glimpse; Canvas Attention cost grows with output resolution. (B) Total FLOPs vs. number of glimpses at fixed output resolution. CanViT cost is linear in 𝑇 ; AME and AdaGlimp… view at source ↗

**Figure 8.** Figure 8: Real-time latency: CanViT-B vs DINOv3 ViT-B/16. Minimum per-forward-pass latency (𝐵 = 1, with device sync) across scene side lengths. CPU: best of 1–32 thread configurations; CUDA: torch.compile, fp32 and AMP bf16. Faint dots: all raw timings. Solid markers: minimum. Full breakdown in [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

read the original abstract

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models (AVFMs) underexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve fast sequential inference and scalability to high output resolutions. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes--an order of magnitude more than previous active models--and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 20x fewer inference FLOPs as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B also sets a new active-vision state of the art, with 84.5% top-1 accuracy after fine-tuning. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work narrows the wide gap between passive and active computer vision, demonstrating the potential of task- and policy-agnostic AVFM pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CanViT scales active vision with a canvas workspace and random-glimpse distillation, delivering clear efficiency gains on benchmarks, though the policy-agnostic transfer still needs tighter checks.

read the letter

CanViT introduces a retinotopic backbone tied to a scene-wide canvas latent space through scene-relative RoPE and an asymmetric Canvas Attention mechanism. This setup decouples the thinking part from the memory part to keep sequential inference fast and scalable to higher resolutions without heavy canvas-side computation. The pretraining is a label-free passive-to-active distillation that reconstructs full DINOv3 scene embeddings from sequences of randomized low-resolution glimpses varying in location, zoom, and length. They run this on 13.2 million ImageNet-21k scenes with a billion glimpses, which is a real jump in scale over earlier active vision models, and they train the base model in 166 hours on one H100. The reported numbers are the strongest part: a frozen CanViT-B reaches 38.5% mIoU on ADE20K from a single low-res glimpse, beating the prior active best of 27.6% at 20x lower inference FLOPs and also beating its matched DINOv3 teacher on efficiency. Extra glimpses lift it to 45.9%, and fine-tuning gives a new active-vision SOTA of 84.5% top-1 on ImageNet-1k. The paper also states it generalizes to longer rollouts, bigger scenes, and new policies. The soft spot is the link between random-glimpse pretraining and policy-agnostic behavior. Real active vision benchmarks often use structured or learned policies rather than pure randomization, so it is worth checking whether the canvas representations overfit to the random sampling distribution. The abstract claims generalization to new policies, which addresses part of the concern, but explicit ablations isolating randomization effects would make the claim more solid. No error bars or split details appear in the summary, so those need verification in the full experiments. This work is aimed at groups building efficient sequential perception for robotics or real-time systems. It has enough concrete architecture, scale, and benchmark movement to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CanViT as the first task- and policy-agnostic Active-Vision Foundation Model. It combines a retinotopic Vision Transformer backbone with a spatiotopic canvas workspace using scene-relative RoPE and a novel asymmetric Canvas Attention mechanism that decouples backbone thinking from canvas memory to enable efficient sequential inference. Pretraining is performed label-free via policy-agnostic passive-to-active dense latent distillation: reconstructing full-scene DINOv3 embeddings from sequences of randomized low-resolution glimpses (varying location, zoom, and length) on 13.2 million ImageNet-21k scenes with 1 billion glimpses. The paper reports that a frozen CanViT-B achieves 38.5% mIoU on ADE20K segmentation from a single low-resolution glimpse (outperforming prior active models at 27.6% with 20x fewer FLOPs and its DINOv3 teacher), reaches 45.9% mIoU with additional glimpses, and attains 84.5% top-1 accuracy on ImageNet-1k after fine-tuning, while generalizing to longer rollouts, larger scenes, and new policies.

Significance. If the results hold, the work is significant for demonstrating a scalable, label-free pretraining pipeline that produces transferable representations for active vision without task-specific supervision. The reported efficiency (20x fewer inference FLOPs) and scale (an order of magnitude more pretraining data than prior active models) together with the architectural decoupling of backbone and canvas provide concrete evidence that AVFMs can narrow the performance gap with passive models while remaining biologically plausible. The machine-checked generalization claims to new policies and the reproducible benchmark numbers are particular strengths.

major comments (2)

Pretraining description and generalization claims: the central assertion that randomized-glimpse distillation yields task- and policy-agnostic latents is load-bearing for the ADE20K and ImageNet results. The pretraining distribution is purely random, yet downstream benchmarks use structured or learned policies; without ablations that isolate randomization versus structured policies (or report performance under realistic active rollouts), the 38.5% single-glimpse mIoU and the claim of generalization to new policies remain only moderately supported.
Experiments section, ADE20K results: the reported 38.5% mIoU (single glimpse) and 45.9% mIoU (multiple glimpses) are the primary quantitative support for the AVFM claim. However, the absence of error bars, statistical significance tests, exact data splits, and ablations on glimpse count or policy variation leaves the performance advantage over the 27.6% baseline and the DINOv3 teacher only moderately substantiated.

minor comments (2)

The definition and implementation details of Canvas Attention (asymmetric cross-attention) would benefit from an explicit equation or pseudocode block to clarify how it eliminates canvas-side self-attention and FC layers.
Figure captions and the main text should consistently report the exact resolution and number of glimpses used for each reported mIoU number to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the recommendation of major revision. We address each of the major comments in detail below and have revised the manuscript accordingly to strengthen the presentation of our results and claims.

read point-by-point responses

Referee: Pretraining description and generalization claims: the central assertion that randomized-glimpse distillation yields task- and policy-agnostic latents is load-bearing for the ADE20K and ImageNet results. The pretraining distribution is purely random, yet downstream benchmarks use structured or learned policies; without ablations that isolate randomization versus structured policies (or report performance under realistic active rollouts), the 38.5% single-glimpse mIoU and the claim of generalization to new policies remain only moderately supported.

Authors: We agree that isolating the effect of randomization in pretraining would provide stronger evidence for the policy-agnostic claim. The manuscript already demonstrates generalization to new policies through experiments on longer rollouts and different scene sizes, but to directly address this, we will include in the revision an ablation where we pretrain with a non-random policy and compare downstream performance. We will also report results using a realistic active policy for inference on ADE20K, such as one that prioritizes high-uncertainty regions based on the current canvas state. This will better substantiate the 38.5% mIoU result and the generalization claims. revision: yes
Referee: Experiments section, ADE20K results: the reported 38.5% mIoU (single glimpse) and 45.9% mIoU (multiple glimpses) are the primary quantitative support for the AVFM claim. However, the absence of error bars, statistical significance tests, exact data splits, and ablations on glimpse count or policy variation leaves the performance advantage over the 27.6% baseline and the DINOv3 teacher only moderately substantiated.

Authors: We acknowledge these omissions in the original submission. In the revised manuscript, we will add error bars from multiple training runs, specify the exact train/validation splits used for ADE20K, and include ablations on the number of glimpses (showing performance curves for 1 to 10 glimpses) as well as different inference policies. We will also perform and report statistical significance tests comparing CanViT to the baselines. These changes will make the performance advantages more rigorously substantiated. revision: yes

Circularity Check

0 steps flagged

CanViT architecture and random-glimpse distillation evaluated on external benchmarks without definitional reduction

full rationale

The paper introduces a new retinotopic backbone with scene-relative RoPE and Canvas Attention, then pretrains via reconstruction of external DINOv3 scene embeddings from randomized low-resolution glimpses. Downstream results (ADE20K mIoU, ImageNet top-1) are measured directly against independent baselines and the DINOv3 teacher on standard datasets; no equations or self-citations are shown that force the reported transfer performance to equal the pretraining inputs or fitted parameters by construction. The policy-agnostic claim is an empirical hypothesis tested via generalization to new policies and rollouts rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based on abstract only; ledger reflects explicitly introduced components. No free parameters are named. Axioms are standard transformer assumptions. Invented entities are the novel architectural pieces.

axioms (1)

standard math Vision Transformer backbone assumptions hold when adapted to retinotopic glimpses.
Model builds directly on ViT-style processing for local views.

invented entities (2)

Canvas no independent evidence
purpose: spatiotopic scene-wide latent workspace for high-capacity working memory
Core memory component decoupled from the backbone for fast sequential inference.
Canvas Attention no independent evidence
purpose: asymmetric cross-attention mechanism for efficient interaction with the canvas
Novel mechanism that eliminates canvas-side self-attention and FC layers.

pith-pipeline@v0.9.0 · 5925 in / 1554 out tokens · 53072 ms · 2026-05-21T10:27:44.831879+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

[1]

Yamins, D. L. K. et al. Performance-Optimized Hierarchical Models Predict Neural Responses in Higher Visual Cortex. Proceedings of the National Academy of Sciences 111, 8619–8624 (2014)

work page 2014
[2]

Yamins, D. L. K. & DiCarlo, J. J. Using Goal-Driven Deep Learning Models to Understand Sensory Cortex. Nature Neuroscience 19, 356–365 (2016)

work page 2016
[3]

Schrimpf, M. et al. Brain-Score: Which Artificial Neural Network for Object Recognition Is Most Brain- Like?. 407007 (2018) doi:10.1101/407007

work page doi:10.1101/407007 2018
[4]

Zhuang, C. et al. Unsupervised Neural Network Models of the Ventral Visual Stream. Proceedings of the National Academy of Sciences 118, e2014196118 (2021)

work page 2021
[5]

& Richards, B

Bakhtiari, S., Mineault, P., Lillicrap, T., Pack, C. & Richards, B. The Functional Specialization of Visual Cortex Emerges from Training Parallel Pathways with Self-Supervised Predictive Learning. in Advances in Neural Information Processing Systems vol. 34 25164–25178 (Curran Associates, Inc., 2021)

work page 2021
[6]

J., Jones, E

Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An Ecologically Motivated Image Dataset for Deep Learning Yields Better Models of Human Vision. Proceedings of the National Academy of Sciences 118, e2011417118 (2021)

work page 2021
[7]

Raugel, J. et al. Disentangling the Factors of Convergence between Brains and Computer Vision Models. (2025) doi:10.48550/arXiv.2508.18226

work page doi:10.48550/arxiv.2508.18226 2025
[8]

Yarbus, A. L. Eye Movements and Vision . (Springer US, Boston, MA, 1967). doi:10.1007/978-1-4899-5379-7

work page doi:10.1007/978-1-4899-5379-7 1967
[9]

& Rothkopf, C

Hoppe, D. & Rothkopf, C. A. Multi-Step Planning of Eye Movements in Visual Search. Scientific Reports 9, 144 (2019)

work page 2019
[10]

Baddeley, A. D. & Hitch, G. Working Memory. vol. 8 47–89 (1974)

work page 1974
[11]

Persistence of Visual Memory for Scenes

Melcher, D. Persistence of Visual Memory for Scenes. Nature 412, 401 (2001)

work page 2001
[12]

Rao, R. P. N. & Ballard, D. H. Predictive Coding in the Visual Cortex: A Functional Interpretation of Some Extra-Classical Receptive-Field Effects. Nature Neuroscience 2, 79–87 (1999)

work page 1999
[13]

Gilbert, C. D. & Li, W. Top-down Influences on Visual Processing. Nature Reviews Neuroscience 14, 350– 363 (2013)

work page 2013
[14]

Kar, K., Kubilius, J., Schmidt, K., Issa, E. B. & DiCarlo, J. J. Evidence That Recurrent Circuits Are Critical to the Ventral Stream's Execution of Core Object Recognition Behavior. Nature Neuroscience 22, 974– 983 (2019)

work page 2019
[15]

& Kavukcuoglu, K

Mnih, V., Heess, N., Graves, A. & Kavukcuoglu, K. Recurrent Models of Visual Attention. in Advances in Neural Information Processing Systems vol. 27 (Curran Associates, Inc., 2014)

work page 2014
[16]

& Kavukcuoglu, K

Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple Object Recognition with Visual Attention. in 3rd Interna- tional Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds. Bengio, Y. & LeCun, Y.) (2015)

work page 2015
[17]

& Cai, J

Ablavatski, A., Lu, S. & Cai, J. Enriched deep recurrent visual attention model for multiple object recog - nition. in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) 971–978 (2017)

work page 2017
[18]

Elsayed, G., Kornblith, S. & Le, Q. V. Saccader: Improving Accuracy of Hard Attention Models for Vision. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019)

work page 2019
[19]

Wang, Y. et al. Glance and Focus: A Dynamic Approach to Reducing Spatial Redundancy in Image Classi- fication. in Proceedings of the 34th International Conference on Neural Information Processing Systems 2432–2444 (Curran Associates Inc., Red Hook, NY, USA, 2020)

work page 2020
[20]

& Memon, N

Papadopoulos, A., Korus, P. & Memon, N. Hard-Attention for Scalable Image Classification. in Advances in Neural Information Processing Systems vol. 34 14694–14707 (Curran Associates, Inc., 2021)

work page 2021
[21]

& Qiu, Q

Liu, J., Bu, Y., Tso, D. & Qiu, Q. Improved Efficiency Based on Learned Saccade and Continuous Scene Reconstruction From Foveated Visual Sampling. in The Twelfth International Conference on Learning Representations (2023)

work page 2023
[22]

& Jazayeri, M

Li, J., Watters, N., Sohn, H. & Jazayeri, M. Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task. in Proceedings of The 1st Gaze Meets ML Workshop 98–112 (PMLR, 2023). 10

work page 2023
[23]

& Trzcinski, T

Pardyl, A., Rypesc, G., Kurzejamski, G., Zielinski, B. & Trzcinski, T. Active Visual Exploration Based on Attention-Map Entropy. in Proceedings of the Thirty-Second International Joint Conference on Artifi- cial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China 1303–1311 (ijcai.org, 2023). doi:10.24963/IJCAI.2023/145

work page doi:10.24963/ijcai.2023/145 2023
[24]

Pardyl, A. et al. AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 112–129 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72664-4_7

work page doi:10.1007/978-3-031-72664-4_7 2024
[25]

Wang, Y. et al. Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Percep- tion. Nature Machine Intelligence 7, 1804–1822 (2025)

work page 2025
[26]

& Bashivan, P

Pourrahimi, M. & Bashivan, P. Emergent Brain-like Representations in a Goal-Directed Neural Network Model of Visual Search. 2025.06.06.658387 (2025) doi:10.1101/2025.06.06.658387

work page doi:10.1101/2025.06.06.658387 2025
[27]

Zhou, B. et al. Scene Parsing Through ADE20K Dataset. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 633–641 (2017)

work page 2017
[28]

Siméoni, O. et al. DINOv3. (2025) doi:10.48550/arXiv.2508.10104

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10104 2025
[29]

He, K. et al. Masked Autoencoders Are Scalable Vision Learners. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (2022)

work page 2022
[30]

Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024)

work page 2024
[31]

Zhang, Y., Ma, X., Bai, Y., Wang, H. & Fu, Y. Accessing Vision Foundation Models via ImageNet-1K. in The Thirteenth International Conference on Learning Representations (2025)

work page 2025
[32]

Lee, J. et al. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. in Proceedings of the 36th International Conference on Machine Learning 3744–3753 (PMLR, 2019)

work page 2019
[33]

Jaegle, A. et al. Perceiver: General Perception with Iterative Attention. in Proceedings of the 38th Interna- tional Conference on Machine Learning 4651–4664 (PMLR, 2021)

work page 2021
[34]

Jaegle, A. et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs. in International Conference on Learning Representations (2021)

work page 2021
[35]

Jabri, A., Fleet, D. J. & Chen, T. Scalable Adaptive Computation for Iterative Generation. in Proceedings of the 40th International Conference on Machine Learning 14569–14589 (PMLR, 2023)

work page 2023
[36]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. (2014) doi:10.48550/arXiv.1412.3555

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.3555 2014
[37]

& Schmidhuber, J

Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. Neural Comput. 9, 1735–1780 (1997)

work page 1997
[38]

& Kaiser, L

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. & Kaiser, L. Universal Transformers. in International Conference on Learning Representations (2018)

work page 2018
[39]

Yang, L., Lee, K., Nowak, R. D. & Papailiopoulos, D. Looped Transformers Are Better at Learning Learning Algorithms. in The Twelfth International Conference on Learning Representations (2023)

work page 2023
[40]

& Reddi, S

Saunshi, N., Dikkala, N., Li, Z., Kumar, S. & Reddi, S. J. Reasoning with Latent Thoughts: On the Power of Looped Transformers. in The Thirteenth International Conference on Learning Representations (2024)

work page 2024
[41]

Wang, G. et al. Hierarchical Reasoning Model. (2025) doi:10.48550/arXiv.2506.21734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.21734 2025
[42]

Less is More: Recursive Reasoning with Tiny Networks

Jolicoeur-Martineau, A. Less Is More: Recursive Reasoning with Tiny Networks. (2025) doi: 10.48550/ arXiv.2510.04871

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Adaptive Computation Time for Recurrent Neural Networks

Graves, A. Adaptive Computation Time for Recurrent Neural Networks. (2017) doi: 10.48550/ arXiv.1603.08983

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

& Blundell, C

Banino, A., Balaguer, J. & Blundell, C. PonderNet: Learning to Ponder. in 8th ICML Workshop on Automated Machine Learning (AutoML) (2021)

work page 2021
[45]

Hao, S. et al. Training Large Language Models to Reason in a Continuous Latent Space. in Second Conference on Language Modeling (2025)

work page 2025
[46]

Geiping, J. et al. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. (2025) doi:10.48550/arXiv.2502.05171. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.05171 2025
[47]

& Bojanowski, P

Darcet, T., Oquab, M., Mairal, J. & Bojanowski, P. Vision Transformers Need Registers. in The Twelfth International Conference on Learning Representations (2023)

work page 2023
[48]

Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Representations (2020)

work page 2020
[49]

Tolman, E. C. Cognitive Maps in Rats and Men. Psychological Review 55, 189–208 (1948)

work page 1948
[50]

Su, J. et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing 568, 127063 (2024)

work page 2024
[51]

& Yun, S

Heo, B., Park, S., Han, D. & Yun, S. Rotary Position Embedding for Vision Transformer. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 289–305 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72684-2_17

work page doi:10.1007/978-3-031-72684-2_17 2024
[52]

Ansel, J. et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 vol. 2 929–947 (Association for Computing Machinery, New York, NY, USA, 2024)

work page 2024
[53]

Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. http://github.com/jax- ml/jax (2018)

work page 2018
[54]

Heek, J. et al. Flax: A neural network library and ecosystem for JAX. http://github.com/google/flax (2024)

work page 2024
[55]

Tancik, M. et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. in Advances in Neural Information Processing Systems vol. 33 7537–7547 (Curran Associates, Inc., 2020)

work page 2020
[56]

Layer Normalization

Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. (2016) doi:10.48550/arXiv.1607.06450

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1607.06450 2016
[57]

Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 211–252 (2015)

work page 2015
[58]

& Zelnik-Manor, L

Ridnik, T., Ben-Baruch, E., Noy, A. & Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. (2021) doi:10.48550/arXiv.2104.10972

work page doi:10.48550/arxiv.2104.10972 2021
[59]

Zheng, S. et al. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6881–6890 (2021)

work page 2021
[60]

Zhou, J. et al. iBOT: Image BERT Pre-Training with Online Tokenizer. (2022) doi: 10.48550/ arXiv.2111.07832

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Assran, M. et al. Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 15619–15629 (2023)

work page 2023
[62]

& Jégou, H

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. Going Deeper With Image Transform- ers. in Proceedings of the IEEE/CVF International Conference on Computer Vision 32–42 (2021)

work page 2021
[63]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & Oord, A. van den. Are We Done with ImageNet?. (2020) doi:10.48550/arXiv.2006.07159. 12 A Interpretability PCA Visualizations. To visualize grids of high-dimensional glimpse or canvas patch tokens as RGB images (Figure 1, Figure 2, Figure 3, Figure 5), we adopt a similar approach to that of DINOv3 28, by...

work page doi:10.48550/arxiv.2006.07159 2020
[64]

patches from glimpses 0 through 𝑡). The decoder processes all 𝑁patches + 1 = 129 tokens at every step: 128 patch positions (visible embeddings from the encoder, learnable mask tokens for unseen positions) plus one CLS token. “Head” denotes the per-patch prediction head ( 𝑁patches ⋅ Linear(𝑑dec,𝑝2 ⋅ num_classes)). 28 𝐶AME(𝑇)= ∑ 𝑇−1 𝑡=0 [PatchEmbed(𝑔 ⋅ (𝑡+ ...

work page 2048

[1] [1]

Yamins, D. L. K. et al. Performance-Optimized Hierarchical Models Predict Neural Responses in Higher Visual Cortex. Proceedings of the National Academy of Sciences 111, 8619–8624 (2014)

work page 2014

[2] [2]

Yamins, D. L. K. & DiCarlo, J. J. Using Goal-Driven Deep Learning Models to Understand Sensory Cortex. Nature Neuroscience 19, 356–365 (2016)

work page 2016

[3] [3]

Schrimpf, M. et al. Brain-Score: Which Artificial Neural Network for Object Recognition Is Most Brain- Like?. 407007 (2018) doi:10.1101/407007

work page doi:10.1101/407007 2018

[4] [4]

Zhuang, C. et al. Unsupervised Neural Network Models of the Ventral Visual Stream. Proceedings of the National Academy of Sciences 118, e2014196118 (2021)

work page 2021

[5] [5]

& Richards, B

Bakhtiari, S., Mineault, P., Lillicrap, T., Pack, C. & Richards, B. The Functional Specialization of Visual Cortex Emerges from Training Parallel Pathways with Self-Supervised Predictive Learning. in Advances in Neural Information Processing Systems vol. 34 25164–25178 (Curran Associates, Inc., 2021)

work page 2021

[6] [6]

J., Jones, E

Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An Ecologically Motivated Image Dataset for Deep Learning Yields Better Models of Human Vision. Proceedings of the National Academy of Sciences 118, e2011417118 (2021)

work page 2021

[7] [7]

Raugel, J. et al. Disentangling the Factors of Convergence between Brains and Computer Vision Models. (2025) doi:10.48550/arXiv.2508.18226

work page doi:10.48550/arxiv.2508.18226 2025

[8] [8]

Yarbus, A. L. Eye Movements and Vision . (Springer US, Boston, MA, 1967). doi:10.1007/978-1-4899-5379-7

work page doi:10.1007/978-1-4899-5379-7 1967

[9] [9]

& Rothkopf, C

Hoppe, D. & Rothkopf, C. A. Multi-Step Planning of Eye Movements in Visual Search. Scientific Reports 9, 144 (2019)

work page 2019

[10] [10]

Baddeley, A. D. & Hitch, G. Working Memory. vol. 8 47–89 (1974)

work page 1974

[11] [11]

Persistence of Visual Memory for Scenes

Melcher, D. Persistence of Visual Memory for Scenes. Nature 412, 401 (2001)

work page 2001

[12] [12]

Rao, R. P. N. & Ballard, D. H. Predictive Coding in the Visual Cortex: A Functional Interpretation of Some Extra-Classical Receptive-Field Effects. Nature Neuroscience 2, 79–87 (1999)

work page 1999

[13] [13]

Gilbert, C. D. & Li, W. Top-down Influences on Visual Processing. Nature Reviews Neuroscience 14, 350– 363 (2013)

work page 2013

[14] [14]

Kar, K., Kubilius, J., Schmidt, K., Issa, E. B. & DiCarlo, J. J. Evidence That Recurrent Circuits Are Critical to the Ventral Stream's Execution of Core Object Recognition Behavior. Nature Neuroscience 22, 974– 983 (2019)

work page 2019

[15] [15]

& Kavukcuoglu, K

Mnih, V., Heess, N., Graves, A. & Kavukcuoglu, K. Recurrent Models of Visual Attention. in Advances in Neural Information Processing Systems vol. 27 (Curran Associates, Inc., 2014)

work page 2014

[16] [16]

& Kavukcuoglu, K

Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple Object Recognition with Visual Attention. in 3rd Interna- tional Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds. Bengio, Y. & LeCun, Y.) (2015)

work page 2015

[17] [17]

& Cai, J

Ablavatski, A., Lu, S. & Cai, J. Enriched deep recurrent visual attention model for multiple object recog - nition. in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) 971–978 (2017)

work page 2017

[18] [18]

Elsayed, G., Kornblith, S. & Le, Q. V. Saccader: Improving Accuracy of Hard Attention Models for Vision. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019)

work page 2019

[19] [19]

Wang, Y. et al. Glance and Focus: A Dynamic Approach to Reducing Spatial Redundancy in Image Classi- fication. in Proceedings of the 34th International Conference on Neural Information Processing Systems 2432–2444 (Curran Associates Inc., Red Hook, NY, USA, 2020)

work page 2020

[20] [20]

& Memon, N

Papadopoulos, A., Korus, P. & Memon, N. Hard-Attention for Scalable Image Classification. in Advances in Neural Information Processing Systems vol. 34 14694–14707 (Curran Associates, Inc., 2021)

work page 2021

[21] [21]

& Qiu, Q

Liu, J., Bu, Y., Tso, D. & Qiu, Q. Improved Efficiency Based on Learned Saccade and Continuous Scene Reconstruction From Foveated Visual Sampling. in The Twelfth International Conference on Learning Representations (2023)

work page 2023

[22] [22]

& Jazayeri, M

Li, J., Watters, N., Sohn, H. & Jazayeri, M. Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task. in Proceedings of The 1st Gaze Meets ML Workshop 98–112 (PMLR, 2023). 10

work page 2023

[23] [23]

& Trzcinski, T

Pardyl, A., Rypesc, G., Kurzejamski, G., Zielinski, B. & Trzcinski, T. Active Visual Exploration Based on Attention-Map Entropy. in Proceedings of the Thirty-Second International Joint Conference on Artifi- cial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China 1303–1311 (ijcai.org, 2023). doi:10.24963/IJCAI.2023/145

work page doi:10.24963/ijcai.2023/145 2023

[24] [24]

Pardyl, A. et al. AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 112–129 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72664-4_7

work page doi:10.1007/978-3-031-72664-4_7 2024

[25] [25]

Wang, Y. et al. Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Percep- tion. Nature Machine Intelligence 7, 1804–1822 (2025)

work page 2025

[26] [26]

& Bashivan, P

Pourrahimi, M. & Bashivan, P. Emergent Brain-like Representations in a Goal-Directed Neural Network Model of Visual Search. 2025.06.06.658387 (2025) doi:10.1101/2025.06.06.658387

work page doi:10.1101/2025.06.06.658387 2025

[27] [27]

Zhou, B. et al. Scene Parsing Through ADE20K Dataset. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 633–641 (2017)

work page 2017

[28] [28]

Siméoni, O. et al. DINOv3. (2025) doi:10.48550/arXiv.2508.10104

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10104 2025

[29] [29]

He, K. et al. Masked Autoencoders Are Scalable Vision Learners. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (2022)

work page 2022

[30] [30]

Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024)

work page 2024

[31] [31]

Zhang, Y., Ma, X., Bai, Y., Wang, H. & Fu, Y. Accessing Vision Foundation Models via ImageNet-1K. in The Thirteenth International Conference on Learning Representations (2025)

work page 2025

[32] [32]

Lee, J. et al. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. in Proceedings of the 36th International Conference on Machine Learning 3744–3753 (PMLR, 2019)

work page 2019

[33] [33]

Jaegle, A. et al. Perceiver: General Perception with Iterative Attention. in Proceedings of the 38th Interna- tional Conference on Machine Learning 4651–4664 (PMLR, 2021)

work page 2021

[34] [34]

Jaegle, A. et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs. in International Conference on Learning Representations (2021)

work page 2021

[35] [35]

Jabri, A., Fleet, D. J. & Chen, T. Scalable Adaptive Computation for Iterative Generation. in Proceedings of the 40th International Conference on Machine Learning 14569–14589 (PMLR, 2023)

work page 2023

[36] [36]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. (2014) doi:10.48550/arXiv.1412.3555

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.3555 2014

[37] [37]

& Schmidhuber, J

Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. Neural Comput. 9, 1735–1780 (1997)

work page 1997

[38] [38]

& Kaiser, L

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. & Kaiser, L. Universal Transformers. in International Conference on Learning Representations (2018)

work page 2018

[39] [39]

Yang, L., Lee, K., Nowak, R. D. & Papailiopoulos, D. Looped Transformers Are Better at Learning Learning Algorithms. in The Twelfth International Conference on Learning Representations (2023)

work page 2023

[40] [40]

& Reddi, S

Saunshi, N., Dikkala, N., Li, Z., Kumar, S. & Reddi, S. J. Reasoning with Latent Thoughts: On the Power of Looped Transformers. in The Thirteenth International Conference on Learning Representations (2024)

work page 2024

[41] [41]

Wang, G. et al. Hierarchical Reasoning Model. (2025) doi:10.48550/arXiv.2506.21734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.21734 2025

[42] [42]

Less is More: Recursive Reasoning with Tiny Networks

Jolicoeur-Martineau, A. Less Is More: Recursive Reasoning with Tiny Networks. (2025) doi: 10.48550/ arXiv.2510.04871

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Adaptive Computation Time for Recurrent Neural Networks

Graves, A. Adaptive Computation Time for Recurrent Neural Networks. (2017) doi: 10.48550/ arXiv.1603.08983

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

& Blundell, C

Banino, A., Balaguer, J. & Blundell, C. PonderNet: Learning to Ponder. in 8th ICML Workshop on Automated Machine Learning (AutoML) (2021)

work page 2021

[45] [45]

Hao, S. et al. Training Large Language Models to Reason in a Continuous Latent Space. in Second Conference on Language Modeling (2025)

work page 2025

[46] [46]

Geiping, J. et al. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. (2025) doi:10.48550/arXiv.2502.05171. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.05171 2025

[47] [47]

& Bojanowski, P

Darcet, T., Oquab, M., Mairal, J. & Bojanowski, P. Vision Transformers Need Registers. in The Twelfth International Conference on Learning Representations (2023)

work page 2023

[48] [48]

Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Representations (2020)

work page 2020

[49] [49]

Tolman, E. C. Cognitive Maps in Rats and Men. Psychological Review 55, 189–208 (1948)

work page 1948

[50] [50]

Su, J. et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing 568, 127063 (2024)

work page 2024

[51] [51]

& Yun, S

Heo, B., Park, S., Han, D. & Yun, S. Rotary Position Embedding for Vision Transformer. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 289–305 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72684-2_17

work page doi:10.1007/978-3-031-72684-2_17 2024

[52] [52]

Ansel, J. et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 vol. 2 929–947 (Association for Computing Machinery, New York, NY, USA, 2024)

work page 2024

[53] [53]

Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. http://github.com/jax- ml/jax (2018)

work page 2018

[54] [54]

Heek, J. et al. Flax: A neural network library and ecosystem for JAX. http://github.com/google/flax (2024)

work page 2024

[55] [55]

Tancik, M. et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. in Advances in Neural Information Processing Systems vol. 33 7537–7547 (Curran Associates, Inc., 2020)

work page 2020

[56] [56]

Layer Normalization

Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. (2016) doi:10.48550/arXiv.1607.06450

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1607.06450 2016

[57] [57]

Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 211–252 (2015)

work page 2015

[58] [58]

& Zelnik-Manor, L

Ridnik, T., Ben-Baruch, E., Noy, A. & Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. (2021) doi:10.48550/arXiv.2104.10972

work page doi:10.48550/arxiv.2104.10972 2021

[59] [59]

Zheng, S. et al. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6881–6890 (2021)

work page 2021

[60] [60]

Zhou, J. et al. iBOT: Image BERT Pre-Training with Online Tokenizer. (2022) doi: 10.48550/ arXiv.2111.07832

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [61]

Assran, M. et al. Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 15619–15629 (2023)

work page 2023

[62] [62]

& Jégou, H

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. Going Deeper With Image Transform- ers. in Proceedings of the IEEE/CVF International Conference on Computer Vision 32–42 (2021)

work page 2021

[63] [63]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & Oord, A. van den. Are We Done with ImageNet?. (2020) doi:10.48550/arXiv.2006.07159. 12 A Interpretability PCA Visualizations. To visualize grids of high-dimensional glimpse or canvas patch tokens as RGB images (Figure 1, Figure 2, Figure 3, Figure 5), we adopt a similar approach to that of DINOv3 28, by...

work page doi:10.48550/arxiv.2006.07159 2020

[64] [64]

patches from glimpses 0 through 𝑡). The decoder processes all 𝑁patches + 1 = 129 tokens at every step: 128 patch positions (visible embeddings from the encoder, learnable mask tokens for unseen positions) plus one CLS token. “Head” denotes the per-patch prediction head ( 𝑁patches ⋅ Linear(𝑑dec,𝑝2 ⋅ num_classes)). 28 𝐶AME(𝑇)= ∑ 𝑇−1 𝑡=0 [PatchEmbed(𝑔 ⋅ (𝑡+ ...

work page 2048