CanViT: Toward Active-Vision Foundation Models
Pith reviewed 2026-05-21 10:27 UTC · model grok-4.3
The pith
CanViT is a task- and policy-agnostic Vision Transformer that builds scene representations from sequential low-resolution glimpses via a retinotopic backbone and spatiotopic canvas.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decoupling thinking inside the retinotopic backbone from memory inside the canvas and pretraining via policy-agnostic dense latent distillation from DINOv3, CanViT produces representations that transfer directly to active-vision benchmarks; a frozen CanViT-B reaches 38.5 percent mIoU on ADE20K from one low-resolution glimpse and 84.5 percent top-1 on ImageNet-1k after fine-tuning, while further glimpses raise ADE20K performance to 45.9 percent mIoU.
What carries the argument
The canvas, a high-capacity spatiotopic latent workspace that receives asymmetric cross-attention from the retinotopic backbone and stores scene-wide embeddings without self-attention or feed-forward layers on the canvas side.
If this is right
- Additional glimpses raise ADE20K mIoU from 38.5 percent to 45.9 percent without retraining.
- The same pretrained weights set a new active-vision state of the art of 84.5 percent top-1 on ImageNet-1k after fine-tuning.
- The model generalizes to longer rollouts, larger scenes, and policies different from those seen in pretraining.
- Inference cost remains low because canvas-side self-attention and fully-connected layers are eliminated.
Where Pith is reading between the lines
- The separation of retinotopic and spatiotopic representations could be tested on robotic navigation tasks where camera motion must be planned on the fly.
- Because pretraining uses only unlabeled images, the same pipeline could be applied to video streams or egocentric datasets without new annotations.
- Scaling the canvas size while keeping the backbone fixed might further close the remaining gap to passive foundation models.
- The approach suggests that active-vision models may no longer require separate policy networks if the memory binding is sufficiently general.
Load-bearing premise
Reconstructing scene-wide DINOv3 embeddings from sequences of randomized low-resolution glimpses produces task- and policy-agnostic representations that transfer to downstream active-vision tasks.
What would settle it
Training a comparable architecture on the same data volume but with fixed full-resolution inputs instead of randomized glimpses, then measuring whether it still outperforms the glimpse-based CanViT on a held-out active-vision rollout benchmark.
Figures
read the original abstract
Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models (AVFMs) underexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve fast sequential inference and scalability to high output resolutions. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes--an order of magnitude more than previous active models--and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 20x fewer inference FLOPs as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B also sets a new active-vision state of the art, with 84.5% top-1 accuracy after fine-tuning. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work narrows the wide gap between passive and active computer vision, demonstrating the potential of task- and policy-agnostic AVFM pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CanViT as the first task- and policy-agnostic Active-Vision Foundation Model. It combines a retinotopic Vision Transformer backbone with a spatiotopic canvas workspace using scene-relative RoPE and a novel asymmetric Canvas Attention mechanism that decouples backbone thinking from canvas memory to enable efficient sequential inference. Pretraining is performed label-free via policy-agnostic passive-to-active dense latent distillation: reconstructing full-scene DINOv3 embeddings from sequences of randomized low-resolution glimpses (varying location, zoom, and length) on 13.2 million ImageNet-21k scenes with 1 billion glimpses. The paper reports that a frozen CanViT-B achieves 38.5% mIoU on ADE20K segmentation from a single low-resolution glimpse (outperforming prior active models at 27.6% with 20x fewer FLOPs and its DINOv3 teacher), reaches 45.9% mIoU with additional glimpses, and attains 84.5% top-1 accuracy on ImageNet-1k after fine-tuning, while generalizing to longer rollouts, larger scenes, and new policies.
Significance. If the results hold, the work is significant for demonstrating a scalable, label-free pretraining pipeline that produces transferable representations for active vision without task-specific supervision. The reported efficiency (20x fewer inference FLOPs) and scale (an order of magnitude more pretraining data than prior active models) together with the architectural decoupling of backbone and canvas provide concrete evidence that AVFMs can narrow the performance gap with passive models while remaining biologically plausible. The machine-checked generalization claims to new policies and the reproducible benchmark numbers are particular strengths.
major comments (2)
- Pretraining description and generalization claims: the central assertion that randomized-glimpse distillation yields task- and policy-agnostic latents is load-bearing for the ADE20K and ImageNet results. The pretraining distribution is purely random, yet downstream benchmarks use structured or learned policies; without ablations that isolate randomization versus structured policies (or report performance under realistic active rollouts), the 38.5% single-glimpse mIoU and the claim of generalization to new policies remain only moderately supported.
- Experiments section, ADE20K results: the reported 38.5% mIoU (single glimpse) and 45.9% mIoU (multiple glimpses) are the primary quantitative support for the AVFM claim. However, the absence of error bars, statistical significance tests, exact data splits, and ablations on glimpse count or policy variation leaves the performance advantage over the 27.6% baseline and the DINOv3 teacher only moderately substantiated.
minor comments (2)
- The definition and implementation details of Canvas Attention (asymmetric cross-attention) would benefit from an explicit equation or pseudocode block to clarify how it eliminates canvas-side self-attention and FC layers.
- Figure captions and the main text should consistently report the exact resolution and number of glimpses used for each reported mIoU number to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our work's significance and for the recommendation of major revision. We address each of the major comments in detail below and have revised the manuscript accordingly to strengthen the presentation of our results and claims.
read point-by-point responses
-
Referee: Pretraining description and generalization claims: the central assertion that randomized-glimpse distillation yields task- and policy-agnostic latents is load-bearing for the ADE20K and ImageNet results. The pretraining distribution is purely random, yet downstream benchmarks use structured or learned policies; without ablations that isolate randomization versus structured policies (or report performance under realistic active rollouts), the 38.5% single-glimpse mIoU and the claim of generalization to new policies remain only moderately supported.
Authors: We agree that isolating the effect of randomization in pretraining would provide stronger evidence for the policy-agnostic claim. The manuscript already demonstrates generalization to new policies through experiments on longer rollouts and different scene sizes, but to directly address this, we will include in the revision an ablation where we pretrain with a non-random policy and compare downstream performance. We will also report results using a realistic active policy for inference on ADE20K, such as one that prioritizes high-uncertainty regions based on the current canvas state. This will better substantiate the 38.5% mIoU result and the generalization claims. revision: yes
-
Referee: Experiments section, ADE20K results: the reported 38.5% mIoU (single glimpse) and 45.9% mIoU (multiple glimpses) are the primary quantitative support for the AVFM claim. However, the absence of error bars, statistical significance tests, exact data splits, and ablations on glimpse count or policy variation leaves the performance advantage over the 27.6% baseline and the DINOv3 teacher only moderately substantiated.
Authors: We acknowledge these omissions in the original submission. In the revised manuscript, we will add error bars from multiple training runs, specify the exact train/validation splits used for ADE20K, and include ablations on the number of glimpses (showing performance curves for 1 to 10 glimpses) as well as different inference policies. We will also perform and report statistical significance tests comparing CanViT to the baselines. These changes will make the performance advantages more rigorously substantiated. revision: yes
Circularity Check
CanViT architecture and random-glimpse distillation evaluated on external benchmarks without definitional reduction
full rationale
The paper introduces a new retinotopic backbone with scene-relative RoPE and Canvas Attention, then pretrains via reconstruction of external DINOv3 scene embeddings from randomized low-resolution glimpses. Downstream results (ADE20K mIoU, ImageNet top-1) are measured directly against independent baselines and the DINOv3 teacher on standard datasets; no equations or self-citations are shown that force the reported transfer performance to equal the pretraining inputs or fitted parameters by construction. The policy-agnostic claim is an empirical hypothesis tested via generalization to new policies and rollouts rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Vision Transformer backbone assumptions hold when adapted to retinotopic glimpses.
invented entities (2)
-
Canvas
no independent evidence
-
Canvas Attention
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yamins, D. L. K. et al. Performance-Optimized Hierarchical Models Predict Neural Responses in Higher Visual Cortex. Proceedings of the National Academy of Sciences 111, 8619–8624 (2014)
work page 2014
-
[2]
Yamins, D. L. K. & DiCarlo, J. J. Using Goal-Driven Deep Learning Models to Understand Sensory Cortex. Nature Neuroscience 19, 356–365 (2016)
work page 2016
-
[3]
Schrimpf, M. et al. Brain-Score: Which Artificial Neural Network for Object Recognition Is Most Brain- Like?. 407007 (2018) doi:10.1101/407007
-
[4]
Zhuang, C. et al. Unsupervised Neural Network Models of the Ventral Visual Stream. Proceedings of the National Academy of Sciences 118, e2014196118 (2021)
work page 2021
-
[5]
Bakhtiari, S., Mineault, P., Lillicrap, T., Pack, C. & Richards, B. The Functional Specialization of Visual Cortex Emerges from Training Parallel Pathways with Self-Supervised Predictive Learning. in Advances in Neural Information Processing Systems vol. 34 25164–25178 (Curran Associates, Inc., 2021)
work page 2021
-
[6]
Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An Ecologically Motivated Image Dataset for Deep Learning Yields Better Models of Human Vision. Proceedings of the National Academy of Sciences 118, e2011417118 (2021)
work page 2021
-
[7]
Raugel, J. et al. Disentangling the Factors of Convergence between Brains and Computer Vision Models. (2025) doi:10.48550/arXiv.2508.18226
-
[8]
Yarbus, A. L. Eye Movements and Vision . (Springer US, Boston, MA, 1967). doi:10.1007/978-1-4899-5379-7
-
[9]
Hoppe, D. & Rothkopf, C. A. Multi-Step Planning of Eye Movements in Visual Search. Scientific Reports 9, 144 (2019)
work page 2019
-
[10]
Baddeley, A. D. & Hitch, G. Working Memory. vol. 8 47–89 (1974)
work page 1974
-
[11]
Persistence of Visual Memory for Scenes
Melcher, D. Persistence of Visual Memory for Scenes. Nature 412, 401 (2001)
work page 2001
-
[12]
Rao, R. P. N. & Ballard, D. H. Predictive Coding in the Visual Cortex: A Functional Interpretation of Some Extra-Classical Receptive-Field Effects. Nature Neuroscience 2, 79–87 (1999)
work page 1999
-
[13]
Gilbert, C. D. & Li, W. Top-down Influences on Visual Processing. Nature Reviews Neuroscience 14, 350– 363 (2013)
work page 2013
-
[14]
Kar, K., Kubilius, J., Schmidt, K., Issa, E. B. & DiCarlo, J. J. Evidence That Recurrent Circuits Are Critical to the Ventral Stream's Execution of Core Object Recognition Behavior. Nature Neuroscience 22, 974– 983 (2019)
work page 2019
-
[15]
Mnih, V., Heess, N., Graves, A. & Kavukcuoglu, K. Recurrent Models of Visual Attention. in Advances in Neural Information Processing Systems vol. 27 (Curran Associates, Inc., 2014)
work page 2014
-
[16]
Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple Object Recognition with Visual Attention. in 3rd Interna- tional Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (eds. Bengio, Y. & LeCun, Y.) (2015)
work page 2015
- [17]
-
[18]
Elsayed, G., Kornblith, S. & Le, Q. V. Saccader: Improving Accuracy of Hard Attention Models for Vision. in Advances in Neural Information Processing Systems vol. 32 (Curran Associates, Inc., 2019)
work page 2019
-
[19]
Wang, Y. et al. Glance and Focus: A Dynamic Approach to Reducing Spatial Redundancy in Image Classi- fication. in Proceedings of the 34th International Conference on Neural Information Processing Systems 2432–2444 (Curran Associates Inc., Red Hook, NY, USA, 2020)
work page 2020
-
[20]
Papadopoulos, A., Korus, P. & Memon, N. Hard-Attention for Scalable Image Classification. in Advances in Neural Information Processing Systems vol. 34 14694–14707 (Curran Associates, Inc., 2021)
work page 2021
- [21]
-
[22]
Li, J., Watters, N., Sohn, H. & Jazayeri, M. Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task. in Proceedings of The 1st Gaze Meets ML Workshop 98–112 (PMLR, 2023). 10
work page 2023
-
[23]
Pardyl, A., Rypesc, G., Kurzejamski, G., Zielinski, B. & Trzcinski, T. Active Visual Exploration Based on Attention-Map Entropy. in Proceedings of the Thirty-Second International Joint Conference on Artifi- cial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China 1303–1311 (ijcai.org, 2023). doi:10.24963/IJCAI.2023/145
-
[24]
Pardyl, A. et al. AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 112–129 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72664-4_7
-
[25]
Wang, Y. et al. Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Percep- tion. Nature Machine Intelligence 7, 1804–1822 (2025)
work page 2025
-
[26]
Pourrahimi, M. & Bashivan, P. Emergent Brain-like Representations in a Goal-Directed Neural Network Model of Visual Search. 2025.06.06.658387 (2025) doi:10.1101/2025.06.06.658387
-
[27]
Zhou, B. et al. Scene Parsing Through ADE20K Dataset. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 633–641 (2017)
work page 2017
-
[28]
Siméoni, O. et al. DINOv3. (2025) doi:10.48550/arXiv.2508.10104
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10104 2025
-
[29]
He, K. et al. Masked Autoencoders Are Scalable Vision Learners. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (2022)
work page 2022
-
[30]
Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024)
work page 2024
-
[31]
Zhang, Y., Ma, X., Bai, Y., Wang, H. & Fu, Y. Accessing Vision Foundation Models via ImageNet-1K. in The Thirteenth International Conference on Learning Representations (2025)
work page 2025
-
[32]
Lee, J. et al. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. in Proceedings of the 36th International Conference on Machine Learning 3744–3753 (PMLR, 2019)
work page 2019
-
[33]
Jaegle, A. et al. Perceiver: General Perception with Iterative Attention. in Proceedings of the 38th Interna- tional Conference on Machine Learning 4651–4664 (PMLR, 2021)
work page 2021
-
[34]
Jaegle, A. et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs. in International Conference on Learning Representations (2021)
work page 2021
-
[35]
Jabri, A., Fleet, D. J. & Chen, T. Scalable Adaptive Computation for Iterative Generation. in Proceedings of the 40th International Conference on Machine Learning 14569–14589 (PMLR, 2023)
work page 2023
-
[36]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. (2014) doi:10.48550/arXiv.1412.3555
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.3555 2014
-
[37]
Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. Neural Comput. 9, 1735–1780 (1997)
work page 1997
-
[38]
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. & Kaiser, L. Universal Transformers. in International Conference on Learning Representations (2018)
work page 2018
-
[39]
Yang, L., Lee, K., Nowak, R. D. & Papailiopoulos, D. Looped Transformers Are Better at Learning Learning Algorithms. in The Twelfth International Conference on Learning Representations (2023)
work page 2023
-
[40]
Saunshi, N., Dikkala, N., Li, Z., Kumar, S. & Reddi, S. J. Reasoning with Latent Thoughts: On the Power of Looped Transformers. in The Thirteenth International Conference on Learning Representations (2024)
work page 2024
-
[41]
Wang, G. et al. Hierarchical Reasoning Model. (2025) doi:10.48550/arXiv.2506.21734
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.21734 2025
-
[42]
Less is More: Recursive Reasoning with Tiny Networks
Jolicoeur-Martineau, A. Less Is More: Recursive Reasoning with Tiny Networks. (2025) doi: 10.48550/ arXiv.2510.04871
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Adaptive Computation Time for Recurrent Neural Networks
Graves, A. Adaptive Computation Time for Recurrent Neural Networks. (2017) doi: 10.48550/ arXiv.1603.08983
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Banino, A., Balaguer, J. & Blundell, C. PonderNet: Learning to Ponder. in 8th ICML Workshop on Automated Machine Learning (AutoML) (2021)
work page 2021
-
[45]
Hao, S. et al. Training Large Language Models to Reason in a Continuous Latent Space. in Second Conference on Language Modeling (2025)
work page 2025
-
[46]
Geiping, J. et al. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. (2025) doi:10.48550/arXiv.2502.05171. 11
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.05171 2025
-
[47]
Darcet, T., Oquab, M., Mairal, J. & Bojanowski, P. Vision Transformers Need Registers. in The Twelfth International Conference on Learning Representations (2023)
work page 2023
-
[48]
Dosovitskiy, A. et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Representations (2020)
work page 2020
-
[49]
Tolman, E. C. Cognitive Maps in Rats and Men. Psychological Review 55, 189–208 (1948)
work page 1948
-
[50]
Su, J. et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing 568, 127063 (2024)
work page 2024
-
[51]
Heo, B., Park, S., Han, D. & Yun, S. Rotary Position Embedding for Vision Transformer. in Computer Vision – ECCV 2024 (eds. Leonardis, A. et al.) 289–305 (Springer Nature Switzerland, Cham, 2025). doi:10.1007/978-3-031-72684-2_17
-
[52]
Ansel, J. et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 vol. 2 929–947 (Association for Computing Machinery, New York, NY, USA, 2024)
work page 2024
-
[53]
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. http://github.com/jax- ml/jax (2018)
work page 2018
-
[54]
Heek, J. et al. Flax: A neural network library and ecosystem for JAX. http://github.com/google/flax (2024)
work page 2024
-
[55]
Tancik, M. et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. in Advances in Neural Information Processing Systems vol. 33 7537–7547 (Curran Associates, Inc., 2020)
work page 2020
-
[56]
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. (2016) doi:10.48550/arXiv.1607.06450
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1607.06450 2016
-
[57]
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 211–252 (2015)
work page 2015
-
[58]
Ridnik, T., Ben-Baruch, E., Noy, A. & Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. (2021) doi:10.48550/arXiv.2104.10972
-
[59]
Zheng, S. et al. Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 6881–6890 (2021)
work page 2021
-
[60]
Zhou, J. et al. iBOT: Image BERT Pre-Training with Online Tokenizer. (2022) doi: 10.48550/ arXiv.2111.07832
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Assran, M. et al. Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 15619–15629 (2023)
work page 2023
-
[62]
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. Going Deeper With Image Transform- ers. in Proceedings of the IEEE/CVF International Conference on Computer Vision 32–42 (2021)
work page 2021
-
[63]
Are we done with imagenet?arXiv preprint arXiv:2006.07159,
Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & Oord, A. van den. Are We Done with ImageNet?. (2020) doi:10.48550/arXiv.2006.07159. 12 A Interpretability PCA Visualizations. To visualize grids of high-dimensional glimpse or canvas patch tokens as RGB images (Figure 1, Figure 2, Figure 3, Figure 5), we adopt a similar approach to that of DINOv3 28, by...
-
[64]
patches from glimpses 0 through 𝑡). The decoder processes all 𝑁patches + 1 = 129 tokens at every step: 128 patch positions (visible embeddings from the encoder, learnable mask tokens for unseen positions) plus one CLS token. “Head” denotes the per-patch prediction head ( 𝑁patches ⋅ Linear(𝑑dec,𝑝2 ⋅ num_classes)). 28 𝐶AME(𝑇)= ∑ 𝑇−1 𝑡=0 [PatchEmbed(𝑔 ⋅ (𝑡+ ...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.