pith. sign in

arxiv: 2511.05168 · v2 · submitted 2025-11-07 · 💻 cs.CV · cs.LG

Another BRIXEL in the Wall: Towards Cheaper Dense Features

Pith reviewed 2026-05-18 00:14 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords dense featuresknowledge distillationvision transformersDINOfeature upsamplingefficient inference
0
0 comments X

The pith

BRIXEL trains vision models to generate higher-resolution dense features from lower-resolution inputs through simple knowledge distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BRIXEL as a straightforward distillation method that addresses the high computational cost of producing fine-grained dense features in vision foundation models like DINOv3. These models normally demand very high-resolution inputs because of the quadratic scaling in transformers, limiting practical use. BRIXEL has a student model learn to reproduce its own feature maps at elevated resolutions even when fed lower-resolution images. This yields large gains on downstream dense tasks at fixed resolution and extends to other feature extractors. The result points toward more efficient dense vision pipelines without sacrificing detail.

Core claim

BRIXEL is a knowledge distillation procedure in which a student model is trained to match the higher-resolution feature maps that would be produced by processing the same scene at greater input resolution, allowing the model to deliver finer dense features from cheaper lower-resolution inputs while outperforming the corresponding DINOv3 baselines on downstream tasks.

What carries the argument

BRIXEL, a self-distillation loop that teaches the student to reconstruct elevated-resolution dense feature maps directly from lower-resolution inputs.

If this is right

  • Outperforms baseline DINOv3 models by large margins on downstream dense tasks at fixed input resolution.
  • Delivers substantial gains when applied to other recent dense-feature extractors beyond the DINO family.
  • Reduces the quadratic compute burden of high-resolution transformer inference while preserving feature quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may allow dense feature models to run in real time on edge devices by lowering required input size.
  • It could improve scale consistency in applications that combine features from multiple resolutions.
  • Direct measurement of feature-map fidelity to a high-resolution teacher on held-out images would test whether information is truly preserved.

Load-bearing premise

A student model can learn to reproduce accurate higher-resolution feature maps from lower-resolution inputs without introducing artifacts or dropping task-relevant information.

What would settle it

If models trained with BRIXEL show lower accuracy than high-resolution baselines on a dense prediction benchmark such as semantic segmentation or keypoint detection, or if their upsampled feature maps contain visible interpolation artifacts not present in the teacher.

Figures

Figures reproduced from arXiv: 2511.05168 by Alexander Lappe, Martin A. Giese.

Figure 1
Figure 1. Figure 1: Recent dense feature extractors are able to operate at very high resolution, albeit at great computational cost. We propose [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of BRIXEL. The teacher and student network share both architecture and weights, which are all frozen. During [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative evaluation of the proposed method. The second and third column display the dense feature maps of DINOv3 when [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We compare the computational cost of generating dense [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature maps of the ViT-B model fine-tuned and evaluated at an image size of 480x480. Best viewed on screen using zoom. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature maps computed using SigLIP 2 as a backbone instead of DINOv3. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We evaluate the fine-tuned ViT-B BRIXEL model on semantic segmentation on ADE20k at a variety of input image sizes. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Feature maps of the fine-tuned ViT-B model for different input sizes. Different RGB maps across image sizes are due to the PCA [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. We also apply BRIXEL to other recent dense-feature extractors and show that it yields substantial performance gains across model families. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BRIXEL, a knowledge-distillation procedure in which a student vision transformer is trained to reproduce higher-resolution feature maps from lower-resolution inputs, using the model's own outputs as targets. The central claim is that this simple objective yields large performance gains over the original DINOv3 checkpoints (and other dense-feature extractors) on downstream dense-prediction tasks when input resolution is held fixed, while reducing compute; code and weights are released.

Significance. If the gains are shown to arise specifically from the high-resolution matching loss rather than from additional optimization steps, BRIXEL would offer a practical route to cheaper dense features for segmentation, detection, and related tasks. The empirical focus, cross-model applicability, and public release of code strengthen the potential impact, though the result remains sensitive to experimental controls on training budget.

major comments (2)
  1. [Experiments] Experiments section: the reported comparisons to DINOv3 baselines do not state whether the baseline checkpoints received an equivalent number of additional training epochs, data passes, or regularization as the BRIXEL student models. If the baselines are the original pretrained weights without continued training, the large-margin improvements on dense tasks could be explained by the extra optimization stage rather than the BRIXEL distillation objective itself.
  2. [Method] Method section (around the distillation loss definition): it is unclear whether the teacher feature maps at target resolution are obtained by feeding the teacher the original high-resolution image or by upsampling lower-resolution features; this choice directly affects whether the student is learning genuine high-frequency information or merely an interpolation artifact, which is load-bearing for the claim that BRIXEL reproduces task-relevant high-resolution structure.
minor comments (2)
  1. [Abstract] Abstract: quantitative effect sizes (e.g., mIoU deltas or AP improvements) and the exact downstream tasks are not stated, making the phrase 'large margins' difficult to evaluate without consulting the tables.
  2. [Figures/Tables] Figure captions and tables: error bars or standard deviations across runs are not visible in the provided excerpts; adding them would strengthen the reliability of the cross-model claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment in detail below, providing clarifications and committing to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported comparisons to DINOv3 baselines do not state whether the baseline checkpoints received an equivalent number of additional training epochs, data passes, or regularization as the BRIXEL student models. If the baselines are the original pretrained weights without continued training, the large-margin improvements on dense tasks could be explained by the extra optimization stage rather than the BRIXEL distillation objective itself.

    Authors: We agree this control is important for isolating the contribution of the distillation objective. The baselines reported in the manuscript are the publicly released DINOv3 checkpoints evaluated at the fixed lower resolution with no additional training. To address the concern, we will add a control experiment in the revised manuscript: we will continue training the original DINOv3 model for the same number of epochs and on the same data using its standard self-supervised objective, then evaluate the resulting checkpoint on the downstream dense tasks. This will allow direct comparison to BRIXEL and demonstrate that the observed gains arise from the high-resolution feature matching rather than additional optimization alone. revision: yes

  2. Referee: [Method] Method section (around the distillation loss definition): it is unclear whether the teacher feature maps at target resolution are obtained by feeding the teacher the original high-resolution image or by upsampling lower-resolution features; this choice directly affects whether the student is learning genuine high-frequency information or merely an interpolation artifact, which is load-bearing for the claim that BRIXEL reproduces task-relevant high-resolution structure.

    Authors: We thank the referee for highlighting this ambiguity. In BRIXEL the target feature maps are generated by feeding the original high-resolution images to the teacher model (the unmodified foundation model). The student receives only the corresponding low-resolution inputs and is trained to match the teacher's high-resolution outputs via the distillation loss. This is not an upsampling of low-resolution features; the student must therefore recover genuine high-frequency structure present in the teacher's high-resolution computation. We will revise the method section to state this procedure explicitly, add a clarifying sentence in the loss definition, and include a simple diagram illustrating the high-resolution teacher path versus the low-resolution student path. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical distillation method

full rationale

The paper introduces BRIXEL as an empirical knowledge-distillation training procedure in which a student network is optimized to match higher-resolution feature maps produced by a teacher (itself or DINOv3). All performance claims are supported by downstream-task experiments at fixed input resolution rather than any closed-form derivation, uniqueness theorem, or parameter fit that reduces to the input data by construction. No equations, self-citations, or ansatzes are presented that would make the reported gains tautological; the method remains self-contained as a standard training recipe whose validity is assessed externally via benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard knowledge distillation assumptions and likely includes hyperparameters for loss weighting or temperature that are not detailed in the abstract; no new physical entities or ad-hoc axioms are introduced beyond the distillation objective itself.

pith-pipeline@v0.9.0 · 5459 in / 1065 out tokens · 27430 ms · 2026-05-18T00:14:26.455813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Probing the 3D Awareness of Visual Foundation Models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3D Awareness of Visual Foundation Models. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 21795–21806, 2024. 2, 4

  2. [2]

    Perception Encoder: The best visual embeddings are not at the output of the net- work

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Mon- teiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll´ar, and Christoph Feichtenhofer. Perception Encoder: The best visual embeddings are not at the output of the net- work. InNeurIPS....

  3. [3]

    Emerg- ing Properties in Self-Supervised Vision Transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9630–9640, 2021. 1, 8

  4. [4]

    Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1979–1986, 2014

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1979–1986, 2014. 4

  5. [5]

    Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and An- drew F. Luo. Vision Transformers with Self-Distilled Regis- ters, 2025. 8

  6. [6]

    Vision Transformer Adapter for Dense Predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision Transformer Adapter for Dense Predictions. InThe Eleventh International Conference on Learning Representations, 2022. 2, 3, 8

  7. [7]

    Schwing, and Alexander Kirillov

    Bowen Cheng, A. Schwing, and Alexander Kirillov. Per- Pixel Classification is Not All You Need for Semantic Seg- mentation. InNeural Information Processing Systems, 2021. 2

  8. [8]

    Schwing, Alexan- der Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention Mask Transformer for Universal Image Segmentation. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2022. 2, 8

  9. [9]

    The Cityscapes Dataset for Semantic Urban Scene Understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In2016 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 3213–3223, 2016. 4

  10. [10]

    Vision Transformers Need Registers, 2024

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers, 2024. 8

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 1, 6

  12. [12]

    Hinton, O

    Geoffrey E. Hinton, O. Vinyals, and J. Dean. Distilling the Knowledge in a Neural Network.ArXiv, 2015. 8

  13. [13]

    NA VI: Category- agnostic image collections with high-quality 3D shape and pose annotations

    Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engel- hardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andre Araujo, Ricardo Martin-Brualla, Kaushal Pa- tel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, and Howard Zhou. NA VI: Category- agnostic image collections with high-quality 3D shape and pose annotations. InNeur...

  14. [14]

    Vision Transformers Don’t Need Trained Registers,

    Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision Transformers Don’t Need Trained Registers,

  15. [15]

    Alexander Lappe and Martin A. Giese. Register and [CLS] tokens induce a decoupling of local and global features in large ViTs. InNeurIPS, 2025. 8

  16. [16]

    BRAIN MAPPING WITH DENSE FEATURES: GROUND- ING CORTICAL SEMANTIC SELECTIVITY IN NATURAL IMAGES WITH VISION TRANSFORM- ERS

    Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya De- wan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. BRAIN MAPPING WITH DENSE FEATURES: GROUND- ING CORTICAL SEMANTIC SELECTIVITY IN NATURAL IMAGES WITH VISION TRANSFORM- ERS. InICLR, 2025. 8

  17. [17]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patr...

  18. [18]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1

  19. [19]

    SAM 2: Segment Anything in Images and Videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos,

  20. [20]

    Indoor Segmentation and Support Inference from RGBD Images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. InComputer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidel- berg. 4

  21. [21]

    V o, Simon W Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P ´erez, Renaud Mar- let, and Jean Ponce

    Oriane Sim ´eoni, Gilles Puy, Huy V . V o, Simon W Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P ´erez, Renaud Mar- let, and Jean Ponce. Localizing Objects with Self-supervised 9 Transformers and no Labels. InProceedings of the British Machine Vision Conference 2021, page 365, Online, 2021. British Machine Vision Association. 8

  22. [22]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  23. [23]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 8

  24. [24]

    DeiT III: Revenge of the ViT

    Hugo Touvron, Matthieu Cord, and Herv ´e J´egou. DeiT III: Revenge of the ViT. InComputer Vision – ECCV 2022, pages 516–533. Springer Nature Switzerland, Cham, 2022. 1

  25. [25]

    SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understanding, Localization, and Dense Features, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understanding, Localization, and Dense Feature...

  26. [26]

    SINDER: Repairing the Singular Defects of DINOv2, 2024

    Haoqi Wang, Tong Zhang, and Mathieu Salzmann. SINDER: Repairing the Singular Defects of DINOv2, 2024. 8

  27. [27]

    Crowley, and Dominique Vaufreydaz

    Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Mao- mao Li, Shell Xu Hu, James L. Crowley, and Dominique Vaufreydaz. TokenCut: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15790–15801, 2023. 8

  28. [28]

    Depth Any- thing V2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Any- thing V2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 2, 8

  29. [29]

    Sigmoid Loss for Language Image Pre- Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023. 1

  30. [30]

    Scene Parsing through ADE20K Dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene Parsing through ADE20K Dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130,

  31. [31]

    Semantic Un- derstanding of Scenes through the ADE20K Dataset, 2018

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic Un- derstanding of Scenes through the ADE20K Dataset, 2018. 4

  32. [32]

    Extract Free Dense Labels from CLIP

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract Free Dense Labels from CLIP. InComputer Vision – ECCV 2022, pages 696–712. Springer Nature Switzerland, Cham, 2022. 8

  33. [33]

    Yuille, and Tao Kong

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, A. Yuille, and Tao Kong. iBOT: Image BERT Pre- Training with Online Tokenizer.ArXiv, 2021. 1 10 A. Appendix 11 Figure 7. We evaluate the fine-tuned ViT-B BRIXEL model on semantic segmentation on ADE20k at a variety of input image sizes. BRIXEL outperforms the DINOv3 baseline at all image sizes, sho...