Another BRIXEL in the Wall: Towards Cheaper Dense Features
Pith reviewed 2026-05-18 00:14 UTC · model grok-4.3
The pith
BRIXEL trains vision models to generate higher-resolution dense features from lower-resolution inputs through simple knowledge distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BRIXEL is a knowledge distillation procedure in which a student model is trained to match the higher-resolution feature maps that would be produced by processing the same scene at greater input resolution, allowing the model to deliver finer dense features from cheaper lower-resolution inputs while outperforming the corresponding DINOv3 baselines on downstream tasks.
What carries the argument
BRIXEL, a self-distillation loop that teaches the student to reconstruct elevated-resolution dense feature maps directly from lower-resolution inputs.
If this is right
- Outperforms baseline DINOv3 models by large margins on downstream dense tasks at fixed input resolution.
- Delivers substantial gains when applied to other recent dense-feature extractors beyond the DINO family.
- Reduces the quadratic compute burden of high-resolution transformer inference while preserving feature quality.
Where Pith is reading between the lines
- The method may allow dense feature models to run in real time on edge devices by lowering required input size.
- It could improve scale consistency in applications that combine features from multiple resolutions.
- Direct measurement of feature-map fidelity to a high-resolution teacher on held-out images would test whether information is truly preserved.
Load-bearing premise
A student model can learn to reproduce accurate higher-resolution feature maps from lower-resolution inputs without introducing artifacts or dropping task-relevant information.
What would settle it
If models trained with BRIXEL show lower accuracy than high-resolution baselines on a dense prediction benchmark such as semantic segmentation or keypoint detection, or if their upsampled feature maps contain visible interpolation artifacts not present in the teacher.
Figures
read the original abstract
Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. We also apply BRIXEL to other recent dense-feature extractors and show that it yields substantial performance gains across model families. Code and model weights are available at https://github.com/alexanderlappe/BRIXEL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BRIXEL, a knowledge-distillation procedure in which a student vision transformer is trained to reproduce higher-resolution feature maps from lower-resolution inputs, using the model's own outputs as targets. The central claim is that this simple objective yields large performance gains over the original DINOv3 checkpoints (and other dense-feature extractors) on downstream dense-prediction tasks when input resolution is held fixed, while reducing compute; code and weights are released.
Significance. If the gains are shown to arise specifically from the high-resolution matching loss rather than from additional optimization steps, BRIXEL would offer a practical route to cheaper dense features for segmentation, detection, and related tasks. The empirical focus, cross-model applicability, and public release of code strengthen the potential impact, though the result remains sensitive to experimental controls on training budget.
major comments (2)
- [Experiments] Experiments section: the reported comparisons to DINOv3 baselines do not state whether the baseline checkpoints received an equivalent number of additional training epochs, data passes, or regularization as the BRIXEL student models. If the baselines are the original pretrained weights without continued training, the large-margin improvements on dense tasks could be explained by the extra optimization stage rather than the BRIXEL distillation objective itself.
- [Method] Method section (around the distillation loss definition): it is unclear whether the teacher feature maps at target resolution are obtained by feeding the teacher the original high-resolution image or by upsampling lower-resolution features; this choice directly affects whether the student is learning genuine high-frequency information or merely an interpolation artifact, which is load-bearing for the claim that BRIXEL reproduces task-relevant high-resolution structure.
minor comments (2)
- [Abstract] Abstract: quantitative effect sizes (e.g., mIoU deltas or AP improvements) and the exact downstream tasks are not stated, making the phrase 'large margins' difficult to evaluate without consulting the tables.
- [Figures/Tables] Figure captions and tables: error bars or standard deviations across runs are not visible in the provided excerpts; adding them would strengthen the reliability of the cross-model claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment in detail below, providing clarifications and committing to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the reported comparisons to DINOv3 baselines do not state whether the baseline checkpoints received an equivalent number of additional training epochs, data passes, or regularization as the BRIXEL student models. If the baselines are the original pretrained weights without continued training, the large-margin improvements on dense tasks could be explained by the extra optimization stage rather than the BRIXEL distillation objective itself.
Authors: We agree this control is important for isolating the contribution of the distillation objective. The baselines reported in the manuscript are the publicly released DINOv3 checkpoints evaluated at the fixed lower resolution with no additional training. To address the concern, we will add a control experiment in the revised manuscript: we will continue training the original DINOv3 model for the same number of epochs and on the same data using its standard self-supervised objective, then evaluate the resulting checkpoint on the downstream dense tasks. This will allow direct comparison to BRIXEL and demonstrate that the observed gains arise from the high-resolution feature matching rather than additional optimization alone. revision: yes
-
Referee: [Method] Method section (around the distillation loss definition): it is unclear whether the teacher feature maps at target resolution are obtained by feeding the teacher the original high-resolution image or by upsampling lower-resolution features; this choice directly affects whether the student is learning genuine high-frequency information or merely an interpolation artifact, which is load-bearing for the claim that BRIXEL reproduces task-relevant high-resolution structure.
Authors: We thank the referee for highlighting this ambiguity. In BRIXEL the target feature maps are generated by feeding the original high-resolution images to the teacher model (the unmodified foundation model). The student receives only the corresponding low-resolution inputs and is trained to match the teacher's high-resolution outputs via the distillation loss. This is not an upsampling of low-resolution features; the student must therefore recover genuine high-frequency structure present in the teacher's high-resolution computation. We will revise the method section to state this procedure explicitly, add a clarifying sentence in the loss definition, and include a simple diagram illustrating the high-resolution teacher path versus the low-resolution student path. revision: yes
Circularity Check
No significant circularity in empirical distillation method
full rationale
The paper introduces BRIXEL as an empirical knowledge-distillation training procedure in which a student network is optimized to match higher-resolution feature maps produced by a teacher (itself or DINOv3). All performance claims are supported by downstream-task experiments at fixed input resolution rather than any closed-form derivation, uniqueness theorem, or parameter fit that reduces to the input data by construction. No equations, self-citations, or ansatzes are presented that would make the reported gains tautological; the method remains self-contained as a standard training recipe whose validity is assessed externally via benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BRIXEL... simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution... L1(θ) := E[||T(x)−Sθ(x−)||1] + λedge Ledge + λspectral Lspectral
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate... on semantic segmentation... monocular depth estimation... across 42 model comparisons
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Probing the 3D Awareness of Visual Foundation Models
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3D Awareness of Visual Foundation Models. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 21795–21806, 2024. 2, 4
work page 2024
-
[2]
Perception Encoder: The best visual embeddings are not at the output of the net- work
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Mon- teiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll´ar, and Christoph Feichtenhofer. Perception Encoder: The best visual embeddings are not at the output of the net- work. InNeurIPS....
work page 2025
-
[3]
Emerg- ing Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9630–9640, 2021. 1, 8
work page 2021
-
[4]
Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts.2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1979–1986, 2014. 4
work page 2014
-
[5]
Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and An- drew F. Luo. Vision Transformers with Self-Distilled Regis- ters, 2025. 8
work page 2025
-
[6]
Vision Transformer Adapter for Dense Predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision Transformer Adapter for Dense Predictions. InThe Eleventh International Conference on Learning Representations, 2022. 2, 3, 8
work page 2022
-
[7]
Schwing, and Alexander Kirillov
Bowen Cheng, A. Schwing, and Alexander Kirillov. Per- Pixel Classification is Not All You Need for Semantic Seg- mentation. InNeural Information Processing Systems, 2021. 2
work page 2021
-
[8]
Schwing, Alexan- der Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention Mask Transformer for Universal Image Segmentation. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2022. 2, 8
work page 2022
-
[9]
The Cityscapes Dataset for Semantic Urban Scene Understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In2016 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 3213–3223, 2016. 4
work page 2016
-
[10]
Vision Transformers Need Registers, 2024
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers, 2024. 8
work page 2024
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. 1, 6
work page 2021
- [12]
-
[13]
NA VI: Category- agnostic image collections with high-quality 3D shape and pose annotations
Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engel- hardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andre Araujo, Ricardo Martin-Brualla, Kaushal Pa- tel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, and Howard Zhou. NA VI: Category- agnostic image collections with high-quality 3D shape and pose annotations. InNeur...
work page 2023
-
[14]
Vision Transformers Don’t Need Trained Registers,
Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision Transformers Don’t Need Trained Registers,
-
[15]
Alexander Lappe and Martin A. Giese. Register and [CLS] tokens induce a decoupling of local and global features in large ViTs. InNeurIPS, 2025. 8
work page 2025
-
[16]
Andrew F Luo, Jacob Yeung, Rushikesh Zawar, Shaurya De- wan, Margaret M Henderson, Leila Wehbe, and Michael J Tarr. BRAIN MAPPING WITH DENSE FEATURES: GROUND- ING CORTICAL SEMANTIC SELECTIVITY IN NATURAL IMAGES WITH VISION TRANSFORM- ERS. InICLR, 2025. 8
work page 2025
-
[17]
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patr...
work page 2023
-
[18]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1
work page 2021
-
[19]
SAM 2: Segment Anything in Images and Videos,
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos,
-
[20]
Indoor Segmentation and Support Inference from RGBD Images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. InComputer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidel- berg. 4
work page 2012
-
[21]
Oriane Sim ´eoni, Gilles Puy, Huy V . V o, Simon W Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P ´erez, Renaud Mar- let, and Jean Ponce. Localizing Objects with Self-supervised 9 Transformers and no Labels. InProceedings of the British Machine Vision Conference 2021, page 365, Online, 2021. British Machine Vision Association. 8
work page 2021
-
[22]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...
work page 2025
-
[23]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 8
work page 2021
-
[24]
Hugo Touvron, Matthieu Cord, and Herv ´e J´egou. DeiT III: Revenge of the ViT. InComputer Vision – ECCV 2022, pages 516–533. Springer Nature Switzerland, Cham, 2022. 1
work page 2022
-
[25]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual Vision- Language Encoders with Improved Semantic Understanding, Localization, and Dense Feature...
work page 2025
-
[26]
SINDER: Repairing the Singular Defects of DINOv2, 2024
Haoqi Wang, Tong Zhang, and Mathieu Salzmann. SINDER: Repairing the Singular Defects of DINOv2, 2024. 8
work page 2024
-
[27]
Crowley, and Dominique Vaufreydaz
Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Mao- mao Li, Shell Xu Hu, James L. Crowley, and Dominique Vaufreydaz. TokenCut: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15790–15801, 2023. 8
work page 2023
-
[28]
Depth Any- thing V2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Any- thing V2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 2, 8
work page 2024
-
[29]
Sigmoid Loss for Language Image Pre- Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023. 1
work page 2023
-
[30]
Scene Parsing through ADE20K Dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene Parsing through ADE20K Dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130,
-
[31]
Semantic Un- derstanding of Scenes through the ADE20K Dataset, 2018
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic Un- derstanding of Scenes through the ADE20K Dataset, 2018. 4
work page 2018
-
[32]
Extract Free Dense Labels from CLIP
Chong Zhou, Chen Change Loy, and Bo Dai. Extract Free Dense Labels from CLIP. InComputer Vision – ECCV 2022, pages 696–712. Springer Nature Switzerland, Cham, 2022. 8
work page 2022
-
[33]
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, A. Yuille, and Tao Kong. iBOT: Image BERT Pre- Training with Online Tokenizer.ArXiv, 2021. 1 10 A. Appendix 11 Figure 7. We evaluate the fine-tuned ViT-B BRIXEL model on semantic segmentation on ADE20k at a variety of input image sizes. BRIXEL outperforms the DINOv3 baseline at all image sizes, sho...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.