UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
Pith reviewed 2026-05-20 06:44 UTC · model grok-4.3
The pith
Pre-trained vision transformers can learn to isolate and discard spurious tokens lacking location-aligned semantics through contrastive registers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniRefiner teaches pre-trained ViTs to self-dispose of spurious tokens by deploying contrastive registers that isolate artifacts through a dual contrastive objective: one term aligns regular image tokens with filtered clean tokens to preserve semantics, while the second aligns register tokens with detected spurious tokens to absorb the unwanted signals.
What carries the argument
Contrastive registers paired with a dual alignment objective that isolates spurious tokens without degrading retained semantic content.
If this is right
- Large pre-trained ViTs such as EVA-CLIP-8B reach 51.9 percent mIoU on ADE20K after refinement, exceeding specialized models like DINOv2.
- Zero-shot segmentation accuracy rises by as much as 22 percent across multiple datasets.
- The same few-epoch procedure works on diverse ViT scales, including InternViT-6B and EVA-CLIP-8B.
- Existing foundation models gain usable spatial capability without full retraining from scratch.
Where Pith is reading between the lines
- The register mechanism could be inserted into other transformer architectures that process sequential data beyond images.
- If the spurious-token definition proves robust, similar contrastive isolation might reduce compute by pruning tokens earlier in the forward pass.
- The method suggests a general post-training stage that any large vision model could undergo to improve downstream dense tasks.
Load-bearing premise
Any token that does not encode location-aligned semantics can be treated as a removable spurious artifact and safely separated from the rest of the representation.
What would settle it
After fine-tuning with UniRefiner, measure whether zero-shot or supervised segmentation accuracy on ADE20K fails to rise or whether the retained tokens lose semantic fidelity compared with the original model.
Figures
read the original abstract
Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes UniRefiner, a universal refinement framework for pre-trained Vision Transformers that identifies and isolates spurious tokens—defined broadly as any token failing to encode location-aligned semantics—via contrastive register tokens. It categorizes three types of such artifacts and applies a dual contrastive objective: aligning image tokens to filtered regular tokens to preserve semantics while aligning register tokens to detected spurious tokens to capture artifacts. The approach requires only a few epochs of fine-tuning on ~5k images and is demonstrated on models including EVA-CLIP-8B and InternViT-6B, reporting gains such as 51.9% mIoU on ADE20K (+9.4%) and up to 22% improvement in zero-shot segmentation accuracy.
Significance. If validated, the result would be significant for adapting large-scale ViT foundation models to dense prediction tasks with minimal additional data and compute. The broader definition of spurious tokens and the use of contrastive registers to explicitly isolate them without degrading retained semantics could extend the utility of models like EVA-CLIP-8B beyond their original training objectives. The reported outperformance over specialized models such as DINOv2 on ADE20K highlights potential practical impact, though this hinges on confirming the mechanism is not reducible to generic fine-tuning.
major comments (2)
- [Method] Method section: The initial detection rule for spurious tokens is not specified independently of the dual contrastive objective. The abstract refers to 'detected spurious tokens' and a location-alignment criterion, but without an explicit, pre-objective criterion or initialization procedure, the construction risks circularity—the loss may simply reinforce an arbitrary partitioning of tokens rather than isolating true artifacts. This directly affects whether the +9.4% mIoU and 22% zero-shot gains can be attributed to the proposed mechanism.
- [Experiments] Experiments section: The reported numerical gains (e.g., EVA-CLIP-8B at 51.9% mIoU on ADE20K and zero-shot improvements) provide no details on controls, exact baseline implementations, number of random seeds, statistical significance testing, or ablation of the dual objective components. Without these, it remains unclear whether the improvements exceed what would be obtained from standard fine-tuning on the same ~5k images.
minor comments (2)
- [Abstract] Abstract: The three fundamental types of spurious tokens are mentioned but not enumerated or briefly characterized; adding one sentence listing them would improve readability for readers unfamiliar with the diagnosis.
- [Method] Notation: The distinction between 'regular tokens,' 'filtered regular tokens,' and 'image tokens' is used without an explicit diagram or equation defining their relationships in the dual objective; a small schematic in §3 would clarify the flow.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, clarifying the initialization of spurious token detection and committing to expanded experimental details and controls in the revision.
read point-by-point responses
-
Referee: [Method] Method section: The initial detection rule for spurious tokens is not specified independently of the dual contrastive objective. The abstract refers to 'detected spurious tokens' and a location-alignment criterion, but without an explicit, pre-objective criterion or initialization procedure, the construction risks circularity—the loss may simply reinforce an arbitrary partitioning of tokens rather than isolating true artifacts. This directly affects whether the +9.4% mIoU and 22% zero-shot gains can be attributed to the proposed mechanism.
Authors: We agree that an explicit, pre-objective initialization rule should be stated independently to avoid any appearance of circularity. The location-alignment criterion is defined prior to the contrastive objective as a semantic consistency check across spatial positions (detailed in Section 3.2 of the manuscript). In the revised version we will add a dedicated paragraph and pseudocode that first applies this criterion to produce an initial mask, after which the dual contrastive loss refines the three artifact categories. This separation ensures the partitioning is not solely driven by the loss. revision: yes
-
Referee: [Experiments] Experiments section: The reported numerical gains (e.g., EVA-CLIP-8B at 51.9% mIoU on ADE20K and zero-shot improvements) provide no details on controls, exact baseline implementations, number of random seeds, statistical significance testing, or ablation of the dual objective components. Without these, it remains unclear whether the improvements exceed what would be obtained from standard fine-tuning on the same ~5k images.
Authors: We acknowledge the need for these controls to isolate the contribution of the proposed mechanism. In the revision we will add: (i) precise descriptions of all baseline fine-tuning setups, (ii) results averaged over five random seeds with standard deviations, (iii) paired statistical significance tests (Wilcoxon signed-rank) on the reported metrics, and (iv) a full ablation table that removes each term of the dual objective in turn. These additions will demonstrate that the observed gains exceed those from generic fine-tuning on the same data. revision: yes
Circularity Check
No significant circularity; derivation introduces independent contrastive registers and dual objective
full rationale
The paper posits a broader definition of spurious tokens as those failing location-aligned semantics and proposes UniRefiner with new register tokens plus a dual contrastive objective (align image tokens to filtered regular tokens; align registers to detected spurious tokens). This is implemented via fine-tuning on ~5k images rather than reducing to prior fitted quantities or self-citations by construction. No equations or steps are shown where a 'prediction' or result is equivalent to its inputs (e.g., no fitted parameter renamed as prediction, no uniqueness theorem imported from authors' prior work, no ansatz smuggled via citation). The central mechanism relies on explicit new components and training, making the reported gains (e.g., +9.4% mIoU) attributable to the proposed refinement rather than definitional equivalence. This is the common case of a self-contained proposal against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Any token failing to encode location-aligned semantics should be treated as a spurious artifact.
invented entities (1)
-
Contrastive registers
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We systematically identify and characterize three fundamental types of spurious tokens in ViTs that impair spatial representation quality.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 3
work page 2023
- [2]
-
[3]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 7
work page 2018
-
[4]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 3
work page 2021
-
[5]
Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025
Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and An- drew F Luo. Vision transformers with self-distilled registers. arXiv preprint arXiv:2505.21501, 2025. 2, 3, 4, 5, 8
-
[6]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2, 3, 6
work page 2024
-
[7]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 7
work page 2016
-
[8]
Vision Transformers Need Registers
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Maskclip: Masked self- distillation advances contrastive language-image pretraining
Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, et al. Maskclip: Masked self- distillation advances contrastive language-image pretraining. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10995–11005, 2023. 2
work page 2023
-
[10]
Learning to prompt for open-vocabulary ob- ject detection with vision-language model
Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14084–14093, 2022. 2
work page 2022
-
[11]
Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective.Inter- national journal of computer vision, 111(1):98–136, 2015. 7
work page 2015
-
[12]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 5
work page 2017
-
[13]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3
work page 2022
-
[14]
LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 4, 6
work page 2022
-
[16]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,
-
[17]
Cribo: Self- supervised learning via cross-image object-level bootstrap- ping
Tim Lebailly, Thomas Stegm ¨uller, Behzad Bozorgtabar, Jean-Philippe Thiran, and Tinne Tuytelaars. Cribo: Self- supervised learning via cross-image object-level bootstrap- ping. InInternational Conference on Learning Representa- tions, 2024. 3
work page 2024
-
[18]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
work page 2023
-
[19]
Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,
Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Ar- jun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024. 3
-
[20]
The role of context for object detection and semantic segmentation in the wild
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 891–898, 2014. 7
work page 2014
-
[21]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Valentinos Pariza, Mohammadreza Salehi, Gertjan Burgh- outs, Francesco Locatello, and Yuki M. Asano. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. InInternational Conference on Learning Representations, 2025. 3
work page 2025
-
[24]
Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, and Tong Zhang. Refining clip’s spatial awareness: A visual-centric perspective.arXiv preprint arXiv:2504.02328, 2025. 4
-
[25]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 6, 7
work page 2021
-
[26]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 2, 6
work page 2018
-
[27]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012. 7
work page 2012
-
[28]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Massive Activations in Large Language Models
Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Sinder: Repairing the singular defects of dinov2
Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. InEuropean Con- ference on Computer Vision, pages 20–35. Springer, 2024. 2, 3
work page 2024
-
[33]
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020. 6
work page 2020
-
[34]
Yuanchen Wu, Xiaoqiang Li, Jide Li, Kequan Yang, Pinpin Zhu, and Shaohua Zhang. Dino is also a semantic guider: Exploiting class-aware affinity for weakly supervised seman- tic segmentation. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 1389–1397, 2024. 2
work page 2024
-
[35]
Structured 3d latents for scalable and versatile 3d gen- eration
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 3
work page 2025
-
[36]
Simmim: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 3
work page 2022
-
[37]
Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers. InEuropean Conference on Computer Vision, pages 453–
-
[38]
Springer, 2024. 2, 3, 4
work page 2024
-
[39]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 2, 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3
work page 2023
-
[41]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
-
[43]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 7
work page 2022
-
[44]
iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.