pith. sign in

arxiv: 2605.19622 · v1 · pith:AGAZIG4Jnew · submitted 2026-05-19 · 💻 cs.CV

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

Pith reviewed 2026-05-20 06:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords Vision TransformerSpurious tokensContrastive registersSemantic segmentationModel refinementDense predictionToken isolation
0
0 comments X

The pith

Pre-trained vision transformers can learn to isolate and discard spurious tokens lacking location-aligned semantics through contrastive registers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing definitions of spurious tokens in ViTs are too narrow for dense prediction, so it broadens the definition to any token that fails to encode location-aligned semantics and identifies three fundamental types of such artifacts. It then introduces UniRefiner as a lightweight fine-tuning method that equips pre-trained ViTs with contrastive registers to explicitly separate these artifacts from useful tokens. A dual objective keeps semantic content intact in retained tokens while routing spurious signals into the registers. The approach requires only a few epochs on roughly 5,000 images yet yields large gains on segmentation benchmarks for models up to 8 billion parameters.

Core claim

UniRefiner teaches pre-trained ViTs to self-dispose of spurious tokens by deploying contrastive registers that isolate artifacts through a dual contrastive objective: one term aligns regular image tokens with filtered clean tokens to preserve semantics, while the second aligns register tokens with detected spurious tokens to absorb the unwanted signals.

What carries the argument

Contrastive registers paired with a dual alignment objective that isolates spurious tokens without degrading retained semantic content.

If this is right

  • Large pre-trained ViTs such as EVA-CLIP-8B reach 51.9 percent mIoU on ADE20K after refinement, exceeding specialized models like DINOv2.
  • Zero-shot segmentation accuracy rises by as much as 22 percent across multiple datasets.
  • The same few-epoch procedure works on diverse ViT scales, including InternViT-6B and EVA-CLIP-8B.
  • Existing foundation models gain usable spatial capability without full retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The register mechanism could be inserted into other transformer architectures that process sequential data beyond images.
  • If the spurious-token definition proves robust, similar contrastive isolation might reduce compute by pruning tokens earlier in the forward pass.
  • The method suggests a general post-training stage that any large vision model could undergo to improve downstream dense tasks.

Load-bearing premise

Any token that does not encode location-aligned semantics can be treated as a removable spurious artifact and safely separated from the rest of the representation.

What would settle it

After fine-tuning with UniRefiner, measure whether zero-shot or supervised segmentation accuracy on ADE20K fails to rise or whether the retained tokens lose semantic fidelity compared with the original model.

Figures

Figures reproduced from arXiv: 2605.19622 by Congpei Qiu, Tong Zhang, Wei Ke, Yanhao Wu, Zhaoyu Hu, Zhuotao Tian.

Figure 1
Figure 1. Figure 1: (a) Visualization of the similarity matrix between language prompts and visual tokens. Warmer colors (red) indicate higher cosine [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of spurious token ratios in different ViT models. We use 1k images sampled from CC3M [26] to calculate the spurious token ratios, including both FP, GP, and AH tokens. 30, 31] domains. This progress has given rise to a flour￾ishing ecosystem of downstream applications in percep￾tion [9, 10, 34], visual understanding [18, 44], and image generation [35, 38, 40], e.g., REPA [38] leverages DINOv2 fe… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of different spurious token categories. We sample a source image and a random reference image to illustrate the characteristics of different spurious tokens, highlighting three key categories: Fixed Pattern Tokens: exhibit high cosine similarity with tokens from irrelevant images; Global Proxy Tokens: exhibit high cosine similarity with other tokens within the same feature map but vary across diff… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of UniRefiner. UniRefiner employs the Spurious Token Filtering pipeline to identify both regular and spurious tokens (Sec 4.2). Then, we add Gaussian-noise patches as register bias to the input image to obtain image and register tokens, which are respectively aligned with filtered regular and spurious tokens via Contrastive Register distillation (Sec 4.3). As distillation progresses, the learned r… view at source ↗
Figure 6
Figure 6. Figure 6: Depth estimation comparison on NYUv2. Predicted depth maps from linear probing on frozen backbone features, shown before and after UniRefiner refinement (suffix “-R”). UniRefiner produces smoother depth maps with fewer spurious artifacts and improved boundary preservation. the student ViT to redirect spurious information into the reg￾ister region. We further apply the Uniformity term [33] in InfoNCE to sep… view at source ↗
Figure 7
Figure 7. Figure 7: Heatmaps visualization of cosine similarity between image and text embeddings under high resolution. (left) we compare vanilla and UniRefiner-refined EVA-CLIP-8B, with text requiring both localization and world knowledge; (right) we visualize refined model on a complex visual scene. We upsample the input image of both visualizations to a high resolution of 1792×1792, leading to feature maps of 128×128 toke… view at source ↗
Figure 9
Figure 9. Figure 9: PCA visualization comparison. We compare the final￾layer tokens of vanilla and refined SigLIP2-So400m and EVA￾CLIP-8B. After refinement, register tokens on the image boundary absorb spurious tokens and redistribute them to the periphery, sep￾arating them from regular image tokens. the FP–GP filter lets spurious tokens overwhelm most im￾age tokens, producing severe segmentation failures—worse than the vanil… view at source ↗
read the original abstract

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes UniRefiner, a universal refinement framework for pre-trained Vision Transformers that identifies and isolates spurious tokens—defined broadly as any token failing to encode location-aligned semantics—via contrastive register tokens. It categorizes three types of such artifacts and applies a dual contrastive objective: aligning image tokens to filtered regular tokens to preserve semantics while aligning register tokens to detected spurious tokens to capture artifacts. The approach requires only a few epochs of fine-tuning on ~5k images and is demonstrated on models including EVA-CLIP-8B and InternViT-6B, reporting gains such as 51.9% mIoU on ADE20K (+9.4%) and up to 22% improvement in zero-shot segmentation accuracy.

Significance. If validated, the result would be significant for adapting large-scale ViT foundation models to dense prediction tasks with minimal additional data and compute. The broader definition of spurious tokens and the use of contrastive registers to explicitly isolate them without degrading retained semantics could extend the utility of models like EVA-CLIP-8B beyond their original training objectives. The reported outperformance over specialized models such as DINOv2 on ADE20K highlights potential practical impact, though this hinges on confirming the mechanism is not reducible to generic fine-tuning.

major comments (2)
  1. [Method] Method section: The initial detection rule for spurious tokens is not specified independently of the dual contrastive objective. The abstract refers to 'detected spurious tokens' and a location-alignment criterion, but without an explicit, pre-objective criterion or initialization procedure, the construction risks circularity—the loss may simply reinforce an arbitrary partitioning of tokens rather than isolating true artifacts. This directly affects whether the +9.4% mIoU and 22% zero-shot gains can be attributed to the proposed mechanism.
  2. [Experiments] Experiments section: The reported numerical gains (e.g., EVA-CLIP-8B at 51.9% mIoU on ADE20K and zero-shot improvements) provide no details on controls, exact baseline implementations, number of random seeds, statistical significance testing, or ablation of the dual objective components. Without these, it remains unclear whether the improvements exceed what would be obtained from standard fine-tuning on the same ~5k images.
minor comments (2)
  1. [Abstract] Abstract: The three fundamental types of spurious tokens are mentioned but not enumerated or briefly characterized; adding one sentence listing them would improve readability for readers unfamiliar with the diagnosis.
  2. [Method] Notation: The distinction between 'regular tokens,' 'filtered regular tokens,' and 'image tokens' is used without an explicit diagram or equation defining their relationships in the dual objective; a small schematic in §3 would clarify the flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, clarifying the initialization of spurious token detection and committing to expanded experimental details and controls in the revision.

read point-by-point responses
  1. Referee: [Method] Method section: The initial detection rule for spurious tokens is not specified independently of the dual contrastive objective. The abstract refers to 'detected spurious tokens' and a location-alignment criterion, but without an explicit, pre-objective criterion or initialization procedure, the construction risks circularity—the loss may simply reinforce an arbitrary partitioning of tokens rather than isolating true artifacts. This directly affects whether the +9.4% mIoU and 22% zero-shot gains can be attributed to the proposed mechanism.

    Authors: We agree that an explicit, pre-objective initialization rule should be stated independently to avoid any appearance of circularity. The location-alignment criterion is defined prior to the contrastive objective as a semantic consistency check across spatial positions (detailed in Section 3.2 of the manuscript). In the revised version we will add a dedicated paragraph and pseudocode that first applies this criterion to produce an initial mask, after which the dual contrastive loss refines the three artifact categories. This separation ensures the partitioning is not solely driven by the loss. revision: yes

  2. Referee: [Experiments] Experiments section: The reported numerical gains (e.g., EVA-CLIP-8B at 51.9% mIoU on ADE20K and zero-shot improvements) provide no details on controls, exact baseline implementations, number of random seeds, statistical significance testing, or ablation of the dual objective components. Without these, it remains unclear whether the improvements exceed what would be obtained from standard fine-tuning on the same ~5k images.

    Authors: We acknowledge the need for these controls to isolate the contribution of the proposed mechanism. In the revision we will add: (i) precise descriptions of all baseline fine-tuning setups, (ii) results averaged over five random seeds with standard deviations, (iii) paired statistical significance tests (Wilcoxon signed-rank) on the reported metrics, and (iv) a full ablation table that removes each term of the dual objective in turn. These additions will demonstrate that the observed gains exceed those from generic fine-tuning on the same data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent contrastive registers and dual objective

full rationale

The paper posits a broader definition of spurious tokens as those failing location-aligned semantics and proposes UniRefiner with new register tokens plus a dual contrastive objective (align image tokens to filtered regular tokens; align registers to detected spurious tokens). This is implemented via fine-tuning on ~5k images rather than reducing to prior fitted quantities or self-citations by construction. No equations or steps are shown where a 'prediction' or result is equivalent to its inputs (e.g., no fitted parameter renamed as prediction, no uniqueness theorem imported from authors' prior work, no ansatz smuggled via citation). The central mechanism relies on explicit new components and training, making the reported gains (e.g., +9.4% mIoU) attributable to the proposed refinement rather than definitional equivalence. This is the common case of a self-contained proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; the central claim rests on the new definition of spurious tokens and the effectiveness of the contrastive register mechanism. No explicit free parameters or invented entities beyond the registers are stated.

axioms (1)
  • domain assumption Any token failing to encode location-aligned semantics should be treated as a spurious artifact.
    This broadened definition is presented as the starting point for categorizing the three token types.
invented entities (1)
  • Contrastive registers no independent evidence
    purpose: To isolate and capture spurious token signals during the dual alignment objective.
    New component introduced by the method to handle the broader spurious-token problem.

pith-pipeline@v0.9.0 · 5840 in / 1403 out tokens · 52178 ms · 2026-05-20T06:44:40.832292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 12 internal anchors

  1. [1]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 3

  2. [2]

    H ´enaff

    Ivana Bala ˇzevi´c, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi´c, and Olivier J. H ´enaff. Towards in-context scene understanding. InAdvances in Neural Information Processing Systems, 2023. 3

  3. [3]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 1209–1218, 2018. 7

  4. [4]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 1, 3

  5. [5]

    Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025

    Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and An- drew F Luo. Vision transformers with self-distilled registers. arXiv preprint arXiv:2505.21501, 2025. 2, 3, 4, 5, 8

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2, 3, 6

  7. [7]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 7

  8. [8]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 2, 3, 5

  9. [9]

    Maskclip: Masked self- distillation advances contrastive language-image pretraining

    Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, et al. Maskclip: Masked self- distillation advances contrastive language-image pretraining. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10995–11005, 2023. 2

  10. [10]

    Learning to prompt for open-vocabulary ob- ject detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14084–14093, 2022. 2

  11. [11]

    The pascal visual object classes challenge: A retrospective.Inter- national journal of computer vision, 111(1):98–136, 2015

    Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective.Inter- national journal of computer vision, 111(1):98–136, 2015. 7

  12. [12]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 5

  13. [13]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3

  14. [14]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 3

  15. [15]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations, 2022. 2, 4, 6

  16. [16]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  17. [17]

    Cribo: Self- supervised learning via cross-image object-level bootstrap- ping

    Tim Lebailly, Thomas Stegm ¨uller, Behzad Bozorgtabar, Jean-Philippe Thiran, and Tinne Tuytelaars. Cribo: Self- supervised learning via cross-image object-level bootstrap- ping. InInternational Conference on Learning Representa- tions, 2024. 3

  18. [18]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  19. [19]

    Tips: Text-image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512,

    Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Ar- jun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024. 3

  20. [20]

    The role of context for object detection and semantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 891–898, 2014. 7

  21. [21]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 5

  22. [22]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 3, 6, 7

  23. [23]

    Valentinos Pariza, Mohammadreza Salehi, Gertjan Burgh- outs, Francesco Locatello, and Yuki M. Asano. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. InInternational Conference on Learning Representations, 2025. 3

  24. [24]

    Refining clip’s spatial awareness: A visual-centric perspective.arXiv preprint arXiv:2504.02328, 2025

    Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, and Tong Zhang. Refining clip’s spatial awareness: A visual-centric perspective.arXiv preprint arXiv:2504.02328, 2025. 4

  25. [25]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 6, 7

  26. [26]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 2, 6

  27. [27]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012. 7

  28. [28]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 3

  29. [29]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024. 3

  30. [30]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 2, 3, 6, 7

  31. [31]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 3, 6, 7

  32. [32]

    Sinder: Repairing the singular defects of dinov2

    Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. InEuropean Con- ference on Computer Vision, pages 20–35. Springer, 2024. 2, 3

  33. [33]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020. 6

  34. [34]

    Dino is also a semantic guider: Exploiting class-aware affinity for weakly supervised seman- tic segmentation

    Yuanchen Wu, Xiaoqiang Li, Jide Li, Kequan Yang, Pinpin Zhu, and Shaohua Zhang. Dino is also a semantic guider: Exploiting class-aware affinity for weakly supervised seman- tic segmentation. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 1389–1397, 2024. 2

  35. [35]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 3

  36. [36]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 3

  37. [37]

    Denoising vision transformers

    Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers. InEuropean Conference on Computer Vision, pages 453–

  38. [38]

    Springer, 2024. 2, 3, 4

  39. [39]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 2, 3, 8

  40. [40]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3

  41. [41]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025. 2, 3

  42. [42]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

  43. [43]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 7

  44. [44]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,

  45. [45]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2