arxiv: 2512.20157 · v2 · submitted 2025-12-23 · 💻 cs.CV

Recognition: no theorem link

SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

Sofian Chaybouti , Sanath Narayan , Yasser Dahou , Ph\'uc H. L\^e Khac , Ankit Singh , Ngoc Dung Huynh , Wamiq Reyaz Para , Hilde Kuehne

show 1 more author

Hakim Hacid

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-teacher distillationvision foundation modelsknowledge distillationmixture of expertsgrounding vision-language modelstoken-balanced batchinghierarchical data sampling

0 comments

The pith

Distilling SigLIP2 and DINOv3 simultaneously into dense and MoE students produces vision representations that initialize stronger early-fusion Grounding-VLMs than training from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that multi-teacher distillation can be made computationally efficient by combining two strong vision teachers into smaller student models. An asymmetric relation-knowledge distillation loss is shown to keep the geometric structure of each teacher intact while transferring knowledge. Supporting techniques include token-balanced batching to handle mixed resolutions and hierarchical clustering to sample a 200-million-image corpus more effectively than random selection. The resulting SigLino checkpoints, when used to initialize early-fusion Grounding-VLMs, outperform equivalent models trained from random weights on the same data. This approach suggests a route to unified visual representations at lower overall training cost.

Core claim

SigLino models are created by simultaneously distilling SigLIP2 and DINOv3 into dense and Mixture-of-Experts students using an asymmetric relation-knowledge distillation loss, token-balanced batching, and hierarchical data sampling on the OpenLVD200M corpus; the resulting representations transfer to early-fusion Grounding-VLMs and outperform models trained from scratch.

What carries the argument

The Asymmetric Relation-Knowledge Distillation loss, which maintains separate geometric relations from each teacher during joint distillation into the student.

Load-bearing premise

The asymmetric loss can preserve the distinct geometric properties of both teachers at once without one distorting the other inside the shared student representation.

What would settle it

Train a SigLino-MoE student with the proposed loss on OpenLVD200M and measure its downstream grounding accuracy against an identical-architecture model trained from scratch on the same corpus; if the distilled version shows no gain, the central transfer claim does not hold.

Figures

Figures reproduced from arXiv: 2512.20157 by Ankit Singh, Hakim Hacid, Hilde Kuehne, Ngoc Dung Huynh, Ph\'uc H. L\^e Khac, Sanath Narayan, Sofian Chaybouti, Wamiq Reyaz Para, Yasser Dahou.

**Figure 1.** Figure 1: AMoE vision foundation model: A Mixture-of-Experts student is distilled from multiple frozen vision teachers as shown in the multi-teacher distillation stage (on the left). The input image is fed to both teachers (SigLIP2 and DINOv3) and the student to obtain respective patch and global representation embeddings. Additional register tokens are employed in the student model, similar to DINOv3. The student e… view at source ↗

**Figure 2.** Figure 2: Token-balanced batching: Packing multiple native-resolution images per sequence up to a fixed token budget and applying [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Linear CKA alignments between MoE experts and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: We visualize PCA projections of global features, patches, and DINOv3 registers (0 and 1): original data (Col 1), synthetic [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of Asymmetric Relational Knowledge Distillation (ARKD) on training dynamics. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of positional encoding on unseen resolutions. We compare feature map consistency across resolutions ( [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: PCA-maps of learned representations: the original im [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Concept hierarchy captured by the 4-level clustering. Each column represents a high-level semantic cluster (Level 4, grey [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce SigLino, an efficient family of agglomerative vision foundation models that distill knowledge from SigLIP2 and DINOv3 simultaneously into Dense and Mixture-of-Experts students. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data, typically reserved for self-supervised learning, substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our SigLino-MoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT->LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and five distilled checkpoints comprising MoE and dense variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SigLino gives practical efficiency gains for multi-teacher vision distillation via asymmetric loss, token batching, and hierarchical sampling, but the claim that both teachers' geometries are preserved without conflict rests on unverified empirical transfer results.

read the letter

The paper's real contribution is showing that you can distill SigLIP2 and DINOv3 together into dense and MoE students more cheaply than before. The asymmetric relation-knowledge loss, token-balanced batching for mixed resolutions, and hierarchical data clustering all look like workable engineering moves that cut compute while still beating scratch-trained baselines on downstream grounding tasks. Releasing OpenLVD200M and the five checkpoints is the clearest positive signal; that alone makes the work useful to people who want to build on these models rather than retrain from scratch.

Referee Report

2 major / 2 minor

Summary. The paper introduces SigLino, an efficient family of agglomerative vision foundation models obtained by distilling knowledge simultaneously from SigLIP2 and DINOv3 into dense and Mixture-of-Experts student architectures. It proposes an Asymmetric Relation-Knowledge Distillation loss, token-balanced batching for multi-resolution inputs, and hierarchical clustering/sampling of training data to improve sample efficiency. The authors release the OpenLVD200M corpus (200M images) and five checkpoints, and demonstrate that the resulting representations initialize early-fusion Grounding-VLMs that outperform models trained from scratch on transfer tasks.

Significance. If the empirical claims are substantiated, the work is significant for advancing practical multi-teacher distillation of vision foundation models at reduced computational cost. The combination of the proposed loss, batching, and data-sampling strategies, together with the public release of OpenLVD200M and the distilled checkpoints, provides concrete tools and data for the community. The downstream transfer results to early-fusion Grounding-VLMs, if robust, would indicate a viable path toward unified visual representations without training large models from scratch.

major comments (2)

[Abstract and §3 (Method)] Abstract, claim (1) and the corresponding method section describing the Asymmetric Relation-Knowledge Distillation loss: the assertion that this loss simultaneously preserves the geometric properties of both SigLIP2 (contrastive alignment) and DINOv3 (self-supervised patch geometry) lacks direct supporting diagnostics. No CKA similarity, Procrustes alignment, or per-teacher retrieval metrics are reported to demonstrate that the student retains high-fidelity structure from each teacher rather than converging to a compromised embedding.
[§5 (Experiments)] Experimental evaluation section (transfer results to Grounding-VLMs): the claim that SigLino-MoE initializes superior early-fusion models compared with from-scratch training is presented without visible baseline details, error bars, number of runs, or ablation isolating the contribution of the asymmetric loss versus the data-sampling strategy. This makes it difficult to verify that the reported gains are attributable to the distillation rather than other factors.

minor comments (2)

[Abstract and §3] Ensure consistent definition of acronyms (VLM, MoE, etc.) on first appearance and clarify the exact form of the asymmetric loss (weighting coefficients, relation vs. knowledge terms) with an equation reference.
[Figures and Tables] Figure captions and tables should explicitly state the evaluation metrics and dataset splits used for the Grounding-VLM transfer experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that additional diagnostics and experimental details will strengthen the manuscript and have revised the paper accordingly to address both major points.

read point-by-point responses

Referee: [Abstract and §3 (Method)] Abstract, claim (1) and the corresponding method section describing the Asymmetric Relation-Knowledge Distillation loss: the assertion that this loss simultaneously preserves the geometric properties of both SigLIP2 (contrastive alignment) and DINOv3 (self-supervised patch geometry) lacks direct supporting diagnostics. No CKA similarity, Procrustes alignment, or per-teacher retrieval metrics are reported to demonstrate that the student retains high-fidelity structure from each teacher rather than converging to a compromised embedding.

Authors: We agree that direct diagnostics are valuable for substantiating the claim. In the revised manuscript we have added CKA similarity matrices computed between the student embeddings and each teacher separately, Procrustes alignment distances, and per-teacher zero-shot retrieval metrics on ImageNet and COCO. These results appear in a new subsection of §3 and in the supplementary material; they show that the student retains high fidelity to both teachers rather than collapsing to an intermediate representation. revision: yes
Referee: [§5 (Experiments)] Experimental evaluation section (transfer results to Grounding-VLMs): the claim that SigLino-MoE initializes superior early-fusion models compared with from-scratch training is presented without visible baseline details, error bars, number of runs, or ablation isolating the contribution of the asymmetric loss versus the data-sampling strategy. This makes it difficult to verify that the reported gains are attributable to the distillation rather than other factors.

Authors: We acknowledge the need for greater experimental rigor. The revised §5 now reports mean and standard deviation over three independent runs with different random seeds, explicitly lists all baseline hyperparameters and training schedules for the from-scratch models, and includes a new ablation table that isolates the asymmetric loss from the hierarchical sampling and token-balanced batching. These additions allow readers to attribute performance gains to the distillation components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation results are independent of inputs

full rationale

The paper reports empirical findings on multi-teacher distillation using an asymmetric loss, token-balanced batching, and hierarchical data sampling. All claims (preservation of teacher geometries, transfer to Grounding-VLMs) are validated via held-out benchmarks and comparisons to scratch-trained baselines rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own definitions or prior self-work.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Standard supervised and self-supervised training assumptions plus hyperparameter choices for loss balancing and clustering; no new invented entities or unstated axioms beyond typical deep-learning practice.

free parameters (2)

loss balancing coefficients
Weights controlling contribution of each teacher in the asymmetric distillation loss are chosen during training.
clustering hyperparameters
Number of clusters and sampling ratios in hierarchical data selection are tuned for the 200M corpus.

pith-pipeline@v0.9.0 · 5619 in / 1149 out tokens · 20186 ms · 2026-05-16T20:19:58.481502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5

work page 2014
[3]

Dearkd: data-efficient early knowledge distillation for vision transformers

Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12052–12062, 2022. 3

work page 2022
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

work page 2014
[5]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5

work page 2016
[6]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023. 3

work page 2023
[8]

Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496,

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2024. 2, 3

work page arXiv 2024
[9]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010. 5

work page 2010
[10]

Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023

Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023. 2, 3

work page arXiv 2023
[11]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

work page 2004
[12]

Learning effi- cient vision transformers via fine-grained manifold distilla- tion.Advances in Neural Information Processing Systems, 35:9164–9175, 2022

Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, and Yunhe Wang. Learning effi- cient vision transformers via fine-grained manifold distilla- tion.Advances in Neural Information Processing Systems, 35:9164–9175, 2022. 3

work page 2022
[13]

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,

work page
[14]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24905–24916, 2025. 2

work page 2025
[15]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 6

work page 2014
[16]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3, 5

work page 2023
[17]

Similarity of neural network represen- tations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network represen- tations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019. 7

work page 2019
[18]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

work page 2014
[19]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Swiss army knife: Synergizing biases in knowledge from vision foundation models for multi-task learning.arXiv preprint arXiv:2410.14633, 2024

Yuxiang Lu, Shengcao Cao, and Yu-Xiong Wang. Swiss army knife: Synergizing biases in knowledge from vision foundation models for multi-task learning.arXiv preprint arXiv:2410.14633, 2024. 3

work page arXiv 2024
[21]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013
[22]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5

work page 2008
[23]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019. 2, 3, 5

work page 2019
[24]

Am-radio: Agglomerative vision founda- tion model–reduce all domains into one.arXiv preprint arXiv:2312.06709, 2023

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision founda- tion model–reduce all domains into one.arXiv preprint arXiv:2312.06709, 2023. 1, 3

work page arXiv 2023
[25]

Phi-s: Dis- tribution balancing for label-free multi-teacher distillation

Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. Phi-s: Dis- tribution balancing for label-free multi-teacher distillation. arXiv preprint arXiv:2410.01680, 2024. 4, 1

work page arXiv 2024
[26]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2, 5

work page 2015
[27]

Unic: Univer- sal classification models via multi-teacher distillation.arXiv preprint arXiv:2408.05088, 2024

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lu- cas, Diane Larlus, and Yannis Kalantidis. Unic: Univer- sal classification models via multi-teacher distillation.arXiv preprint arXiv:2408.05088, 2024. 3

work page arXiv 2024
[28]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

work page 2022
[29]

Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 3

work page arXiv 2024
[30]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024

Huy V V o, Vasil Khalidov, Timoth ´ee Darcet, Th ´eo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Tou- vron, Camille Couprie, Maxime Oquab, Armand Joulin, et al. Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024. 2, 5, 3

work page arXiv 2024
[33]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 5

work page 2011
[34]

Sam-clip: Merging vision foundation models to- wards semantic and spatial understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models to- wards semantic and spatial understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–364...

work page 2024
[35]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Tinyvit: Fast pretraining distillation for small vision transformers

Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. InEuropean con- ference on computer vision, pages 68–85. Springer, 2022. 3

work page 2022
[37]

On n-dimensional rotary positional embed- dings, 2025

Jerry Xiong. On n-dimensional rotary positional embed- dings, 2025. 1

work page 2025
[38]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Clip-kd: An empirical study of clip model distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xin- qiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15952–15962, 2024. 3

work page 2024
[40]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 5

work page 2014
[41]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 6

work page 2016
[42]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022. 2, 3, 5

work page 2022
[43]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3

work page 2023
[44]

Minivit: Compressing vi- sion transformers with weight multiplexing

Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vi- sion transformers with weight multiplexing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12145–12154, 2022. 3

work page 2022
[45]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

work page
[46]

5 AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model Supplementary Material

work page
[47]

Analysis of PHI-S Transformation on Regis- ters We apply PHI-S [25] to evenly distribute the statistical influence of diverse channels and teacher representations. PHI-S operates by rotating the feature space via an in- vertible transform, composed of PCA whitening and a Hadamard rotation, such that the variance is distributed uni- formly across all chann...

work page
[48]

Here, we provide an empirical analysis of its effect on training dynamics

Impact of Asymmetric Relational Knowl- edge Distillation (ARKD) As introduced in the main text (Section 3.2), we propose Asymmetric Relational Knowledge Distillation (ARKD) to enforce pairwise geometric consistency in the student em- bedding space. Here, we provide an empirical analysis of its effect on training dynamics. Figure 5 visualizes the evo- luti...

work page
[49]

Positional Encoding Analysis We investigate the impact of the Rotary Positional Embed- ding (RoPE) strategy on the student’s ability to generalize to unseen high resolutions. Specifically, we compare the standard Axial RoPE against normalizing the input coor- dinates based on the image aspect ratio (mapping coordi- nates roughly to[−1,1]) rather than usin...

work page 2048
[50]

Original data Synthetic data Phi-S transformed original data Phi-S transformed synthetic data Patches Globalrepresentation Dino reg 0 Dino reg 1 Figure 4

Qualitative Analysis of Distilled Representa- tions We provide a qualitative comparison of the distilled stu- dent features against the teacher baselines in Figure 7. Original data Synthetic data Phi-S transformed original data Phi-S transformed synthetic data Patches Globalrepresentation Dino reg 0 Dino reg 1 Figure 4. We visualize PCA projections of glo...

work page
[51]

We use the AdamW optimizer withβ 1=0.9,β 2=0.999, and ϵ=10−15

Training Implementation Details We train our 18-layer MoE student model (d=768, 28 ex- perts, top-k=6) on 4 nodes with 8×A100 GPUs each. We use the AdamW optimizer withβ 1=0.9,β 2=0.999, and ϵ=10−15. The learning rate follows a linear decay schedule from10 −3 to10 −4 after a 500-step warmup, with weight decay set to0.02. We summarize the pseudo-code of th...

work page
[52]

Detailed Ablation Benchmarks We provide the full per-dataset results for our ablations. Table 8 and Table 11 detail the comparison between our curated OpenLVD200M dataset and random subsampling, highlighting the consistent gains across fine-grained classi- fication and retrieval tasks. Similarly, Table 9 and Table 10 present the full breakdown of the ARKD...

work page
[53]

dino": z_dino,

Details on OpenLVD200M Curation As outlined in §3, we construct OpenLVD200M using the hierarchical clustering and sampling pipeline proposed by [32] to mitigate the long-tail biases inherent in web- scraped data. Figure 8 visually demonstrates the seman- tic structure captured by this process. The hierarchy orga- 1# 1. Student Architecture (Agglomerative-...

work page arXiv 2048