pith. machine review for the scientific record. sign in

arxiv: 2512.20157 · v2 · submitted 2025-12-23 · 💻 cs.CV

Recognition: no theorem link

SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-teacher distillationvision foundation modelsknowledge distillationmixture of expertsgrounding vision-language modelstoken-balanced batchinghierarchical data sampling
0
0 comments X

The pith

Distilling SigLIP2 and DINOv3 simultaneously into dense and MoE students produces vision representations that initialize stronger early-fusion Grounding-VLMs than training from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that multi-teacher distillation can be made computationally efficient by combining two strong vision teachers into smaller student models. An asymmetric relation-knowledge distillation loss is shown to keep the geometric structure of each teacher intact while transferring knowledge. Supporting techniques include token-balanced batching to handle mixed resolutions and hierarchical clustering to sample a 200-million-image corpus more effectively than random selection. The resulting SigLino checkpoints, when used to initialize early-fusion Grounding-VLMs, outperform equivalent models trained from random weights on the same data. This approach suggests a route to unified visual representations at lower overall training cost.

Core claim

SigLino models are created by simultaneously distilling SigLIP2 and DINOv3 into dense and Mixture-of-Experts students using an asymmetric relation-knowledge distillation loss, token-balanced batching, and hierarchical data sampling on the OpenLVD200M corpus; the resulting representations transfer to early-fusion Grounding-VLMs and outperform models trained from scratch.

What carries the argument

The Asymmetric Relation-Knowledge Distillation loss, which maintains separate geometric relations from each teacher during joint distillation into the student.

Load-bearing premise

The asymmetric loss can preserve the distinct geometric properties of both teachers at once without one distorting the other inside the shared student representation.

What would settle it

Train a SigLino-MoE student with the proposed loss on OpenLVD200M and measure its downstream grounding accuracy against an identical-architecture model trained from scratch on the same corpus; if the distilled version shows no gain, the central transfer claim does not hold.

Figures

Figures reproduced from arXiv: 2512.20157 by Ankit Singh, Hakim Hacid, Hilde Kuehne, Ngoc Dung Huynh, Ph\'uc H. L\^e Khac, Sanath Narayan, Sofian Chaybouti, Wamiq Reyaz Para, Yasser Dahou.

Figure 1
Figure 1. Figure 1: AMoE vision foundation model: A Mixture-of-Experts student is distilled from multiple frozen vision teachers as shown in the multi-teacher distillation stage (on the left). The input image is fed to both teachers (SigLIP2 and DINOv3) and the student to obtain respective patch and global representation embeddings. Additional register tokens are employed in the student model, similar to DINOv3. The student e… view at source ↗
Figure 2
Figure 2. Figure 2: Token-balanced batching: Packing multiple native-resolution images per sequence up to a fixed token budget and applying [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Linear CKA alignments between MoE experts and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We visualize PCA projections of global features, patches, and DINOv3 registers (0 and 1): original data (Col 1), synthetic [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Asymmetric Relational Knowledge Distillation (ARKD) on training dynamics. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of positional encoding on unseen resolutions. We compare feature map consistency across resolutions ( [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: PCA-maps of learned representations: the original im [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Concept hierarchy captured by the 4-level clustering. Each column represents a high-level semantic cluster (Level 4, grey [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce SigLino, an efficient family of agglomerative vision foundation models that distill knowledge from SigLIP2 and DINOv3 simultaneously into Dense and Mixture-of-Experts students. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data, typically reserved for self-supervised learning, substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our SigLino-MoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT->LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and five distilled checkpoints comprising MoE and dense variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SigLino, an efficient family of agglomerative vision foundation models obtained by distilling knowledge simultaneously from SigLIP2 and DINOv3 into dense and Mixture-of-Experts student architectures. It proposes an Asymmetric Relation-Knowledge Distillation loss, token-balanced batching for multi-resolution inputs, and hierarchical clustering/sampling of training data to improve sample efficiency. The authors release the OpenLVD200M corpus (200M images) and five checkpoints, and demonstrate that the resulting representations initialize early-fusion Grounding-VLMs that outperform models trained from scratch on transfer tasks.

Significance. If the empirical claims are substantiated, the work is significant for advancing practical multi-teacher distillation of vision foundation models at reduced computational cost. The combination of the proposed loss, batching, and data-sampling strategies, together with the public release of OpenLVD200M and the distilled checkpoints, provides concrete tools and data for the community. The downstream transfer results to early-fusion Grounding-VLMs, if robust, would indicate a viable path toward unified visual representations without training large models from scratch.

major comments (2)
  1. [Abstract and §3 (Method)] Abstract, claim (1) and the corresponding method section describing the Asymmetric Relation-Knowledge Distillation loss: the assertion that this loss simultaneously preserves the geometric properties of both SigLIP2 (contrastive alignment) and DINOv3 (self-supervised patch geometry) lacks direct supporting diagnostics. No CKA similarity, Procrustes alignment, or per-teacher retrieval metrics are reported to demonstrate that the student retains high-fidelity structure from each teacher rather than converging to a compromised embedding.
  2. [§5 (Experiments)] Experimental evaluation section (transfer results to Grounding-VLMs): the claim that SigLino-MoE initializes superior early-fusion models compared with from-scratch training is presented without visible baseline details, error bars, number of runs, or ablation isolating the contribution of the asymmetric loss versus the data-sampling strategy. This makes it difficult to verify that the reported gains are attributable to the distillation rather than other factors.
minor comments (2)
  1. [Abstract and §3] Ensure consistent definition of acronyms (VLM, MoE, etc.) on first appearance and clarify the exact form of the asymmetric loss (weighting coefficients, relation vs. knowledge terms) with an equation reference.
  2. [Figures and Tables] Figure captions and tables should explicitly state the evaluation metrics and dataset splits used for the Grounding-VLM transfer experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that additional diagnostics and experimental details will strengthen the manuscript and have revised the paper accordingly to address both major points.

read point-by-point responses
  1. Referee: [Abstract and §3 (Method)] Abstract, claim (1) and the corresponding method section describing the Asymmetric Relation-Knowledge Distillation loss: the assertion that this loss simultaneously preserves the geometric properties of both SigLIP2 (contrastive alignment) and DINOv3 (self-supervised patch geometry) lacks direct supporting diagnostics. No CKA similarity, Procrustes alignment, or per-teacher retrieval metrics are reported to demonstrate that the student retains high-fidelity structure from each teacher rather than converging to a compromised embedding.

    Authors: We agree that direct diagnostics are valuable for substantiating the claim. In the revised manuscript we have added CKA similarity matrices computed between the student embeddings and each teacher separately, Procrustes alignment distances, and per-teacher zero-shot retrieval metrics on ImageNet and COCO. These results appear in a new subsection of §3 and in the supplementary material; they show that the student retains high fidelity to both teachers rather than collapsing to an intermediate representation. revision: yes

  2. Referee: [§5 (Experiments)] Experimental evaluation section (transfer results to Grounding-VLMs): the claim that SigLino-MoE initializes superior early-fusion models compared with from-scratch training is presented without visible baseline details, error bars, number of runs, or ablation isolating the contribution of the asymmetric loss versus the data-sampling strategy. This makes it difficult to verify that the reported gains are attributable to the distillation rather than other factors.

    Authors: We acknowledge the need for greater experimental rigor. The revised §5 now reports mean and standard deviation over three independent runs with different random seeds, explicitly lists all baseline hyperparameters and training schedules for the from-scratch models, and includes a new ablation table that isolates the asymmetric loss from the hierarchical sampling and token-balanced batching. These additions allow readers to attribute performance gains to the distillation components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation results are independent of inputs

full rationale

The paper reports empirical findings on multi-teacher distillation using an asymmetric loss, token-balanced batching, and hierarchical data sampling. All claims (preservation of teacher geometries, transfer to Grounding-VLMs) are validated via held-out benchmarks and comparisons to scratch-trained baselines rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own definitions or prior self-work.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Standard supervised and self-supervised training assumptions plus hyperparameter choices for loss balancing and clustering; no new invented entities or unstated axioms beyond typical deep-learning practice.

free parameters (2)
  • loss balancing coefficients
    Weights controlling contribution of each teacher in the asymmetric distillation loss are chosen during training.
  • clustering hyperparameters
    Number of clusters and sampling ratios in hierarchical data selection are tuned for the 200M corpus.

pith-pipeline@v0.9.0 · 5619 in / 1149 out tokens · 20186 ms · 2026-05-16T20:19:58.481502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1

  2. [2]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5

  3. [3]

    Dearkd: data-efficient early knowledge distillation for vision transformers

    Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12052–12062, 2022. 3

  4. [4]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

  5. [5]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5

  6. [6]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 3

  7. [7]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023. 3

  8. [8]

    Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496,

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2024. 2, 3

  9. [9]

    The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010. 5

  10. [10]

    Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023

    Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023. 2, 3

  11. [11]

    Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

  12. [12]

    Learning effi- cient vision transformers via fine-grained manifold distilla- tion.Advances in Neural Information Processing Systems, 35:9164–9175, 2022

    Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, and Yunhe Wang. Learning effi- cient vision transformers via fine-grained manifold distilla- tion.Advances in Neural Information Processing Systems, 35:9164–9175, 2022. 3

  13. [13]

    Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,

  14. [14]

    Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

    Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24905–24916, 2025. 2

  15. [15]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 6

  16. [16]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3, 5

  17. [17]

    Similarity of neural network represen- tations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network represen- tations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019. 7

  18. [18]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

  19. [19]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  20. [20]

    Swiss army knife: Synergizing biases in knowledge from vision foundation models for multi-task learning.arXiv preprint arXiv:2410.14633, 2024

    Yuxiang Lu, Shengcao Cao, and Yu-Xiong Wang. Swiss army knife: Synergizing biases in knowledge from vision foundation models for multi-task learning.arXiv preprint arXiv:2410.14633, 2024. 3

  21. [21]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5

  22. [22]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5

  23. [23]

    Relational knowledge distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019. 2, 3, 5

  24. [24]

    Am-radio: Agglomerative vision founda- tion model–reduce all domains into one.arXiv preprint arXiv:2312.06709, 2023

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision founda- tion model–reduce all domains into one.arXiv preprint arXiv:2312.06709, 2023. 1, 3

  25. [25]

    Phi-s: Dis- tribution balancing for label-free multi-teacher distillation

    Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. Phi-s: Dis- tribution balancing for label-free multi-teacher distillation. arXiv preprint arXiv:2410.01680, 2024. 4, 1

  26. [26]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2, 5

  27. [27]

    Unic: Univer- sal classification models via multi-teacher distillation.arXiv preprint arXiv:2408.05088, 2024

    Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lu- cas, Diane Larlus, and Yannis Kalantidis. Unic: Univer- sal classification models via multi-teacher distillation.arXiv preprint arXiv:2408.05088, 2024. 3

  28. [28]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2

  29. [29]

    Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024

    Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 3

  30. [30]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 3

  31. [31]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1, 2, 3

  32. [32]

    Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024

    Huy V V o, Vasil Khalidov, Timoth ´ee Darcet, Th ´eo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Tou- vron, Camille Couprie, Maxime Oquab, Armand Joulin, et al. Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024. 2, 5, 3

  33. [33]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 5

  34. [34]

    Sam-clip: Merging vision foundation models to- wards semantic and spatial understanding

    Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models to- wards semantic and spatial understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–364...

  35. [35]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

  36. [36]

    Tinyvit: Fast pretraining distillation for small vision transformers

    Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. InEuropean con- ference on computer vision, pages 68–85. Springer, 2022. 3

  37. [37]

    On n-dimensional rotary positional embed- dings, 2025

    Jerry Xiong. On n-dimensional rotary positional embed- dings, 2025. 1

  38. [38]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1

  39. [39]

    Clip-kd: An empirical study of clip model distillation

    Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xin- qiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15952–15962, 2024. 3

  40. [40]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 5

  41. [41]

    Modeling context in referring expres- sions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 6

  42. [42]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022. 2, 3, 5

  43. [43]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3

  44. [44]

    Minivit: Compressing vi- sion transformers with weight multiplexing

    Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vi- sion transformers with weight multiplexing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12145–12154, 2022. 3

  45. [45]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

  46. [46]

    5 AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model Supplementary Material

  47. [47]

    Analysis of PHI-S Transformation on Regis- ters We apply PHI-S [25] to evenly distribute the statistical influence of diverse channels and teacher representations. PHI-S operates by rotating the feature space via an in- vertible transform, composed of PCA whitening and a Hadamard rotation, such that the variance is distributed uni- formly across all chann...

  48. [48]

    Here, we provide an empirical analysis of its effect on training dynamics

    Impact of Asymmetric Relational Knowl- edge Distillation (ARKD) As introduced in the main text (Section 3.2), we propose Asymmetric Relational Knowledge Distillation (ARKD) to enforce pairwise geometric consistency in the student em- bedding space. Here, we provide an empirical analysis of its effect on training dynamics. Figure 5 visualizes the evo- luti...

  49. [49]

    Positional Encoding Analysis We investigate the impact of the Rotary Positional Embed- ding (RoPE) strategy on the student’s ability to generalize to unseen high resolutions. Specifically, we compare the standard Axial RoPE against normalizing the input coor- dinates based on the image aspect ratio (mapping coordi- nates roughly to[−1,1]) rather than usin...

  50. [50]

    Original data Synthetic data Phi-S transformed original data Phi-S transformed synthetic data Patches Globalrepresentation Dino reg 0 Dino reg 1 Figure 4

    Qualitative Analysis of Distilled Representa- tions We provide a qualitative comparison of the distilled stu- dent features against the teacher baselines in Figure 7. Original data Synthetic data Phi-S transformed original data Phi-S transformed synthetic data Patches Globalrepresentation Dino reg 0 Dino reg 1 Figure 4. We visualize PCA projections of glo...

  51. [51]

    We use the AdamW optimizer withβ 1=0.9,β 2=0.999, and ϵ=10−15

    Training Implementation Details We train our 18-layer MoE student model (d=768, 28 ex- perts, top-k=6) on 4 nodes with 8×A100 GPUs each. We use the AdamW optimizer withβ 1=0.9,β 2=0.999, and ϵ=10−15. The learning rate follows a linear decay schedule from10 −3 to10 −4 after a 500-step warmup, with weight decay set to0.02. We summarize the pseudo-code of th...

  52. [52]

    Detailed Ablation Benchmarks We provide the full per-dataset results for our ablations. Table 8 and Table 11 detail the comparison between our curated OpenLVD200M dataset and random subsampling, highlighting the consistent gains across fine-grained classi- fication and retrieval tasks. Similarly, Table 9 and Table 10 present the full breakdown of the ARKD...

  53. [53]

    dino": z_dino,

    Details on OpenLVD200M Curation As outlined in §3, we construct OpenLVD200M using the hierarchical clustering and sampling pipeline proposed by [32] to mitigate the long-tail biases inherent in web- scraped data. Figure 8 visually demonstrates the seman- tic structure captured by this process. The hierarchy orga- 1# 1. Student Architecture (Agglomerative-...