Recognition: no theorem link
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
Pith reviewed 2026-05-16 20:19 UTC · model grok-4.3
The pith
Distilling SigLIP2 and DINOv3 simultaneously into dense and MoE students produces vision representations that initialize stronger early-fusion Grounding-VLMs than training from scratch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SigLino models are created by simultaneously distilling SigLIP2 and DINOv3 into dense and Mixture-of-Experts students using an asymmetric relation-knowledge distillation loss, token-balanced batching, and hierarchical data sampling on the OpenLVD200M corpus; the resulting representations transfer to early-fusion Grounding-VLMs and outperform models trained from scratch.
What carries the argument
The Asymmetric Relation-Knowledge Distillation loss, which maintains separate geometric relations from each teacher during joint distillation into the student.
Load-bearing premise
The asymmetric loss can preserve the distinct geometric properties of both teachers at once without one distorting the other inside the shared student representation.
What would settle it
Train a SigLino-MoE student with the proposed loss on OpenLVD200M and measure its downstream grounding accuracy against an identical-architecture model trained from scratch on the same corpus; if the distilled version shows no gain, the central transfer claim does not hold.
Figures
read the original abstract
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce SigLino, an efficient family of agglomerative vision foundation models that distill knowledge from SigLIP2 and DINOv3 simultaneously into Dense and Mixture-of-Experts students. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data, typically reserved for self-supervised learning, substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our SigLino-MoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT->LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and five distilled checkpoints comprising MoE and dense variants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SigLino, an efficient family of agglomerative vision foundation models obtained by distilling knowledge simultaneously from SigLIP2 and DINOv3 into dense and Mixture-of-Experts student architectures. It proposes an Asymmetric Relation-Knowledge Distillation loss, token-balanced batching for multi-resolution inputs, and hierarchical clustering/sampling of training data to improve sample efficiency. The authors release the OpenLVD200M corpus (200M images) and five checkpoints, and demonstrate that the resulting representations initialize early-fusion Grounding-VLMs that outperform models trained from scratch on transfer tasks.
Significance. If the empirical claims are substantiated, the work is significant for advancing practical multi-teacher distillation of vision foundation models at reduced computational cost. The combination of the proposed loss, batching, and data-sampling strategies, together with the public release of OpenLVD200M and the distilled checkpoints, provides concrete tools and data for the community. The downstream transfer results to early-fusion Grounding-VLMs, if robust, would indicate a viable path toward unified visual representations without training large models from scratch.
major comments (2)
- [Abstract and §3 (Method)] Abstract, claim (1) and the corresponding method section describing the Asymmetric Relation-Knowledge Distillation loss: the assertion that this loss simultaneously preserves the geometric properties of both SigLIP2 (contrastive alignment) and DINOv3 (self-supervised patch geometry) lacks direct supporting diagnostics. No CKA similarity, Procrustes alignment, or per-teacher retrieval metrics are reported to demonstrate that the student retains high-fidelity structure from each teacher rather than converging to a compromised embedding.
- [§5 (Experiments)] Experimental evaluation section (transfer results to Grounding-VLMs): the claim that SigLino-MoE initializes superior early-fusion models compared with from-scratch training is presented without visible baseline details, error bars, number of runs, or ablation isolating the contribution of the asymmetric loss versus the data-sampling strategy. This makes it difficult to verify that the reported gains are attributable to the distillation rather than other factors.
minor comments (2)
- [Abstract and §3] Ensure consistent definition of acronyms (VLM, MoE, etc.) on first appearance and clarify the exact form of the asymmetric loss (weighting coefficients, relation vs. knowledge terms) with an equation reference.
- [Figures and Tables] Figure captions and tables should explicitly state the evaluation metrics and dataset splits used for the Grounding-VLM transfer experiments.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We agree that additional diagnostics and experimental details will strengthen the manuscript and have revised the paper accordingly to address both major points.
read point-by-point responses
-
Referee: [Abstract and §3 (Method)] Abstract, claim (1) and the corresponding method section describing the Asymmetric Relation-Knowledge Distillation loss: the assertion that this loss simultaneously preserves the geometric properties of both SigLIP2 (contrastive alignment) and DINOv3 (self-supervised patch geometry) lacks direct supporting diagnostics. No CKA similarity, Procrustes alignment, or per-teacher retrieval metrics are reported to demonstrate that the student retains high-fidelity structure from each teacher rather than converging to a compromised embedding.
Authors: We agree that direct diagnostics are valuable for substantiating the claim. In the revised manuscript we have added CKA similarity matrices computed between the student embeddings and each teacher separately, Procrustes alignment distances, and per-teacher zero-shot retrieval metrics on ImageNet and COCO. These results appear in a new subsection of §3 and in the supplementary material; they show that the student retains high fidelity to both teachers rather than collapsing to an intermediate representation. revision: yes
-
Referee: [§5 (Experiments)] Experimental evaluation section (transfer results to Grounding-VLMs): the claim that SigLino-MoE initializes superior early-fusion models compared with from-scratch training is presented without visible baseline details, error bars, number of runs, or ablation isolating the contribution of the asymmetric loss versus the data-sampling strategy. This makes it difficult to verify that the reported gains are attributable to the distillation rather than other factors.
Authors: We acknowledge the need for greater experimental rigor. The revised §5 now reports mean and standard deviation over three independent runs with different random seeds, explicitly lists all baseline hyperparameters and training schedules for the from-scratch models, and includes a new ablation table that isolates the asymmetric loss from the hierarchical sampling and token-balanced batching. These additions allow readers to attribute performance gains to the distillation components. revision: yes
Circularity Check
No circularity: empirical distillation results are independent of inputs
full rationale
The paper reports empirical findings on multi-teacher distillation using an asymmetric loss, token-balanced batching, and hierarchical data sampling. All claims (preservation of teacher geometries, transfer to Grounding-VLMs) are validated via held-out benchmarks and comparisons to scratch-trained baselines rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own definitions or prior self-work.
Axiom & Free-Parameter Ledger
free parameters (2)
- loss balancing coefficients
- clustering hyperparameters
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean conference on computer vision, pages 446–461. Springer, 2014. 5
work page 2014
-
[3]
Dearkd: data-efficient early knowledge distillation for vision transformers
Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12052–12062, 2022. 3
work page 2022
-
[4]
Describing textures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5
work page 2014
-
[5]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 5
work page 2016
-
[6]
Vision Transformers Need Registers
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdul- mohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Infor- mation Processing Systems, 36:2252–2274, 2023. 3
work page 2023
-
[8]
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2024. 2, 3
-
[9]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010. 5
work page 2010
-
[10]
Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023
Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023. 2, 3
-
[11]
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5
work page 2004
-
[12]
Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, and Yunhe Wang. Learning effi- cient vision transformers via fine-grained manifold distilla- tion.Advances in Neural Information Processing Systems, 35:9164–9175, 2022. 3
work page 2022
-
[13]
Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,
-
[14]
Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment
Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24905–24916, 2025. 2
work page 2025
-
[15]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InProceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 6
work page 2014
-
[16]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3, 5
work page 2023
-
[17]
Similarity of neural network represen- tations revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network represen- tations revisited. InInternational conference on machine learning, pages 3519–3529. PMlR, 2019. 7
work page 2019
-
[18]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5
work page 2014
-
[19]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Yuxiang Lu, Shengcao Cao, and Yu-Xiong Wang. Swiss army knife: Synergizing biases in knowledge from vision foundation models for multi-task learning.arXiv preprint arXiv:2410.14633, 2024. 3
-
[21]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[22]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5
work page 2008
-
[23]
Relational knowledge distillation
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019. 2, 3, 5
work page 2019
-
[24]
Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision founda- tion model–reduce all domains into one.arXiv preprint arXiv:2312.06709, 2023. 1, 3
-
[25]
Phi-s: Dis- tribution balancing for label-free multi-teacher distillation
Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. Phi-s: Dis- tribution balancing for label-free multi-teacher distillation. arXiv preprint arXiv:2410.01680, 2024. 4, 1
-
[26]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015. 2, 5
work page 2015
-
[27]
Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lu- cas, Diane Larlus, and Yannis Kalantidis. Unic: Univer- sal classification models via multi-teacher distillation.arXiv preprint arXiv:2408.05088, 2024. 3
-
[28]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2
work page 2022
-
[29]
Jinghuan Shang, Karl Schmeckpeper, Brandon B May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, and Laura Herlant. Theia: Distilling diverse vision foundation models for robot learning.arXiv preprint arXiv:2407.20179, 2024. 3
-
[30]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Huy V V o, Vasil Khalidov, Timoth ´ee Darcet, Th ´eo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Tou- vron, Camille Couprie, Maxime Oquab, Armand Joulin, et al. Automatic data curation for self-supervised learning: A clustering-based approach.arXiv preprint arXiv:2405.15613, 2024. 2, 5, 3
-
[33]
The caltech-ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 5
work page 2011
-
[34]
Sam-clip: Merging vision foundation models to- wards semantic and spatial understanding
Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models to- wards semantic and spatial understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–364...
work page 2024
-
[35]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Tinyvit: Fast pretraining distillation for small vision transformers
Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. InEuropean con- ference on computer vision, pages 68–85. Springer, 2022. 3
work page 2022
-
[37]
On n-dimensional rotary positional embed- dings, 2025
Jerry Xiong. On n-dimensional rotary positional embed- dings, 2025. 1
work page 2025
-
[38]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Clip-kd: An empirical study of clip model distillation
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xin- qiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15952–15962, 2024. 3
work page 2024
-
[40]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2:67–78, 2014. 5
work page 2014
-
[41]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 6
work page 2016
-
[42]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022. 2, 3, 5
work page 2022
-
[43]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3
work page 2023
-
[44]
Minivit: Compressing vi- sion transformers with weight multiplexing
Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Minivit: Compressing vi- sion transformers with weight multiplexing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12145–12154, 2022. 3
work page 2022
-
[45]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
-
[46]
5 AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model Supplementary Material
-
[47]
Analysis of PHI-S Transformation on Regis- ters We apply PHI-S [25] to evenly distribute the statistical influence of diverse channels and teacher representations. PHI-S operates by rotating the feature space via an in- vertible transform, composed of PCA whitening and a Hadamard rotation, such that the variance is distributed uni- formly across all chann...
-
[48]
Here, we provide an empirical analysis of its effect on training dynamics
Impact of Asymmetric Relational Knowl- edge Distillation (ARKD) As introduced in the main text (Section 3.2), we propose Asymmetric Relational Knowledge Distillation (ARKD) to enforce pairwise geometric consistency in the student em- bedding space. Here, we provide an empirical analysis of its effect on training dynamics. Figure 5 visualizes the evo- luti...
-
[49]
Positional Encoding Analysis We investigate the impact of the Rotary Positional Embed- ding (RoPE) strategy on the student’s ability to generalize to unseen high resolutions. Specifically, we compare the standard Axial RoPE against normalizing the input coor- dinates based on the image aspect ratio (mapping coordi- nates roughly to[−1,1]) rather than usin...
work page 2048
-
[50]
Qualitative Analysis of Distilled Representa- tions We provide a qualitative comparison of the distilled stu- dent features against the teacher baselines in Figure 7. Original data Synthetic data Phi-S transformed original data Phi-S transformed synthetic data Patches Globalrepresentation Dino reg 0 Dino reg 1 Figure 4. We visualize PCA projections of glo...
-
[51]
We use the AdamW optimizer withβ 1=0.9,β 2=0.999, and ϵ=10−15
Training Implementation Details We train our 18-layer MoE student model (d=768, 28 ex- perts, top-k=6) on 4 nodes with 8×A100 GPUs each. We use the AdamW optimizer withβ 1=0.9,β 2=0.999, and ϵ=10−15. The learning rate follows a linear decay schedule from10 −3 to10 −4 after a 500-step warmup, with weight decay set to0.02. We summarize the pseudo-code of th...
-
[52]
Detailed Ablation Benchmarks We provide the full per-dataset results for our ablations. Table 8 and Table 11 detail the comparison between our curated OpenLVD200M dataset and random subsampling, highlighting the consistent gains across fine-grained classi- fication and retrieval tasks. Similarly, Table 9 and Table 10 present the full breakdown of the ARKD...
-
[53]
Details on OpenLVD200M Curation As outlined in §3, we construct OpenLVD200M using the hierarchical clustering and sampling pipeline proposed by [32] to mitigate the long-tail biases inherent in web- scraped data. Figure 8 visually demonstrates the seman- tic structure captured by this process. The hierarchy orga- 1# 1. Student Architecture (Agglomerative-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.