Recognition: unknown
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Pith reviewed 2026-05-10 11:19 UTC · model grok-4.3
The pith
Switch-KD lets a 0.5B vision-language model distill multimodal knowledge from a 3B teacher by switching visual outputs into the language pathway.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Switch-KD is a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. It consists of Visual-Switch Distillation, which routes the student's visual outputs into the teacher's language pathway to build cross-modal probabilistic references, and the Dynamic Bi-directional Logits Difference loss, which adaptively aligns informative probability regions while preserving distributional structures through bidirectional supervision. When applied to a 0.5B student model and a 3B teacher, the method produces an average 3.6 point gain across ten multimodal benchmarks with no changes to the student's architecture.
What carries the argument
Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer.
If this is right
- A 0.5B parameter vision-language model can absorb multimodal knowledge from a 3B teacher and improve 3.6 points on average across ten benchmarks.
- Multimodal knowledge transfers consistently when placed inside a single text-probability space instead of being supervised separately by modality.
- The student model requires no architectural modifications to receive the performance gain.
- The bidirectional loss keeps the original probability distributions of both teacher and student while still aligning the most useful regions.
- Knowledge distillation becomes viable for shrinking large vision-language models without sacrificing their fused multimodal capabilities.
Where Pith is reading between the lines
- The same routing idea could be tested on student-teacher pairs that differ more sharply in architecture to see how far the cross-modal reference construction reaches.
- Resource-limited settings such as mobile or edge deployment of vision-language models would become more practical if the observed gains hold when the teacher-student size gap widens.
- Researchers could check whether removing the dynamic part of the bidirectional loss still produces most of the improvement or whether the adaptivity is essential for avoiding misalignment.
Load-bearing premise
Routing the student's visual outputs into the teacher's language pathway creates consistent cross-modal probabilistic references that transfer multimodal knowledge without introducing misalignment or losing critical visual information.
What would settle it
Re-run the distillation experiment on the same models and benchmarks but replace the visual-switch step with separate per-modality supervision and check whether the 3.6 point average gain disappears.
Figures
read the original abstract
Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Switch-KD, a knowledge distillation framework for vision-language models that unifies multimodal transfer in a shared text-probability space. It consists of Visual-Switch Distillation, which routes the student's visual token outputs into the teacher's language pathway to generate cross-modal probabilistic references, and the Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative regions while preserving distributional structure. The central empirical claim is that this enables a 0.5B TinyLLaVA student to distill from a 3B teacher, yielding an average 3.6-point gain across 10 multimodal benchmarks with no architectural changes.
Significance. If the reported gains prove robust, Switch-KD would offer a practical route to improving small VLMs by transferring fused multimodal knowledge without increasing model size or requiring new data, addressing a key deployment bottleneck. The approach's emphasis on modality alignment in probability space is a clear conceptual contribution over separate-modality KD baselines.
major comments (2)
- [Abstract / Experimental results] Abstract and experimental results section: the claimed 3.6-point average improvement across 10 benchmarks is presented without error bars, standard deviations, or statistical significance tests. This makes it impossible to determine whether the gains exceed run-to-run variance or depend on specific hyperparameter choices, directly undermining attribution to the visual-switch and DBiLD components.
- [Method / DBiLD loss definition] Section describing Visual-Switch Distillation and DBiLD: the framework assumes that routing student visual outputs into the teacher's language pathway produces semantically consistent cross-modal references, yet no derivation, bound, or ablation is provided showing that bidirectional logit differences recover lost visual information or prevent misalignment when visual embeddings are substituted for language tokens. If this assumption does not hold, the performance gains cannot be confidently attributed to unified multimodal transfer.
minor comments (2)
- [Abstract] The abstract states gains occur 'without any architectural modification,' but the manuscript should explicitly confirm that the student and teacher share the same tokenizer and projection layers to avoid hidden interface changes.
- [Tables and figures] Figure captions and table footnotes should include the exact list of 10 benchmarks and the precise evaluation protocol (e.g., zero-shot vs. few-shot) to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our paper. We have carefully considered the major comments and provide the following point-by-point responses. We believe these clarifications and planned revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experimental results] Abstract and experimental results section: the claimed 3.6-point average improvement across 10 benchmarks is presented without error bars, standard deviations, or statistical significance tests. This makes it impossible to determine whether the gains exceed run-to-run variance or depend on specific hyperparameter choices, directly undermining attribution to the visual-switch and DBiLD components.
Authors: We agree with the referee that the absence of error bars, standard deviations, and statistical significance tests in the reported 3.6-point average improvement limits the ability to assess robustness against run-to-run variance. In the revised manuscript, we will include results from multiple training runs with mean and standard deviation, as well as appropriate statistical tests to validate the significance of the gains. This will more convincingly attribute the improvements to the Visual-Switch Distillation and DBiLD loss. revision: yes
-
Referee: [Method / DBiLD loss definition] Section describing Visual-Switch Distillation and DBiLD: the framework assumes that routing student visual outputs into the teacher's language pathway produces semantically consistent cross-modal references, yet no derivation, bound, or ablation is provided showing that bidirectional logit differences recover lost visual information or prevent misalignment when visual embeddings are substituted for language tokens. If this assumption does not hold, the performance gains cannot be confidently attributed to unified multimodal transfer.
Authors: We acknowledge that the manuscript does not provide a theoretical derivation or bound demonstrating that the bidirectional logit differences recover visual information or prevent misalignment. The approach relies on the empirical observation that operating in the shared text-probability space allows for consistent cross-modal transfer. To address this concern, we will add a dedicated ablation study in the revised version that isolates the effect of the visual switching and DBiLD components on alignment, along with further discussion in the method section on the rationale for semantic consistency. revision: yes
Circularity Check
No significant circularity; Switch-KD components are independently defined
full rationale
The paper defines Visual-Switch Distillation and DBiLD loss as novel constructs that route student visual outputs into the teacher's language pathway and apply bidirectional logit alignment. These are presented as new mechanisms rather than reductions of fitted parameters or prior results. The reported 3.6-point benchmark gains are empirical outcomes of applying the method, not predictions forced by construction or self-citation chains. No equations or claims reduce the central result to its own inputs; the framework remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,
work page internal anchor Pith review arXiv
-
[2]
Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Zhucun Xue, Yong Liu, and Xiang Bai. Llava-kd: A framework of distill- ing multimodal large language models.arXiv preprint arXiv:2410.16236, 2024. 1, 2, 3, 5
-
[3]
Move- kd: Knowledge distillation for vlms with mixture of visual encoders
Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning Ma, and Shanghang Zhang. Move- kd: Knowledge distillation for vlms with mixture of visual encoders. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19846–19856, 2025. 3, 5
2025
-
[4]
On the efficacy of knowledge distillation
Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 4794– 4802, 2019. 1
2019
-
[5]
Mobilevlm : A fast, strong and open vision language assistant for mobile devices
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023. 5
-
[6]
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 5
-
[7]
Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement
Qianhan Feng, Wenshuo Li, Tong Lin, and Xinghao Chen. Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 4178–4188, 2025. 2, 3, 5
2025
-
[8]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5
2017
-
[9]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543, 2023. 1, 2, 7
work page internal anchor Pith review arXiv 2023
-
[10]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,
-
[11]
One-for-all: Bridge the gap between heterogeneous architectures in knowledge distilla- tion.Advances in Neural Information Processing Systems, 36:79570–79582, 2023
Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, and Chang Xu. One-for-all: Bridge the gap between heterogeneous architectures in knowledge distilla- tion.Advances in Neural Information Processing Systems, 36:79570–79582, 2023. 1
2023
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5
2019
-
[14]
C2kd: Bridging the modality gap for cross-modal knowledge distillation
Fushuo Huo, Wenchao Xu, Jingcai Guo, Haozhao Wang, and Song Guo. C2kd: Bridging the modality gap for cross-modal knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16006–16015, 2024. 1
2024
-
[15]
Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991. 3
1991
-
[16]
Multi-level logit distil- lation
Ying Jin, Jiaqi Wang, and Dahua Lin. Multi-level logit distil- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 24276–24285,
-
[17]
Token- scaled logit distillation for ternary weight generative lan- guage models.Advances in Neural Information Processing Systems, 36:42097–42118, 2023
Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du- Seong Chang, Wonyong Sung, and Jungwook Choi. Token- scaled logit distillation for ternary weight generative lan- guage models.Advances in Neural Information Processing Systems, 36:42097–42118, 2023. 1
2023
-
[18]
Sequence-level knowl- edge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowl- edge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016. 7
2016
-
[19]
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large lan- guage models.arXiv preprint arXiv:2402.03898, 2024. 1, 2
-
[20]
Tanishq Kumar, Zachary Ankner, Benjamin F Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R ´e, and Aditi Raghunathan. Scaling laws for precision.arXiv preprint arXiv:2411.04330, 2024. 2
-
[21]
Vlsi: Verbalized layers- to-interactions from large to small vision language models
Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers- to-interactions from large to small vision language models. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29545–29557, 2025. 3
2025
-
[22]
Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios
Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20379–20389, 2025. 2
2025
-
[23]
An efficient framework for enhancing discriminative models via diffusion techniques
Chunxiao Li, Xiaoxiao Wang, Boming Miao, Chuanlong Xie, Zizhe Wang, and Yao Zhu. An efficient framework for enhancing discriminative models via diffusion techniques. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 4670–4678, 2025. 2
2025
-
[24]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2
2023
-
[25]
arXiv preprint arXiv:2408.08632 , year=
Jian Li, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, et al. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024. 5
-
[26]
Minchong Li, Feng Zhou, and Xiaohui Song. Bild: Bi- directional logits difference loss for large language model distillation.arXiv preprint arXiv:2406.13555, 2024. 1, 2
-
[27]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5
work page internal anchor Pith review arXiv 2023
-
[28]
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,
-
[29]
Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.arXiv preprint arXiv:2402.05935, 2024. 2, 5
-
[30]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2
2023
-
[31]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 5
2024
-
[32]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5
2024
-
[33]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[34]
Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tom- mie Kerssies, Romain Beaumont, Mehdi Cherti, and Jenia Jitsev. Scaling laws for robust comparison of open foun- dation language-vision models and datasets.arXiv preprint arXiv:2506.04598, 2025. 2
-
[35]
Cheng Peng, Kai Zhang, Mengxian Lyu, Hongfang Liu, Lichao Sun, and Yonghui Wu. Scaling up biomedical vision- language models: Fine-tuning, instruction tuning, and multi- modal learning.arXiv preprint arXiv:2505.17436, 2025. 2
-
[36]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5
2021
-
[37]
Finding a” kneedle” in a haystack: Detect- ing knee points in system behavior
Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. Finding a” kneedle” in a haystack: Detect- ing knee points in system behavior. In2011 31st interna- tional conference on distributed computing systems work- shops, pages 166–171. IEEE, 2011. 4
2011
-
[38]
Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881,
Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881,
-
[39]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5
2019
-
[40]
Densely guided knowledge distillation using multi- ple teacher assistants
Wonchul Son, Jaemin Na, Junyong Choi, and Wonjun Hwang. Densely guided knowledge distillation using multi- ple teacher assistants. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9395–9404,
-
[41]
Logit standardization in knowledge distillation
Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xi- aochun Cao. Logit standardization in knowledge distillation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15731–15740, 2024. 1
2024
-
[42]
Watermarking via gaus- sian noise modulation in diffusion models.Neurocomputing, page 132188, 2025
Yingjie Tian and Xiaoxiao Wang. Watermarking via gaus- sian noise modulation in diffusion models.Neurocomputing, page 132188, 2025. 2
2025
-
[43]
Distilling knowl- edge by mimicking features.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8183–8195,
Guo-Hua Wang, Yifan Ge, and Jianxin Wu. Distilling knowl- edge by mimicking features.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8183–8195,
-
[44]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2
work page internal anchor Pith review arXiv 2025
-
[45]
Towards provenance- aware diffusion: Key-free watermarking with gaussian shad- ing.Expert Systems with Applications, page 130822, 2025
Xiaoxiao Wang and Yingjie Tian. Towards provenance- aware diffusion: Key-free watermarking with gaussian shad- ing.Expert Systems with Applications, page 130822, 2025. 2
2025
-
[46]
Towards annotation-free eval- uation: Kpascore for human keypoint detection
Xiaoxiao Wang, Chunxiao Li, Peng Sun, Boming Miao, Yunjian Zhang, and Yao Zhu. Towards annotation-free eval- uation: Kpascore for human keypoint detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 8441–8450, 2025
2025
-
[47]
Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zi- jian Chen, Chunyi Li, Xiaohong Liu, Zicheng Zhang, and Guangtao Zhai. Star: Bridging statistical and agentic reason- ing for large model performance prediction.arXiv preprint arXiv:2602.12143, 2026. 2
-
[48]
Rethinking kullback-leibler di- vergence in knowledge distillation for large language mod- els
Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler di- vergence in knowledge distillation for large language mod- els. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025. 1, 2
2025
-
[49]
Feature nor- malized knowledge distillation for image classification
Kunran Xu, Lai Rui, Yishi Li, and Lin Gu. Feature nor- malized knowledge distillation for image classification. In European conference on computer vision, pages 664–680. Springer, 2020. 2
2020
-
[50]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[51]
A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 5
2024
-
[52]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5
2024
-
[53]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5
2023
-
[54]
Dual-space knowledge distillation for large language models.arXiv preprint arXiv:2406.17328, 2024
Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models.arXiv preprint arXiv:2406.17328, 2024. 1, 2
-
[55]
Decoupled knowledge distillation
Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022. 1
2022
-
[56]
Tinyllava: A framework of small-scale large multimodal models,
Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 1, 2, 5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.