arxiv: 2604.14629 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Haoyi Sun , Xiaoxiao Wang , Ning Mao , Qian Wang , Lifu Mu , Wen Zheng , Tao Wei , Wei Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords knowledge distillationvision-language modelsmultimodal transfermodel compressionvisual switchcross-modal alignmentlogits difference loss

0 comments

The pith

Switch-KD lets a 0.5B vision-language model distill multimodal knowledge from a 3B teacher by switching visual outputs into the language pathway.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix inconsistent multimodal knowledge transfer during distillation of vision-language models. Existing methods supervise vision and language separately even though the models fuse knowledge inside the language space, which breaks alignment during transfer. Switch-KD instead routes the student's visual outputs directly into the teacher's language pathway and adds a bidirectional loss that aligns key probability regions while keeping both models' distributions intact. If this works, smaller models can inherit richer multimodal abilities from larger ones without growing in size or needing more data. That would make capable vision-language systems practical in settings where compute and memory are limited.

Core claim

Switch-KD is a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. It consists of Visual-Switch Distillation, which routes the student's visual outputs into the teacher's language pathway to build cross-modal probabilistic references, and the Dynamic Bi-directional Logits Difference loss, which adaptively aligns informative probability regions while preserving distributional structures through bidirectional supervision. When applied to a 0.5B student model and a 3B teacher, the method produces an average 3.6 point gain across ten multimodal benchmarks with no changes to the student's architecture.

What carries the argument

Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer.

If this is right

A 0.5B parameter vision-language model can absorb multimodal knowledge from a 3B teacher and improve 3.6 points on average across ten benchmarks.
Multimodal knowledge transfers consistently when placed inside a single text-probability space instead of being supervised separately by modality.
The student model requires no architectural modifications to receive the performance gain.
The bidirectional loss keeps the original probability distributions of both teacher and student while still aligning the most useful regions.
Knowledge distillation becomes viable for shrinking large vision-language models without sacrificing their fused multimodal capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing idea could be tested on student-teacher pairs that differ more sharply in architecture to see how far the cross-modal reference construction reaches.
Resource-limited settings such as mobile or edge deployment of vision-language models would become more practical if the observed gains hold when the teacher-student size gap widens.
Researchers could check whether removing the dynamic part of the bidirectional loss still produces most of the improvement or whether the adaptivity is essential for avoiding misalignment.

Load-bearing premise

Routing the student's visual outputs into the teacher's language pathway creates consistent cross-modal probabilistic references that transfer multimodal knowledge without introducing misalignment or losing critical visual information.

What would settle it

Re-run the distillation experiment on the same models and benchmarks but replace the visual-switch step with separate per-modality supervision and check whether the 3.6 point average gain disappears.

Figures

Figures reproduced from arXiv: 2604.14629 by Haoyi Sun, Lifu Mu, Ning Mao, Qian Wang, Tao Wei, Wei Chen, Wen Zheng, Xiaoxiao Wang.

**Figure 2.** Figure 2: Overview of the proposed Switch-KD framework, consisting of two components: (a) Visual-Switch Distillation (left), where the student’s visual outputs are switched into the teacher’s language pathway to obtain visual-switch logits for implicit multimodal knowledge transfer; and (b) DBiLD loss (right), which first detects knee points k t and k s in the respective logits distributions, then constructs two set… view at source ↗

**Figure 3.** Figure 3: compares attention maps from the teacher, an SFT baseline, two distillation methods, and our Switch-KD. The teacher focuses on semantically critical regions (e.g., the intersection between a wooden bridge and distant mountains), demonstrating strong visual–semantic understanding. While the SFT baseline approximates the teacher’s overall attention distribution, it fails to match the finegrained semantic … view at source ↗

read the original abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Switch-KD routes student visual outputs through the teacher's language path and adds a bidirectional logits loss, but the 3.6-point gains rest on an untested assumption about cross-modal consistency and lack supporting controls.

read the letter

The paper's main move is to handle the modality gap in VLM distillation by switching the student's visual tokens straight into the teacher's language pathway, then using a Dynamic Bi-directional Logits Difference loss to pull the probability regions into alignment while trying to keep the original distributions. They show this lets a 0.5B TinyLLaVA pick up an average 3.6 points over 10 benchmarks from a 3B teacher, with no architecture changes. That visual-switch plus DBiLD combination is the actual new piece; it targets the fused multimodal knowledge sitting in language space instead of treating vision and language supervision separately like earlier KD work. The idea is concrete and the setup is simple enough to implement if you're already running these models. The gains are real on the numbers given, and the approach avoids the usual separate-modality pitfalls. Still, the abstract gives no error bars, no component ablations, and no protocol details, so it's impossible to tell whether the improvement is stable or driven by the switch itself. The stress-test point lands: nothing in the description shows that the switched visual outputs produce logits that stay consistent with the teacher's original multimodal fusion, or that the bidirectional difference loss actually recovers lost visual information when the input spaces differ. If that mapping is noisy, the reported lift could be incidental. The math is basic logit differencing with no circularity or invented quantities. This is useful for people who need to compress VLMs for constrained hardware and are already comfortable with standard KD pipelines. It is worth sending to peer review because the problem is practical and the fix is targeted, but any referee will need to see the missing controls and a direct check on the routing assumption before the claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper introduces Switch-KD, a knowledge distillation framework for vision-language models that unifies multimodal transfer in a shared text-probability space. It consists of Visual-Switch Distillation, which routes the student's visual token outputs into the teacher's language pathway to generate cross-modal probabilistic references, and the Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative regions while preserving distributional structure. The central empirical claim is that this enables a 0.5B TinyLLaVA student to distill from a 3B teacher, yielding an average 3.6-point gain across 10 multimodal benchmarks with no architectural changes.

Significance. If the reported gains prove robust, Switch-KD would offer a practical route to improving small VLMs by transferring fused multimodal knowledge without increasing model size or requiring new data, addressing a key deployment bottleneck. The approach's emphasis on modality alignment in probability space is a clear conceptual contribution over separate-modality KD baselines.

major comments (2)

[Abstract / Experimental results] Abstract and experimental results section: the claimed 3.6-point average improvement across 10 benchmarks is presented without error bars, standard deviations, or statistical significance tests. This makes it impossible to determine whether the gains exceed run-to-run variance or depend on specific hyperparameter choices, directly undermining attribution to the visual-switch and DBiLD components.
[Method / DBiLD loss definition] Section describing Visual-Switch Distillation and DBiLD: the framework assumes that routing student visual outputs into the teacher's language pathway produces semantically consistent cross-modal references, yet no derivation, bound, or ablation is provided showing that bidirectional logit differences recover lost visual information or prevent misalignment when visual embeddings are substituted for language tokens. If this assumption does not hold, the performance gains cannot be confidently attributed to unified multimodal transfer.

minor comments (2)

[Abstract] The abstract states gains occur 'without any architectural modification,' but the manuscript should explicitly confirm that the student and teacher share the same tokenizer and projection layers to avoid hidden interface changes.
[Tables and figures] Figure captions and table footnotes should include the exact list of 10 benchmarks and the precise evaluation protocol (e.g., zero-shot vs. few-shot) to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our paper. We have carefully considered the major comments and provide the following point-by-point responses. We believe these clarifications and planned revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and experimental results section: the claimed 3.6-point average improvement across 10 benchmarks is presented without error bars, standard deviations, or statistical significance tests. This makes it impossible to determine whether the gains exceed run-to-run variance or depend on specific hyperparameter choices, directly undermining attribution to the visual-switch and DBiLD components.

Authors: We agree with the referee that the absence of error bars, standard deviations, and statistical significance tests in the reported 3.6-point average improvement limits the ability to assess robustness against run-to-run variance. In the revised manuscript, we will include results from multiple training runs with mean and standard deviation, as well as appropriate statistical tests to validate the significance of the gains. This will more convincingly attribute the improvements to the Visual-Switch Distillation and DBiLD loss. revision: yes
Referee: [Method / DBiLD loss definition] Section describing Visual-Switch Distillation and DBiLD: the framework assumes that routing student visual outputs into the teacher's language pathway produces semantically consistent cross-modal references, yet no derivation, bound, or ablation is provided showing that bidirectional logit differences recover lost visual information or prevent misalignment when visual embeddings are substituted for language tokens. If this assumption does not hold, the performance gains cannot be confidently attributed to unified multimodal transfer.

Authors: We acknowledge that the manuscript does not provide a theoretical derivation or bound demonstrating that the bidirectional logit differences recover visual information or prevent misalignment. The approach relies on the empirical observation that operating in the shared text-probability space allows for consistent cross-modal transfer. To address this concern, we will add a dedicated ablation study in the revised version that isolates the effect of the visual switching and DBiLD components on alignment, along with further discussion in the method section on the rationale for semantic consistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Switch-KD components are independently defined

full rationale

The paper defines Visual-Switch Distillation and DBiLD loss as novel constructs that route student visual outputs into the teacher's language pathway and apply bidirectional logit alignment. These are presented as new mechanisms rather than reductions of fitted parameters or prior results. The reported 3.6-point benchmark gains are empirical outcomes of applying the method, not predictions forced by construction or self-citation chains. No equations or claims reduce the central result to its own inputs; the framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the DBiLD loss is described as adaptive but no explicit fitted values or background assumptions are stated.

pith-pipeline@v0.9.0 · 5551 in / 1194 out tokens · 36638 ms · 2026-05-10T11:19:33.321876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review arXiv
[2]

Llava-kd: A framework of distill- ing multimodal large language models.arXiv preprint arXiv:2410.16236, 2024

Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Zhucun Xue, Yong Liu, and Xiang Bai. Llava-kd: A framework of distill- ing multimodal large language models.arXiv preprint arXiv:2410.16236, 2024. 1, 2, 3, 5

work page arXiv 2024
[3]

Move- kd: Knowledge distillation for vlms with mixture of visual encoders

Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning Ma, and Shanghang Zhang. Move- kd: Knowledge distillation for vlms with mixture of visual encoders. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19846–19856, 2025. 3, 5

2025
[4]

On the efficacy of knowledge distillation

Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 4794– 4802, 2019. 1

2019
[5]

Mobilevlm : A fast, strong and open vision language assistant for mobile devices

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023. 5

work page arXiv 2023
[6]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 5

work page arXiv 2024
[7]

Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement

Qianhan Feng, Wenshuo Li, Tong Lin, and Xinghao Chen. Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 4178–4188, 2025. 2, 3, 5

2025
[8]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5

2017
[9]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543, 2023. 1, 2, 7

work page internal anchor Pith review arXiv 2023
[10]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,
[11]

One-for-all: Bridge the gap between heterogeneous architectures in knowledge distilla- tion.Advances in Neural Information Processing Systems, 36:79570–79582, 2023

Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, and Chang Xu. One-for-all: Bridge the gap between heterogeneous architectures in knowledge distilla- tion.Advances in Neural Information Processing Systems, 36:79570–79582, 2023. 1

2023
[12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5

2019
[14]

C2kd: Bridging the modality gap for cross-modal knowledge distillation

Fushuo Huo, Wenchao Xu, Jingcai Guo, Haozhao Wang, and Song Guo. C2kd: Bridging the modality gap for cross-modal knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16006–16015, 2024. 1

2024
[15]

Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991. 3

1991
[16]

Multi-level logit distil- lation

Ying Jin, Jiaqi Wang, and Dahua Lin. Multi-level logit distil- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 24276–24285,
[17]

Token- scaled logit distillation for ternary weight generative lan- guage models.Advances in Neural Information Processing Systems, 36:42097–42118, 2023

Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du- Seong Chang, Wonyong Sung, and Jungwook Choi. Token- scaled logit distillation for ternary weight generative lan- guage models.Advances in Neural Information Processing Systems, 36:42097–42118, 2023. 1

2023
[18]

Sequence-level knowl- edge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowl- edge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016. 7

2016
[19]

Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large lan- guage models.arXiv preprint arXiv:2402.03898, 2024. 1, 2

work page arXiv 2024
[20]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Re, and Aditi Raghunathan

Tanishq Kumar, Zachary Ankner, Benjamin F Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher R ´e, and Aditi Raghunathan. Scaling laws for precision.arXiv preprint arXiv:2411.04330, 2024. 2

work page arXiv 2024
[21]

Vlsi: Verbalized layers- to-interactions from large to small vision language models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers- to-interactions from large to small vision language models. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29545–29557, 2025. 3

2025
[22]

Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20379–20389, 2025. 2

2025
[23]

An efficient framework for enhancing discriminative models via diffusion techniques

Chunxiao Li, Xiaoxiao Wang, Boming Miao, Chuanlong Xie, Zizhe Wang, and Yao Zhu. An efficient framework for enhancing discriminative models via diffusion techniques. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 4670–4678, 2025. 2

2025
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

2023
[25]

arXiv preprint arXiv:2408.08632 , year=

Jian Li, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, et al. A survey on benchmarks of multimodal large language models. arXiv preprint arXiv:2408.08632, 2024. 5

work page arXiv 2024
[26]

Bild: Bi- directional logits difference loss for large language model distillation.arXiv preprint arXiv:2406.13555, 2024

Minchong Li, Feng Zhou, and Xiaohui Song. Bild: Bi- directional logits difference loss for large language model distillation.arXiv preprint arXiv:2406.13555, 2024. 1, 2

work page arXiv 2024
[27]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5

work page internal anchor Pith review arXiv 2023
[28]

arXiv:2403.18814 , year=

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

work page arXiv
[29]

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.ArXiv, abs/2402.05935,

Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.arXiv preprint arXiv:2402.05935, 2024. 2, 5

work page arXiv 2024
[30]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

2023
[31]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 5

2024
[32]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5

2024
[33]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
[34]

Scaling laws for robust comparison of open foun- dation language-vision models and datasets.arXiv preprint arXiv:2506.04598, 2025

Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tom- mie Kerssies, Romain Beaumont, Mehdi Cherti, and Jenia Jitsev. Scaling laws for robust comparison of open foun- dation language-vision models and datasets.arXiv preprint arXiv:2506.04598, 2025. 2

work page arXiv 2025
[35]

Scaling up biomedical vision- language models: Fine-tuning, instruction tuning, and multi- modal learning.arXiv preprint arXiv:2505.17436, 2025

Cheng Peng, Kai Zhang, Mengxian Lyu, Hongfang Liu, Lichao Sun, and Yonghui Wu. Scaling up biomedical vision- language models: Fine-tuning, instruction tuning, and multi- modal learning.arXiv preprint arXiv:2505.17436, 2025. 2

work page arXiv 2025
[36]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5

2021
[37]

Finding a” kneedle” in a haystack: Detect- ing knee points in system behavior

Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. Finding a” kneedle” in a haystack: Detect- ing knee points in system behavior. In2011 31st interna- tional conference on distributed computing systems work- shops, pages 166–171. IEEE, 2011. 4

2011
[38]

Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881,

Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, et al. Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881,

work page arXiv
[39]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5

2019
[40]

Densely guided knowledge distillation using multi- ple teacher assistants

Wonchul Son, Jaemin Na, Junyong Choi, and Wonjun Hwang. Densely guided knowledge distillation using multi- ple teacher assistants. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9395–9404,
[41]

Logit standardization in knowledge distillation

Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xi- aochun Cao. Logit standardization in knowledge distillation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15731–15740, 2024. 1

2024
[42]

Watermarking via gaus- sian noise modulation in diffusion models.Neurocomputing, page 132188, 2025

Yingjie Tian and Xiaoxiao Wang. Watermarking via gaus- sian noise modulation in diffusion models.Neurocomputing, page 132188, 2025. 2

2025
[43]

Distilling knowl- edge by mimicking features.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8183–8195,

Guo-Hua Wang, Yifan Ge, and Jianxin Wu. Distilling knowl- edge by mimicking features.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(11):8183–8195,
[44]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[45]

Towards provenance- aware diffusion: Key-free watermarking with gaussian shad- ing.Expert Systems with Applications, page 130822, 2025

Xiaoxiao Wang and Yingjie Tian. Towards provenance- aware diffusion: Key-free watermarking with gaussian shad- ing.Expert Systems with Applications, page 130822, 2025. 2

2025
[46]

Towards annotation-free eval- uation: Kpascore for human keypoint detection

Xiaoxiao Wang, Chunxiao Li, Peng Sun, Boming Miao, Yunjian Zhang, and Yao Zhu. Towards annotation-free eval- uation: Kpascore for human keypoint detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 8441–8450, 2025

2025
[47]

Star: Bridging statistical and agentic reason- ing for large model performance prediction.arXiv preprint arXiv:2602.12143, 2026

Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zi- jian Chen, Chunyi Li, Xiaohong Liu, Zicheng Zhang, and Guangtao Zhai. Star: Bridging statistical and agentic reason- ing for large model performance prediction.arXiv preprint arXiv:2602.12143, 2026. 2

work page arXiv 2026
[48]

Rethinking kullback-leibler di- vergence in knowledge distillation for large language mod- els

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, and Ngai Wong. Rethinking kullback-leibler di- vergence in knowledge distillation for large language mod- els. InProceedings of the 31st International Conference on Computational Linguistics, pages 5737–5755, 2025. 1, 2

2025
[49]

Feature nor- malized knowledge distillation for image classification

Kunran Xu, Lai Rui, Yishi Li, and Lin Gu. Feature nor- malized knowledge distillation for image classification. In European conference on computer vision, pages 664–680. Springer, 2020. 2

2020
[50]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 2

work page internal anchor Pith review arXiv 2024
[51]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 5

2024
[52]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5

2024
[53]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5

2023
[54]

Dual-space knowledge distillation for large language models.arXiv preprint arXiv:2406.17328, 2024

Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distillation for large language models.arXiv preprint arXiv:2406.17328, 2024. 1, 2

work page arXiv 2024
[55]

Decoupled knowledge distillation

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022. 1

2022
[56]

Tinyllava: A framework of small-scale large multimodal models,

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 1, 2, 5

work page arXiv 2024