Online In-Context Distillation for Low-Resource Vision Language Models
Pith reviewed 2026-05-18 05:32 UTC · model grok-4.3
The pith
Small vision-language models can close much of the performance gap with larger ones by distilling knowledge through sparse in-context demonstrations at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose an online In-Context Distillation (ICD) method in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. The method rests on an analysis identifying the model scales and choices for which vision-language in-context learning is feasible and shows ICL's advantage over fine-tuning under constrained compute budgets. Enhancements include cross-modal demonstration selection, teacher test-time scaling to reduce noise, and student uncertainty conditioning to populate a demonstration pool while minimizing teacher queries.
What carries the argument
Online In-Context Distillation (ICD), which dynamically builds a pool of teacher demonstrations through cross-modal selection and student uncertainty conditioning to transfer knowledge at inference time with minimal queries.
If this is right
- Small models achieve performance boosts of up to 33 percent.
- Only 4 percent teacher annotations suffice for the gains.
- The adapted small model matches the teacher's zero-shot performance.
- The approach outperforms fine-tuning when compute budgets are tightly constrained.
- Vision-language ICL is feasible only for certain model scales and pairings identified in the analysis.
Where Pith is reading between the lines
- The method could support on-device adaptation of vision models when only occasional access to a remote teacher is available.
- Similar uncertainty-driven selection might reduce labeling costs in other multimodal tasks such as visual question answering.
- Extending the demonstration pool dynamically could enable continuous improvement without repeated full teacher passes.
Load-bearing premise
Cross-modal demonstration selection combined with uncertainty conditioning reliably picks useful teacher examples without introducing new biases or noise.
What would settle it
On a standard vision-language benchmark, measure whether small models lose the reported performance gains when teacher annotations are capped at 4 percent of the data.
Figures
read the original abstract
As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Online In-Context Distillation (ICD) for adapting small vision-language models in low-resource settings. A small VLM collaborates with a stronger teacher at inference time, distilling knowledge via sparse demonstrations selected through cross-modal strategies, augmented by teacher test-time scaling and student uncertainty conditioning to minimize queries. An in-depth analysis identifies scales where vision-language ICL is feasible and shows ICL advantages over fine-tuning under compute constraints. The method claims up to 33% performance gains for small models using as little as 4% teacher annotations while competing with the teacher's zero-shot performance.
Significance. If the reported gains are reproducible under proper controls, the work could meaningfully advance efficient VLM deployment in budget-limited scenarios by reducing reliance on fine-tuning and large-scale annotations. The feasibility analysis for vision-language ICL and the dynamic demonstration pool mechanism are potentially useful contributions to in-context adaptation techniques.
major comments (2)
- The abstract and experimental results report up to 33% gains and 4% annotation usage but provide no details on experimental controls, statistical significance, or exact data splits. This directly affects verifiability of the central performance claims.
- The description of cross-modal demonstration selection and student uncertainty conditioning (central to minimizing teacher queries) does not include ablations or analysis demonstrating that these components avoid introducing systematic bias or noise relative to random or baseline selection.
minor comments (2)
- Clarify the exact formulation of the demonstration pool update rule and how uncertainty scores are thresholded or combined with cross-modal similarity.
- Add a table or figure summarizing the model scales and tasks for which ICL was found feasible, to make the analysis section easier to reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for improving verifiability and analytical rigor, which we will address in the revision. We respond to each major comment below.
read point-by-point responses
-
Referee: The abstract and experimental results report up to 33% gains and 4% annotation usage but provide no details on experimental controls, statistical significance, or exact data splits. This directly affects verifiability of the central performance claims.
Authors: We agree that the current version lacks sufficient detail on these elements, which limits reproducibility. In the revised manuscript we will add a dedicated experimental setup subsection specifying the exact train/test splits for each dataset, the number of random seeds used, reporting of mean and standard deviation to establish statistical significance, and explicit controls for baseline implementations. These additions will directly support the reported gains and annotation usage figures. revision: yes
-
Referee: The description of cross-modal demonstration selection and student uncertainty conditioning (central to minimizing teacher queries) does not include ablations or analysis demonstrating that these components avoid introducing systematic bias or noise relative to random or baseline selection.
Authors: We concur that ablations are needed to substantiate the benefits of the cross-modal selection and uncertainty conditioning mechanisms. The revised paper will incorporate new ablation experiments that compare these strategies against random selection and other baselines, along with quantitative analysis of query efficiency, performance variance, and any observable biases or noise. This will clarify their contribution to minimizing teacher queries without introducing systematic drawbacks. revision: yes
Circularity Check
No significant circularity: novel method components and empirical claims remain independent of self-definition or fitted inputs.
full rationale
The paper proposes an online In-Context Distillation (ICD) approach that introduces distinct new elements—cross-modal demonstration selection, teacher test-time scaling, and student uncertainty conditioning—to dynamically manage a demonstration pool and minimize teacher queries. These enhancements are presented as original strategies built on an in-depth feasibility analysis of vision-language ICL, rather than as reductions to prior self-citations, ansatzes smuggled via citation, or parameters fitted to the target performance metrics. Performance claims (up to 33% boost with as low as 4% annotations, competing with teacher zero-shot) are framed as experimental outcomes under constrained budgets, not as predictions forced by construction from the method's own definitions or inputs. No equations or load-bearing steps in the abstract or described chain equate the result to its inputs by self-definition or renaming. The derivation is therefore self-contained with independent empirical content.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our findings show that its effectiveness depends heavily on the model’s inherent capabilities... tiny VLMs (N≤4B) struggle to benefit from ICL
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
What learning algorithm is in-context learning? Investigations with linear models
Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algo- rithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[6]
URLhttps://arxiv.org/abs/2311.18021. Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, V olker Tresp, and Jindong Gu. Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6000–6010. IEEE, 2025a. Shuo Chen, Jianzhe L...
-
[7]
Is gpt-3 a good data annotator?arXiv preprint arXiv:2212.10450,
Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Shafiq Joty, Boyang Li, and Lidong Bing. Is gpt-3 a good data annotator?arXiv preprint arXiv:2212.10450,
-
[8]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Do llms estimate uncertainty well in instruction-following?arXiv preprint arXiv:2410.14582,
Juyeon Heo, Miao Xiong, Christina Heinze-Deml, and Jaya Narain. Do llms estimate uncertainty well in instruction-following?arXiv preprint arXiv:2410.14582,
-
[10]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2305.02301 , year=
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301,
-
[12]
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,
-
[13]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, and Xu Yang. How to configure good in-context sequence for visual question answering. InProceedings of...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023 2023
-
[14]
Yinpeng Liu, Jiawei Liu, Xiang Shi, Qikai Cheng, Yong Huang, and Wei Lu. Let’s learn step by step: Enhancing in-context learning ability with curriculum learning.arXiv preprint arXiv:2402.10738,
-
[15]
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity.arXiv preprint arXiv:2104.08786,
-
[16]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
doi: 10.1016/j.neunet.2012.02.016
ISSN 0893-6080. doi: 10.1016/j.neunet.2012.02.016. URL http://www.sciencedirect.com/science/ article/pii/S0893608012000457. Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Com...
-
[19]
URL https://doi.org/10.1109/TIP.2018
doi: 10.1109/TIP.2018.2866698. URL https://doi.org/10.1109/TIP.2018. 2866698. 13 Under review Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
-
[20]
The Caltech- UCSD birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech- UCSD birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology,
work page 2011
-
[21]
Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang
URL https://arxiv.org/abs/2210.07795. Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6830–6839, 2023a. Xinru Wang, Hannah Kim, Sajjadur Rahman, Kushan Mitra, and Zhengjie Mi...
-
[22]
Xinyi Wang, Wanrong Zhu, and William Yang Wang. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning.arXiv preprint arXiv:2301.11916, 1:15, 2023b. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et a...
-
[23]
Yuchen Yang, Houqiang Li, Yanfeng Wang, and Yu Wang. Improving the reliability of large language models by leveraging uncertainty-aware in-context learning.arXiv preprint arXiv:2310.04782,
-
[24]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023a. Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learn- ing? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Tinyllava: A framework of small-scale large multimodal models, 2024a
14 Under review Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models, 2024a. URL https://arxiv.org/ abs/2402.14289. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vis...
-
[26]
Visual in-context learning for large vision-language models, 2024b
Yucheng Zhou, Xiang Li, Qianning Wang, and Jianbing Shen. Visual in-context learning for large vision-language models, 2024b. URLhttps://arxiv.org/abs/2402.11574. Yongshuo Zong, Ondrej Bohdal, and Timothy Hospedales. Vl-icl bench: The devil in the details of multimodal in-context learning.arXiv preprint arXiv:2403.13164,
-
[27]
achieves68.7%with SigLIP vs.56.2%with CLIP on CUB. CodeWe developed our method, analysis, and online pipeline based on the implementation of VL-ICLZong et al. (2024) benchmark. A.4.2 COMPUTATIONCOST In this section, we discuss the computation cost of our proposed framework in two aspects. First, we discuss the computation at inference time that is related...
work page 2024
-
[28]
This is also the Calculate Results task in Sec. 3.2, which requires the model to output the final results. Other sub-tasks are presented as follows. • Text Reasoning: the model is required to infer the operator and calculate the results with purely text input, as shown in Fig. 11 • Image OCR: the model is required to do text recognition to read the expres...
work page 2024
-
[29]
Our experiments confirm that 3-shot ICL is the most effective choice in general and with more demonstrations (i.e., 5 shots), the performance instead degrades. We attribute this phenomenon to the extended input length with more demonstrations and the potential noise in the demonstrations. A.11 DISCUSSION OFDATASETS A.11.1 DESCRIPTION OFDATASETS FOREVALUAT...
work page 2012
-
[30]
We leave further investigation of ICL on these challenging datasets as future work
As the dataset is generic and diverse, it is more challenging to find semantically meaningful demonstrations that would help the model better understand the task. We leave further investigation of ICL on these challenging datasets as future work. A.12 DIFFERENTSIZE OFUNLABELEDSET FORICD In the main experiments, we consider annotating all the training samp...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.