A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models
Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3
The pith
Targeted selection of images for expert verification lets vision-language models reach 100% accuracy with only 20 annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling prompt-set construction as a target-driven active learning problem and applying three complementary selection criteria, the framework prioritizes unlabeled microscopy images for expert verification. This produces compact prompt sets that allow the vision-language model to classify remaining images with high accuracy using far fewer verified exemplars than random selection.
What carries the argument
Three complementary selection criteria used to prioritize which images experts should verify and edit for building the prompt set.
Load-bearing premise
The three selection criteria effectively identify images whose verification produces prompt sets that generalize well to the full dataset.
What would settle it
A direct comparison on multiple microscopy datasets where the number of images needed to reach 100% accuracy is measured for the proposed criteria versus random selection; failure occurs if the criteria do not reduce the count.
read the original abstract
Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a human-in-the-loop framework for constructing prompt sets in vision-language models for microscopy image classification. It models prompt selection as a target-driven active learning problem and evaluates three complementary selection criteria under low-resource constraints with small unlabeled pools. The central empirical claim is that these criteria allow reaching 100% test accuracy with substantially fewer expert-verified images than random selection, specifically as few as 20 annotated images on average.
Significance. If the results hold, this framework offers a practical way to reduce expert annotation effort in biomedical microscopy analysis by combining VLM-generated captions with targeted expert verification. The emphasis on human involvement while leveraging generative AI is a positive contribution to the field. The public release of code and data would further strengthen reproducibility.
major comments (1)
- Experiments section: The reported result of 100% test accuracy with an average of 20 annotated images is load-bearing for the central claim, yet the manuscript provides no pool-size ablations, variance estimates across repeated samplings, or explicit comparison of how the three selection criteria perform relative to random ordering when the unlabeled pool is small. Without these, it remains unclear whether the observed advantage is due to criterion quality or chance ordering under the very low-resource constraints emphasized in the paper.
minor comments (1)
- Abstract: The statement that 'Code and data will be publicly available' should include a specific repository link or DOI to support the reproducibility claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and positive review of our manuscript. We address the single major comment below and will revise the paper to incorporate additional experimental details that strengthen the central claims.
read point-by-point responses
-
Referee: Experiments section: The reported result of 100% test accuracy with an average of 20 annotated images is load-bearing for the central claim, yet the manuscript provides no pool-size ablations, variance estimates across repeated samplings, or explicit comparison of how the three selection criteria perform relative to random ordering when the unlabeled pool is small. Without these, it remains unclear whether the observed advantage is due to criterion quality or chance ordering under the very low-resource constraints emphasized in the paper.
Authors: We agree that these details would improve the robustness of the reported results. In the revised manuscript we will add pool-size ablations that vary the size of the unlabeled pool while keeping the target performance fixed, allowing readers to see how the advantage scales. We will also report performance averaged over five independent runs with different random seeds, including standard deviations, to quantify sampling variability. Finally, we will include an explicit side-by-side comparison (new table and figure) of the three selection criteria versus random ordering specifically for small pool sizes (up to 50 images), demonstrating that the performance gap persists consistently rather than arising from a single fortunate ordering. These additions will be placed in the Experiments section and will directly address the concern about low-resource constraints. revision: yes
Circularity Check
Empirical active-learning framework with no circular derivation
full rationale
The paper formulates prompt-set construction as a target-driven active learning problem and evaluates three selection criteria through experiments on microscopy datasets, reporting that the methods reach target accuracy with fewer expert-verified images than random selection. All central claims rest on direct experimental comparisons against an external baseline rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own inputs by construction. The work is therefore self-contained against its stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Active learning selection criteria can identify the most informative images for building effective prompt sets in low-resource settings
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools... uncertainty-guided acquisition using stochastic decoding, complexity-aware uncertainty acquisition... density-tree boundary sampling
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
formulating prompt-set construction as a target-driven active learning problem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
tiktoken: Fast bpe tokeniser for openai models.https: //github.com/openai/tiktoken. Accessed 2026- 02-22. 7
work page 2026
-
[2]
Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal
Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learn- ing by diverse, uncertain gradient lower bounds. InInterna- tional Conference on Learning Representations, 2020. 2
work page 2020
-
[3]
Mar- gin based active learning
Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Mar- gin based active learning. InProceedings of the 20th Annual Conference on Learning Theory, page 35–50, Berlin, Hei- delberg, 2007. Springer-Verlag. 2
work page 2007
-
[4]
Active prompt learning in vision language models
Jihwan Bang, Sumyeong Ahn, and Jae-Gil Lee. Active prompt learning in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 27004–27014, 2024. 2
work page 2024
-
[5]
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers.arXiv preprint arXiv:2212.10559, 2022. 1
-
[6]
Palak Dave, Yaroslav Kolinko, Hunter Morera, Kurtis Allen, Saeed Alahmari, Dmitry Goldgof, Lawrence O. Hall, and Peter R. Mouton. MIMO U-Net: efficient cell segmenta- tion and counting in microscopy image sequences. InSociety of Photo-Optical Instrumentation Engineers (SPIE) Confer- ence Series, 2023. 1
work page 2023
-
[7]
Active prompting with chain-of- thought for large language models
Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, and Tong Zhang. Active prompting with chain-of- thought for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1330–1350. As- sociation for Computational Linguistics, 2024. 2, 3, 4
work page 2024
-
[8]
LMMs for histopathology: zero- and few-shot patch classifi- cation with GPT and Gemini models
Caleb Heinzman, Huazhang Guo, Mai He, and Ye Duan. LMMs for histopathology: zero- and few-shot patch classifi- cation with GPT and Gemini models. InNinth International Conference on Advances in Image Processing (ICAIP 2025), page 140170T, 2026. 2, 3, 6
work page 2025
-
[9]
Entropy- based active learning for object recognition
Alex Holub, Pietro Perona, and Michael C Burl. Entropy- based active learning for object recognition. InIEEE com- puter society conference on computer vision and pattern recognition workshops, pages 1–8. IEEE, 2008. 2, 4
work page 2008
-
[10]
Mouton, Yaroslav Kolinko, Lawrence O
Abhiram Kandiyana, Peter R. Mouton, Yaroslav Kolinko, Lawrence O. Hall, and Dmitry Goldgof. Active prompt tuning enables gpt-4o to do efficient classification of mi- croscopy images. In2025 IEEE 22nd International Sym- posium on Biomedical Imaging (ISBI), pages 01–05, 2025. 2, 3, 6
work page 2025
-
[11]
Siddharth Karamcheti, Ranjay Krishna, Li Fei-Fei, and Christopher D Manning. Mind your outliers! investigat- ing the negative impact of outliers on active learning for visual question answering. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Proc...
work page 2021
-
[12]
David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. InProceedings of SIGIR, 1994. 2
work page 1994
-
[13]
Bin Li, Yin Li, and Kevin W Eliceiri. Dual-stream multiple instance learning network for whole slide image classifica- tion with self-supervised contrastive learning. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2021. 1
work page 2021
-
[14]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12: 157–173, 2024. 7
work page 2024
- [15]
-
[16]
Hunter Morera, Palak Dave, Yaroslav Kolinko, Saeed Alah- mari, Aidan Anderson, Grant Denham, Chloe Davis, Juan Riano, Dmitry Goldgof, Lawrence O. Hall, et al. A novel deep learning-based method for automatic stereology of mi- croglia cells from low magnification images.Neurotoxicol- ogy and Teratology, 102:107336, 2024. 1
work page 2024
-
[17]
Mouton.Unbiased Stereology: A Concise Guide
Peter R. Mouton.Unbiased Stereology: A Concise Guide. Johns Hopkins University Press, 2011. 1
work page 2011
-
[18]
Peter R Mouton.Neurostereology: unbiased stereology of neural systems. John Wiley & Sons, 2014. 1
work page 2014
-
[19]
Daisuke Ono, Dennis W Dickson, and Shunsuke Koga. Eval- uating the efficacy of few-shot learning for GPT-4Vision in neurodegenerative disease histopathology: A comparative analysis with convolutional neural network model.Neu- ropathol Appl Neurobiol, 50(4):e12997, 2024. 2, 3, 6
work page 2024
-
[20]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 8
work page 2021
-
[21]
Residual prompt tuning: Improving prompt tuning with residual reparameterization
Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: Improving prompt tuning with residual reparameterization. InFindings of the Associa- tion for Computational Linguistics: ACL 2023, pages 6740– 6757, 2023. 2
work page 2023
-
[22]
Active learning for vision- language models
Bardia Safaei and Vishal M Patel. Active learning for vision- language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4902–4912. IEEE, 2025. 2, 5
work page 2025
-
[23]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Active learning for convolu- tional neural networks: A core-set approach
Ozan Sener and Silvio Savarese. Active learning for convolu- tional neural networks: A core-set approach. InInternational Conference on Learning Representations, 2018. 2, 5
work page 2018
-
[25]
Active learning literature survey
Burr Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin–Madison, 2009. 2
work page 2009
-
[26]
Silverman.Density Estimation for Statistics and Data Analysis
Bernard W. Silverman.Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1 edition, 1998. 5
work page 1998
-
[27]
Simon Tong and Daphne Koller. Support vector machine ac- tive learning with applications to text classification.Journal of Machine Learning Research, 2:45–66, 2001. 2
work page 2001
-
[28]
Minilmv2: Multi-head self-attention relation dis- tillation for compressing pretrained transformers
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation dis- tillation for compressing pretrained transformers. InFind- ings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pages 2140–2151, 2021. 7
work page 2021
-
[29]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, pages 24824–24837. Curran Associates, Inc., 2022. 3
work page 2022
-
[31]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36, 2024. 2
work page 2024
-
[33]
Conditional prompt learning for vision-language mod- els
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,
-
[34]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational journal of computer vision, 130(9):2337–2348,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.