pith. sign in

arxiv: 2605.20495 · v1 · pith:FV7XKH2Hnew · submitted 2026-05-19 · 💻 cs.CV

A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords imagesannotationexemplarspromptselectionexpert-verifiedexpertsframework
0
0 comments X

The pith

Targeted selection of images for expert verification lets vision-language models reach 100% accuracy with only 20 annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to reduce the expensive expert annotation needed for training deep learning models on microscopy images. Instead of full annotation, it uses a vision-language model to draft captions for a small number of images, which experts verify and edit lightly. These verified image-caption pairs then serve as few-shot prompts to classify all other images. To decide which images to send to experts, the authors frame the problem as active learning and test three selection criteria on small pools of unlabeled data. If effective, this keeps human experts central to the process while slashing the number of images they must handle.

Core claim

By modeling prompt-set construction as a target-driven active learning problem and applying three complementary selection criteria, the framework prioritizes unlabeled microscopy images for expert verification. This produces compact prompt sets that allow the vision-language model to classify remaining images with high accuracy using far fewer verified exemplars than random selection.

What carries the argument

Three complementary selection criteria used to prioritize which images experts should verify and edit for building the prompt set.

Load-bearing premise

The three selection criteria effectively identify images whose verification produces prompt sets that generalize well to the full dataset.

What would settle it

A direct comparison on multiple microscopy datasets where the number of images needed to reach 100% accuracy is measured for the proposed criteria versus random selection; failure occurs if the criteria do not reduce the count.

read the original abstract

Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a human-in-the-loop framework for constructing prompt sets in vision-language models for microscopy image classification. It models prompt selection as a target-driven active learning problem and evaluates three complementary selection criteria under low-resource constraints with small unlabeled pools. The central empirical claim is that these criteria allow reaching 100% test accuracy with substantially fewer expert-verified images than random selection, specifically as few as 20 annotated images on average.

Significance. If the results hold, this framework offers a practical way to reduce expert annotation effort in biomedical microscopy analysis by combining VLM-generated captions with targeted expert verification. The emphasis on human involvement while leveraging generative AI is a positive contribution to the field. The public release of code and data would further strengthen reproducibility.

major comments (1)
  1. Experiments section: The reported result of 100% test accuracy with an average of 20 annotated images is load-bearing for the central claim, yet the manuscript provides no pool-size ablations, variance estimates across repeated samplings, or explicit comparison of how the three selection criteria perform relative to random ordering when the unlabeled pool is small. Without these, it remains unclear whether the observed advantage is due to criterion quality or chance ordering under the very low-resource constraints emphasized in the paper.
minor comments (1)
  1. Abstract: The statement that 'Code and data will be publicly available' should include a specific repository link or DOI to support the reproducibility claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and positive review of our manuscript. We address the single major comment below and will revise the paper to incorporate additional experimental details that strengthen the central claims.

read point-by-point responses
  1. Referee: Experiments section: The reported result of 100% test accuracy with an average of 20 annotated images is load-bearing for the central claim, yet the manuscript provides no pool-size ablations, variance estimates across repeated samplings, or explicit comparison of how the three selection criteria perform relative to random ordering when the unlabeled pool is small. Without these, it remains unclear whether the observed advantage is due to criterion quality or chance ordering under the very low-resource constraints emphasized in the paper.

    Authors: We agree that these details would improve the robustness of the reported results. In the revised manuscript we will add pool-size ablations that vary the size of the unlabeled pool while keeping the target performance fixed, allowing readers to see how the advantage scales. We will also report performance averaged over five independent runs with different random seeds, including standard deviations, to quantify sampling variability. Finally, we will include an explicit side-by-side comparison (new table and figure) of the three selection criteria versus random ordering specifically for small pool sizes (up to 50 images), demonstrating that the performance gap persists consistently rather than arising from a single fortunate ordering. These additions will be placed in the Experiments section and will directly address the concern about low-resource constraints. revision: yes

Circularity Check

0 steps flagged

Empirical active-learning framework with no circular derivation

full rationale

The paper formulates prompt-set construction as a target-driven active learning problem and evaluates three selection criteria through experiments on microscopy datasets, reporting that the methods reach target accuracy with fewer expert-verified images than random selection. All central claims rest on direct experimental comparisons against an external baseline rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own inputs by construction. The work is therefore self-contained against its stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract, the framework relies on standard assumptions from active learning and VLM prompt tuning without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption Active learning selection criteria can identify the most informative images for building effective prompt sets in low-resource settings
    Central to prioritizing verification effort and achieving performance with fewer annotations.

pith-pipeline@v0.9.0 · 5795 in / 1244 out tokens · 73181 ms · 2026-05-21T07:01:11.130777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

  1. [1]

    Accessed 2026- 02-22

    tiktoken: Fast bpe tokeniser for openai models.https: //github.com/openai/tiktoken. Accessed 2026- 02-22. 7

  2. [2]

    Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal

    Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learn- ing by diverse, uncertain gradient lower bounds. InInterna- tional Conference on Learning Representations, 2020. 2

  3. [3]

    Mar- gin based active learning

    Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Mar- gin based active learning. InProceedings of the 20th Annual Conference on Learning Theory, page 35–50, Berlin, Hei- delberg, 2007. Springer-Verlag. 2

  4. [4]

    Active prompt learning in vision language models

    Jihwan Bang, Sumyeong Ahn, and Jae-Gil Lee. Active prompt learning in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 27004–27014, 2024. 2

  5. [5]

    Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers.arXiv preprint arXiv:2212.10559, 2022

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers.arXiv preprint arXiv:2212.10559, 2022. 1

  6. [6]

    Hall, and Peter R

    Palak Dave, Yaroslav Kolinko, Hunter Morera, Kurtis Allen, Saeed Alahmari, Dmitry Goldgof, Lawrence O. Hall, and Peter R. Mouton. MIMO U-Net: efficient cell segmenta- tion and counting in microscopy image sequences. InSociety of Photo-Optical Instrumentation Engineers (SPIE) Confer- ence Series, 2023. 1

  7. [7]

    Active prompting with chain-of- thought for large language models

    Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, and Tong Zhang. Active prompting with chain-of- thought for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1330–1350. As- sociation for Computational Linguistics, 2024. 2, 3, 4

  8. [8]

    LMMs for histopathology: zero- and few-shot patch classifi- cation with GPT and Gemini models

    Caleb Heinzman, Huazhang Guo, Mai He, and Ye Duan. LMMs for histopathology: zero- and few-shot patch classifi- cation with GPT and Gemini models. InNinth International Conference on Advances in Image Processing (ICAIP 2025), page 140170T, 2026. 2, 3, 6

  9. [9]

    Entropy- based active learning for object recognition

    Alex Holub, Pietro Perona, and Michael C Burl. Entropy- based active learning for object recognition. InIEEE com- puter society conference on computer vision and pattern recognition workshops, pages 1–8. IEEE, 2008. 2, 4

  10. [10]

    Mouton, Yaroslav Kolinko, Lawrence O

    Abhiram Kandiyana, Peter R. Mouton, Yaroslav Kolinko, Lawrence O. Hall, and Dmitry Goldgof. Active prompt tuning enables gpt-4o to do efficient classification of mi- croscopy images. In2025 IEEE 22nd International Sym- posium on Biomedical Imaging (ISBI), pages 01–05, 2025. 2, 3, 6

  11. [11]

    Mind your outliers! investigat- ing the negative impact of outliers on active learning for visual question answering

    Siddharth Karamcheti, Ranjay Krishna, Li Fei-Fei, and Christopher D Manning. Mind your outliers! investigat- ing the negative impact of outliers on active learning for visual question answering. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Proc...

  12. [12]

    Lewis and William A

    David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. InProceedings of SIGIR, 1994. 2

  13. [13]

    Dual-stream multiple instance learning network for whole slide image classifica- tion with self-supervised contrastive learning

    Bin Li, Yin Li, and Kevin W Eliceiri. Dual-stream multiple instance learning network for whole slide image classifica- tion with self-supervised contrastive learning. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2021. 1

  14. [14]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans- actions of the Association for Computational Linguistics, 12: 157–173, 2024. 7

  15. [15]

    Morera, P

    H. Morera, P. Dave, S. Alahmari, Y . Kolinko, L.O. Hall, D. Goldgof, and P.R. Mouton. Mimo yolo - a multiple input multiple output model for automatic cell counting. In2023 IEEE 36th International Symposium on Computer- Based Medical Systems (CBMS), pages 827–831, 2023. 1

  16. [16]

    Hall, et al

    Hunter Morera, Palak Dave, Yaroslav Kolinko, Saeed Alah- mari, Aidan Anderson, Grant Denham, Chloe Davis, Juan Riano, Dmitry Goldgof, Lawrence O. Hall, et al. A novel deep learning-based method for automatic stereology of mi- croglia cells from low magnification images.Neurotoxicol- ogy and Teratology, 102:107336, 2024. 1

  17. [17]

    Mouton.Unbiased Stereology: A Concise Guide

    Peter R. Mouton.Unbiased Stereology: A Concise Guide. Johns Hopkins University Press, 2011. 1

  18. [18]

    John Wiley & Sons, 2014

    Peter R Mouton.Neurostereology: unbiased stereology of neural systems. John Wiley & Sons, 2014. 1

  19. [19]

    Daisuke Ono, Dennis W Dickson, and Shunsuke Koga. Eval- uating the efficacy of few-shot learning for GPT-4Vision in neurodegenerative disease histopathology: A comparative analysis with convolutional neural network model.Neu- ropathol Appl Neurobiol, 50(4):e12997, 2024. 2, 3, 6

  20. [20]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 8

  21. [21]

    Residual prompt tuning: Improving prompt tuning with residual reparameterization

    Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: Improving prompt tuning with residual reparameterization. InFindings of the Associa- tion for Computational Linguistics: ACL 2023, pages 6740– 6757, 2023. 2

  22. [22]

    Active learning for vision- language models

    Bardia Safaei and Vishal M Patel. Active learning for vision- language models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4902–4912. IEEE, 2025. 2, 5

  23. [23]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 8

  24. [24]

    Active learning for convolu- tional neural networks: A core-set approach

    Ozan Sener and Silvio Savarese. Active learning for convolu- tional neural networks: A core-set approach. InInternational Conference on Learning Representations, 2018. 2, 5

  25. [25]

    Active learning literature survey

    Burr Settles. Active learning literature survey. Technical Report 1648, University of Wisconsin–Madison, 2009. 2

  26. [26]

    Silverman.Density Estimation for Statistics and Data Analysis

    Bernard W. Silverman.Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, 1 edition, 1998. 5

  27. [27]

    Support vector machine ac- tive learning with applications to text classification.Journal of Machine Learning Research, 2:45–66, 2001

    Simon Tong and Daphne Koller. Support vector machine ac- tive learning with applications to text classification.Journal of Machine Learning Research, 2:45–66, 2001. 2

  28. [28]

    Minilmv2: Multi-head self-attention relation dis- tillation for compressing pretrained transformers

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation dis- tillation for compressing pretrained transformers. InFind- ings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pages 2140–2151, 2021. 7

  29. [29]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

  30. [30]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, pages 24824–24837. Curran Associates, Inc., 2022. 3

  31. [31]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

  32. [32]

    What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36, 2024

    Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning?Advances in Neural Information Processing Systems, 36, 2024. 2

  33. [33]

    Conditional prompt learning for vision-language mod- els

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 16816–16825,

  34. [34]

    Learning to prompt for vision-language models.In- ternational journal of computer vision, 130(9):2337–2348,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational journal of computer vision, 130(9):2337–2348,