Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology
Pith reviewed 2026-05-20 10:37 UTC · model grok-4.3
The pith
A training-free coreset method selects small sets of image-text pairs that make vision-language models more accurate, better calibrated, and more robust to rephrased prompts on pathology images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAUC jointly optimises a Maximum Mean Discrepancy term that enforces distributional fidelity between the coreset and the full dataset, an Effective Mutual Information Difference regulariser that bounds degradation under prompt paraphrases by using the model's joint vision-text alignment, and a predictive-variance penalty that suppresses overconfident outputs, all inside the fixed pre-trained multimodal embedding space, yielding coresets that improve accuracy, calibration, and prompt robustness for in-context learning on CRC-100K and MHIST across multiple VLM architectures without any parameter updates.
What carries the argument
The GAUC coreset selector, which jointly optimises three objectives (MMD distributional fidelity, Effective Mutual Information Difference for paraphrase robustness, and predictive-variance penalty) directly in the pre-trained multimodal embedding space to produce compact, geometry-aware example sets for in-context learning.
If this is right
- Selected coresets raise diagnostic accuracy on CRC-100K and MHIST without any model updates.
- The same coresets improve output calibration and reduce sensitivity to how queries are phrased.
- Gains hold across several open-source vision-language model architectures.
- All benefits come from geometry-aware selection in embedding space rather than parameter changes or large-scale distillation.
Where Pith is reading between the lines
- Similar joint objectives could be applied to select examples for in-context learning in other medical imaging tasks where labeled data is limited.
- The emphasis on embedding geometry suggests that preserving alignment between vision and text features is central to stable performance in domain-specific ICL.
- The method may lower the cost of prompt engineering by making performance less dependent on exact wording.
- One could test whether the same selection criteria improve few-shot performance when the underlying model is a vision-only encoder rather than a multimodal VLM.
Load-bearing premise
The three objectives of distributional fidelity, paraphrase robustness via mutual information, and predictive variance can be jointly optimised in the fixed pre-trained multimodal embedding space to produce coresets that reliably improve downstream in-context learning performance on held-out pathology queries.
What would settle it
An experiment on held-out images from CRC-100K or MHIST in which coresets chosen by GAUC fail to show higher accuracy, better calibration, or greater stability under prompt paraphrases than query-dependent nearest-neighbour retrieval or random selection baselines.
Figures
read the original abstract
Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GAUC, a training-free coreset selection algorithm for visual in-context learning with vision-language models in histopathology. Operating in the frozen multimodal embedding space, GAUC jointly optimizes a Maximum Mean Discrepancy term for distributional fidelity to the full dataset, an Effective Mutual Information Difference regularizer for robustness to prompt paraphrases, and a predictive-variance penalty to suppress unstable outputs. The central claim is that this geometry-aware selection yields consistent gains in accuracy, calibration, and prompt robustness on CRC-100K and MHIST across multiple open-source VLMs, outperforming recent ICL selection and dataset-distillation baselines without any gradient updates or fine-tuning.
Significance. If the empirical results hold after addressing the points below, the work would be a useful contribution to reliable deployment of VLMs in computational pathology. The training-free nature and direct use of joint vision-text geometry address practical constraints of scarce expert annotations and prohibitive fine-tuning costs. Credit is due for the explicit multi-objective formulation that combines global distributional matching with local uncertainty and paraphrase sensitivity, and for evaluating across multiple VLM architectures on two pathology datasets.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and the associated ablation tables: the manuscript must include a sensitivity analysis or ablation on the balancing weights among the MMD, EMID, and variance terms. The axiom ledger identifies these weights as free parameters; without showing that the joint optimum is not dominated by any single term and that downstream ICL gains on held-out CRC-100K/MHIST queries remain stable, the claim that the three objectives can be reliably co-optimized in the fixed embedding space is not yet load-bearing.
- [Table 2 / §5.2] Table 2 (or equivalent main-results table) and §5.2: reported accuracy and calibration improvements lack error bars, statistical significance tests, or multiple random seeds. Given that the skeptic note highlights potential misalignment between global MMD and local uncertainty/paraphrase terms, the absence of these controls leaves open whether the observed gains over nearest-neighbor and distillation baselines are reproducible or attributable to post-hoc choices.
minor comments (2)
- [§3.2] The definition of Effective Mutual Information Difference (EMID) in §3.2 uses notation that could be clarified with an explicit equation relating it to the VLM's joint vision-text alignment; a short derivation or pseudocode would help readers verify it does not reduce to a fitted quantity on the target task.
- [Figure 3] Figure 3 (qualitative coreset visualizations) would benefit from side-by-side comparison with the nearest-neighbor baseline to illustrate how the geometry-aware selection differs in embedding space.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our work. We address each major comment in detail below and have made revisions to the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and the associated ablation tables: the manuscript must include a sensitivity analysis or ablation on the balancing weights among the MMD, EMID, and variance terms. The axiom ledger identifies these weights as free parameters; without showing that the joint optimum is not dominated by any single term and that downstream ICL gains on held-out CRC-100K/MHIST queries remain stable, the claim that the three objectives can be reliably co-optimized in the fixed embedding space is not yet load-bearing.
Authors: We appreciate the referee's emphasis on validating the multi-objective optimization. The manuscript currently uses fixed weights determined through limited validation experiments. To address this, we have added a new sensitivity analysis subsection in §4. This includes ablations where we vary the weights λ1 for MMD, λ2 for EMID, and λ3 for variance penalty individually and in combination. Results show that the ICL performance on CRC-100K and MHIST remains stable for weights in [0.1, 10] range, with the full joint objective yielding the best or near-best results without dominance by any term. We believe this strengthens the claim of reliable co-optimization in the embedding space. revision: yes
-
Referee: [Table 2 / §5.2] Table 2 (or equivalent main-results table) and §5.2: reported accuracy and calibration improvements lack error bars, statistical significance tests, or multiple random seeds. Given that the skeptic note highlights potential misalignment between global MMD and local uncertainty/paraphrase terms, the absence of these controls leaves open whether the observed gains over nearest-neighbor and distillation baselines are reproducible or attributable to post-hoc choices.
Authors: We concur that statistical controls are essential for reproducibility. The initial results were reported from single runs due to the computational cost of VLM inference on large datasets. In the revised manuscript, we have conducted experiments with 5 different random seeds for the coreset selection process and query evaluation. Table 2 now includes mean values with standard error bars. We have also added statistical significance testing using Wilcoxon signed-rank tests or t-tests as appropriate, showing that GAUC's improvements are statistically significant (p<0.01) over the baselines. This mitigates concerns regarding post-hoc choices and potential term misalignments by demonstrating consistent performance across seeds. revision: yes
Circularity Check
No significant circularity; method is a new joint optimization over independent objectives
full rationale
The paper introduces GAUC as a training-free coreset selector that jointly optimizes three distinct terms (MMD for distributional match, Effective Mutual Information Difference for paraphrase robustness, and predictive-variance penalty) directly inside a frozen multimodal embedding space. No equations or claims reduce any of these objectives to a fitted parameter that is then renamed as a prediction on the same data. No load-bearing self-citation chain is invoked to justify uniqueness or to forbid alternatives; the central claim rests on empirical gains versus baselines on held-out CRC-100K and MHIST queries. The derivation therefore remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
free parameters (1)
- balancing weights among MMD, EMID, and variance terms
axioms (1)
- domain assumption The joint vision-text embedding space of a pre-trained VLM preserves geometry relevant to histopathology classification and prompt robustness
Reference graph
Works this paper leans on
-
[1]
Achiam, J., et al.: GPT-4 technical report. Tech. rep., OpenAI (2024), arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L.: OpenFlamingo: An open-source framework for training large autoregressive vision-language models (2023), arXiv:2308.01390
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...
work page 1901
-
[4]
Bungert, T.J., Kobelke, L., Jaeger, P.F.: Understanding silent failures in medical image classification. In: MICCAI’23. pp. 400–410 (2023)
work page 2023
-
[5]
Nature Medicine25(8), 1301–1309 (2019).https://doi.org/10.1038/ s41591-019-0508-1
Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade com- putational pathology using weakly supervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (2019).https://doi.org/10.1038/ s41591-019-0508-1
work page 2019
-
[6]
Cazenavette, G., Wang, T., Torralba, A., Efros, A.A., Zhu, J.Y.: Dataset distilla- tion by matching training trajectories. In: CVPR’22. pp. 4750–4759 (2022)
work page 2022
-
[7]
Cechnicka, S., Ball, J., Baugh, M., Reynaud, H., Simmonds, N., Smith, A.P., Hors- field, C., Roufosse, C., Kainz, B.: URCDM: Ultra-resolution image synthesis in histopathology. In: MICCAI’24. pp. 535–545 (2024)
work page 2024
-
[8]
In: MICCAI’23 Workshop on Domain Adaptation and Representation Transfer
Cechnicka, S., Ball, J., Reynaud, H., Arthurs, C., Roufosse, C., Kainz, B.: Realistic data enrichment for robust image segmentation in histopathology. In: MICCAI’23 Workshop on Domain Adaptation and Representation Transfer. pp. 63–72 (2023)
work page 2023
-
[9]
Nature Communications15(1), 10104 (2024).https://doi
Ferber, D., Wölflein, G., Wiest, I.C., Ligero, M., Sainath, S., Ghaffari Laleh, N., El Nahhas, O.S.M., Müller-Franzes, G., Jäger, D., Truhn, D., Kather, J.N.: In-context learning enables multimodal large language models to classify cancer pathology images. Nature Communications15(1), 10104 (2024).https://doi. org/10.1038/s41467-024-51465-9
-
[10]
Ferlay, J., Colombet, M., Soerjomataram, I., Parkin, D.M., Piñeros, M., Znaor, A., Bray, F.: Cancer statistics for the year 2020: An overview. International Journal 10 Franciskus Xaverius Erick , Johanna Paula Müller, and Bernhard Kainz of Cancer149(4), 778–789 (2021).https://doi.org/https://doi.org/10.1002/ ijc.33588,https://onlinelibrary.wiley.com/doi/a...
-
[11]
medRxiv (2023).https://doi.org/10.1101/2023
Filiot, A., Ghermi, R., Olivier, A., Jacob, P., Fidon, L., Camara, A., Mac Kain, A., Saillard, C., Schiratti, J.B.: Scaling self-supervised learning for histopathology with masked image modeling. medRxiv (2023).https://doi.org/10.1101/2023. 07.21.23292757
-
[12]
Jiang, Y., Fu, J., Hao, C., Hu, X., Peng, Y., Geng, X., Yang, X.: Mimic in-context learning for multimodal tasks. In: CVPR’25. pp. 29825–29834 (2025)
work page 2025
-
[13]
PLOS Medicine16(1), e1002730 (2019).https: //doi.org/10.1371/journal.pmed.1002730
Kather, J.N., Krisam, J., Charoentong, P., Luedde, T., Herpel, E., Weis, C.A., Gaiser, T., Marx, A., Valous, N.A., Ferber, D., Jansen, L., Reyes-Aldasoro, C.C., Zörnig, I., Jäger, D., Brenner, H., Chang-Claude, J., Hoffmeister, M., Halama, N.: Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter stud...
-
[14]
npj Digital Medicine8(1), 423 (2025).https://doi.org/10.1038/ s41746-025-01837-2
Kurz, C.F., Merzhevich, T., Eskofier, B.M., Kather, J.N., Gmeiner, B.: Bench- marking vision-language models for diagnostics in emergency and critical care settings. npj Digital Medicine8(1), 423 (2025).https://doi.org/10.1038/ s41746-025-01837-2
work page 2025
-
[15]
Laurençon, H., Tronchon, L., Cord, M., Sanh, V.: What matters when building vision-language models? In: NeurIPS’24. vol. 37 (2024)
work page 2024
-
[16]
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In: NeurIPS’23. vol. 36, pp. 28541–28564 (2023)
work page 2023
-
[17]
Li, L., Peng, J., Chen, H., Gao, C., Yang, X.: How to configure good in-context sequence for visual question answering. In: CVPR’24. pp. 26710–26720 (2024)
work page 2024
-
[18]
Liu, S., Ye, H., Xing, L., Zou, J.: In-context vectors: Making in context learning more effective and controllable through latent space steering. In: ICML’24 (2024)
work page 2024
-
[19]
Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In: ACL’22. pp. 8086–8098 (2022).https://doi.org/10.18653/v1/2022.acl-long. 556
-
[20]
McIntosh-Smith, S., Alam, S.R., Woods, C.: Isambard-ai: a leadership class super- computer optimised specifically for artificial intelligence. arXiv.2410.11199 (2024)
-
[21]
Oh, C., Fang, Z., Im, S., Du, X., Li, Y.: Understanding multimodal LLMs under distribution shifts: An information-theoretic approach. In: ICML’25 (2025)
work page 2025
- [22]
-
[23]
Tolstikhin, I.O., Sriperumbudur, B.K., Schölkopf, B.: Minimax estimation of max- imum mean discrepancy with radial kernels. In: NeurIPS’16. vol. 29 (2016)
work page 2016
-
[24]
Wang, X., Yang, S., Zhang, J., Wang, M., Zhang, J., Yang, W., Huang, J., Han, X.: TransPath: Transformer-based self-supervised learning for histopathological image classification. In: MICCAI’21. Lecture Notes in Computer Science, vol. 12908, pp. 186–195 (2021).https://doi.org/10.1007/978-3-030-87237-3_18
-
[25]
Wang, Z., Wang, J., Xu, H., Yan, M., Huang, F., Yang, X., Wei, X.S., Mi, S., Zhang, Y.: Efficient and effective in-context demonstration selection with coreset. Proceedings of the AAAI Conference on Artificial Intelligence40(13), 10458–10466 (Mar 2026).https://doi.org/10.1609/aaai.v40i13.38017,https://ojs.aaai. org/index.php/AAAI/article/view/38017
-
[26]
Wei, J., Suriawinata, A., Ren, B., Liu, X., Lisovsky, M., Vaickus, L., Brown, C., Baker, M., Tomita, N., Torresani, L., Wei, J., Hassanpour, S.: A petri dish for Geometry-Aware Uncertainty Coresets 11 histopathology image analysis. In: AIME’21. Lecture Notes in Computer Science, vol. 12721, pp. 11–24 (2021).https://doi.org/10.1007/978-3-030-77211-6_2
-
[27]
Zhao, B., Bilen, H.: Dataset condensation with distribution matching. In: WACV’23. pp. 6514–6523 (2023)
work page 2023
-
[28]
Zhao, L., Wu, Y., Jiang, X., Gu, J., Wang, Y., Xu, X., Zhao, P., Lin, X.: Taming diffusion for dataset distillation with high representativeness. In: ICML’25 (2025)
work page 2025
-
[29]
Zhao, T.Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Im- proving few-shot performance of language models. In: ICML’21. pp. 12697–12706 (2021)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.