Recognition: unknown
Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging
Pith reviewed 2026-05-08 16:36 UTC · model grok-4.3
The pith
Wasserstein distances let vision-language models pick moderately similar normal anatomy references to localize anomalies in medical images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WALDO reformulates zero-shot localisation as comparative inference and supplies three components: entropy-weighted Sliced Wasserstein distances for selecting anatomically-aware references from DINOv2 patch distributions of normal anatomy, Goldilocks zone sampling that exploits the non-monotonic relationship between reference similarity and localisation accuracy, and self-consistency aggregation via weighted non-maximum suppression. The authors theoretically link the Goldilocks effect to distributional divergence and show that moderate-similarity references minimise a bias-variance trade-off in VLM visual reasoning. On NOVA brain MRI, WALDO with Qwen2.5-VL-72B reaches 43.5 percent mAP@30, a 9
What carries the argument
WALDO framework that selects reference images via entropy-weighted Sliced Wasserstein distances from normal-anatomy DINOv2 distributions and samples them inside the moderate-similarity Goldilocks zone for comparative VLM reasoning.
If this is right
- VLMs can localise rare pathologies more accurately without task-specific training or fine-tuning.
- Reference selection that minimises distributional divergence improves the bias-variance trade-off in comparative visual reasoning.
- The method produces statistically significant gains across multiple VLM architectures on the same benchmark.
- Self-consistency aggregation through weighted non-maximum suppression further stabilises localisation outputs.
Where Pith is reading between the lines
- The same reference-selection principle could be tested on CT or ultrasound if comparable patch embeddings of normal anatomy are available.
- Clinical deployment would require building large, curated banks of normal-anatomy images indexed by Wasserstein distance for each imaging modality.
- If the Goldilocks zone proves stable across sites, the approach could reduce the need for site-specific model retraining in radiology AI.
Load-bearing premise
References drawn from DINOv2 patch distributions of normal anatomy and chosen for moderate similarity supply a reliable bias-variance trade-off for VLM comparative reasoning.
What would settle it
A controlled experiment on a new medical imaging dataset in which the non-monotonic relationship between reference similarity and localisation accuracy disappears or reverses would falsify the Goldilocks mechanism.
Figures
read the original abstract
Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}\%$ mAP@30 (95\% CI: [40.4, 46.7]), representing a 19\% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}\%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}\%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces WALDO, a training-free framework for zero-shot anomaly localisation in medical imaging using vision-language models (VLMs). It reformulates the task as comparative inference against reference distributions of normal anatomy, employing entropy-weighted Sliced Wasserstein distances on DINOv2 patch features for reference selection, a 'Goldilocks zone' sampling based on moderate similarity to balance bias-variance, and self-consistency aggregation. The paper reports a 19% relative improvement in mAP@30 on the NOVA brain MRI benchmark with Qwen2.5-VL-72B, with consistent gains across other VLMs and statistical significance via McNemar tests.
Significance. If the central claims hold, this work offers a significant advance in training-free OOD detection for medical imaging by grounding VLM-based localisation in optimal transport theory. The cross-model consistency and reported statistical tests suggest robustness, and the availability of source code is a strength. It addresses a key limitation of zero-shot VLMs by providing anatomical context without training.
major comments (2)
- [Abstract] The central performance claim (19% relative mAP gain on NOVA) rests on the Goldilocks effect: moderate-similarity references (selected via entropy-weighted Sliced Wasserstein on DINOv2 patches) minimize bias-variance in VLM comparative reasoning. The abstract states this is supported by 'theoretical analysis through distributional divergence' and a 'non-monotonic relationship', yet no equations, proofs, or figures demonstrating the accuracy-vs-similarity curve are visible in the provided material; the result reduces to a single benchmark point.
- [Abstract] The three components (entropy-weighted Sliced Wasserstein reference selection, Goldilocks zone sampling, self-consistency aggregation) are described at a high level, but exact implementation details, hyperparameter choices for the 'moderate similarity' threshold, and any post-hoc reference selection criteria are not visible. This makes it impossible to verify whether the reported gains are attributable to the proposed mechanism or to standard reference matching.
minor comments (2)
- The abstract reports 95% CIs and McNemar p<0.01 but does not specify the number of runs or exact test setup; clarify this in the results section for reproducibility.
- Consider adding an ablation isolating the contribution of the entropy-weighting in the Wasserstein distance versus unweighted selection.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the significance of our work and for the constructive feedback. We address each major comment below and outline the revisions that will be made to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] The central performance claim (19% relative mAP gain on NOVA) rests on the Goldilocks effect: moderate-similarity references (selected via entropy-weighted Sliced Wasserstein on DINOv2 patches) minimize bias-variance in VLM comparative reasoning. The abstract states this is supported by 'theoretical analysis through distributional divergence' and a 'non-monotonic relationship', yet no equations, proofs, or figures demonstrating the accuracy-vs-similarity curve are visible in the provided material; the result reduces to a single benchmark point.
Authors: We acknowledge that the abstract summarizes the theoretical analysis at a high level without embedding the supporting equations or empirical curve. The full manuscript contains the distributional divergence analysis in Section 3.2, but to make the non-monotonic relationship explicit and directly tied to the reported gains, we will add a dedicated figure in the revised version plotting localisation accuracy versus reference similarity on the NOVA benchmark. We will also include the key equations formalizing the bias-variance trade-off in the main text. These additions will ensure the central claim is substantiated beyond the single benchmark point. revision: yes
-
Referee: [Abstract] The three components (entropy-weighted Sliced Wasserstein reference selection, Goldilocks zone sampling, self-consistency aggregation) are described at a high level, but exact implementation details, hyperparameter choices for the 'moderate similarity' threshold, and any post-hoc reference selection criteria are not visible. This makes it impossible to verify whether the reported gains are attributable to the proposed mechanism or to standard reference matching.
Authors: We agree that greater specificity is required for reproducibility and to distinguish our method from generic reference matching. In the revised manuscript we will expand the Methods section with the exact formula for the entropy-weighted Sliced Wasserstein distance, the precise similarity thresholds (and selection procedure) defining the Goldilocks zone, the number of references and weighting scheme used in self-consistency aggregation, and any post-hoc filtering criteria. These details, currently available in the released code, will be integrated into the paper with pseudocode to allow direct verification that the performance improvements arise from the proposed components. revision: yes
Circularity Check
No significant circularity; derivation relies on external models and standard OT theory
full rationale
The paper's core claims rest on a training-free pipeline that applies entropy-weighted Sliced Wasserstein distances to DINOv2 patch features for reference selection, followed by Goldilocks-zone sampling justified by a distributional-divergence analysis of bias-variance trade-off, and self-consistency via weighted NMS. None of these steps reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The reported mAP gains are empirical outcomes on the external NOVA benchmark, cross-validated across independent VLMs (Qwen2.5-VL-72B, GPT-4o, Qwen3-VL-32B) with statistical tests; the theoretical Goldilocks analysis is presented as a general property of comparative VLM reasoning rather than an input-derived tautology. The framework is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DINOv2 patch distributions from normal anatomy serve as suitable reference distributions for comparative anomaly detection
- domain assumption A non-monotonic relationship exists between reference similarity and localisation accuracy that can be exploited by Goldilocks zone sampling
Reference graph
Works this paper leans on
-
[1]
In: International conference on machine learning
Arjovsky,M.,Chintala,S.,Bottou,L.:Wassersteingenerativeadversarialnetworks. In: International conference on machine learning. pp. 214–223. ICML - PMLR (2017)
2017
-
[2]
Bai, S., Chen, K., Liu, X., Wang, P., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review arXiv 2025
-
[3]
In: MICCAI
Baugh, M., Tan, J., Müller, J.P., Dombrowski, M., Batten, J., Kainz, B.: Many tasks make light work: Learning to localise medical anomalies from multiple syn- thetic tasks. In: MICCAI. pp. 162–172. Springer (2023)
2023
-
[4]
In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging
Baugh, M., Tan, J., Vlontzos, A., Müller, J.P., Kainz, B.: nnood: A framework for benchmarking self-supervised anomaly localisation methods. In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. pp. 103–112. Springer (2022)
2022
-
[5]
Bercea, C.I., Li, J., Raffler, P., Riedel, E.O., Schmitzer, L., Kurz, A., Bitzer, F., Roßmüller, P., Canisius, J., Beyrle, M.L., et al.: NOVA: A benchmark for anomaly localization and clinical reasoning in brain MRI. arXiv preprint arXiv:2505.14064 (2025), NeurIPS 2025
-
[6]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4183–4192 (2020)
2020
-
[7]
Journal of Mathematical Imaging and Vision51(1), 22–45 (2015)
Bonneel,N., Rabin,J., Peyré,G., Pfister,H.: Sliced andradon wassersteinbarycen- ters of measures. Journal of Mathematical Imaging and Vision51(1), 22–45 (2015)
2015
-
[8]
NeurIPS’2033, 1877–1901 (2020) 10 B
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS’2033, 1877–1901 (2020) 10 B. Kainz et al
2033
-
[9]
IEEE TPAMI39(9), 1853–1865 (2017)
Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE TPAMI39(9), 1853–1865 (2017)
2017
-
[10]
In: ICPR’21
Defard, T., Setkov, A., Loesch, A., Audigier, R.: Padim: a patch distribution mod- eling framework for anomaly detection and localization. In: ICPR’21. pp. 475–489. Springer (2021)
2021
-
[11]
A Survey on In-context Learning
Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., et al.: A survey on in-context learning. In: ELNLP’24 - arXiv:2301.00234. pp. 1107–1128 (2024)
work page internal anchor Pith review arXiv 2024
-
[12]
Statistical science pp
Efron, B., Tibshirani, R.: Bootstrap methods for standard errors, confidence in- tervals, and other measures of statistical accuracy. Statistical science pp. 54–75 (1986)
1986
-
[13]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team: Gemini: A family of highly capable multimodal models. arXiv:2312.11805 (2023)
work page internal anchor Pith review arXiv 2023
-
[14]
Foundations and Trends in Machine Learning5(2–3), 123–286 (2012)
Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Foundations and Trends in Machine Learning5(2–3), 123–286 (2012)
2012
-
[15]
Medical image analysis42, 60–88 (2017)
Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis42, 60–88 (2017)
2017
-
[16]
Psychometrika12, 153–157 (1947)
McNemar, Q.: Note on the sampling error of the difference between correlated proportions. Psychometrika12, 153–157 (1947)
1947
-
[17]
In: MICCAI 2024
Naval Marimont, S., Siomos, V., Baugh, M., Tzelepis, C., Kainz, B., Tarroni, G.: Ensembled cold-diffusion restorations for unsupervised anomaly detection. In: MICCAI 2024. vol. LNCS 15011, pp. 243–253. Springer (2024)
2024
-
[18]
Scientific Data9(1), 429 (2022)
Nguyen, H.Q., Lam, K., Le, L.T., Pham, H.H., Tran, D.Q., Nguyen, D.B., Le, D.D., Pham, C.M., Tong, H.T., Dinh, D.H., et al.: VinDr-CXR: An open dataset of chest x-rays with radiologist’s annotations. Scientific Data9(1), 429 (2022)
2022
-
[19]
OpenAI Technical Report (2024)
OpenAI: GPT-4o: Multimodal intelligence at scale. OpenAI Technical Report (2024)
2024
-
[20]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. TMLR - arXiv:2304.07193 (2024)
work page internal anchor Pith review arXiv 2024
-
[21]
Foundations and Trends®in Machine Learning11(5-6), 355–607 (2019)
Peyré, G., Cuturi, M., et al.: Computational optimal transport: With applications to data science. Foundations and Trends®in Machine Learning11(5-6), 355–607 (2019)
2019
-
[22]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. ICML - PmLR (2021)
2021
-
[23]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14318–14328 (2022)
2022
-
[24]
Medical Image Analysis54, 30–44 (2019)
Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f- AnoGAN: Fast unsupervised anomaly detection with generative adversarial net- works. Medical Image Analysis54, 30–44 (2019)
2019
-
[25]
In: ECCV
Schlüter, H.M., Tan, J., Hou, B., Kainz, B.: Natural synthetic anomalies for self- supervised anomaly detection and localization. In: ECCV. pp. 474–489. Springer (2022)
2022
-
[26]
Siméoni,O.,Vo,H.V.,Seitzer,M.,Baldassarre,F.,Oquab,M.,etal.:DINOv3:Self- supervised learning with gram anchoring. arXiv preprint arXiv:2508.10104 (2025) WALDO 11
work page internal anchor Pith review arXiv 2025
-
[27]
Machine Learning for Biomedical Imaging - arXiv:2011.041971(April 2022 issue), 1–27 (2022)
Tan, J., Hou, B., Batten, J., Qiu, H., Kainz, B., et al.: Detecting outliers with foreign patch interpolation. Machine Learning for Biomedical Imaging - arXiv:2011.041971(April 2022 issue), 1–27 (2022)
-
[28]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., et al.: Qwen2-VL: Enhancing vision-language model’s perception at any resolution. arXiv:2409.12191 (2024)
work page internal anchor Pith review arXiv 2024
-
[29]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, X., Wei, J., Schuurmans, D., et al.: Self-consistency improves chain of thought reasoning. In: ICLR - arXiv:2203.11171 (2023)
work page internal anchor Pith review arXiv 2023
-
[30]
NeurIPS’2235, 24824–24837 (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS’2235, 24824–24837 (2022)
2022
-
[31]
box_2d": [y1, x1, y2, x2],
Zimmerer, D., Full, P.M., Isensee, F., Jäger, P., Adler, T., Petersen, J., Köhler, G., Ross, T., Reinke, A., Kascenas, A., et al.: Mood 2020: A public benchmark for out- of-distribution detection and localization on medical images. IEEE transactions on medical imaging41(10), 2728–2738 (2022) 12 B. Kainz et al. Supplementary Material Figures 4, 5, and 6 sh...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.