arxiv: 2605.05161 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Bernhard Kainz , Johanna P Mueller , Matthew Baugh , Cosmin Bercea

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords Wasserstein distancevision-language modelsanomaly localizationout-of-distribution detectionmedical imagingbrain MRIzero-shot learningoptimal transport

0 comments

The pith

Wasserstein distances let vision-language models pick moderately similar normal anatomy references to localize anomalies in medical images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that zero-shot anomaly localisation in medical imaging is limited because VLMs lack healthy anatomical context, and that this can be addressed by turning the task into structured comparison against reference distributions of normal anatomy. It proposes WALDO, a training-free method that uses entropy-weighted sliced Wasserstein distances on DINOv2 patch features to select references in a moderate-similarity Goldilocks zone, then aggregates self-consistent localisations via weighted non-maximum suppression. On the NOVA brain MRI benchmark the approach raises mAP@30 from zero-shot baselines by 19 percent relative, with consistent gains across three different VLMs and statistical confirmation via paired tests. A sympathetic reader would care because rare pathologies could be detected more reliably without collecting task-specific training data or fine-tuning models.

Core claim

WALDO reformulates zero-shot localisation as comparative inference and supplies three components: entropy-weighted Sliced Wasserstein distances for selecting anatomically-aware references from DINOv2 patch distributions of normal anatomy, Goldilocks zone sampling that exploits the non-monotonic relationship between reference similarity and localisation accuracy, and self-consistency aggregation via weighted non-maximum suppression. The authors theoretically link the Goldilocks effect to distributional divergence and show that moderate-similarity references minimise a bias-variance trade-off in VLM visual reasoning. On NOVA brain MRI, WALDO with Qwen2.5-VL-72B reaches 43.5 percent mAP@30, a 9

What carries the argument

WALDO framework that selects reference images via entropy-weighted Sliced Wasserstein distances from normal-anatomy DINOv2 distributions and samples them inside the moderate-similarity Goldilocks zone for comparative VLM reasoning.

If this is right

VLMs can localise rare pathologies more accurately without task-specific training or fine-tuning.
Reference selection that minimises distributional divergence improves the bias-variance trade-off in comparative visual reasoning.
The method produces statistically significant gains across multiple VLM architectures on the same benchmark.
Self-consistency aggregation through weighted non-maximum suppression further stabilises localisation outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reference-selection principle could be tested on CT or ultrasound if comparable patch embeddings of normal anatomy are available.
Clinical deployment would require building large, curated banks of normal-anatomy images indexed by Wasserstein distance for each imaging modality.
If the Goldilocks zone proves stable across sites, the approach could reduce the need for site-specific model retraining in radiology AI.

Load-bearing premise

References drawn from DINOv2 patch distributions of normal anatomy and chosen for moderate similarity supply a reliable bias-variance trade-off for VLM comparative reasoning.

What would settle it

A controlled experiment on a new medical imaging dataset in which the non-monotonic relationship between reference similarity and localisation accuracy disappears or reverses would falsify the Goldilocks mechanism.

Figures

Figures reproduced from arXiv: 2605.05161 by Bernhard Kainz, Cosmin Bercea, Johanna P Mueller, Matthew Baugh.

**Figure 1.** Figure 1: WALDO idea. The query image and healthy reference pool are processed view at source ↗

**Figure 2.** Figure 2: Qualitative results. Left three: NOVA brain MRI; Right three: VinDr view at source ↗

**Figure 3.** Figure 3: Error stratification analysis on NOVA. IoU distributions grouped by lesion view at source ↗

**Figure 4.** Figure 4: WALDO prompt example: Cerebral fat embolism ( view at source ↗

**Figure 5.** Figure 5: WALDO prompt example: Pleomorphic xanthoastrocytoma with dural view at source ↗

**Figure 6.** Figure 6: CXR WALDO prompt example: Cardiomegaly (IoU=0.80). Top row: 5 view at source ↗

read the original abstract

Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}\%$ mAP@30 (95\% CI: [40.4, 46.7]), representing a 19\% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}\%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}\%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WALDO adds a training-free reference selection step using Wasserstein distances on DINOv2 features that lifts VLM OOD localization on NOVA by about 19%, though the Goldilocks non-monotonic mechanism rests on thin validation.

read the letter

The main thing here is that WALDO reformulates zero-shot anomaly localization as comparative inference against normal anatomy references, picked via entropy-weighted sliced Wasserstein on DINOv2 patches, then sampled from a moderate-similarity Goldilocks zone and aggregated with self-consistency. On the NOVA brain MRI set it reports 43.5% mAP@30 with Qwen2.5-VL-72B, a 19% relative lift over plain zero-shot baselines, with similar gains for GPT-4o and Qwen3-VL-32B plus McNemar significance at p<0.01. Code is out, which is useful.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces WALDO, a training-free framework for zero-shot anomaly localisation in medical imaging using vision-language models (VLMs). It reformulates the task as comparative inference against reference distributions of normal anatomy, employing entropy-weighted Sliced Wasserstein distances on DINOv2 patch features for reference selection, a 'Goldilocks zone' sampling based on moderate similarity to balance bias-variance, and self-consistency aggregation. The paper reports a 19% relative improvement in mAP@30 on the NOVA brain MRI benchmark with Qwen2.5-VL-72B, with consistent gains across other VLMs and statistical significance via McNemar tests.

Significance. If the central claims hold, this work offers a significant advance in training-free OOD detection for medical imaging by grounding VLM-based localisation in optimal transport theory. The cross-model consistency and reported statistical tests suggest robustness, and the availability of source code is a strength. It addresses a key limitation of zero-shot VLMs by providing anatomical context without training.

major comments (2)

[Abstract] The central performance claim (19% relative mAP gain on NOVA) rests on the Goldilocks effect: moderate-similarity references (selected via entropy-weighted Sliced Wasserstein on DINOv2 patches) minimize bias-variance in VLM comparative reasoning. The abstract states this is supported by 'theoretical analysis through distributional divergence' and a 'non-monotonic relationship', yet no equations, proofs, or figures demonstrating the accuracy-vs-similarity curve are visible in the provided material; the result reduces to a single benchmark point.
[Abstract] The three components (entropy-weighted Sliced Wasserstein reference selection, Goldilocks zone sampling, self-consistency aggregation) are described at a high level, but exact implementation details, hyperparameter choices for the 'moderate similarity' threshold, and any post-hoc reference selection criteria are not visible. This makes it impossible to verify whether the reported gains are attributable to the proposed mechanism or to standard reference matching.

minor comments (2)

The abstract reports 95% CIs and McNemar p<0.01 but does not specify the number of runs or exact test setup; clarify this in the results section for reproducibility.
Consider adding an ablation isolating the contribution of the entropy-weighting in the Wasserstein distance versus unweighted selection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the significance of our work and for the constructive feedback. We address each major comment below and outline the revisions that will be made to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] The central performance claim (19% relative mAP gain on NOVA) rests on the Goldilocks effect: moderate-similarity references (selected via entropy-weighted Sliced Wasserstein on DINOv2 patches) minimize bias-variance in VLM comparative reasoning. The abstract states this is supported by 'theoretical analysis through distributional divergence' and a 'non-monotonic relationship', yet no equations, proofs, or figures demonstrating the accuracy-vs-similarity curve are visible in the provided material; the result reduces to a single benchmark point.

Authors: We acknowledge that the abstract summarizes the theoretical analysis at a high level without embedding the supporting equations or empirical curve. The full manuscript contains the distributional divergence analysis in Section 3.2, but to make the non-monotonic relationship explicit and directly tied to the reported gains, we will add a dedicated figure in the revised version plotting localisation accuracy versus reference similarity on the NOVA benchmark. We will also include the key equations formalizing the bias-variance trade-off in the main text. These additions will ensure the central claim is substantiated beyond the single benchmark point. revision: yes
Referee: [Abstract] The three components (entropy-weighted Sliced Wasserstein reference selection, Goldilocks zone sampling, self-consistency aggregation) are described at a high level, but exact implementation details, hyperparameter choices for the 'moderate similarity' threshold, and any post-hoc reference selection criteria are not visible. This makes it impossible to verify whether the reported gains are attributable to the proposed mechanism or to standard reference matching.

Authors: We agree that greater specificity is required for reproducibility and to distinguish our method from generic reference matching. In the revised manuscript we will expand the Methods section with the exact formula for the entropy-weighted Sliced Wasserstein distance, the precise similarity thresholds (and selection procedure) defining the Goldilocks zone, the number of references and weighting scheme used in self-consistency aggregation, and any post-hoc filtering criteria. These details, currently available in the released code, will be integrated into the paper with pseudocode to allow direct verification that the performance improvements arise from the proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external models and standard OT theory

full rationale

The paper's core claims rest on a training-free pipeline that applies entropy-weighted Sliced Wasserstein distances to DINOv2 patch features for reference selection, followed by Goldilocks-zone sampling justified by a distributional-divergence analysis of bias-variance trade-off, and self-consistency via weighted NMS. None of these steps reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The reported mAP gains are empirical outcomes on the external NOVA benchmark, cross-validated across independent VLMs (Qwen2.5-VL-72B, GPT-4o, Qwen3-VL-32B) with statistical tests; the theoretical Goldilocks analysis is presented as a general property of comparative VLM reasoning rather than an input-derived tautology. The framework is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that DINOv2 patch features encode sufficient normal anatomical context and that VLMs can perform reliable comparative inference when supplied with appropriately chosen references; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption DINOv2 patch distributions from normal anatomy serve as suitable reference distributions for comparative anomaly detection
Invoked for anatomically-aware reference selection via Sliced Wasserstein distances.
domain assumption A non-monotonic relationship exists between reference similarity and localisation accuracy that can be exploited by Goldilocks zone sampling
Stated as the basis for the theoretical analysis of the Goldilocks effect.

pith-pipeline@v0.9.0 · 5607 in / 1594 out tokens · 70580 ms · 2026-05-08T16:36:15.678369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages · 7 internal anchors

[1]

In: International conference on machine learning

Arjovsky,M.,Chintala,S.,Bottou,L.:Wassersteingenerativeadversarialnetworks. In: International conference on machine learning. pp. 214–223. ICML - PMLR (2017)

2017
[2]

Qwen3-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, P., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review arXiv 2025
[3]

In: MICCAI

Baugh, M., Tan, J., Müller, J.P., Dombrowski, M., Batten, J., Kainz, B.: Many tasks make light work: Learning to localise medical anomalies from multiple syn- thetic tasks. In: MICCAI. pp. 162–172. Springer (2023)

2023
[4]

In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging

Baugh, M., Tan, J., Vlontzos, A., Müller, J.P., Kainz, B.: nnood: A framework for benchmarking self-supervised anomaly localisation methods. In: International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging. pp. 103–112. Springer (2022)

2022
[5]

Nova: A benchmark for anomaly localization and clinical reasoning in brain mri.arXiv preprint arXiv:2505.14064, 2025

Bercea, C.I., Li, J., Raffler, P., Riedel, E.O., Schmitzer, L., Kurz, A., Bitzer, F., Roßmüller, P., Canisius, J., Beyrle, M.L., et al.: NOVA: A benchmark for anomaly localization and clinical reasoning in brain MRI. arXiv preprint arXiv:2505.14064 (2025), NeurIPS 2025

work page arXiv 2025
[6]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bergmann, P., Fauser, M., Sattlegger, D., Steger, C.: Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4183–4192 (2020)

2020
[7]

Journal of Mathematical Imaging and Vision51(1), 22–45 (2015)

Bonneel,N., Rabin,J., Peyré,G., Pfister,H.: Sliced andradon wassersteinbarycen- ters of measures. Journal of Mathematical Imaging and Vision51(1), 22–45 (2015)

2015
[8]

NeurIPS’2033, 1877–1901 (2020) 10 B

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS’2033, 1877–1901 (2020) 10 B. Kainz et al

2033
[9]

IEEE TPAMI39(9), 1853–1865 (2017)

Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE TPAMI39(9), 1853–1865 (2017)

2017
[10]

In: ICPR’21

Defard, T., Setkov, A., Loesch, A., Audigier, R.: Padim: a patch distribution mod- eling framework for anomaly detection and localization. In: ICPR’21. pp. 475–489. Springer (2021)

2021
[11]

A Survey on In-context Learning

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., et al.: A survey on in-context learning. In: ELNLP’24 - arXiv:2301.00234. pp. 1107–1128 (2024)

work page internal anchor Pith review arXiv 2024
[12]

Statistical science pp

Efron, B., Tibshirani, R.: Bootstrap methods for standard errors, confidence in- tervals, and other measures of statistical accuracy. Statistical science pp. 54–75 (1986)

1986
[13]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team: Gemini: A family of highly capable multimodal models. arXiv:2312.11805 (2023)

work page internal anchor Pith review arXiv 2023
[14]

Foundations and Trends in Machine Learning5(2–3), 123–286 (2012)

Kulesza, A., Taskar, B., et al.: Determinantal point processes for machine learning. Foundations and Trends in Machine Learning5(2–3), 123–286 (2012)

2012
[15]

Medical image analysis42, 60–88 (2017)

Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis42, 60–88 (2017)

2017
[16]

Psychometrika12, 153–157 (1947)

McNemar, Q.: Note on the sampling error of the difference between correlated proportions. Psychometrika12, 153–157 (1947)

1947
[17]

In: MICCAI 2024

Naval Marimont, S., Siomos, V., Baugh, M., Tzelepis, C., Kainz, B., Tarroni, G.: Ensembled cold-diffusion restorations for unsupervised anomaly detection. In: MICCAI 2024. vol. LNCS 15011, pp. 243–253. Springer (2024)

2024
[18]

Scientific Data9(1), 429 (2022)

Nguyen, H.Q., Lam, K., Le, L.T., Pham, H.H., Tran, D.Q., Nguyen, D.B., Le, D.D., Pham, C.M., Tong, H.T., Dinh, D.H., et al.: VinDr-CXR: An open dataset of chest x-rays with radiologist’s annotations. Scientific Data9(1), 429 (2022)

2022
[19]

OpenAI Technical Report (2024)

OpenAI: GPT-4o: Multimodal intelligence at scale. OpenAI Technical Report (2024)

2024
[20]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. TMLR - arXiv:2304.07193 (2024)

work page internal anchor Pith review arXiv 2024
[21]

Foundations and Trends®in Machine Learning11(5-6), 355–607 (2019)

Peyré, G., Cuturi, M., et al.: Computational optimal transport: With applications to data science. Foundations and Trends®in Machine Learning11(5-6), 355–607 (2019)

2019
[22]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. ICML - PmLR (2021)

2021
[23]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14318–14328 (2022)

2022
[24]

Medical Image Analysis54, 30–44 (2019)

Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f- AnoGAN: Fast unsupervised anomaly detection with generative adversarial net- works. Medical Image Analysis54, 30–44 (2019)

2019
[25]

In: ECCV

Schlüter, H.M., Tan, J., Hou, B., Kainz, B.: Natural synthetic anomalies for self- supervised anomaly detection and localization. In: ECCV. pp. 474–489. Springer (2022)

2022
[26]

DINOv3

Siméoni,O.,Vo,H.V.,Seitzer,M.,Baldassarre,F.,Oquab,M.,etal.:DINOv3:Self- supervised learning with gram anchoring. arXiv preprint arXiv:2508.10104 (2025) WALDO 11

work page internal anchor Pith review arXiv 2025
[27]

Machine Learning for Biomedical Imaging - arXiv:2011.041971(April 2022 issue), 1–27 (2022)

Tan, J., Hou, B., Batten, J., Qiu, H., Kainz, B., et al.: Detecting outliers with foreign patch interpolation. Machine Learning for Biomedical Imaging - arXiv:2011.041971(April 2022 issue), 1–27 (2022)

work page arXiv 2011
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., et al.: Qwen2-VL: Enhancing vision-language model’s perception at any resolution. arXiv:2409.12191 (2024)

work page internal anchor Pith review arXiv 2024
[29]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., et al.: Self-consistency improves chain of thought reasoning. In: ICLR - arXiv:2203.11171 (2023)

work page internal anchor Pith review arXiv 2023
[30]

NeurIPS’2235, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS’2235, 24824–24837 (2022)

2022
[31]

box_2d": [y1, x1, y2, x2],

Zimmerer, D., Full, P.M., Isensee, F., Jäger, P., Adler, T., Petersen, J., Köhler, G., Ross, T., Reinke, A., Kascenas, A., et al.: Mood 2020: A public benchmark for out- of-distribution detection and localization on medical images. IEEE transactions on medical imaging41(10), 2728–2738 (2022) 12 B. Kainz et al. Supplementary Material Figures 4, 5, and 6 sh...

2020