pith. sign in

arxiv: 2606.22649 · v2 · pith:EX7A3VIPnew · submitted 2026-06-21 · 💻 cs.CV · cs.LG· eess.IV

MaRS: Robust Out-of-Distribution Detection via Mahalanobis Residual Scoring

Pith reviewed 2026-06-29 01:21 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV
keywords out-of-distribution detectionMahalanobis distanceautoencoderreconstruction residualsmedical imagingdistribution shiftpost-hoc detector
0
0 comments X

The pith

Mahalanobis distance on autoencoder residuals detects out-of-distribution medical images more reliably than L2 norms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that reconstruction-based OOD detection underperforms not because autoencoders reconstruct poorly, but because L2 norms on residuals ignore the directional variance in those errors. MaRS instead fits a lightweight autoencoder to in-distribution embeddings and scores new residuals with Mahalanobis distance to produce variance-aware OOD scores. This post-hoc method is tested across three imaging modalities, multiple distribution shifts, and various model families. A sympathetic reader would care because it aims to make frozen foundation models safer for clinical deployment without retraining or labels.

Core claim

The limitation of reconstruction-based methods in latent space does not stem from poor reconstruction quality, but from how reconstruction errors are scored. Standard L2 residual norms collapse the anisotropic residual structure, thereby suppressing informative deviations. MaRS addresses this by learning an in-distribution manifold using a lightweight autoencoder and measuring deviation via a Mahalanobis distance on reconstruction residuals, yielding variance-aware OOD scores that outperform confidence-, distance-, and reconstruction-based baselines.

What carries the argument

Mahalanobis Residual Scoring, which computes a Mahalanobis distance on the reconstruction residuals of a lightweight autoencoder trained on in-distribution data.

If this is right

  • MaRS outperforms established confidence-, distance-, and reconstruction-based baselines across three imaging modalities and multiple distribution shifts.
  • The detector remains fully post-hoc and lightweight while applying to different model families and scales.
  • Label-free OOD detection becomes feasible by fitting the autoencoder only on in-distribution data.
  • Variance-aware scoring on residuals revives the utility of reconstruction-based detectors in latent space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other reconstruction-based OOD methods could be upgraded simply by swapping their scoring function to Mahalanobis without retraining larger models.
  • The same residual structure may exist in non-medical high-dimensional embeddings, suggesting tests on natural images or text.
  • MaRS could be combined with embedding-distance detectors like kNN to create hybrid post-hoc systems.

Load-bearing premise

The performance gap in reconstruction-based OOD detection arises from L2 scoring collapsing anisotropic residual structure rather than from reconstruction quality itself, and a lightweight autoencoder suffices to learn the manifold.

What would settle it

An experiment that keeps the same lightweight autoencoder but replaces Mahalanobis scoring with L2 and shows the performance gap to MaRS disappears, or that improves reconstruction error alone while keeping L2 scoring and still fails to match MaRS.

Figures

Figures reproduced from arXiv: 2606.22649 by Christian Ledig, Francesco Di Salvo, Sebastian Doerrich.

Figure 1
Figure 1. Figure 1: Left: Given a frozen backbone Φ producing features z, an autoencoder (E, D) learns a projection onto the in-distribution (ID) manifold MID, and the reconstruction residual r = z − zˆ captures deviations from it. Right: Principal components of the ID residual covariance reveal that out-of-distribution (OOD) deviations concentrate in low-variance directions, which are amplified by variance-aware Mahalanobis … view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of residual space for MIDOG+FNAC. Left: CDF of eigenvalues of the ID residual covariance, indicating a highly anisotropic residual distribution. Right: Mean squared residual deviation Ex[(u ⊤ i r) 2 ] of ID and OOD samples along each prin￾cipal component ui. OOD residuals exhibit larger and more persistent deviations pre￾cisely in the low-variance directions, motivating variance-aware Mahalanobis … view at source ↗
Figure 3
Figure 3. Figure 3: Left: AUROC on chest X-ray and dermatoscopy for ViT and DINOv3 back￾bones at small (S) and base (B) scales. MaRS consistently outperforms Mahalanobis++ and is less sensitive to increased feature dimensionality. Right: Average AUROC of the top three methods under pre- and post-normalization. While normalization improves latent-space baselines, it suppresses variance anisotropy and degrades MaRS, confirming … view at source ↗
read the original abstract

Foundation models provide highly descriptive representations for medical images, yet their reliability degrades under distribution shifts arising from changes in patients, devices, or acquisition conditions. Reliable out-of-distribution (OOD) detection is therefore essential for safe deployment. Recent post-hoc detectors efficiently exploit frozen embeddings (e.g., kNN), whereas reconstruction-based OOD detection in latent feature space has seen limited adoption due to inconsistent performance. In this work, we show that the limitation of reconstruction-based methods in latent space does not stem from poor reconstruction quality, but from how reconstruction errors are scored. Standard L2 residual norms collapse the anisotropic residual structure, thereby suppressing informative deviations. To address this limitation, we introduce MaRS (Mahalanobis Residual Scoring), a label-free OOD detector that learns an in-distribution manifold using a lightweight autoencoder and measures deviation via a Mahalanobis distance on reconstruction residuals, yielding variance-aware OOD scores. Across three imaging modalities, multiple types of distribution shift, and different model families and scales, MaRS outperforms established confidence-, distance-, and reconstruction-based baselines, while remaining fully post-hoc and lightweight. The code is available at https://github.com/francescodisalvo05/mars.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MaRS, a post-hoc OOD detector for frozen foundation-model embeddings in medical imaging. A lightweight autoencoder is trained on in-distribution features to approximate the ID manifold; OOD scores are then obtained by applying Mahalanobis distance to the reconstruction residuals rather than L2 norms, on the grounds that L2 collapses informative anisotropic structure. The authors claim that this yields consistent gains over confidence-, distance-, and reconstruction-based baselines across three modalities, multiple distribution shifts, and different model families/scales, while remaining fully post-hoc and computationally light.

Significance. If the empirical superiority is robust and the causal attribution to scoring (rather than manifold quality) is substantiated, the work would usefully rehabilitate reconstruction-based OOD detection for safety-critical medical applications. The method is simple, label-free, and compatible with existing foundation models, which could make it a practical addition to the post-hoc detector toolbox.

major comments (3)
  1. [Abstract, §4] Abstract and §4: the central claim of consistent outperformance is stated without any quantitative numbers, baseline specifications, or statistical tests visible in the provided material; the full manuscript must supply these to support the headline result.
  2. [§3.2, §5.1] §3.2 and §5.1: the argument that reconstruction-based methods underperform because of L2 scoring (rather than inadequate manifold approximation) is load-bearing, yet no reconstruction-error metrics (MSE, cosine similarity) on held-out ID data are reported, nor is there an ablation that holds the autoencoder fixed while swapping L2 versus Mahalanobis scoring.
  3. [§5.2] §5.2, Table 3 (or equivalent): without an explicit comparison of the lightweight AE against a higher-capacity reconstructor while keeping the Mahalanobis scorer fixed, it remains unclear whether the reported gains isolate the effect of variance-aware scoring or simply reflect a different (still-limited) deviation signal.
minor comments (2)
  1. [§3.1] Notation in §3.1: explicitly define the residual vector r and the sample covariance estimator used for the Mahalanobis matrix; state whether any regularization or shrinkage is applied.
  2. [Figures] Figure captions and legends should indicate the exact number of ID/OOD samples and the precise shift types shown in each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects that will strengthen the manuscript. We address each major comment point-by-point below, with plans to incorporate revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4: the central claim of consistent outperformance is stated without any quantitative numbers, baseline specifications, or statistical tests visible in the provided material; the full manuscript must supply these to support the headline result.

    Authors: We agree that the abstract and §4 require explicit quantitative support. The revised version will include specific performance metrics (e.g., AUROC and AUPR values) for MaRS and all baselines across the reported modalities and shifts. We will also specify the exact baselines used and include statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) to substantiate the claims of consistent outperformance. revision: yes

  2. Referee: [§3.2, §5.1] §3.2 and §5.1: the argument that reconstruction-based methods underperform because of L2 scoring (rather than inadequate manifold approximation) is load-bearing, yet no reconstruction-error metrics (MSE, cosine similarity) on held-out ID data are reported, nor is there an ablation that holds the autoencoder fixed while swapping L2 versus Mahalanobis scoring.

    Authors: This point is well-taken and directly addresses a core claim. We will add reconstruction-error metrics (MSE and cosine similarity) computed on held-out in-distribution data to demonstrate that the autoencoder achieves high-fidelity reconstruction. We will also include a dedicated ablation that fixes the trained autoencoder and directly compares L2 residual norms against Mahalanobis scoring on the residuals, isolating the contribution of the variance-aware scoring. revision: yes

  3. Referee: [§5.2] §5.2, Table 3 (or equivalent): without an explicit comparison of the lightweight AE against a higher-capacity reconstructor while keeping the Mahalanobis scorer fixed, it remains unclear whether the reported gains isolate the effect of variance-aware scoring or simply reflect a different (still-limited) deviation signal.

    Authors: We acknowledge the value of this control experiment. In the revision we will add a comparison in which the Mahalanobis scorer is held fixed while the autoencoder capacity is increased (e.g., deeper or wider architecture), allowing us to verify that the performance gains are attributable to the scoring method rather than differences in manifold approximation quality. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method applies Mahalanobis to AE residuals without reducing claims to fitted inputs by construction.

full rationale

The paper defines MaRS as training a lightweight autoencoder on in-distribution features, computing residuals, estimating their covariance from ID data, and scoring via Mahalanobis distance. This is a straightforward post-hoc procedure whose performance is assessed through direct comparisons to baselines on held-out shifts. No equation or claim equates a derived quantity to its own fitting procedure, invokes self-citations as load-bearing uniqueness theorems, or renames an empirical pattern as a novel derivation. The central assertion that L2 scoring (rather than reconstruction quality) is the bottleneck is presented as a hypothesis tested by the method's results, not presupposed by the equations themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on training a lightweight autoencoder on in-distribution data and the domain assumption that residuals are anisotropic enough for Mahalanobis to add value over L2; no new entities are postulated.

free parameters (1)
  • lightweight autoencoder weights
    Weights are fitted to in-distribution medical images to learn the manifold; this is a standard fitted component of any autoencoder approach.
axioms (1)
  • domain assumption Reconstruction residuals exhibit anisotropic structure that L2 norms suppress but Mahalanobis distance captures
    This premise is stated directly in the abstract as the reason standard reconstruction methods underperform.

pith-pipeline@v0.9.1-grok · 5750 in / 1141 out tokens · 43956 ms · 2026-06-29T01:21:34.722692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)

    Amorim, J.G.A., Macarini, L.A.B., Matias, A.V., Cerentini, A., Onofre, F.B.D.M., Onofre, A.S.C., Von Wangenheim, A.: A novel approach on segmentation of agnor- stained cytology images using deep learning. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS). IEEE (2020)

  2. [2]

    In: International workshop on uncertainty for safe utilization of machine learning in medical imag- ing

    Anthony, H., Kamnitsas, K.: On the use of mahalanobis distance for out-of- distribution detection with neural networks for medical imaging. In: International workshop on uncertainty for safe utilization of machine learning in medical imag- ing. pp. 136–146. Springer (2023)

  3. [3]

    Scientific data10(1), 484 (2023)

    Aubreville, M., Wilm, F., Stathonikos, N., Breininger, K., Donovan, T.A., Jabari, S., Veta, M., Ganz, J., Ammeling, J., van Diest, P.J., et al.: A comprehensive multi- domain dataset for mitotic figure detection. Scientific data10(1), 484 (2023)

  4. [4]

    ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)

  5. [5]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Gutbrod, M., Rauber, D., Nunes, D.W., Palm, C.: Openmibood: Open medical imaging benchmarks for out-of-distribution detection. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 25874–25886 (2025)

  6. [6]

    In: International Conference on Learning Representations (2017)

    Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of- distribution examples in neural networks. In: International Conference on Learning Representations (2017)

  7. [7]

    arXiv:2404.18279 (2024)

    Hong, Z., Yue, Y., Chen, Y., Cong, L., Lin, H., Luo, Y., Wang, M.H., Wang, W., Xu, J., Yang, X., et al.: Out-of-distribution detection in medical image analysis: A survey. arXiv:2404.18279 (2024)

  8. [8]

    Cell172, 1122–1131.e9 (2018)

    Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., Dong, J., Prasadha, M.K., Pei, J., Ting, M., Zhu, J., Li, C., Hewett, S., Dong, J., Ziyar, I., Shi, A., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell172, 1122–1131.e9 (2018)

  9. [9]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Koch, V., Wagner, S.J., Kazeminia, S., Sancar, E., Hehr, M., Schnabel, J.A., Peng, T., Marr, C.: Dinobloom: a foundation model for generalizable cell embeddings in hematology. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 520–530. Springer (2024)

  10. [10]

    Nature Medicine30(4) (2024)

    Ktena, I., Wiles, O., Albuquerque, I., Rebuffi, S.A., Tanno, R., Roy, A.G., Azizi, S., Belgrave, D., Kohli, P., Cemgil, T., et al.: Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine30(4) (2024)

  11. [11]

    NIPS (2018)

    Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out- of-distribution samples and adversarial attacks. NIPS (2018)

  12. [12]

    In: International Workshop on Machine Learning in Medical Imaging

    Li, X., Lu, Y., Desrosiers, C., Liu, X.: Out-of-distribution detection for skin lesion images with deep isolation forest. In: International Workshop on Machine Learning in Medical Imaging. pp. 91–100. Springer (2020) 10 F. Di Salvo et al

  13. [13]

    In: International Conference on Learning Represen- tations (2018)

    Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: International Conference on Learning Represen- tations (2018)

  14. [14]

    Liu, C., Chen, Y., Shi, H., Lu, J., Jian, B., Pan, J., Cai, L., Wang, J., Zhang, Y., Li, J., et al.: Does dinov3 set a new medical vision standard? arXiv preprint arXiv:2509.06467 (2025)

  15. [15]

    Anomaly Detection for Skin Disease Images Using Variational Autoencoder

    Lu, Y., Xu, P.: Anomaly detection for skin disease images using variational au- toencoder. arXiv:1807.01349 (2018)

  16. [16]

    In: Forty-second International Conference on Machine Learning (2025)

    Müller, M., Hein, M.: Mahalanobis++: Improving OOD detection via feature nor- malization. In: Forty-second International Conference on Machine Learning (2025)

  17. [17]

    arXiv:2012.04250 (2020)

    Ndiour, I., Ahuja, N., Tickoo, O.: Out-of-distribution detection with subspace tech- niques and probabilistic modeling of features. arXiv:2012.04250 (2020)

  18. [18]

    Transactions on Machine Learning Re- search

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Re- search

  19. [19]

    Tissue and Cell57(2019)

    Saikia, A.R., Bora, K., Mahanta, L.B., Das, A.K.: Comparative assessment of cnn architectures for classification of breast fnac images. Tissue and Cell57(2019)

  20. [20]

    Advances in neural information processing systems12(1999)

    Schölkopf, B., Williamson, R.C., Smola, A., Shawe-Taylor, J., Platt, J.: Support vector method for novelty detection. Advances in neural information processing systems12(1999)

  21. [21]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Schulthess, N., Konukoglu, E.: Anomaly detection by clustering dino embeddings using a dirichlet process mixture. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 46–56. Springer (2025)

  22. [22]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  23. [23]

    In: International conference on machine learning

    Sun, Y., Ming, Y., Zhu, X., Li, Y.: Out-of-distribution detection with deep nearest neighbors. In: International conference on machine learning. PMLR (2022)

  24. [24]

    Teterwak, P., Saito, K., Tsiligkaridis, T., Plummer, B.A., Saenko, K.: Is large- scale pretraining the secret to good domain generalization? In: The Thirteenth International Conference on Learning Representations (2025)

  25. [25]

    Scientific data5(1), 1–9 (2018)

    Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data5(1), 1–9 (2018)

  26. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, H., Li, Z., Feng, L., Zhang, W.: Vim: Out-of-distribution with virtual-logit matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4921–4930 (2022)

  27. [27]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised clas- sification and localization of common thorax diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3462–3471 (2017)

  28. [28]

    Nature630(8015), 181–188 (2024)

    Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024)

  29. [29]

    Scientific Data10(1), 41 (2023)

    Yang,J.,Shi,R.,Wei,D.,Liu,Z.,Zhao,L.,Ke,B.,Pfister,H.,Ni,B.:Medmnistv2- a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data10(1), 41 (2023)

  30. [30]

    Advances in Neural Information Processing Systems35(2022) MaRS: Mahalanobis Residual Scoring 11

    Yang, J., Wang, P., Zou, D., Zhou, Z., Ding, K., Peng, W., Wang, H., Chen, G., Li, B., Sun, Y., et al.: Openood: Benchmarking generalized out-of-distribution detection. Advances in Neural Information Processing Systems35(2022) MaRS: Mahalanobis Residual Scoring 11

  31. [31]

    Proceedings of the IEEE (2024)

    Yoon, J.S., Oh, K., Shin, Y., Mazurowski, M.A., Suk, H.I.: Domain generalization for medical image analysis: A review. Proceedings of the IEEE (2024)

  32. [32]

    Journal of Data-centric Machine Learning Research (2024), dataset Certification

    Zhang, J., Yang, J., Wang, P., Wang, H., Lin, Y., Zhang, H., Sun, Y., Du, X., Li, Y., Liu, Z., Chen, Y., Li, H.: OpenOOD v1.5: Enhanced benchmark for out-of- distribution detection. Journal of Data-centric Machine Learning Research (2024), dataset Certification