Metrics reloaded: Recommendations for image analysis validation

Abdel A. Taha; Adrian Galdran; A. Emre Kavur; Alan Karthikesalingam; Aleksei Tiulpin; Alexandros Karargyris; Amin Madani; Anna Kreshuk; Anne L. Martel; Annette Kopp-Schneider

arxiv: 2206.01653 · v8 · pith:TXA72PKYnew · submitted 2022-06-03 · 💻 cs.CV

Metrics reloaded: Recommendations for image analysis validation

Lena Maier-Hein , Annika Reinke , Patrick Godau , Minu D. Tizabi , Florian Buettner , Evangelia Christodoulou , Ben Glocker , Fabian Isensee

show 66 more authors

Jens Kleesiek Michal Kozubek Mauricio Reyes Michael A. Riegler Manuel Wiesenfarth A. Emre Kavur Carole H. Sudre Michael Baumgartner Matthias Eisenmann Doreen Heckmann-N\"otzel Tim R\"adsch Laura Acion Michela Antonelli Tal Arbel Spyridon Bakas Arriel Benis Matthew Blaschko M. Jorge Cardoso Veronika Cheplygina Beth A. Cimini Gary S. Collins Keyvan Farahani Luciana Ferrer Adrian Galdran Bram van Ginneken Robert Haase Daniel A. Hashimoto Michael M. Hoffman Merel Huisman Pierre Jannin Charles E. Kahn Dagmar Kainmueller Bernhard Kainz Alexandros Karargyris Alan Karthikesalingam Hannes Kenngott Florian Kofler Annette Kopp-Schneider Anna Kreshuk Tahsin Kurc Bennett A. Landman Geert Litjens Amin Madani Klaus Maier-Hein Anne L. Martel Peter Mattson Erik Meijering Bjoern Menze Karel G.M. Moons Henning M\"uller Brennan Nichyporuk Felix Nickel Jens Petersen Nasir Rajpoot Nicola Rieke Julio Saez-Rodriguez Clara I. S\'anchez Shravya Shetty Maarten van Smeden Ronald M. Summers Abdel A. Taha Aleksei Tiulpin Sotirios A. Tsaftaris Ben Van Calster Ga\"el Varoquaux Paul F. J\"ager

This is my paper

classification 💻 cs.CV

keywords metricsimagereloadedvalidationanalysisframeworkproblemacross

0 comments

read the original abstract

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
eess.IV 2024-01 unverdicted novelty 7.0

U-Mamba is a hybrid CNN-SSM architecture that outperforms prior CNN and Transformer networks on biomedical image segmentation tasks by efficiently modeling long-range dependencies.
MONAI: An open-source framework for deep learning in healthcare
cs.LG 2022-11 accept novelty 6.0

MONAI is a community-supported PyTorch framework that extends deep learning to medical data with domain-specific architectures, transforms, and deployment tools.
ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment
cs.CV 2026-06 unverdicted novelty 5.0

ClinReadNet introduces SOQN, (S)W-MTMSA, and HRPS loss to achieve SOTA no-reference IQA on LDCTIQAG2023 with PLCC 0.9507, SROCC 0.9554, KROCC 0.8629.
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
cs.CV 2026-05 accept novelty 5.0

The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration
cs.CV 2026-05 unverdicted novelty 5.0

Systematic multi-variable experiments show panoptic segmentation yields poorer uncertainty quality than semantic, with high variance across datasets and backbones, limited value from time-series samples, calibration g...
The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization
cs.CV 2026-05 unverdicted novelty 5.0

The autoPET3 challenge finds that leading AI models reach a mean Dice score of 0.66 for multitracer PET/CT lesion segmentation, with compositional generalization to unseen tracer-center pairs remaining an open problem...
The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization
cs.CV 2026-05 unverdicted novelty 4.0

The autoPET3 challenge finds good in-domain lesion segmentation performance in multitracer PET/CT but identifies compositional generalization to unseen tracer-center combinations as an open problem driven by volume ov...