Evaluation of Anatomical Shape Priors in Deep Learning-Based Cardiac Multi-Compartment Segmentation

Franz Thaler; Martin Urschler; Michael Hudler

arxiv: 2605.15707 · v1 · pith:6U4WBOVOnew · submitted 2026-05-15 · 📡 eess.IV · cs.CV

Evaluation of Anatomical Shape Priors in Deep Learning-Based Cardiac Multi-Compartment Segmentation

Michael Hudler , Franz Thaler , Martin Urschler This is my paper

Pith reviewed 2026-05-19 19:13 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords cardiac segmentationshape priors3D U-NetCT imagingmulti-compartment segmentationanatomical constraintsdeep learning

0 comments

The pith

A standard 3D U-Net remains a strong baseline for cardiac CT segmentation while lightweight explicit shape priors deliver only marginal and often negative effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding simple explicit anatomical shape priors can improve multi-compartment heart segmentation in CT images by enforcing anatomical plausibility that standard CNNs lack. It implements the priors as shape-aware loss terms and as spatial label distribution heatmaps that guide a 3D U-Net, then evaluates both versions against an unmodified U-Net on the MM-WHS CT and WHS++ datasets. Across all runs the plain U-Net matches or exceeds the prior-augmented models, with the added constraints producing inconsistent or detrimental changes. This outcome matters because it indicates that ordinary networks already absorb substantial anatomical regularities from the training data alone. The authors therefore conclude that future progress will need richer learned priors rather than lightweight handcrafted ones.

Core claim

Across experiments on the MM-WHS CT and WHS++ datasets, a standard 3D U-Net served as a robust baseline for whole-heart multi-compartment segmentation, while implementations of lightweight explicit shape priors as shape-aware losses and label distribution heatmaps produced at best marginal and inconsistent improvements that frequently reduced performance.

What carries the argument

Lightweight explicit anatomical shape priors implemented via shape-aware losses and spatial label distribution heatmaps in U-Net models.

If this is right

The unmodified 3D U-Net already encodes substantial implicit anatomical regularities from the training data.
Handcrafted shape priors are not reliably beneficial and can degrade segmentation quality.
Future accuracy gains will require more expressive learned priors rather than simple handcrafted constraints.
Performance differences observed are attributable to the priors rather than implementation or dataset artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard CNNs may learn sufficient anatomical constraints implicitly for many medical segmentation tasks without manual priors.
Researchers should benchmark unmodified baselines thoroughly before introducing explicit prior terms.
The same evaluation approach could be applied to other organs and imaging modalities where shape priors are routinely added.

Load-bearing premise

The tested shape-aware losses and heatmap-guided U-Net variants are representative of what lightweight explicit anatomical shape priors can achieve and that observed differences stem from the priors themselves.

What would settle it

A controlled experiment showing that an alternative implementation of lightweight explicit priors produces consistent, statistically significant gains in Dice score or surface distance over the unmodified 3D U-Net on the same datasets would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15707 by Franz Thaler, Martin Urschler, Michael Hudler.

**Figure 2.** Figure 2: Representative qualitative comparison on WHS++ (subject 2014, coronal slice 76). The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Whole-heart multi-compartment CT segmentation is clinically important, but standard CNNs do not explicitly enforce anatomical plausibility. Based on statistics derived from the training data, we evaluate whether lightweight explicit shape priors, implemented as shape-aware losses and spatial label distribution heatmap-guided U-Net variants, improve 3D cardiac segmentation on MM-WHS CT and WHS++. Across all experiments, a standard 3D U-Net surprisingly remained a very strong baseline, with handcrafted priors yielding at best marginal and inconsistent changes and often degrading performance. These results suggest that the baseline already captures substantial implicit anatomical regularities and that future gains will likely require more expressive learned priors rather than simple handcrafted anatomical shape constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A plain 3D U-Net holds up well against added handcrafted shape priors in cardiac CT segmentation, with the priors giving marginal or negative returns.

read the letter

The main point here is that a plain 3D U-Net performs as well as or better than versions with added handcrafted shape priors for cardiac CT segmentation. The priors, whether through shape-aware losses or heatmap guidance, give only small or inconsistent gains and sometimes hurt results. The work does a solid job of setting up a direct comparison on the MM-WHS CT and WHS++ datasets. It derives the priors from training data statistics and tests them in a lightweight way, which is a reasonable approach for checking if explicit constraints help. This kind of negative result is useful because it pushes the field toward more sophisticated learned priors rather than simple handcrafted ones. The authors are clear that the baseline already picks up a lot of anatomical structure implicitly. On the soft spots, the stress test raises a fair question about hyperparameter tuning. If the shape-prior variants didn't get as thorough a search for things like loss weights or guidance parameters as the baseline did for learning rate and optimizer, then the observed drops in performance might not be fair. The abstract doesn't spell out a joint sweep or ablation on prior strength, so the full paper needs to show that the comparisons are apples-to-apples. Also, the lack of specific numbers, error bars, or statistical tests in the summary makes it harder to gauge how reliable the differences are, though the abstract says results are consistent. This paper is mainly for people working on medical image segmentation, especially cardiac CT. A reader looking for evidence on whether to bother with explicit priors will get value from the empirical check. It is not introducing a new method, but the evaluation is honest and points to a practical takeaway. I would recommend sending it for peer review. The experiments address a real question with standard tools, and even with the tuning concern, it looks like something that deserves referee input to verify the details and perhaps strengthen the reporting.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates lightweight explicit anatomical shape priors—implemented as shape-aware losses and spatial label distribution heatmap-guided U-Net variants—for 3D multi-compartment cardiac CT segmentation on the MM-WHS and WHS++ datasets. The central empirical claim is that a standard 3D U-Net remains a strong baseline across experiments, while the handcrafted priors produce at best marginal and inconsistent gains and frequently degrade performance, implying that the baseline already encodes substantial implicit anatomical regularities and that future progress will require more expressive learned priors.

Significance. If the quantitative results hold under rigorous verification, the work would be significant for medical image segmentation research by providing evidence against the routine addition of simple handcrafted shape constraints to CNNs. It could redirect efforts toward learned or data-driven priors and has direct relevance to clinical cardiac imaging pipelines that require anatomically plausible multi-structure outputs.

major comments (2)

[Abstract and Results] Abstract and experimental results: the claim that priors yield 'marginal and inconsistent changes' and 'often degrading performance' is presented without any reported Dice scores, Hausdorff distances, standard deviations, or statistical tests across compartments and datasets. This absence prevents assessment of effect sizes and undermines the load-bearing conclusion that the baseline is already sufficient.
[Methods / Experimental Setup] Experimental protocol: it is unclear whether shape-prior variants (e.g., loss-weight schedules for shape-aware terms or heatmap scaling) received hyperparameter optimization equivalent in scope to the baseline 3D U-Net (learning-rate sweeps, optimizer choices). Without a joint ablation study or explicit statement of equal tuning effort, observed degradations cannot be confidently attributed to the priors rather than implementation imbalance.

minor comments (2)

[Methods] Clarify the precise mathematical definition and normalization of the spatial label distribution heatmaps and their exact fusion mechanism within the U-Net decoder.
[Results] Add a table or supplementary material listing all quantitative metrics (mean and std) for every compartment, dataset, and method variant to support the narrative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below, providing clarifications and indicating where revisions will be made to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract and Results] Abstract and experimental results: the claim that priors yield 'marginal and inconsistent changes' and 'often degrading performance' is presented without any reported Dice scores, Hausdorff distances, standard deviations, or statistical tests across compartments and datasets. This absence prevents assessment of effect sizes and undermines the load-bearing conclusion that the baseline is already sufficient.

Authors: We agree that incorporating specific quantitative metrics into the abstract would improve readability and allow readers to immediately assess effect sizes. The full results section already includes comprehensive tables with per-compartment Dice scores, Hausdorff distances, and standard deviations for both the MM-WHS CT and WHS++ datasets. To directly address this concern, we will revise the abstract to report key representative values (e.g., mean Dice for the baseline 3D U-Net versus the strongest prior variant) and will add paired statistical tests (with p-values) to the results section in the revised manuscript. revision: yes
Referee: [Methods / Experimental Setup] Experimental protocol: it is unclear whether shape-prior variants (e.g., loss-weight schedules for shape-aware terms or heatmap scaling) received hyperparameter optimization equivalent in scope to the baseline 3D U-Net (learning-rate sweeps, optimizer choices). Without a joint ablation study or explicit statement of equal tuning effort, observed degradations cannot be confidently attributed to the priors rather than implementation imbalance.

Authors: All shape-prior variants underwent hyperparameter optimization of equivalent scope to the baseline, including learning-rate sweeps, optimizer selection, and specific tuning of loss weights and heatmap scaling factors under the same 5-fold cross-validation protocol. To eliminate ambiguity, we will add an explicit subsection to the Methods describing the search ranges and selected hyperparameters for each method. A exhaustive joint ablation across every combination is computationally prohibitive given our resources, but the documented equal-effort tuning supports attributing performance differences to the priors themselves. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations

full rationale

This paper is a self-contained empirical evaluation of segmentation methods on public cardiac CT datasets (MM-WHS and WHS++). It reports direct experimental comparisons between a baseline 3D U-Net and variants incorporating handcrafted shape priors, with all performance claims grounded in measured Dice scores and other metrics rather than any mathematical derivation chain. There are no equations, fitted predictions, self-definitional constructs, or load-bearing self-citations that reduce claims to inputs by construction. The central finding—that explicit priors yield marginal or negative gains—is presented as an observation from the experiments, not as a derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the chosen datasets are representative benchmarks and that the implemented priors fairly test the value of explicit shape constraints; no free parameters or invented entities are described.

axioms (1)

domain assumption The MM-WHS CT and WHS++ datasets provide suitable test cases for evaluating anatomical plausibility in whole-heart multi-compartment segmentation.
These datasets are used to draw the conclusion that standard U-Nets capture implicit regularities.

pith-pipeline@v0.9.0 · 5646 in / 1219 out tokens · 58089 ms · 2026-05-19T19:13:39.884787+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across all experiments, a standard 3D U-Net surprisingly remained a very strong baseline, with handcrafted priors yielding at best marginal and inconsistent changes and often degrading performance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge.Medical Image Analysis, 58:101537, December 2019

Xiahai Zhuang, Lei Li, Christian Payer, Darko Štern, Martin Urschler, Mattias P Heinrich, Julien Oster, Chunliang Wang, Örjan Smedby, Cheng Bian, Xin Yang, Pheng-Ann Heng, Aliasghar Mortazi, Ulas Bagci, et al. Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge.Medical Image Analysis, 58:101537, December 2019

work page 2019
[2]

Elena Zappon, Luca Azzolin, Matthias A F Gsell, Franz Thaler, Anton J Prassl, Robert Arnold, Karli Gillette, Mohammadreza Kariman, Martin Manninger, Daniel Scherr, Aurel Neic, Martin Urschler, Christoph M Augustin, Edward J Vigmond, and Gernot Plank. An efficient end-to-end computational framework for the generation of ECG calibrated volumetric models of ...

work page 2025
[3]

Deep learning.Nature, 521(7553):436–444, May 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553):436–444, May 2015

work page 2015
[4]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015

work page 2015
[5]

3D U-net: Learning dense volumetric segmentation from sparse annotation

Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-net: Learning dense volumetric segmentation from sparse annotation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Lecture notes in computer science, pages 424–432. Springer International Publishing, Cham, 2016

work page 2016
[6]

Multi-label whole heart segmentation using CNNs and anatomical label configurations

Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler. Multi-label whole heart segmentation using CNNs and anatomical label configurations. InStatistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, pages 190–198. Springer International Publishing, 2018

work page 2018
[7]

Active shape models-their training and application.Computer Vision and Image Understanding, 61(1):38–59, January 1995

Tim F Cootes, Chris J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and application.Computer Vision and Image Understanding, 61(1):38–59, January 1995

work page 1995
[8]

Statistical shape models for 3D medical image segmentation: a review.Medical Image Analysis, 13(4):543–563, August 2009

Tobias Heimann and Hans-Peter Meinzer. Statistical shape models for 3D medical image segmentation: a review.Medical Image Analysis, 13(4):543–563, August 2009

work page 2009
[9]

A survey on shape-constraint deep learning for medical image segmentation.IEEE Reviews in Biomedical Engineering, 16:225–240, January 2023

Simon Bohlender, Ilkay Oksuz, and Anirban Mukhopadhyay. A survey on shape-constraint deep learning for medical image segmentation.IEEE Reviews in Biomedical Engineering, 16:225–240, January 2023

work page 2023
[10]

A generalized solution of the orthogonal procrustes problem.Psychome- trika, 31:1–10, 1966

Peter H Schoenemann. A generalized solution of the orthogonal procrustes problem.Psychome- trika, 31:1–10, 1966

work page 1966
[11]

Rectifier nonlinearities improve neural network acoustic models

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. InInternational Conference on Machine Learning (ICML), volume 30, 2013

work page 2013
[12]

Gener- alised dice overlap as a deep learning loss function for highly unbalanced segmentations

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Gener- alised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support - DLMIA ML-CDS 2017, volume 10553 ofLecture Notes in Computer Science, pages 24...

work page 2017
[13]

Master’s Thesis, Graz University of Technology, Graz, Austria, 2026

Michael Hudler.Evaluation of Shape Models for Deep Learning Based Cardiac Image Segmen- tation. Master’s Thesis, Graz University of Technology, Graz, Austria, 2026

work page 2026
[14]

Augmentation-based domain generalization and joint training from multiple source domains for whole heart segmentation

Franz Thaler, Darko Štern, Gernot Plank, and Martin Urschler. Augmentation-based domain generalization and joint training from multiple source domains for whole heart segmentation. In Comprehensive Analysis and Computing of Real-World Medical Images. CARE 2024, volume 15548 ofLecture notes in computer science, pages 168–179. Springer Nature Switzerland, C...

work page 2024
[15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

work page 2020
[16]

Flow matching-based data synthesis for robust anatomical landmark localization.IEEE Journal of Biomedical and Health Informatics, early access, 2025

Arnela Hadzic, Lea Bogensperger, Andrea Berghold, and Martin Urschler. Flow matching-based data synthesis for robust anatomical landmark localization.IEEE Journal of Biomedical and Health Informatics, early access, 2025

work page 2025

[1] [1]

Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge.Medical Image Analysis, 58:101537, December 2019

Xiahai Zhuang, Lei Li, Christian Payer, Darko Štern, Martin Urschler, Mattias P Heinrich, Julien Oster, Chunliang Wang, Örjan Smedby, Cheng Bian, Xin Yang, Pheng-Ann Heng, Aliasghar Mortazi, Ulas Bagci, et al. Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge.Medical Image Analysis, 58:101537, December 2019

work page 2019

[2] [2]

Elena Zappon, Luca Azzolin, Matthias A F Gsell, Franz Thaler, Anton J Prassl, Robert Arnold, Karli Gillette, Mohammadreza Kariman, Martin Manninger, Daniel Scherr, Aurel Neic, Martin Urschler, Christoph M Augustin, Edward J Vigmond, and Gernot Plank. An efficient end-to-end computational framework for the generation of ECG calibrated volumetric models of ...

work page 2025

[3] [3]

Deep learning.Nature, 521(7553):436–444, May 2015

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553):436–444, May 2015

work page 2015

[4] [4]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015

work page 2015

[5] [5]

3D U-net: Learning dense volumetric segmentation from sparse annotation

Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-net: Learning dense volumetric segmentation from sparse annotation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2016, Lecture notes in computer science, pages 424–432. Springer International Publishing, Cham, 2016

work page 2016

[6] [6]

Multi-label whole heart segmentation using CNNs and anatomical label configurations

Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler. Multi-label whole heart segmentation using CNNs and anatomical label configurations. InStatistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, pages 190–198. Springer International Publishing, 2018

work page 2018

[7] [7]

Active shape models-their training and application.Computer Vision and Image Understanding, 61(1):38–59, January 1995

Tim F Cootes, Chris J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and application.Computer Vision and Image Understanding, 61(1):38–59, January 1995

work page 1995

[8] [8]

Statistical shape models for 3D medical image segmentation: a review.Medical Image Analysis, 13(4):543–563, August 2009

Tobias Heimann and Hans-Peter Meinzer. Statistical shape models for 3D medical image segmentation: a review.Medical Image Analysis, 13(4):543–563, August 2009

work page 2009

[9] [9]

A survey on shape-constraint deep learning for medical image segmentation.IEEE Reviews in Biomedical Engineering, 16:225–240, January 2023

Simon Bohlender, Ilkay Oksuz, and Anirban Mukhopadhyay. A survey on shape-constraint deep learning for medical image segmentation.IEEE Reviews in Biomedical Engineering, 16:225–240, January 2023

work page 2023

[10] [10]

A generalized solution of the orthogonal procrustes problem.Psychome- trika, 31:1–10, 1966

Peter H Schoenemann. A generalized solution of the orthogonal procrustes problem.Psychome- trika, 31:1–10, 1966

work page 1966

[11] [11]

Rectifier nonlinearities improve neural network acoustic models

Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. InInternational Conference on Machine Learning (ICML), volume 30, 2013

work page 2013

[12] [12]

Gener- alised dice overlap as a deep learning loss function for highly unbalanced segmentations

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Gener- alised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support - DLMIA ML-CDS 2017, volume 10553 ofLecture Notes in Computer Science, pages 24...

work page 2017

[13] [13]

Master’s Thesis, Graz University of Technology, Graz, Austria, 2026

Michael Hudler.Evaluation of Shape Models for Deep Learning Based Cardiac Image Segmen- tation. Master’s Thesis, Graz University of Technology, Graz, Austria, 2026

work page 2026

[14] [14]

Augmentation-based domain generalization and joint training from multiple source domains for whole heart segmentation

Franz Thaler, Darko Štern, Gernot Plank, and Martin Urschler. Augmentation-based domain generalization and joint training from multiple source domains for whole heart segmentation. In Comprehensive Analysis and Computing of Real-World Medical Images. CARE 2024, volume 15548 ofLecture notes in computer science, pages 168–179. Springer Nature Switzerland, C...

work page 2024

[15] [15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

work page 2020

[16] [16]

Flow matching-based data synthesis for robust anatomical landmark localization.IEEE Journal of Biomedical and Health Informatics, early access, 2025

Arnela Hadzic, Lea Bogensperger, Andrea Berghold, and Martin Urschler. Flow matching-based data synthesis for robust anatomical landmark localization.IEEE Journal of Biomedical and Health Informatics, early access, 2025

work page 2025