Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

Andrei-Alexandru Popa; Andrei Constantinescu; Elisa A. Liehn; Liang Zhong; Madhulika Tripathy; Phongsakon Mark Konrad; Serkan Ayvaz; Yaser Sabzehmeidani

arxiv: 2509.05892 · v2 · submitted 2025-09-07 · 💻 cs.CV · cs.AI

Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

Phongsakon Mark Konrad , Andrei-Alexandru Popa , Yaser Sabzehmeidani , Liang Zhong , Madhulika Tripathy , Andrei Constantinescu , Elisa A. Liehn , Serkan Ayvaz This is my paper

Pith reviewed 2026-05-18 17:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical image segmentationdistribution shiftlow-data regimesmodel benchmarkingfoundation modelshistopathologycarotid arteryuncertainty evaluation

0 comments

The pith

Model rankings from small medical image datasets shift sharply under distribution shift and largely reflect statistical noise instead of algorithmic superiority.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests ten deep learning segmentation models on only nine cardiovascular histology images for carotid artery structures. When the same models are evaluated on an independent set of 153 images drawn from a different distribution, foundation models keep their accuracy while classical architectures lose performance and overall rankings rearrange. Training subsets of varying size from the second dataset produce entirely different ranking orders, showing that which model appears best depends on the specific data collection. Bootstrap resampling of the results yields heavily overlapping confidence intervals for the leading models, indicating that most observed gaps come from sampling variation rather than genuine model differences. The work therefore argues that conventional single-dataset benchmarking cannot reliably guide model selection in low-data clinical settings.

Core claim

Evaluation on an independent generalization dataset (N=153) under distribution shift reveals that foundation models maintain performance while classical architectures fail, and that rankings change substantially between in-distribution and out-of-distribution settings. Training on the second dataset at varying sample sizes reveals dataset-specific ranking hierarchies confirming that model rankings are not generalizable across datasets. Despite rigorous Bayesian hyperparameter optimization, model performance remains highly sensitive to data splits. The bootstrap analysis reveals substantially overlapping confidence intervals among top models, with differences driven more by statistical noise.

What carries the argument

Independent out-of-distribution evaluation on a 153-image generalization set paired with bootstrap resampling to quantify uncertainty in performance rankings.

If this is right

Model rankings obtained on one small training set do not predict performance when the data distribution changes.
Foundation models exhibit greater stability than classical architectures when tested under distribution shift in low-data medical segmentation.
Bootstrap confidence intervals for top models overlap substantially, so point-estimate differences are often attributable to statistical noise.
Performance remains sensitive to data splits even after hyperparameter optimization, undermining single-split comparisons.
Standard benchmarking practices in low-data clinical research require uncertainty-aware methods to avoid overinterpreting rankings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early-stage studies with limited data should routinely include at least one out-of-distribution test set before committing to a particular architecture.
Reporting performance differences with confidence intervals would help distinguish meaningful gains from random variation in small-sample settings.
The observed sensitivity to dataset identity suggests that meta-analyses pooling results across different small medical collections may need additional controls for distribution effects.
These patterns raise the possibility that many published comparisons in small-N medical imaging papers rest on rankings that would reorder under modest changes in image acquisition.

Load-bearing premise

The nine-image training set and the 153-image generalization set are assumed to provide a representative test of distribution shift and model behavior in low-data clinical scenarios.

What would settle it

A follow-up experiment using a different small training collection and a fresh independent test collection that produces stable model rankings with non-overlapping bootstrap confidence intervals among the top models would contradict the central claim.

Figures

Figures reproduced from arXiv: 2509.05892 by Andrei-Alexandru Popa, Andrei Constantinescu, Elisa A. Liehn, Liang Zhong, Madhulika Tripathy, Phongsakon Mark Konrad, Serkan Ayvaz, Yaser Sabzehmeidani.

**Figure 2.** Figure 2: Example of data preparation. (A) Original histological image. (B) Expert line-art [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The critical impact of hyperparameter selection. All twelve segmentation results [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: The five-layer XAI framework. (A) A model’s prediction is dissected into (1) [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Model Rank Stability in 3-Fold vs. LOOCV. The plots track the performance [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Bootstrap 95% Confidence Intervals for Macro-Dice Score. The plots show the [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Critical Difference Analysis of Model Ranks. This analysis compares average [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: The relationship between practical significance (Cohen’s d) and statistical [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Comparison of Model Predictions and Explainable AI (XAI) Outputs, [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Quantitative XAI Stability Analysis. (A) The mean uncertainty map, averaged [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

read the original abstract

Accurate segmentation of carotid artery structures in histopathological images is vital for cardiovascular disease research. This study systematically evaluates ten deep learning segmentation models including classical architectures, modern CNNs, a Vision Transformer, and foundation models, on a limited dataset of nine cardiovascular histology images. We conducted ablation studies on data augmentation, input resolution, and random seed stability to quantify sources of variance. Evaluation on an independent generalization dataset ($N=153$) under distribution shift reveals that foundation models maintain performance while classical architectures fail, and that rankings change substantially between in-distribution and out-of-distribution settings. Training on the second dataset at varying sample sizes reveals dataset-specific ranking hierarchies confirming that model rankings are not generalizable across datasets. Despite rigorous Bayesian hyperparameter optimization, model performance remains highly sensitive to data splits. The bootstrap analysis reveals substantially overlapping confidence intervals among top models, with differences driven more by statistical noise than algorithmic superiority. This instability exposes limitations of standard benchmarking in low-data clinical settings and challenges assumptions that performance rankings reflect clinical utility. We advocate for uncertainty-aware evaluation in low-data clinical research scenarios from two perspectives. First, the scenario is not niche and is rather widely spread; and second, it enables pursuing or discontinuing research tracks with limited datasets from incipient stages of observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

With only nine training images, this benchmarking study finds model rankings flip under distribution shift and that bootstrap intervals overlap enough to question whether top models are truly better.

read the letter

The main thing to know is that training on nine carotid histology images leads to unstable rankings when tested on a 153-image set from a different distribution. Foundation models keep their performance while classical ones drop, and the bootstrap analysis shows wide overlapping intervals among the leaders, pointing to noise as the bigger driver than model choice. They also show that retraining on the second dataset at different sizes produces its own ranking order, so no single hierarchy holds across datasets.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates ten deep learning segmentation models (classical CNNs, modern variants, a Vision Transformer, and foundation models) for carotid artery structures in histopathological images using a limited training set of nine images. Through ablations on augmentation, resolution, and seeds, plus Bayesian hyperparameter optimization, it reports high sensitivity to data splits. On an independent 153-image generalization set under distribution shift, foundation models maintain performance while classical architectures degrade, rankings change substantially, and bootstrap analysis shows overlapping confidence intervals among top models, indicating differences are driven by statistical noise rather than superiority. The work concludes that standard benchmarking has limitations in low-data clinical settings and advocates uncertainty-aware evaluation.

Significance. If the empirical findings hold, the paper makes a useful contribution by providing concrete evidence of ranking instability and model-type divergences under distribution shift in a small-N medical segmentation task. The inclusion of bootstrap analysis, ablations, and cross-dataset training at varying sample sizes is a strength that supports the call for more cautious interpretation of benchmarks; this could encourage the field to adopt uncertainty quantification as standard practice when datasets are limited, rather than relying on point estimates or leaderboards.

major comments (2)

The claim that observed instabilities and foundation-model robustness reflect general limitations of standard benchmarking (rather than artifacts of this specific pair) is load-bearing, yet the manuscript does not quantify the distribution shift between the nine-image training set and the 153-image generalization set (e.g., via metrics on staining, scanner, resolution, or cohort differences). Without such quantification, the generalization argument risks over-extrapolation from a single, potentially extreme shift instance.
With a training set of only N=9, the bootstrap confidence intervals and ablation results have limited statistical power; the manuscript should explicitly address how the small sample size affects the reliability of the overlapping-CI conclusion and the dataset-specific ranking hierarchies reported when retraining on the second dataset at varying sizes.

minor comments (2)

The abstract would be strengthened by reporting error bars or confidence intervals alongside performance numbers and by stating exact model names, hyperparameter ranges, and data-availability status.
Clarify the precise definition of 'foundation models' versus other categories and ensure all ten models are listed with references in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have prompted us to strengthen the manuscript's claims and discussion. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses

Referee: The claim that observed instabilities and foundation-model robustness reflect general limitations of standard benchmarking (rather than artifacts of this specific pair) is load-bearing, yet the manuscript does not quantify the distribution shift between the nine-image training set and the 153-image generalization set (e.g., via metrics on staining, scanner, resolution, or cohort differences). Without such quantification, the generalization argument risks over-extrapolation from a single, potentially extreme shift instance.

Authors: We acknowledge that explicit quantification of the distribution shift would better support the generalization of our findings beyond this specific dataset pair. The manuscript describes the shift qualitatively through differences in staining, resolution, and acquisition protocols, but does not provide numerical metrics. In the revised manuscript, we will add a dedicated subsection quantifying the shift using metrics such as channel-wise intensity histograms, mean/variance differences in RGB space, and resolution comparisons between the N=9 training set and the N=153 generalization set. This addition will help contextualize the observed performance changes and reduce the risk of over-extrapolation. revision: yes
Referee: With a training set of only N=9, the bootstrap confidence intervals and ablation results have limited statistical power; the manuscript should explicitly address how the small sample size affects the reliability of the overlapping-CI conclusion and the dataset-specific ranking hierarchies reported when retraining on the second dataset at varying sizes.

Authors: The small training set size (N=9) is central to the paper's focus on low-data challenges, and the overlapping bootstrap confidence intervals are presented precisely to illustrate that apparent model differences often fall within statistical noise. We agree that the limited statistical power should be discussed more explicitly to avoid misinterpretation of the results. In the revised discussion section, we will add text addressing how the small N constrains the reliability of the overlapping-CI findings and the stability of ranking hierarchies observed when retraining on the generalization dataset at varying sample sizes. This will frame our conclusions as highlighting the inherent difficulties in such regimes rather than definitive claims. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmarking study

full rationale

This is a purely empirical benchmarking paper with no mathematical derivations, equations, or self-referential definitions. All claims rest on direct experimental comparisons of ten segmentation models trained on a nine-image dataset, evaluated on an independent 153-image generalization set under distribution shift, plus ablations on augmentation/resolution/seed and bootstrap confidence intervals. No fitted parameters are renamed as predictions, no uniqueness theorems or ansatzes are imported via self-citation, and no result reduces to its inputs by construction. The central findings (ranking changes, overlapping CIs, dataset-specific hierarchies) are externally falsifiable via replication on the reported splits and therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions in deep learning for medical imaging rather than new theoretical entities or derivations; the main dependencies are the representativeness of the small dataset and the validity of the chosen evaluation metrics under distribution shift.

free parameters (1)

Bayesian-optimized hyperparameters
Each of the ten models underwent Bayesian hyperparameter optimization, introducing fitted values that affect reported performance.

axioms (1)

domain assumption Deep learning segmentation models trained on small histology datasets can be meaningfully compared via standard metrics and ablations.
Invoked throughout the benchmarking and generalization experiments described in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1419 out tokens · 53564 ms · 2026-05-18T17:31:16.623042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

A survey on hyperparameter optimization of machine learning models, in: 2024 2nd International Conference on Disruptive Technologies (ICDT), IEEE. pp. 11–15. Ahmad, H.M., Khan, M.J., Yousaf, A., Ghuffar, S., Khurshid, K.,

work page 2024
[2]

Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation

Re- current residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955 . Arasu, R., Arasu, A., Muller, J.,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

An expert-driven data generation pipeline for histological images, in: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), IEEE. pp. 1–5. Bhagat, M., Bakariya, B.,

work page 2024
[4]

Rethinking Atrous Convolution for Semantic Image Segmentation

Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 . Chrysos, G.,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Architecture design: From neural networks to foundation models, in: 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA), IEEE. pp. 1–3. Cohen, J.,

work page 2024
[6]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 . Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution

A unified framework for evaluating the effectiveness and enhancing the transparency of explainable ai methods in real-world applications. arXiv preprint arXiv:2412.03884 . Khalili, E., Priego-Torres, B., Leon-Jimenez, A., Sanchez-Morillo, D.,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Application and effectiveness evaluation of bayesian optimiza- tion algorithm in hyperparameter tuning of machine learning models, in: 2024 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC), IEEE. pp. 351–355. Langer, E.J.,

work page 2024
[9]

Overfitting in bayesian optimization: an empirical study and early-stopping solution, in: 2nd Workshop on Neural Architecture Search (NAS 2021)@ ICLR 2021, NAS

work page 2021
[10]

arXiv preprint arXiv:2401.14248

On generalisability of segment anything model for nuclear instance segmentation in histology images. arXiv preprint arXiv:2401.14248 . 38 Xu, P., Ji, X., Li, M., Lu, W.,

work page arXiv
[11]

arXiv preprint arXiv:2408.09886

Sam-unet: Enhancing zero- shot segmentation of sam for universal medical images. arXiv preprint arXiv:2408.09886

work page arXiv
[12]

arXiv preprint arXiv:2102.04525

A mixed focal loss function for handling class imbalanced medical image segmentation. arXiv preprint arXiv:2102.04525 . Yu, H., Yang, L.T., Zhang, Q., Armstrong, D., Deen, M.J.,

work page arXiv

[1] [1]

A survey on hyperparameter optimization of machine learning models, in: 2024 2nd International Conference on Disruptive Technologies (ICDT), IEEE. pp. 11–15. Ahmad, H.M., Khan, M.J., Yousaf, A., Ghuffar, S., Khurshid, K.,

work page 2024

[2] [2]

Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation

Re- current residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955 . Arasu, R., Arasu, A., Muller, J.,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

An expert-driven data generation pipeline for histological images, in: 2024 IEEE International Symposium on Biomedical Imaging (ISBI), IEEE. pp. 1–5. Bhagat, M., Bakariya, B.,

work page 2024

[4] [4]

Rethinking Atrous Convolution for Semantic Image Segmentation

Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 . Chrysos, G.,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Architecture design: From neural networks to foundation models, in: 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA), IEEE. pp. 1–3. Cohen, J.,

work page 2024

[6] [6]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 . Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution

A unified framework for evaluating the effectiveness and enhancing the transparency of explainable ai methods in real-world applications. arXiv preprint arXiv:2412.03884 . Khalili, E., Priego-Torres, B., Leon-Jimenez, A., Sanchez-Morillo, D.,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Application and effectiveness evaluation of bayesian optimiza- tion algorithm in hyperparameter tuning of machine learning models, in: 2024 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC), IEEE. pp. 351–355. Langer, E.J.,

work page 2024

[9] [9]

Overfitting in bayesian optimization: an empirical study and early-stopping solution, in: 2nd Workshop on Neural Architecture Search (NAS 2021)@ ICLR 2021, NAS

work page 2021

[10] [10]

arXiv preprint arXiv:2401.14248

On generalisability of segment anything model for nuclear instance segmentation in histology images. arXiv preprint arXiv:2401.14248 . 38 Xu, P., Ji, X., Li, M., Lu, W.,

work page arXiv

[11] [11]

arXiv preprint arXiv:2408.09886

Sam-unet: Enhancing zero- shot segmentation of sam for universal medical images. arXiv preprint arXiv:2408.09886

work page arXiv

[12] [12]

arXiv preprint arXiv:2102.04525

A mixed focal loss function for handling class imbalanced medical image segmentation. arXiv preprint arXiv:2102.04525 . Yu, H., Yang, L.T., Zhang, Q., Armstrong, D., Deen, M.J.,

work page arXiv