CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark

Anton Ivchenko

arxiv: 2604.15561 · v1 · submitted 2026-04-16 · 📡 eess.IV · cs.CV

CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark

Anton Ivchenko This is my paper

Pith reviewed 2026-05-10 09:09 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords chest CTimage segmentationdata leakagepatient disjointbenchmarkDice scorereproducibilitymedical imaging

0 comments

The pith

Mixing slices from the same patient inflates chest CT segmentation performance by 69 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common splitting methods for chest CT scans let slices from one patient appear in both training and testing sets. This reuse lets models pick up patient-specific patterns instead of general anatomy, producing accuracy numbers that do not hold for new patients. To measure the real effect, the authors created the CTSCAN benchmark from multiple sources but with strict patient separation in every split. Experiments using the same model and training schedule found that patient-disjoint evaluation cuts foreground Dice by more than two-thirds compared with the mixed approach. A reader should care because segmentation models guide clinical decisions, and overstated results can hide the gap between lab performance and actual usefulness.

Core claim

The paper establishes that slice-mixed protocols in chest CT segmentation induce near-complete patient reuse across partitions, yielding foreground Dice of 0.6665 and IoU of 0.5031, whereas patient-disjoint protocols yield only 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute, or 69 percent relative, showing that the mixed protocol substantially inflates reported performance.

What carries the argument

The patient-disjoint split protocol inside the CTSCAN benchmark, which supplies deterministic manifests that keep every slice from one case inside a single train, validation, or test partition.

If this is right

Many previously reported chest CT segmentation results from slice-mixed setups represent upper bounds that do not reflect generalization to new patients.
Future studies should adopt patient-disjoint evaluation to produce comparable and realistic performance numbers.
The supplied deterministic manifests and scripted multi-seed sweep allow any new model to be tested under the same leakage-controlled conditions.
Models must improve their handling of unseen patients to close the observed gap between mixed and disjoint scores.
The benchmark's explicit weak-supervision controls let researchers separate leakage effects from supervision choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same patient-reuse problem likely distorts results in other medical imaging tasks such as abdominal CT or MRI segmentation.
Adding patient-specific normalization layers or domain-adaptation steps could be tested to see whether they shrink the performance gap between mixed and disjoint regimes.
The large drop indicates that current models may be capturing individual anatomy rather than universal structures, which could be checked by inspecting learned features across patients.
Extending the benchmark to additional independent sources would provide a stronger test of whether the leakage effect is consistent across data collections.

Load-bearing premise

The performance difference between the two protocols stems mainly from the removal of patient-level data leakage rather than from differences in data distribution across sources or other experimental variables.

What would settle it

Re-running the identical multi-seed protocol on a single homogeneous chest CT dataset that is split once in the mixed style and once in the patient-disjoint style, and checking whether the Dice gap remains near 0.46.

Figures

Figures reproduced from arXiv: 2604.15561 by Anton Ivchenko.

**Figure 1.** Figure 1: Protocol comparison for the shared FPN control. Left: mean foreground Dice and IoU across seeds under the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Leakage effect across architecture families. Both the shared FPN and the shared U-Net controls lose [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-seed training and validation curves for the shared FPN control. Top: foreground Dice. Bottom: loss. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sampler and loss ablations under slice-mixed and patient-disjoint evaluation. Training tweaks move the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Per-class and per-source breakdown under the shared FPN control. The same split change that inflates the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard slice-mixed train/test splits in chest CT segmentation induce severe evaluation leakage through intra-patient slice reuse, inflating reported performance. It introduces the CTSCAN benchmark aggregating 89 cases from PleThora, MedSeg SIRM, and LongCIU, and demonstrates via controlled experiments with FPN + EfficientNet-B0 that slice-mixed protocols achieve 0.6665 foreground Dice / 0.5031 IoU while case-disjoint protocols achieve only 0.2066 Dice / 0.1181 IoU (absolute drops of 0.46 Dice and 0.385 IoU). The work supplies deterministic patient-disjoint split manifests, weak-supervision controls, a scripted multi-seed (3 seeds, 12 epochs) protocol, and reproducible figure generation.

Significance. If the performance gap is shown to arise primarily from leakage removal, the result would be significant for medical image segmentation evaluation practices, quantifying the inflation that occurs under common splitting workflows and supplying a reusable, patient-disjoint multi-source benchmark. The explicit provision of deterministic manifests, multi-seed sweep code, and scripted figure generation constitutes a concrete strength that supports independent verification and extension.

major comments (2)

[case-disjoint protocol description (§3–4)] In the description of the case-disjoint protocol (abstract and §3–4), the manuscript does not report source proportions (PleThora, MedSeg SIRM, LongCIU) within each train/validation/test split nor confirm that patient assignment was stratified by source. Consequently the 0.4599 absolute Dice drop may partly reflect domain shift arising from unbalanced source mixtures rather than leakage removal alone; the central claim that patient-disjoint evaluation accounts for the entire gap therefore requires additional evidence such as per-split source histograms or a source-stratified ablation.
[§4] §4 (multi-seed protocol sweep): the foreground-only Dice and IoU metrics are reported without the corresponding background or per-class values, making it impossible to assess whether the large gap is driven by foreground class imbalance or by overall segmentation quality; this detail is load-bearing for interpreting the 69 % relative reduction.

minor comments (1)

[abstract] The abstract states “near-complete case reuse” but does not quantify the exact fraction of overlapping cases or slices; a brief table or sentence with these numbers would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our presentation of the CTSCAN benchmark. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [case-disjoint protocol description (§3–4)] In the description of the case-disjoint protocol (abstract and §3–4), the manuscript does not report source proportions (PleThora, MedSeg SIRM, LongCIU) within each train/validation/test split nor confirm that patient assignment was stratified by source. Consequently the 0.4599 absolute Dice drop may partly reflect domain shift arising from unbalanced source mixtures rather than leakage removal alone; the central claim that patient-disjoint evaluation accounts for the entire gap therefore requires additional evidence such as per-split source histograms or a source-stratified ablation.

Authors: We agree that the source composition of the splits should be explicitly reported to allow readers to evaluate potential domain shift. In the revised manuscript, we will add per-split source proportion histograms and details on how patients were assigned to splits. While the sources are all chest CT volumes with comparable imaging characteristics, we acknowledge that a source-stratified ablation would provide stronger evidence isolating the effect of patient-disjoint evaluation. However, conducting such an ablation would require additional experiments beyond the current multi-seed protocol and is not feasible within the revision timeline. We believe the primary driver remains the removal of intra-patient slice leakage, as evidenced by the near-complete case reuse in the original slice-mixed workflow. revision: partial
Referee: [§4] §4 (multi-seed protocol sweep): the foreground-only Dice and IoU metrics are reported without the corresponding background or per-class values, making it impossible to assess whether the large gap is driven by foreground class imbalance or by overall segmentation quality; this detail is load-bearing for interpreting the 69 % relative reduction.

Authors: We will include the background Dice and IoU metrics as well as per-class values for all four classes in the updated §4 and associated tables. This will demonstrate that the performance degradation under case-disjoint evaluation is not limited to the foreground but reflects a broader reduction in segmentation quality across classes. revision: yes

standing simulated objections not resolved

A complete source-stratified ablation to fully disentangle leakage effects from any residual domain shift between sources.

Circularity Check

0 steps flagged

No circularity: results are direct experimental outputs

full rationale

The paper reports empirical performance metrics from controlled runs of the same FPN+EfficientNet-B0 model under two explicit data-partitioning protocols (slice-mixed vs. case-disjoint) on the same multi-source dataset. The Dice and IoU deltas are measured outcomes of those runs, not quantities derived from equations, fitted parameters, or self-citations that reduce to the inputs by construction. No load-bearing mathematical steps, uniqueness theorems, or ansatzes appear in the presented claims; the benchmark is self-contained as a reproducible experimental comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on empirical comparison rather than new theoretical constructs or fitted parameters beyond standard ML training.

axioms (1)

domain assumption Standard deep learning model (FPN + EfficientNet-B0) and training protocol represent typical chest CT segmentation setups.
Used as control configuration in the experiments.

pith-pipeline@v0.9.0 · 5539 in / 1295 out tokens · 66213 ms · 2026-05-10T09:09:03.515109+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Berman, A

M. Berman, A. R. Triki, and M. B. Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 4413–4421, Salt Lake City, UT, USA,

work page
[2]

doi: 10.1109/CVPR.2018.00464

IEEE. doi: 10.1109/CVPR.2018.00464. URL https://openaccess.thecvf.com/content_cvpr_ 2018/html/Berman_The_LovaSz-Softmax_Loss_ CVPR_2018_paper.html

work page doi:10.1109/cvpr.2018.00464 2018
[3]

M¨ uller, L´ aszl´ o N´ emeth, Luis Oala, Lennart Purucker, Sahithya Ravi, 10 Jan N

S. Kapoor and A. Narayanan. Leakage and the re- producibility crisis in machine-learning-based science. Patterns, 4(9):100804, 2023. doi: 10.1016/j.patter. 2023.100804. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC10499856/

work page doi:10.1016/j.patter 2023
[4]

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 936– 944, Honolulu, HI, USA, 2017. IEEE. doi: 10.1109/ CVPR.2017.106. URLhttps://dblp.org/rec/conf/ cvpr/LinDGHHB17

work page 2017
[5]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-Net: Con- volutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, volume 9351 ofLecture Notes in Computer Science, pages 234–241. Springer, Cham, 2015. doi: 10.1007/978-3-319-24574-4_

work page doi:10.1007/978-3-319-24574-4_ 2015
[6]

1007/978-3-319-24574-4_28

URL https://link.springer.com/chapter/10. 1007/978-3-319-24574-4_28

work page
[7]

Tan and Q

M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InPro- ceedings of the 36th International Conference on Ma- chine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114, Long Beach, CA, USA, 2019. PMLR. URLhttps://proceedings.mlr. press/v97/tan19a.html. 7

work page 2019

[1] [1]

Berman, A

M. Berman, A. R. Triki, and M. B. Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 4413–4421, Salt Lake City, UT, USA,

work page

[2] [2]

doi: 10.1109/CVPR.2018.00464

IEEE. doi: 10.1109/CVPR.2018.00464. URL https://openaccess.thecvf.com/content_cvpr_ 2018/html/Berman_The_LovaSz-Softmax_Loss_ CVPR_2018_paper.html

work page doi:10.1109/cvpr.2018.00464 2018

[3] [3]

M¨ uller, L´ aszl´ o N´ emeth, Luis Oala, Lennart Purucker, Sahithya Ravi, 10 Jan N

S. Kapoor and A. Narayanan. Leakage and the re- producibility crisis in machine-learning-based science. Patterns, 4(9):100804, 2023. doi: 10.1016/j.patter. 2023.100804. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC10499856/

work page doi:10.1016/j.patter 2023

[4] [4]

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 936– 944, Honolulu, HI, USA, 2017. IEEE. doi: 10.1109/ CVPR.2017.106. URLhttps://dblp.org/rec/conf/ cvpr/LinDGHHB17

work page 2017

[5] [5]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-Net: Con- volutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, volume 9351 ofLecture Notes in Computer Science, pages 234–241. Springer, Cham, 2015. doi: 10.1007/978-3-319-24574-4_

work page doi:10.1007/978-3-319-24574-4_ 2015

[6] [6]

1007/978-3-319-24574-4_28

URL https://link.springer.com/chapter/10. 1007/978-3-319-24574-4_28

work page

[7] [7]

Tan and Q

M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InPro- ceedings of the 36th International Conference on Ma- chine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114, Long Beach, CA, USA, 2019. PMLR. URLhttps://proceedings.mlr. press/v97/tan19a.html. 7

work page 2019