CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark
Pith reviewed 2026-05-10 09:09 UTC · model grok-4.3
The pith
Mixing slices from the same patient inflates chest CT segmentation performance by 69 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that slice-mixed protocols in chest CT segmentation induce near-complete patient reuse across partitions, yielding foreground Dice of 0.6665 and IoU of 0.5031, whereas patient-disjoint protocols yield only 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute, or 69 percent relative, showing that the mixed protocol substantially inflates reported performance.
What carries the argument
The patient-disjoint split protocol inside the CTSCAN benchmark, which supplies deterministic manifests that keep every slice from one case inside a single train, validation, or test partition.
If this is right
- Many previously reported chest CT segmentation results from slice-mixed setups represent upper bounds that do not reflect generalization to new patients.
- Future studies should adopt patient-disjoint evaluation to produce comparable and realistic performance numbers.
- The supplied deterministic manifests and scripted multi-seed sweep allow any new model to be tested under the same leakage-controlled conditions.
- Models must improve their handling of unseen patients to close the observed gap between mixed and disjoint scores.
- The benchmark's explicit weak-supervision controls let researchers separate leakage effects from supervision choices.
Where Pith is reading between the lines
- The same patient-reuse problem likely distorts results in other medical imaging tasks such as abdominal CT or MRI segmentation.
- Adding patient-specific normalization layers or domain-adaptation steps could be tested to see whether they shrink the performance gap between mixed and disjoint regimes.
- The large drop indicates that current models may be capturing individual anatomy rather than universal structures, which could be checked by inspecting learned features across patients.
- Extending the benchmark to additional independent sources would provide a stronger test of whether the leakage effect is consistent across data collections.
Load-bearing premise
The performance difference between the two protocols stems mainly from the removal of patient-level data leakage rather than from differences in data distribution across sources or other experimental variables.
What would settle it
Re-running the identical multi-seed protocol on a single homogeneous chest CT dataset that is split once in the mixed style and once in the patient-disjoint style, and checking whether the Dice gap remains near 0.46.
Figures
read the original abstract
Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard slice-mixed train/test splits in chest CT segmentation induce severe evaluation leakage through intra-patient slice reuse, inflating reported performance. It introduces the CTSCAN benchmark aggregating 89 cases from PleThora, MedSeg SIRM, and LongCIU, and demonstrates via controlled experiments with FPN + EfficientNet-B0 that slice-mixed protocols achieve 0.6665 foreground Dice / 0.5031 IoU while case-disjoint protocols achieve only 0.2066 Dice / 0.1181 IoU (absolute drops of 0.46 Dice and 0.385 IoU). The work supplies deterministic patient-disjoint split manifests, weak-supervision controls, a scripted multi-seed (3 seeds, 12 epochs) protocol, and reproducible figure generation.
Significance. If the performance gap is shown to arise primarily from leakage removal, the result would be significant for medical image segmentation evaluation practices, quantifying the inflation that occurs under common splitting workflows and supplying a reusable, patient-disjoint multi-source benchmark. The explicit provision of deterministic manifests, multi-seed sweep code, and scripted figure generation constitutes a concrete strength that supports independent verification and extension.
major comments (2)
- [case-disjoint protocol description (§3–4)] In the description of the case-disjoint protocol (abstract and §3–4), the manuscript does not report source proportions (PleThora, MedSeg SIRM, LongCIU) within each train/validation/test split nor confirm that patient assignment was stratified by source. Consequently the 0.4599 absolute Dice drop may partly reflect domain shift arising from unbalanced source mixtures rather than leakage removal alone; the central claim that patient-disjoint evaluation accounts for the entire gap therefore requires additional evidence such as per-split source histograms or a source-stratified ablation.
- [§4] §4 (multi-seed protocol sweep): the foreground-only Dice and IoU metrics are reported without the corresponding background or per-class values, making it impossible to assess whether the large gap is driven by foreground class imbalance or by overall segmentation quality; this detail is load-bearing for interpreting the 69 % relative reduction.
minor comments (1)
- [abstract] The abstract states “near-complete case reuse” but does not quantify the exact fraction of overlapping cases or slices; a brief table or sentence with these numbers would improve clarity.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our presentation of the CTSCAN benchmark. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [case-disjoint protocol description (§3–4)] In the description of the case-disjoint protocol (abstract and §3–4), the manuscript does not report source proportions (PleThora, MedSeg SIRM, LongCIU) within each train/validation/test split nor confirm that patient assignment was stratified by source. Consequently the 0.4599 absolute Dice drop may partly reflect domain shift arising from unbalanced source mixtures rather than leakage removal alone; the central claim that patient-disjoint evaluation accounts for the entire gap therefore requires additional evidence such as per-split source histograms or a source-stratified ablation.
Authors: We agree that the source composition of the splits should be explicitly reported to allow readers to evaluate potential domain shift. In the revised manuscript, we will add per-split source proportion histograms and details on how patients were assigned to splits. While the sources are all chest CT volumes with comparable imaging characteristics, we acknowledge that a source-stratified ablation would provide stronger evidence isolating the effect of patient-disjoint evaluation. However, conducting such an ablation would require additional experiments beyond the current multi-seed protocol and is not feasible within the revision timeline. We believe the primary driver remains the removal of intra-patient slice leakage, as evidenced by the near-complete case reuse in the original slice-mixed workflow. revision: partial
-
Referee: [§4] §4 (multi-seed protocol sweep): the foreground-only Dice and IoU metrics are reported without the corresponding background or per-class values, making it impossible to assess whether the large gap is driven by foreground class imbalance or by overall segmentation quality; this detail is load-bearing for interpreting the 69 % relative reduction.
Authors: We will include the background Dice and IoU metrics as well as per-class values for all four classes in the updated §4 and associated tables. This will demonstrate that the performance degradation under case-disjoint evaluation is not limited to the foreground but reflects a broader reduction in segmentation quality across classes. revision: yes
- A complete source-stratified ablation to fully disentangle leakage effects from any residual domain shift between sources.
Circularity Check
No circularity: results are direct experimental outputs
full rationale
The paper reports empirical performance metrics from controlled runs of the same FPN+EfficientNet-B0 model under two explicit data-partitioning protocols (slice-mixed vs. case-disjoint) on the same multi-source dataset. The Dice and IoU deltas are measured outcomes of those runs, not quantities derived from equations, fitted parameters, or self-citations that reduce to the inputs by construction. No load-bearing mathematical steps, uniqueness theorems, or ansatzes appear in the presented claims; the benchmark is self-contained as a reproducible experimental comparison.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard deep learning model (FPN + EfficientNet-B0) and training protocol represent typical chest CT segmentation setups.
Reference graph
Works this paper leans on
-
[1]
M. Berman, A. R. Triki, and M. B. Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 4413–4421, Salt Lake City, UT, USA,
-
[2]
IEEE. doi: 10.1109/CVPR.2018.00464. URL https://openaccess.thecvf.com/content_cvpr_ 2018/html/Berman_The_LovaSz-Softmax_Loss_ CVPR_2018_paper.html
-
[3]
M¨ uller, L´ aszl´ o N´ emeth, Luis Oala, Lennart Purucker, Sahithya Ravi, 10 Jan N
S. Kapoor and A. Narayanan. Leakage and the re- producibility crisis in machine-learning-based science. Patterns, 4(9):100804, 2023. doi: 10.1016/j.patter. 2023.100804. URL https://pmc.ncbi.nlm.nih.gov/ articles/PMC10499856/
-
[4]
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 936– 944, Honolulu, HI, USA, 2017. IEEE. doi: 10.1109/ CVPR.2017.106. URLhttps://dblp.org/rec/conf/ cvpr/LinDGHHB17
work page 2017
-
[5]
O. Ronneberger, P. Fischer, and T. Brox. U-Net: Con- volutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, volume 9351 ofLecture Notes in Computer Science, pages 234–241. Springer, Cham, 2015. doi: 10.1007/978-3-319-24574-4_
-
[6]
URL https://link.springer.com/chapter/10. 1007/978-3-319-24574-4_28
-
[7]
M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InPro- ceedings of the 36th International Conference on Ma- chine Learning, volume 97 ofProceedings of Machine Learning Research, pages 6105–6114, Long Beach, CA, USA, 2019. PMLR. URLhttps://proceedings.mlr. press/v97/tan19a.html. 7
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.