arxiv: 2511.04334 · v2 · submitted 2025-11-06 · 💻 cs.CV · cs.LG

Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography

Sa\'ul Alonso-Monsalve , Leigh H. Whitehead , Adam Aurisano , Lorena Escudero Sanchez This is my paper

Pith reviewed 2026-05-18 00:58 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords sparse convolutional networks3D medical segmentationkidney tumorCT imagingKiTS23two-stage pipelinesubmanifold sparse CNNvolumetric analysis

0 comments

The pith

A two-stage sparse convolutional network matches or exceeds patch-based baselines for kidney tumor segmentation in CT while cutting VRAM use by up to 75 percent and inference time by up to 60 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-stage 3D segmentation approach that first applies a low-resolution submanifold sparse network to locate a region of interest and then runs a high-resolution sparse network inside the cropped volume. This design lets the method process native high-resolution CT scans without the memory cost of dense convolutions or the accuracy trade-offs of heavy patching and downsampling. A sympathetic reader would care because accurate automated tumor delineation supports precision oncology and quantitative analysis, yet current dense networks often demand too much hardware for routine clinical volumes. On the KiTS23 dataset the method reaches Dice scores of 95.8 percent for kidneys plus masses, 85.7 percent for tumors plus cysts, and 80.3 percent for tumors alone. Direct comparisons on identical cross-validation folds show the sparse pipeline is competitive with or slightly ahead of a patch-based nnU-Net while using markedly less memory and time.

Core claim

The central claim is that submanifold sparse convolutional networks arranged in a two-stage pipeline produce Dice similarity coefficients of 95.8 percent for kidneys plus masses, 85.7 percent for tumours plus cysts, and 80.3 percent for tumours alone on the KiTS23 renal cancer CT dataset, results that are competitive with top challenge entries and comparable to or slightly higher than a patch-based nnU-Net baseline, all while achieving up to 60 percent shorter inference time and up to 75 percent lower VRAM usage than an equivalent dense implementation across tested CPU and GPU hardware.

What carries the argument

The two-stage sparse segmentation pipeline in which a low-resolution submanifold sparse network first identifies a region of interest and a subsequent high-resolution submanifold sparse network refines the segmentation inside the cropped ROI.

If this is right

Native high-resolution 3D processing of entire CT volumes becomes feasible without downsampling or patch-based workarounds.
Segmentation accuracy stays at or above the level of a standard patch-based nnU-Net on the same KiTS23 cross-validation folds.
VRAM consumption drops by as much as 75 percent and inference time by as much as 60 percent relative to a dense version of the same architecture.
The method outperforms a zero-shot foundation model on small heterogeneous lesions while still localizing kidneys reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse two-stage pattern could be applied to other large-volume 3D medical segmentation tasks where memory limits currently force downsampling.
Efficiency gains might allow on-premise or edge deployment of high-resolution models in settings without high-end GPUs.
Combining the ROI localization stage with multi-modal inputs or uncertainty estimates could further improve robustness on variable lesion sizes.

Load-bearing premise

The low-resolution first stage always produces a region of interest that fully contains every kidney and tumor voxel, including small or peripherally located lesions.

What would settle it

A single test volume in which the stage-one ROI excludes part of a tumor, causing the stage-two network to output an incomplete or zero segmentation for that lesion.

Figures

Figures reproduced from arXiv: 2511.04334 by Adam Aurisano, Leigh H. Whitehead, Lorena Escudero Sanchez, Sa\'ul Alonso-Monsalve.

**Figure 1.** Figure 1: Overview of the two-stage segmentation framework utilising Sparse Submanifold Convolutional Networks (SSCNs). In Stage 1 (ROI finder), a low-resolution sparsified image is processed by a sparse 3D U-Net to identify a region of interest (ROI). Such ROIs are then dilated to be conservative and ensure all relevant structures are included, and passed to the next stage and used to crop the original high-resolut… view at source ↗

**Figure 2.** Figure 2: Sparsification: the cumulative fraction of voxels removed by applying a minimum (top) and maximum (bottom) threshold to the voxel intensity in Hounsfield Units (HU), shown for the kidney and masses (red) and all other voxels (blue). The green arrows show the chosen regions rejected in the sparsification process. To determine the most effective sparsification strategy in the first stage (low-resolution ROI … view at source ↗

**Figure 3.** Figure 3: The proposed 3D sparse U-Net architecture. The network follows a hierarchical encoder-decoder structure with progressively increasing feature dimensions in the encoder and corresponding feature reduction in the decoder. Downsampling and upsampling operations are performed with convolution and transposed convolution layers, respectively, while skip connections are implemented via element-wise summation to m… view at source ↗

**Figure 4.** Figure 4: Validation losses for Stage 1 (top) and Stage 2 (bottom). Each column represents the accumulated Dice loss across all deep supervision steps for different outputs: kidneys + masses (left), tumour + cyst (middle), and tumour only (right). Different colours indicate different folds. Results This section presents the segmentation results for both stages of our method: low-resolution and high-resolution segmen… view at source ↗

**Figure 5.** Figure 5: From left to right: (first column) 2D slice from the original high-resolution scan showing the ground truth segmentations; (second column) predicted segmentations from Stage 1 on the low-resolution version of the scan; (third column) predicted segmentations from Stage 2 on the high-resolution version of the scan. Each row shows a single 2D slice selected from a different case (patient). of concept use case… view at source ↗

read the original abstract

Accurate delineation of kidney tumours in Computed Tomography (CT) is essential for downstream quantitative analysis and precision oncology, but manual segmentation is a specialised task, time-consuming and difficult to scale. Automated 3D segmentation remains challenging because CT scans are large volumetric images, making high-resolution dense convolutional networks computationally expensive and often dependent on downsampling or patch-based inference. We propose a two-stage 3D segmentation methodology based on voxel sparsification and submanifold sparse convolutional networks (SSCNs). Stage 1 uses a low-resolution sparse network to identify a region of interest (ROI); Stage 2 applies a high-resolution sparse network for refined segmentation within the cropped ROI. This enables native high-resolution 3D processing while reducing memory use and inference time. We evaluate the method on the KiTS23 renal cancer CT dataset using 5-fold cross-validation. Our method achieved Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone, competitive with top KiTS23 approaches. In direct comparisons on the same cross-validation folds, the proposed sparse method achieves tumour + cyst and tumour-only Dice scores comparable to, and slightly higher than, a patch-based nnU-Net baseline, while consistently requiring less VRAM and shorter inference time across the tested hardware. Across the tested GPUs, our sparse model is markedly faster than both nnU-Net and the zero-shot zoom-out/zoom-in foundation model SegVol, which localises kidneys well but underperforms on small heterogeneous lesions. Compared to an equivalent dense implementation of the same architecture, the proposed sparse approach achieves up to a 60% reduction in inference time and up to a 75% reduction in VRAM usage across both CPU and the GPU configurations tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a two-stage 3D segmentation pipeline for kidneys and kidney tumors in CT using submanifold sparse convolutional networks (SSCNs). Stage 1 employs a low-resolution sparse network to localize an ROI; Stage 2 applies a high-resolution sparse network for refinement inside the cropped ROI. On the KiTS23 dataset with 5-fold cross-validation, it reports Dice scores of 95.8% (kidneys + masses), 85.7% (tumours + cysts), and 80.3% (tumours alone), claiming these are competitive with or slightly superior to a patch-based nnU-Net baseline while delivering up to 60% shorter inference time and 75% lower VRAM usage versus dense equivalents and other baselines such as SegVol.

Significance. If the ROI-coverage assumption holds, the work supplies concrete empirical support for efficiency gains in high-resolution 3D medical segmentation through direct side-by-side measurements of Dice, runtime, and VRAM on identical cross-validation folds. The explicit multi-target Dice reporting and hardware-specific comparisons constitute a strength that would be useful for practitioners facing memory constraints on large volumetric CT data.

major comments (1)

[Methods (two-stage pipeline) and Results (Dice reporting)] The central efficiency and accuracy claims rest on the premise that the low-resolution stage-1 sparse network produces an ROI containing every kidney and tumor voxel (including small or peripheral lesions). No stage-1 recall, missed-lesion count, or per-case ROI coverage statistics are reported in the methods or results, so the final Dice scores cannot be interpreted as guaranteed full-volume performance. This assumption is load-bearing for the two-stage design.

minor comments (1)

[Abstract and Methods] The abstract and results would benefit from a brief statement of the exact voxel sparsity thresholds or submanifold convolution parameters used in each stage for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. The comment raises an important consideration for the two-stage pipeline, which we address below. We have made revisions to incorporate additional supporting analysis as suggested.

read point-by-point responses

Referee: [Methods (two-stage pipeline) and Results (Dice reporting)] The central efficiency and accuracy claims rest on the premise that the low-resolution stage-1 sparse network produces an ROI containing every kidney and tumor voxel (including small or peripheral lesions). No stage-1 recall, missed-lesion count, or per-case ROI coverage statistics are reported in the methods or results, so the final Dice scores cannot be interpreted as guaranteed full-volume performance. This assumption is load-bearing for the two-stage design.

Authors: We agree that the central claims depend on the stage-1 network producing an ROI that encompasses all kidney and tumor voxels. The manuscript does not currently report stage-1 recall or per-case ROI coverage statistics, which limits the ability to fully interpret the Dice scores as guaranteed full-volume results. To rectify this, we will add in the revised manuscript a new subsection under Methods describing how we evaluate stage-1 ROI coverage, along with quantitative results in the Results section, including recall rates and any instances of missed lesions. This will allow readers to assess the validity of the assumption directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation only

full rationale

The paper proposes a practical two-stage sparse CNN architecture for 3D CT segmentation and reports Dice scores from 5-fold cross-validation on KiTS23, with direct runtime/VRAM comparisons to nnU-Net and SegVol. No mathematical derivations, parameter fits redefined as predictions, uniqueness theorems, or self-citation chains appear in the described method or results. All performance claims rest on external measurement against held-out data and independent baselines rather than any reduction to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of standard 3D U-Net-style sparse convolutions and the assumption that stage-1 ROI cropping preserves all positive voxels. No new physical or mathematical axioms are introduced.

axioms (1)

domain assumption Submanifold sparse convolutions preserve segmentation accuracy when applied to medical CT volumes at native resolution.
Invoked when claiming that the sparse architecture can replace dense convolutions without loss of Dice score.

pith-pipeline@v0.9.0 · 5886 in / 1332 out tokens · 39334 ms · 2026-05-18T00:58:30.686920+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

voxel sparsification... 0.5th and 99.5th percentiles of the HU values... threshold range of (-53.4, 283.2) HU... retains approximately 99% of segmentation voxels while removing 76.8% of background voxels
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage... Stage 1 uses a low-resolution sparse network to identify a region of interest (ROI); Stage 2 applies a high-resolution sparse network

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

SEMIR replaces dense voxel computation with a learned topology-preserving graph minor that supports exact decoding and GNN-based inference for small-structure segmentation in large medical images.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

& Sahoo, R

Bansal, A., Dhamija, E., Chandrashekhara, S. & Sahoo, R. Role of CT in the detection and management of cancer related complications: a study of 599 patients.Ecancermedicalscience17, DOI: 10.3332/ecancer.2023.1529 (2023). 2.Siegel, R., Giaquinto, A. & A., J. Cancer statistics, 2024.CA Cancer J Clin.74, DOI: 10.3322/caac.21830 (2024). 3.West Midlands Cancer...

work page doi:10.3332/ecancer.2023.1529 2023
[2]

Stewart, G. & et al. The multispeciality approach to the management of localised kidney cancer.Lancet400, 523–534, DOI: 10.1016/S0140-6736(22)01059-5 (2022)

work page doi:10.1016/s0140-6736(22)01059-5 2022
[3]

J., Kinahan, P

Gillies, R. J., Kinahan, P. & Hricak, H. Radiomics: Images are more than pictures, they are data.Radiology563–577, DOI: 10.1148/radiol.2015151169 (2016)

work page doi:10.1148/radiol.2015151169 2016
[4]

Uhlig, A. & et al. Radiomics and machine learning for renal tumor subtype assessment using multiphase computed tomography in a multicenter setting.Eur. Radiol.34, 6254–6263, DOI: 10.1007/s00330-024-10731-6 (2024)

work page doi:10.1007/s00330-024-10731-6 2024
[5]

Rundo, L. & et al. Clinically interpretable radiomics-based prediction of histopathologic response to neoadjuvant chemotherapy in high-grade serous ovarian carcinoma.Front. Oncol.12, DOI: 10.3389/fonc.2022.868265 (2022)

work page doi:10.3389/fonc.2022.868265 2022
[6]

& Viriri, S

Abdelrahman, A. & Viriri, S. Kidney tumor semantic segmentation using deep learning: A survey of state-of-the-art.J. Imaging8, DOI: 10.3390/jimaging8030055 (2022)

work page doi:10.3390/jimaging8030055 2022
[7]

Heller, N. & et al. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge.Med. Image Analysis67, 101821, DOI: 10.1016/j.media.2020.101821 (2021)

work page doi:10.1016/j.media.2020.101821 2020
[8]

Buddenkotte, T.et al.Deep learning-based segmentation of multisite disease in ovarian cancer.Eur Radiol Exp.7(1), DOI: 10.1186/s41747-023-00388-z (2023)

work page doi:10.1186/s41747-023-00388-z 2023
[9]

Wasserthal, J. & et al. Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images.Radiol. Artif. Intell. DOI: 10.1148/ryai.230024 (2023)

work page doi:10.1148/ryai.230024 2023
[10]

& Brox, T

Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation.Int. Conf. on Med. Image Comput. Comput. Interv. (MICCAI)DOI: 10.1007/978-3-319-24574-4_28 (2015)

work page doi:10.1007/978-3-319-24574-4_28 2015
[11]

& Maier-Hein, K

Isensee, F., Jaeger, P., Kohl, S., Petersen, J. & Maier-Hein, K. nnU-net: a self-configuring method for deep learning-based biomedical image segmentation.Nat MethodsDOI: 10.1038/s41592-020-01008-z (2021)

work page doi:10.1038/s41592-020-01008-z 2021
[12]

Kshatri, S. S. & Singh, D. Convolutional neural network in medical image analysis: a review.Arch. Comput. Methods Eng. 30, 2793–2810 (2023)

work page 2023
[13]

Submanifold Sparse Convolutional Networks

Graham, B. & van der Maaten, L. Submanifold sparse convolutional networks.arXivDOI: 10.48550/arXiv.1706.01307 (2017)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.01307 2017
[14]

& van der Maaten, L

Graham, B., Engelcke, M. & van der Maaten, L. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks.IEEE/CVF Conf. on Comput. Vis. Pattern Recognit.DOI: 10.1109/CVPR.2018.00961 (2018)

work page doi:10.1109/cvpr.2018.00961 2018
[15]

& Terao, K

Dominé, L. & Terao, K. Scalable deep convolutional neural networks for sparse, locally dense liquid argon time projection chamber data.Phys. Rev. D102, 012005, DOI: 10.1103/PhysRevD.102.012005 (2020)

work page doi:10.1103/physrevd.102.012005 2020
[16]

Adams, C. & et al. Enhancing neutrino event reconstruction with pixel-based 3D readout for liquid argon time projection chambers.JINST15, P04009, DOI: 10.1088/1748-0221/15/04/P04009 (2020). 1912.10133

work page doi:10.1088/1748-0221/15/04/p04009 2020
[17]

Kekic, M. & et al. Demonstration of background rejection using deep convolutional neural networks in the NEXT experiment.JHEP01, 189, DOI: 10.1007/JHEP01(2021)189 (2021). 2009.10783. 20.Jianning, L.et al.Sparse Convolutional Neural Networks for Medical Image Analysis (2022)

work page doi:10.1007/jhep01(2021)189 2021
[18]

Reports13, 20229 (2023)

Li, J.et al.Sparse convolutional neural network for high-resolution skull shape completion and shape super-resolution.Sci. Reports13, 20229 (2023). 10/12

work page 2023
[19]

Hounsfield, G. N. Computed Medical Imaging: Nobel Lecture December 8 1979.J Comput. Assist. Tomogr4, 665–674 (1980). 23.The 2023 kidney and kidney tumor segmentation challenge (Accessed June 2025). Https://kits-challenge.org/kits23/

work page 1979
[20]

& Peng, Y

Hu, X. & Peng, Y . Gsca-net: A global spatial channel attention network for kidney, tumor and cyst segmentation. In Heller, N.et al.(eds.)Kidney and Kidney Tumor Segmentation, 67–76 (Springer Nature Switzerland, Cham, 2024)

work page 2024
[21]

2312.05528

Uhm, K.-H.et al.Exploring 3d u-net training configurations and post-processing strategies for the miccai 2023 kidney and tumor segmentation challenge (2023). 2312.05528

work page arXiv 2023
[22]

Myronenko, A., Yang, D., He, Y . & Xu, D. Automated 3d segmentation of kidneys and tumors in miccai kits 2023 challenge.Lect. Notes Comput. Sci.14540, DOI: 10.1007/978-3-031-54806-2_1 (2024)

work page doi:10.1007/978-3-031-54806-2_1 2023
[23]

MONAI: An open-source framework for deep learning in healthcare

Liu, S. & Han, B. Dynamic resolution network for kidney tumor segmentation. In Heller, N.et al.(eds.)Kidney and Kidney Tumor Segmentation, 14–21 (Springer Nature Switzerland, Cham, 2024). 28.Cardoso, M. J. & et al. MONAI: An open-source framework for deep learning in healthcare (2022). 2211.02701

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Choy, C., Gwak, J. Y . & Savarese, S. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3075–3084 (2019)

work page 2019
[25]

In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16133–16142, DOI: 10.1109/CVPR52729.2023.01548 (2023)

Woo, S.et al.Convnext v2: Co-designing and scaling convnets with masked autoencoders. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16133–16142, DOI: 10.1109/CVPR52729.2023.01548 (2023)

work page doi:10.1109/cvpr52729.2023.01548 2023
[26]

& Koyama, M

Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(2019)

work page 2019
[27]

& Hutter, F

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. InProceedings of the 7th International Conference on Learning Representations (ICLR)(2019)

work page 2019
[28]

arXiv preprint arXiv:2307.01984 (2023)

Heller, N.et al.The KiTS21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT (2023). 2307.01984

work page arXiv 2023
[29]

Uhm, K. & et al. Configurations and post-processing strategies for the miccai 2023 kidney and tumor segmentation challenge.Lect. Notes Comput. Sci.14540, DOI: 10.1007/978-3-031-54806-2_2 (2024)

work page doi:10.1007/978-3-031-54806-2_2 2023
[30]

Methods Programs Biomed.221, 106861, DOI: https://doi.org/10.1016/j.cmpb

Hsiao, C.-H.et al.A deep learning-based precision volume calculation approach for kidney and tumor segmentation on computed tomography images.Comput. Methods Programs Biomed.221, 106861, DOI: https://doi.org/10.1016/j.cmpb. 2022.106861 (2022)

work page doi:10.1016/j.cmpb 2022
[31]

Ueda, D. & et al. Climate change and artificial intelligence in healthcare: Review and recommendations towards a sustainable future.Diagn. Interv. Imaging453–459, DOI: 10.1016/j.diii.2024.06.002 (2024). Acknowledgements The authors would like to thank Dr Thomas Buddenkotte for answering questions regarding his automated segmentation method, and the organi...

work page doi:10.1016/j.diii.2024.06.002 2024