pith. sign in

arxiv: 2512.08216 · v3 · submitted 2025-12-09 · 📡 eess.IV · cs.CV· cs.LG

Tumor-anchored deep feature random forests for out-of-distribution detection in lung cancer segmentation

Pith reviewed 2026-05-17 00:02 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG
keywords out-of-distribution detectionlung tumor segmentationrandom forestCT imagingdeep featurespost-hoc detectormedical image analysis
0
0 comments X

The pith

A random forest trained on deep features anchored to predicted tumor regions detects out-of-distribution lung CT scans at over 93 percent AUROC with only 40 labeled examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RF-Deep, a post-hoc detector that extracts hierarchical features from an existing lung tumor segmentation network and aggregates them only from regions around the model's own predicted tumor mask. These aggregated descriptors train a random forest classifier using a small set of in-distribution and out-of-distribution scans. The approach matters because segmentation models can produce confidently wrong outlines on unfamiliar CT inputs such as pulmonary embolism or COVID cases, creating clinical risk. By reusing features already computed inside the backbone and requiring minimal extra labels, RF-Deep adds a lightweight safety filter that works across different network depths and pretraining schemes.

Core claim

RF-Deep repurposes hierarchical activations from a pretrained-then-finetuned segmentation backbone by collecting them from multiple regions-of-interest centered on the model's predicted tumor regions, then trains a random forest on those descriptors with as few as 20 in-distribution and 20 OOD scans. On 2232 CT volumes this yields AUROC above 93 on near-OOD sets (pulmonary embolism, COVID-negative) and above 99 on far-OOD sets (kidney cancer, healthy pancreas), with transfer to blinded sets (COVID-positive, breast cancer) above 94 under ensemble use.

What carries the argument

Tumor-anchored feature aggregation that pools hierarchical deep activations from regions-of-interest centered on the segmentation model's own predicted tumor mask before random-forest classification.

If this is right

  • Existing segmentation pipelines can insert a lightweight post-hoc filter that rejects or flags scans before erroneous tumor outlines are produced.
  • Only twenty OOD examples suffice to train the detector, allowing adaptation to new scanner sites or protocols with modest labeling effort.
  • The same detector works across segmentation backbones of different depths and pretraining strategies without retraining the main model.
  • Ensemble versions of the detector maintain high performance on completely unseen clinical validation sets such as COVID-positive and breast-cancer CT scans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Anchoring features to the predicted tumor rather than the entire image may reduce irrelevant anatomical noise and focus detection on the region that matters most for treatment planning.
  • Even if the segmentation backbone itself is unreliable on OOD inputs, the mismatch in its own internal features around the predicted tumor can still provide a usable detection signal.
  • The method could be combined with uncertainty maps already produced by the segmentor to create a stronger two-stage safety check.
  • In hospital deployment the random forest would need periodic retraining on new scanner data to prevent performance drift from protocol changes.

Load-bearing premise

That deep features taken only from regions around the model's predicted tumor contain enough mismatch signal for a random forest to separate in-distribution from out-of-distribution scans when trained on just forty labeled examples.

What would settle it

A test collection of near-OOD lung CT scans in which the segmentation model produces accurate tumor outlines yet RF-Deep assigns low out-of-distribution scores, or in which the detector incorrectly flags many routine in-distribution scans.

Figures

Figures reproduced from arXiv: 2512.08216 by Aneesh Rangnekar, Harini Veeraraghavan.

Figure 1
Figure 1. Figure 1: Uncertainty maps for in-distribution (ID) and out-of-distribution (OOD) scans. (a) ID lung cancer [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RF-Deep workflow for scan-level OOD detection. Panels (a–c) depict feature extraction from the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pretraining strategies performance and robustness evaluation on ID test set. (a) Segmentation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE projected embeddings showing dataset-wise separability of (a) deep features and (b) ra [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SHAP feature importance for RF-Deep across OOD datasets using the SMIT-pretrained segmen [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SHAP feature importance for RF-Radiomics across OOD datasets using segmentations from the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative MaxLogit visualizations for (a) segmentation predictions, (b) spatial heatmaps, and [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation studies showing the sensitivity of RF-Deep for OOD detection across four datasets. Point [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Segmentation performance across backbones under imaging variations (scanner type, contrast, [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation of Mahalanobis distance-based variants for OOD detection using AUROC over 100 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: t-SNE projected embeddings showing dataset-wise separability of (a) deep features and (b) [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
read the original abstract

Accurate segmentation of lung tumors from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. Despite self-supervised pretraining on numerous datasets, state-of-the-art transformer backbones remain susceptible to out-of-distribution (OOD) inputs, often producing confidently incorrect segmentations with potential for risk in clinical deployment. Hence, we introduce RF-Deep, a lightweight post-hoc random forests-based framework that leverages deep features trained with limited outlier exposure, requiring as few as 40 labeled scans (20 in-distribution and 20 OOD), to improve scan-level OOD detection. RF-Deep repurposes the hierarchical features from the pretrained-then-finetuned segmentation backbones, aggregating features from multiple regions-of-interest anchored to predicted tumor regions to capture OOD likelihood. We evaluated RF-Deep on 2,232 CT volumes spanning near-OOD (pulmonary embolism, COVID-19 negative) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC >~93 on the challenging near-OOD datasets, where it outperformed the next best method by 4--7 percentage points, and produced near-perfect detection (AUROC >~99) on far-OOD datasets. The approach also showed transferability to two blinded validation datasets under the ensemble configuration (COVID-19 positive and breast cancer; AUROC >~94). RF-Deep maintained consistent performance across backbones of different depths and pretraining strategies, demonstrating applicability of post-hoc detectors as a safety filter for clinical deployment of tumor segmentation pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RF-Deep, a lightweight post-hoc framework that trains random forests on hierarchical deep features extracted from a pretrained segmentation backbone and aggregated only from regions-of-interest anchored to the model's own predicted tumor masks. Using as few as 40 labeled scans (20 ID + 20 OOD), the method is evaluated for scan-level OOD detection on 2,232 CT volumes spanning near-OOD (pulmonary embolism, COVID-negative) and far-OOD (kidney cancer, healthy pancreas) datasets, with additional transfer tests on blinded sets. It reports AUROC >~93 on near-OOD (outperforming the next-best baseline by 4-7 points) and >~99 on far-OOD, while remaining consistent across different backbone depths and pretraining strategies.

Significance. If the reported AUROC gains are reproducible and the tumor-anchored aggregation is shown to be robust rather than an artifact of the small training set, the work would provide a practical, low-data safety filter for clinical deployment of lung-tumor segmentation models. The approach's post-hoc nature and limited supervision requirement are attractive for real-world use where full retraining is costly.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods: The central performance claim rests on feature aggregation from ROIs defined by the segmentation model's predicted tumor mask. On near-OOD inputs the model is already known to produce confidently wrong segmentations; if the predicted mask is empty, tiny, or mislocalized, the aggregated features are drawn from background or artifact regions. No statistics on predicted-mask volume, overlap with ground truth, or failure rate on the OOD test sets are reported, nor is an ablation against full-volume or random-region aggregation provided. This leaves open whether the anchoring contributes signal or merely enables memorization of spurious correlations from the 40-scan training set.
  2. [Results] Results: AUROC values are reported without error bars, confidence intervals, or statistical significance tests for the 4-7 point gains over baselines. With only 20 OOD training scans and no detail on how the OOD labels were obtained or how the baseline detectors were re-implemented, it is difficult to judge whether the improvements are stable or sensitive to the particular choice of 20 OOD examples.
  3. [Experiments] Experiments: The manuscript states that RF-Deep maintains consistent performance across backbones of different depths and pretraining strategies, yet provides no quantitative table or figure showing per-backbone AUROC on the same OOD splits. Without these numbers it is impossible to verify the claimed robustness.
minor comments (2)
  1. [Abstract] The abstract uses the approximate symbol '~' for AUROC thresholds; exact values and the number of runs should be stated in the main text or tables.
  2. [Methods] Clarify the exact random-forest hyperparameters (number of trees, maximum depth, feature sampling) and whether they were tuned on a validation split or fixed a priori.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments, which have helped us identify areas to strengthen our manuscript. We address each major comment below and will make the necessary revisions to improve clarity and robustness of the presented results.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: The central performance claim rests on feature aggregation from ROIs defined by the segmentation model's predicted tumor mask. On near-OOD inputs the model is already known to produce confidently wrong segmentations; if the predicted mask is empty, tiny, or mislocalized, the aggregated features are drawn from background or artifact regions. No statistics on predicted-mask volume, overlap with ground truth, or failure rate on the OOD test sets are reported, nor is an ablation against full-volume or random-region aggregation provided. This leaves open whether the anchoring contributes signal or merely enables memorization of spurious correlations from the 40-scan training set.

    Authors: We agree that this is a critical point to address for validating the tumor-anchored approach. In the revised version, we will report statistics on the predicted tumor mask volumes (e.g., mean and distribution of mask sizes), Dice overlap with ground truth where available for ID cases, and failure rates (e.g., empty mask percentage) on both ID and OOD test sets. Furthermore, we will include an ablation study comparing our tumor-anchored aggregation to full-volume feature aggregation and random-region sampling. This will demonstrate that the anchoring provides meaningful signal beyond potential spurious correlations in the small training set. revision: yes

  2. Referee: [Results] Results: AUROC values are reported without error bars, confidence intervals, or statistical significance tests for the 4-7 point gains over baselines. With only 20 OOD training scans and no detail on how the OOD labels were obtained or how the baseline detectors were re-implemented, it is difficult to judge whether the improvements are stable or sensitive to the particular choice of 20 OOD examples.

    Authors: We acknowledge the need for more rigorous statistical reporting. The revised manuscript will include error bars (standard deviation over multiple random seeds or cross-validation splits) and 95% confidence intervals for all AUROC values. We will also conduct and report statistical significance tests, such as DeLong's test for comparing AUROCs. Additionally, we will expand the methods section to detail how the OOD labels were sourced from the respective public datasets and provide specifics on the re-implementation of baseline methods to facilitate reproducibility and assessment of stability. revision: yes

  3. Referee: [Experiments] Experiments: The manuscript states that RF-Deep maintains consistent performance across backbones of different depths and pretraining strategies, yet provides no quantitative table or figure showing per-backbone AUROC on the same OOD splits. Without these numbers it is impossible to verify the claimed robustness.

    Authors: We apologize for not including these details in the original submission. In the revision, we will add a table (or supplementary figure) presenting the AUROC values for each backbone variant (different depths and pretraining strategies) on the identical OOD evaluation splits. This will quantitatively support the consistency claim and allow readers to assess the robustness across architectures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical RF-Deep method

full rationale

The paper presents an empirical post-hoc framework in which a random forest is trained on hierarchical features extracted from a segmentation backbone and aggregated from regions anchored to the model's predicted tumor masks, using an external set of 20 ID + 20 OOD labeled scans. No equations, uniqueness theorems, or self-citation chains are invoked that reduce the final AUROC or OOD score to a quantity defined by the same fitted parameters or by construction. Performance is measured on held-out external near-OOD and far-OOD volumes, rendering the central claim independent of any internal definitional loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that segmentation backbone features carry OOD signal when restricted to predicted tumor regions and that a modest number of outlier examples suffice to train a generalizable detector. No free parameters are explicitly named in the abstract; random forest hyperparameters are implicit but not reported. No new physical entities are postulated.

free parameters (1)
  • Random forest hyperparameters (trees, depth, feature sampling)
    Standard random forest tuning parameters required to achieve the reported AUROC but not specified in the abstract.
axioms (1)
  • domain assumption Deep features from a segmentation backbone contain information usable for distinguishing in-distribution from out-of-distribution CT volumes when aggregated around predicted tumor locations.
    Invoked when the method repurposes the pretrained-then-finetuned backbone features for the OOD task.

pith-pipeline@v0.9.0 · 5593 in / 1518 out tokens · 65661 ms · 2026-05-17T00:02:28.753404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    Brennan Nichyporuk, Jillian Cardinell, Justin Szeto, Raghav Mehta, Jean-Pierre Falet, Douglas L

    Springer Nature Switzerland. Brennan Nichyporuk, Jillian Cardinell, Justin Szeto, Raghav Mehta, Jean-Pierre Falet, Douglas L. Arnold, Sotirios A. Tsaftaris, and Tal Arbel. Rethinking generalization: The impact of annotation style on medical image segmentation.Machine Learning for Biomedical Imaging, 2022. Maxime Oquab, Timothée Darcet, Theo Moutakanni, Hu...

  2. [2]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al

    IEEE. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 2019. Walter HL Pinaya, Petru-Daniel Tudosiu, Robert Gray, Geraint Rees, Paras...