pith. sign in

arxiv: 2604.12100 · v1 · submitted 2026-04-13 · 💻 cs.CV

PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords whole-slide imagingmultiple instance learningsupervision scalecomputational pathologyprostate canceranatomical contextinductive biasgeneralization
0
0 comments X

The pith

Anatomical context acts as an independent axis of generalization in MIL for whole-slide images, separate from feature resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard slide-level MIL optimizes only for the presence of cancer anywhere in an image, which leaves models without incentive to capture the millimeter-scale anatomical patterns clinicians rely on. PC-MIL keeps features fixed at 20x magnification while varying the physical extent of supervised bags in millimeters and progressively mixing slide-level with region-level labels anchored at a 2 mm scale. Experiments across 1,476 prostate WSIs from five datasets show that modest amounts of regional supervision improve accuracy when models are tested on different spatial contexts, and that balanced multi-context training maintains global performance while stabilizing results across evaluation scales. This indicates that supervision extent shapes the inductive bias of MIL models in ways that changes to magnification or patch size alone do not address.

Core claim

By anchoring supervision at a clinically motivated 2 mm scale with fixed 20x features and progressively mixing slide- and region-level supervision in controlled proportions, PC-MIL demonstrates that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance.

What carries the argument

PC-MIL framework that decouples feature resolution from supervision scale by varying MIL bag extent in millimeter units while anchoring regional supervision at 2 mm and progressively mixing global and local labels.

If this is right

  • Modest regional supervision at the 2 mm scale improves performance when models are evaluated on spatial contexts different from those used in training.
  • Balanced training that mixes slide-level and region-level supervision maintains high global accuracy while improving stability across different evaluation scales.
  • The spatial extent of supervision directly influences the inductive bias learned by MIL models for whole-slide classification.
  • This approach supports explicit train-context by test-context analysis without requiring changes to magnification or pixel-level segmentation.
  • Anatomically grounded supervision can improve generalization in WSI tasks without trading off slide-level performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling of supervision extent from feature resolution could be tested in other imaging domains where annotation scale varies, such as radiology or satellite imagery.
  • Models trained with mixed-scale supervision might better support downstream tasks that require both global detection and regional localization.
  • Varying the anchor scale beyond 2 mm on additional cancer types could identify whether optimal supervision extent depends on disease-specific lesion patterns.

Load-bearing premise

That anchoring supervision at a fixed 2 mm scale with fixed 20x features isolates supervision extent from lesion density and other confounders across the five datasets.

What would settle it

If adding regional supervision at 2 mm produced no gain in cross-context accuracy or if balanced multi-context training reduced global accuracy on the same prostate WSI datasets, the claim that context forms an independent generalization axis would not hold.

Figures

Figures reproduced from arXiv: 2604.12100 by Abu Zahid Bin Aziz, Attila Gyulassy, Brian Summa, Florian Koehler, Gnanesh Rasineni, J. Quincy Brown, Mei Wang, Shireen Y. Elhabian, Syed Fahim Ahmed, Valerio Pascucci.

Figure 1
Figure 1. Figure 1: PC-MIL pipeline. Top: WSIs are segmented, tiled at 20× into 256×256 patches, and embedded by a frozen encoder. Bottom: Sparse ROI annotations generate candidate regional bags across 4 × 4, 2 × 2, and 1 × 1 mm2 anatomical extents using coverage rules (red: Cancer, green: Non-Cancer; ambiguous regions are discarded). Each WSI is assigned to one supervision context during training to prevent cross-context lea… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of spatial reasoning. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether "somewhere in the slide there is cancer." As a result, the model's inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Progressive-Context MIL (PC-MIL) to address underconstrained slide-level MIL in computational pathology by decoupling feature resolution from supervision scale. Using fixed 20× features, it varies MIL bag extent in millimeter units, anchors regional supervision at a 2 mm scale chosen to preserve comparable tumor burden, and progressively mixes slide- and region-level supervision. Experiments on 1,476 prostate WSIs from five public datasets for binary cancer detection show that anatomical context acts as an independent generalization axis orthogonal to feature resolution: modest regional supervision improves cross-context performance, while balanced multi-context training stabilizes accuracy across slide and regional evaluations without harming global performance.

Significance. If the central claim holds after addressing the noted design controls, the work would be significant for treating supervision extent as an explicit, controllable dimension in MIL without requiring pixel-level annotations or magnification changes. The cross-context train/test analysis framework provides a reproducible way to study inductive bias in WSI models and could inform more anatomically grounded training protocols. The use of multiple public datasets and controlled mixing proportions are positive aspects that support empirical evaluation.

major comments (2)
  1. [§4 (Experimental Setup and Dataset Description)] The experimental design anchors supervision at a fixed 2 mm scale “to preserve comparable tumor burden and avoid confounding scale with lesion density,” yet the manuscript provides no per-dataset statistics on lesion diameter, tumor-area fraction, or positive-instance rate at the 2 mm scale versus full-slide scale (see §4 and the description of dataset preprocessing). If average lesion density or size varies systematically across the five sources, the reported cross-context gains could arise from implicit re-balancing of positive-instance statistics rather than from supervision extent itself; this directly undermines the claim that anatomical context is isolated as an orthogonal axis.
  2. [§5 (Results and Analysis)] The abstract and results claim positive outcomes on 1,476 WSIs, but the manuscript supplies insufficient detail on the exact baselines, statistical tests performed, precise mixing schedules (proportions and annealing), and controls for dataset-specific biases. Without these, it is not possible to verify that the observed stabilization of accuracy across contexts is robust rather than an artifact of particular hyper-parameter choices or dataset imbalances.
minor comments (2)
  1. [§3 (Method)] The notation used for the progressive mixing schedule (e.g., how the mixing proportion λ evolves over training epochs) is introduced without an explicit equation or pseudocode; adding a compact definition would improve reproducibility.
  2. [§5 (Results)] Figure captions for the cross-context accuracy heatmaps could more explicitly state the number of runs and error bars (if any) to allow readers to assess variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested statistics, experimental details, and controls.

read point-by-point responses
  1. Referee: [§4 (Experimental Setup and Dataset Description)] The experimental design anchors supervision at a fixed 2 mm scale “to preserve comparable tumor burden and avoid confounding scale with lesion density,” yet the manuscript provides no per-dataset statistics on lesion diameter, tumor-area fraction, or positive-instance rate at the 2 mm scale versus full-slide scale (see §4 and the description of dataset preprocessing). If average lesion density or size varies systematically across the five sources, the reported cross-context gains could arise from implicit re-balancing of positive-instance statistics rather than from supervision extent itself; this directly undermines the claim that anatomical context is isolated as an orthogonal axis.

    Authors: We agree that explicit per-dataset statistics are required to fully substantiate the claim that the 2 mm anchor avoids confounding with lesion density. In the revised manuscript we have added these statistics (average lesion diameter, tumor-area fraction, and positive-instance rate at both 2 mm and full-slide scales) to Section 4 together with a new Supplementary Table S1. The added data confirm that positive-instance rates at the 2 mm scale remain comparable across the five datasets (variation < 8 %), supporting that the reported cross-context improvements arise from supervision extent. Our within-dataset train/test-context evaluation further provides internal controls against dataset-specific imbalances. revision: yes

  2. Referee: [§5 (Results and Analysis)] The abstract and results claim positive outcomes on 1,476 WSIs, but the manuscript supplies insufficient detail on the exact baselines, statistical tests performed, precise mixing schedules (proportions and annealing), and controls for dataset-specific biases. Without these, it is not possible to verify that the observed stabilization of accuracy across contexts is robust rather than an artifact of particular hyper-parameter choices or dataset imbalances.

    Authors: We acknowledge the need for greater transparency to enable verification. The revised manuscript expands Section 5 and adds an appendix containing: (i) complete baseline specifications and hyper-parameter settings, (ii) the statistical tests employed (paired Wilcoxon signed-rank tests with exact p-values), (iii) the precise mixing proportions and annealing schedule (now tabulated with pseudocode), and (iv) additional controls including per-dataset breakdowns and mixing-ratio ablations. These additions demonstrate that the stabilization across contexts is robust and not an artifact of specific choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper presents PC-MIL as an empirical framework for WSI classification, decoupling supervision scale from feature resolution via controlled experiments that fix 20x features and vary bag extent in millimeter units while anchoring at a 2 mm clinical scale. No mathematical derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the abstract or description. Central claims rest on experimental results across 1,476 prostate WSIs from five datasets, with train-context x test-context analysis, rendering the work self-contained and falsifiable against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that supervision scale can be varied independently; this depends on the assumption that 20x features and public prostate datasets allow isolation of context effects.

free parameters (2)
  • mixing proportions
    Controlled proportions for blending slide-level and region-level supervision during progressive training
  • supervision scale
    2 mm bag extent chosen as clinically motivated anchor
axioms (1)
  • domain assumption Fixed 20x features allow supervision scale to be varied independently of feature resolution
    Paper states that feature resolution is held constant while bag extent changes in millimeter units

pith-pipeline@v0.9.0 · 5631 in / 1276 out tokens · 107558 ms · 2026-05-10T15:21:22.379298+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    The Lancet Oncology21(2), 233–241 (2020)

    Bulten, W., Pinckaers, H., Van Boven, H., Vink, R., De Bel, T., Van Ginneken, B., Van der Laak, J., Hulsbergen-Van de Kaa, C., Litjens, G.: Automated deep- learning system for gleason grading of prostate cancer using biopsies: a diagnostic study. The Lancet Oncology21(2), 233–241 (2020)

  2. [2]

    Nature medicine25(8), 1301–1309 (2019)

    Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical- grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine25(8), 1301–1309 (2019)

  3. [3]

    Pattern recognition 77, 329–353 (2018)

    Carbonneau, M.A., Cheplygina, V., Granger, E., Gagnon, G.: Multiple instance learning: A survey of problem characteristics and applications. Pattern recognition 77, 329–353 (2018)

  4. [4]

    American Association for the Advancement of Science (2020)

    Center, B.S., et al.: The gtex consortium atlas of genetic regulatory effects across human tissues. American Association for the Advancement of Science (2020)

  5. [5]

    Nature medicine30(3), 850–862 (2024)

    Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M., et al.: Towards a general-purpose foundation model for computational pathology. Nature medicine30(3), 850–862 (2024)

  6. [6]

    Communications Medicine4(1), 84 (2024)

    Huo, X., Ong, K.H., Lau, K.W., Gole, L., Young, D.M., Tan, C.L., Zhu, X., Zhang, C., Zhang, Y., Li, L., et al.: A comprehensive ai model development framework for consistent gleason grading. Communications Medicine4(1), 84 (2024)

  7. [7]

    In: International conference on machine learning

    Ilse,M.,Tomczak,J.,Welling,M.:Attention-baseddeepmultipleinstancelearning. In: International conference on machine learning. pp. 2127–2136. PMLR (2018)

  8. [8]

    In: Medical Imaging with Deep Learning (2024), https://openreview.net/forum?id=FNBQOPj18N

    kaiko.ai, Gatopoulos, I., Känzig, N., Moser, R., Otálora, S.: eva: Evaluation frame- work for pathology foundation models. In: Medical Imaging with Deep Learning (2024), https://openreview.net/forum?id=FNBQOPj18N

  9. [9]

    Training state-of-the-art pathology foundation models with orders of magnitude less data.arXiv preprint arXiv:2504.05186, 2025

    Karasikov, M., van Doorn, J., Känzig, N., Cesur, M.E., Horlings, H.M., Berke, R., Tang, F., Otálora, S.: Training state-of-the-art pathology foundation mod- els with orders of magnitude less data. arXiv preprint arXiv:2504.05186 (2025), https://arxiv.org/abs/2504.05186

  10. [10]

    Scientific Reports14(1), 6780 (2024)

    Koziarski, M., Cyganek, B., Niedziela, P., Olborski, B., Antosz, Z., Żydak, M., Kwolek, B., Wąsowicz, P., Bukała, A., Swadźba, J., et al.: Diagset: a dataset for prostate cancer histopathological image classification. Scientific Reports14(1), 6780 (2024)

  11. [11]

    Journal of Cancer8(14), 2653 (2017)

    Kuerer, H.M., Smith, B.D., Chavez-MacGregor, M., Albarracin, C., Barcenas, C.H., Santiago, L., Edgerton, M.E., Rauch, G.M., Giordano, S.H., Sahin, A., et al.: Dcis margins and breast conservation: Md anderson cancer center multidisciplinary practice guidelines and outcomes. Journal of Cancer8(14), 2653 (2017)

  12. [12]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14318–14328 (2021) 10 Ahmed et al

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, J., Chen, Y., Chu, H., Sun, Q., Guan, T., Han, A., He, Y.: Dynamic graph representation with knowledge-aware attention for histopathology whole slide im- age analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11323–11332 (2024)

  14. [14]

    Nature medicine30(3), 863–874 (2024)

    Lu, M.Y., Chen, B., Williamson, D.F., Chen, R.J., Liang, I., Ding, T., Jaume, G., Odintsov, I., Le, L.P., Gerber, G., et al.: A visual-language foundation model for computational pathology. Nature medicine30(3), 863–874 (2024)

  15. [15]

    Nature biomedical engineering5(6), 555–570 (2021)

    Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering5(6), 555–570 (2021)

  16. [16]

    European Urol- ogy Oncology5(6), 611–622 (2022)

    Mazzone, B., et al.: Molecular biomarkers for the detection of clinically signif- icant prostate cancer: A systematic review and meta-analysis. European Urol- ogy Oncology5(6), 611–622 (2022). https://doi.org/10.1016/j.euo.2022.09.004, https://pmc.ncbi.nlm.nih.gov

  17. [17]

    Medical image analysis50, 167–180 (2018)

    Nir, G., Hor, S., Karimi, D., Fazli, L., Skinnider, B.F., Tavassoli, P., Turbin, D., Villamil, C.F., Wang, G., Wilson, R.S., et al.: Automatic grading of prostate cancer in digitized histopathology images: Learning from multiple experts. Medical image analysis50, 167–180 (2018)

  18. [18]

    Plos one16(2), e0245334 (2021)

    Scimone, M.T., Krishnamurthy, S., Maguluri, G., Preda, D., Park, J., Grimble, J., Song, M., Ban, K., Iftimia, N.: Assessment of breast cancer surgical margins with multimodal optical microscopy: A feasibility clinical study. Plos one16(2), e0245334 (2021)

  19. [19]

    Shao, D., Chen, R.J., Song, A.H., Runevic, J., Lu, M.Y., Ding, T., Mahmood, F.: Do multiple instance learning models transfer? arXiv preprint arXiv:2506.09022 (2025)

  20. [20]

    Advances in neural information processing systems34, 2136–2147 (2021)

    Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., et al.: Transmil: Trans- former based correlated multiple instance learning for whole slide image classifica- tion. Advances in neural information processing systems34, 2136–2147 (2021)

  21. [21]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Tang, W., Zhou, F., Huang, S., Zhu, X., Zhang, Y., Liu, B.: Feature re-embedding: Towards foundation model-level performance in computational pathology. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11343–11352 (2024)

  22. [22]

    Medicina57(5), 503 (2021)

    Treviño, J.G., et al.: Sensitivity, specificity, positive predictive value, and neg- ative predictive value: Explaining the real-world performance of diagnostic tests. Medicina57(5), 503 (2021). https://doi.org/10.3390/medicina57050503, https://www.mdpi.com

  23. [23]

    Na- ture medicine30(10), 2924–2935 (2024)

    Vorontsov, E., Bozkurt, A., Casson, A., Shaikovski, G., Zelechowski, M., Sever- son, K., Zimmermann, E., Hall, J., Tenenholtz, N., Fusi, N., et al.: A foundation model for clinical-grade computational pathology and rare cancers detection. Na- ture medicine30(10), 2924–2935 (2024)

  24. [24]

    In: The Eleventh International Conference on Learning Representations (2023)

    Xiang, J., Zhang, J.: Exploring low-rank property in multiple instance learning for whole slide image classification. In: The Eleventh International Conference on Learning Representations (2023)

  25. [25]

    Accelerating data processing and benchmarking of ai models for pathology,

    Zhang, A., Jaume, G., Vaidya, A., Ding, T., Mahmood, F.: Accelerating data processing and benchmarking of ai models for pathology. arXiv preprint arXiv:2502.06750 (2025)

  26. [26]

    & Welling, M

    Zimmermann, E., Vorontsov, E., Viret, J., Casson, A., Zelechowski, M., Shaikovski, G., Tenenholtz, N., Hall, J., Fuchs, T., Fusi, N., Liu, S., Severson, K.: Virchow2: Scaling self-supervised mixed magnification models in pathology. arXiv preprint arXiv:2408.00738 (2024) PC-MIL 11

  27. [27]

    https://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y, https://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y

    Zuley, M.L., Jarosz, R., Drake, B.F., Rancilio, D., Klim, A., Rieger-Christ, K., Lemmerman, J.: The cancer genome atlas prostate adenocarcinoma collection (tcga-prad) (version 4) [data set] (2016). https://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y, https://doi.org/10.7937/K9/TCIA.2016.YXOGLM4Y