pith. sign in

arxiv: 2601.10073 · v2 · submitted 2026-01-15 · 💻 cs.CV · cs.AI

ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology

Pith reviewed 2026-05-16 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multiple instance learningwhole-slide imageshistopathologytile selectionevidence efficiencybudgeted sufficiencycancer subtypingTCGA
0
0 comments X

The pith

ReaMIL adds a selection head to MIL models so that only a small number of tiles suffice for accurate whole-slide cancer subtyping while keeping baseline AUC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReaMIL, which augments a standard multiple-instance-learning backbone with a lightweight selection head that produces soft per-tile gates. This head is trained by a budgeted-sufficiency objective: a hinge loss that forces the true-class probability to stay above a chosen threshold using only the gated tiles, while a sparsity term limits how many tiles are kept. The result is small, spatially compact evidence sets that match or slightly exceed baseline AUC on TCGA-NSCLC, TCGA-BRCA, and PANDA without any extra supervision. On NSCLC the method reports AUC 0.983 together with a mean minimal sufficient K of roughly 8 tiles at a 0.90 threshold and an area-under-K-curve of 0.864, showing that rises sharply once a handful of tiles are retained. The approach therefore supplies both high accuracy and quantitative evidence-efficiency diagnostics that standard MIL pipelines lack.

Core claim

ReaMIL trains a soft selection head on top of a MIL backbone with a budgeted-sufficiency hinge loss that enforces the true-class probability to be at least τ when only the selected tiles are used, subject to a sparsity budget on the number of tiles; this produces small, spatially compact evidence sets that preserve baseline AUC and yield slide-level overlays without additional supervision.

What carries the argument

Budgeted-sufficiency hinge loss with a sparsity constraint on the number of selected tiles, which enforces class-probability sufficiency while promoting compactness and spatial coherence.

If this is right

  • Matches or slightly improves AUC across TCGA-NSCLC, TCGA-BRCA, and PANDA while returning mean minimal sufficient K of about 8 tiles at τ = 0.90 on NSCLC.
  • Produces quantitative diagnostics (MSK, AUKC, contiguity) that measure how quickly class rises with added tiles.
  • Yields slide-level overlays of the selected tiles without requiring extra labels.
  • Integrates directly into existing MIL training pipelines with no change to the backbone architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection mechanism could be applied to other large-image domains where only a few patches carry the decisive signal.
  • Compact evidence sets may lower the cost of human review in clinical workflows by directing attention to a handful of regions.
  • The AUKC metric offers a way to compare evidence efficiency across different MIL variants on the same data.
  • Spatially compact selections might reduce the need for post-hoc explanation methods in digital pathology.

Load-bearing premise

Enforcing class probability above a threshold with a limited number of tiles under a sparsity budget will automatically select diagnostically meaningful and spatially compact tiles without degrading overall accuracy.

What would settle it

A controlled experiment in which enforcing the small selection budget on a held-out WSI cohort causes AUC to fall below the baseline MIL model, or in which selected tiles are judged non-diagnostic by blinded pathologists.

Figures

Figures reproduced from arXiv: 2601.10073 by Hwiyoung Kim, Hyun Do Jung, Jungwon Choi.

Figure 1
Figure 1. Figure 1: Overview of ReaMIL. Frozen UNI2-h features and patch coordinates are extracted from each WSI and mapped to tokens with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: K-curve on NSCLC (test set). True-class probability [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evidence visualization on TCGA-NSCLC. Left: LUSC (squamous cell carcinoma) case with relatively compact evidence clusters over squamous tumor nests. Right: LUAD (adenocarcinoma) case with more diffuse selection over gland-forming tumor regions. Each panel shows selected tile locations (green boxes) and the corresponding top-K zoomed patches. For visualization, we show zoomed-in regions (left: 8192 × 8192; … view at source ↗
read the original abstract

We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be $\geq \tau$ using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) $\approx 8.2$ tiles at $\tau = 0.90$ and AUKC $\approx 0.864$, showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces ReaMIL, a multiple instance learning framework for whole-slide histopathology images that incorporates a selection head producing soft per-tile gates. These gates are optimized via a budgeted-sufficiency hinge loss to maintain a true-class probability of at least τ while respecting a sparsity budget on selected tiles. The method is evaluated on TCGA-NSCLC, TCGA-BRCA, and PANDA datasets, claiming to match or exceed baseline AUC while providing compact evidence sets, with specific metrics such as AUC 0.983, mean MSK of approximately 8.2 tiles at τ=0.90, and AUKC of 0.864 on NSCLC.

Significance. If the selected evidence sets prove clinically relevant, ReaMIL offers a practical advance in interpretable MIL for computational pathology by quantifying evidence efficiency via MSK, AUKC, and contiguity without requiring tile-level labels or extra supervision. The seamless integration with existing MIL backbones and the production of slide-level overlays are notable strengths that could support more trustworthy model deployment in diagnostic workflows.

major comments (3)
  1. [§4] §4 (Experiments): The reported AUC values and slight improvements over baselines are presented without details on baseline re-implementations, hyperparameter search protocols, or statistical significance testing (e.g., DeLong tests or bootstrap confidence intervals for AUC differences), which undermines verification of the central performance claims given the dependence on user-chosen τ and sparsity budget.
  2. [§3.2] §3.2 (Budgeted-sufficiency objective): The hinge loss enforces P(y|kept tiles) ≥ τ under the sparsity constraint but supplies no signal ensuring the kept tiles align with histopathologically diagnostic regions; without tile-level annotations or post-hoc expert review, high AUC + low MSK can arise from any small correlating set, including non-causal or artifactual tiles.
  3. [§4.3] §4.3 (Ablations and diagnostics): Sensitivity of MSK, AUKC, and contiguity to the free parameters τ and sparsity budget is only partially explored; a fuller ablation table showing performance degradation or evidence quality changes across a range of these values is needed to support the claim of robust evidence-aware behavior.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'rigorous evaluation of model behavior' should explicitly name the baselines (e.g., standard attention-based MIL) and the exact statistical measures used.
  2. [Figure 3] Figure 3 (slide overlays): The caption should clarify how contiguity is quantitatively defined and computed from the soft gates.
  3. [§3] Notation: The definition of the soft per-tile gates and their relation to the final slide-level prediction should be stated more explicitly in the method section to avoid ambiguity with standard MIL pooling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported AUC values and slight improvements over baselines are presented without details on baseline re-implementations, hyperparameter search protocols, or statistical significance testing (e.g., DeLong tests or bootstrap confidence intervals for AUC differences), which undermines verification of the central performance claims given the dependence on user-chosen τ and sparsity budget.

    Authors: We agree that additional details are necessary to allow full verification of the results. In the revised manuscript, we will include a dedicated subsection describing the re-implementation of baselines, the hyperparameter search procedure (including ranges and selection criteria on a validation split), and statistical comparisons using bootstrap resampling to compute confidence intervals for AUC differences. revision: yes

  2. Referee: [§3.2] §3.2 (Budgeted-sufficiency objective): The hinge loss enforces P(y|kept tiles) ≥ τ under the sparsity constraint but supplies no signal ensuring the kept tiles align with histopathologically diagnostic regions; without tile-level annotations or post-hoc expert review, high AUC + low MSK can arise from any small correlating set, including non-causal or artifactual tiles.

    Authors: The budgeted-sufficiency loss is intentionally formulated to identify minimal sets sufficient for the model's prediction rather than to enforce alignment with expert-defined diagnostic regions, as the latter would require additional supervision which the method aims to avoid. We acknowledge this as a limitation and will expand the discussion section to clarify that the selected tiles are sufficient for high-confidence predictions but may not correspond to causal histopathological features. We will also include more qualitative visualizations to allow readers to assess relevance. revision: partial

  3. Referee: [§4.3] §4.3 (Ablations and diagnostics): Sensitivity of MSK, AUKC, and contiguity to the free parameters τ and sparsity budget is only partially explored; a fuller ablation table showing performance degradation or evidence quality changes across a range of these values is needed to support the claim of robust evidence-aware behavior.

    Authors: We will extend the ablation experiments to cover a broader range of τ values (e.g., 0.75 to 0.95) and sparsity budgets, presenting a comprehensive table that reports AUC, MSK, AUKC, and contiguity metrics for each combination to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results follow from defined objective on held-out data

full rationale

The paper defines a budgeted-sufficiency hinge loss that enforces true-class probability >= τ using selected tiles under a sparsity budget. Reported metrics (AUC 0.983, MSK ≈8.2 at τ=0.90, AUKC≈0.864) are computed on held-out TCGA and PANDA test sets after standard training. These quantities are not algebraically equivalent to the chosen hyperparameters τ and sparsity budget by construction; they are independent empirical outcomes. No self-citations, uniqueness theorems, or ansatzes are load-bearing in the core claims. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on a new training objective and a lightweight head whose behavior is controlled by two explicit hyperparameters; no new physical entities are postulated.

free parameters (2)
  • tau = 0.90
    Probability threshold enforced by the hinge loss on the kept evidence.
  • sparsity budget
    Upper limit on the number of tiles the selection head may keep.
axioms (1)
  • domain assumption A lightweight selection head can be trained jointly with any strong MIL backbone without architectural conflict.
    The paper states the head integrates seamlessly with standard MIL training.
invented entities (1)
  • soft per-tile gates no independent evidence
    purpose: Produce differentiable selection probabilities for evidence tiles.
    New model component introduced to implement the budgeted selection.

pith-pipeline@v0.9.0 · 5534 in / 1339 out tokens · 51699 ms · 2026-05-16T13:38:53.459348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Artifi- cial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature Medicine, 28(1):154– 163, 2022

    Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Str ¨om, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artifi- cial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature Medicine, 28(1):154– 163, 2022. 2

  2. [2]

    Hanna, Luke Geneslaw, Andrew Miraflor, Vitor Werneck Krauss Silva, Klaus J

    Gabriele Campanella, Matthew G. Hanna, Luke Geneslaw, Andrew Miraflor, Vitor Werneck Krauss Silva, Klaus J. Busam, Edi Brogi, Victor E. Reuter, David S. Klimstra, and Thomas J. Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide im- ages.Nature Medicine, 25(8):1301–1309, 2019. 1

  3. [3]

    Tomsett, R

    Supriyo Chakraborty, Richard J. Tomsett, R. Raghaven- dra, Daniel Harborne, M. Alzantot, F. Cerutti, M. Sri- vastava, A. Preece, S. Julier, R. Rao, T. Kelley, Dave Braines, M. Sensoy, C. Willis, and Prudhvi K. Gurram. Interpretability of deep learning models: A survey of re- sults.2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trust...

  4. [4]

    Towards a general-purpose foundation model for com- putational pathology.Nature Medicine, 30:154–165, 2024

    Richard J Chen, Tong Ding, Ming Y Lu, Drew F K Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban, et al. Towards a general-purpose foundation model for com- putational pathology.Nature Medicine, 30:154–165, 2024. 1, 2

  5. [5]

    Scaling vision transform- ers to 22 billion parameters.arXiv preprint, 2022

    Zhengying Chen, Xiao Wang, Alexander Kolesnikov, Dirk Weissenborn, Xiaojuan Qi, Jakob Verbeek, Neil Houlsby, Xiaohua Zhai, and Lucas Beyer. Scaling vision transform- ers to 22 billion parameters.arXiv preprint, 2022. 2

  6. [6]

    Overcoming data scarcity in biomedical imaging with a foundational multi-task model

    Ozan Ciga and Anne L Martel. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. arXiv preprint arXiv:2010.07964, 2022. 1

  7. [7]

    Neofytos Dimitriou, Ognjen Arandjelovic, and P. Caie. Deep learning for whole slide image analysis: An overview.Fron- tiers in Medicine, 2019. 1

  8. [8]

    Edwards, Mauricio Oberti, Ratna R

    Nathan J. Edwards, Mauricio Oberti, Ratna R. Thangudu, Shuang Cai, Peter B. McGarvey, Shine Jacob, Subha Mad- havan, and Karen A. Ketchum. The CPTAC Data Portal: A Resource for Cancer Proteomics Research.Journal of Pro- teome Research, 14(6):2707–2713, 2015. 2

  9. [9]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems, 2017. 2

  10. [10]

    Tomczak, and Max Welling

    Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning. InProceed- ings of the 35th International Conference on Machine Learn- ing, pages 2127–2136. PMLR, 2018. 1, 2

  11. [11]

    Categorical repa- rameterization with Gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with Gumbel-softmax. InInternational Con- ference on Learning Representations, 2017. 3

  12. [12]

    Litjens, P ´eter B ´andi, Babak Ehteshami Bejnordi, Oscar G

    G. Litjens, P ´eter B ´andi, Babak Ehteshami Bejnordi, Oscar G. F. Geessink, M. Balkenhol, P. Bult, A. Halilovic, M. Hermsen, Rob van de Loo, R. V ogels, Quirine F. Manson, N. Stathonikos, A. Baidoshvili, Paul van Diest, C. Wauters, Marcory van Dijk, and Jeroen van der Laak. 1399 h&e- stained sentinel lymph node sections of breast cancer pa- tients: the c...

  13. [13]

    Lu, Drew F

    Ming Y . Lu, Drew F. K. Williamson, Tiffany Y . Liu, Richard J. Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathol- ogy on whole-slide images.Nature Biomedical Engineering, 5(6):555–570, 2021. 1, 2

  14. [14]

    The concrete distribution: A continuous relaxation of discrete random variables

    Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InInternational Conference on Learning Representations, 2017. 3

  15. [15]

    Learning to deceive with attention-based explanations

    Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, and Zachary Chase Lipton. Learning to deceive with attention-based explanations. InProceedings of the 58th An- nual Meeting of the Association for Computational Linguis- tics, pages 4782–4793, Online, 2020. 1

  16. [16]

    Is attention interpretable? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, 2019

    Sofia Serrano and Noah A Smith. Is attention interpretable? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, 2019. 1, 2

  17. [17]

    Transmil: Transformer based correlated multiple instance learning for whole slide image classification

    Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, and Yongbing Zhang. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. InAdvances in Neural In- formation Processing Systems, pages 2136–2147, 2021. 1, 2

  18. [18]

    The cancer genome atlas pan-cancer analysis project.Nature Genetics, 45(10):1113–1120, 2013

    John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project.Nature Genetics, 45(10):1113–1120, 2013. 2