pith. machine review for the scientific record. sign in

arxiv: 2604.23706 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords ulcerative colitisNancy histological indexweakly supervised learningmultiple instance learningfoundation modelscomputational pathologywhole slide images
0
0 comments X

The pith

A weakly supervised model using foundation models predicts the full five-grade Nancy histological index for ulcerative colitis biopsies from only slide- and case-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how multiple instance learning can automate Nancy index scoring in ulcerative colitis without needing expensive region-by-region annotations. It trains on labels available at the slide or case level and uses foundation model encoders to process whole-slide H&E images from three separate hospitals. The method produces predictions for neutrophilic activity, low versus high groupings, and the complete five-grade scale. A reader would care because current manual grading is slow and variable, so reliable automation could support consistent assessment in trials and daily care across different medical centers.

Core claim

Weakly supervised multiple instance learning that leverages foundation model representations can learn to predict the complete five-grade Nancy histological index, along with related endpoints such as neutrophilic activity, when trained only on slide-level and case-level labels in a realistic multicenter cohort of colon biopsies.

What carries the argument

Weakly supervised multiple instance learning that aggregates patch embeddings from foundation model encoders into slide-level predictions, with ensembling to improve five-grade accuracy.

If this is right

  • Histologic assessment of ulcerative colitis can proceed with far less annotation effort than region-level labeling requires.
  • The same pipeline yields both the full Nancy index and clinically useful groupings such as neutrophilic activity.
  • Performance holds across data from multiple hospitals, supporting deployment in varied clinical environments.
  • A simple ensembling step improves accuracy on the five-grade task relative to a hierarchical gating approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weak-supervision pattern could be tested on other inflammatory bowel disease scoring systems or on biopsies from additional organs.
  • Larger or pathology-specific foundation models might further reduce the performance gap to fully supervised methods.
  • Integration into existing digital pathology systems could make standardized Nancy scoring available at the point of care.

Load-bearing premise

Slide-level and case-level NHI labels alone contain enough signal for the model to learn and predict the fine-grained histological features such as neutrophilic activity.

What would settle it

Testing the trained model on an independent set of biopsies from a fourth hospital that uses different staining protocols or scanners and finding that five-grade NHI accuracy falls substantially below the reported multicenter level.

Figures

Figures reproduced from arXiv: 2604.23706 by Adam Kuku\v{c}ka, Ond\v{r}ej Fabi\'an, Tom\'a\v{s} Br\'azdil, V\'it Musil.

Figure 1
Figure 1. Figure 1: Overview of the proposed pipeline, illustrating preprocessing (tissue detection, quality control, tiling), inference with a frozen foundation-model encoder and MIL aggregation, and post-processing to obtain the final NHI grade. imbalanced (with more low-grade cases than high-grade), and class frequencies differ by hospital, motivating a multicenter evaluation. 2.2 Pipeline view at source ↗
Figure 2
Figure 2. Figure 2: Confusion matrices for five-grade NHI prediction: IKEM, TUH, and KNL (top row), whole final test set across all centers (Overall), and PathAI reference results. The colorbar indicates normalized frequencies. the results of PathAI, a prior multicenter AI system for UC histology. PathAI results are taken from [9]. In the PathAI work, agreement with expert grading was high (weighted Cohen’s κ = 0.91; Spearman… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative interpretability for a representative region: attention-weight overlay (left) and instance-first classification overlay (right). Overlays visualized in xOpat [32]. aggregator and an overlay derived from the instance-first classification head. Such visualizations help pathologists inspect model behavior, identify failure modes, and accelerate validation of model outputs. Expert Pathologist Feedb… view at source ↗
read the original abstract

Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a weakly supervised multiple instance learning (MIL) approach for whole-slide H&E images of ulcerative colitis biopsies that trains on slide- and case-level Nancy histological index (NHI) labels, leverages foundation-model encoders, and targets full five-grade NHI prediction along with neutrophilic activity and derived low/high groupings. It evaluates multiple encoders and aggregation strategies on a three-center multicenter cohort (2019-2025) and reports that encoder choice (e.g., Virchow2) and a simple ensembling rule improve performance over a hierarchical gating baseline.

Significance. If the quantitative results and interpretability analyses hold, the work would be significant for computational pathology: it shows that modern foundation models combined with standard MIL can deliver clinically relevant UC histology scoring without dense region-level annotations, which is especially valuable in heterogeneous multicenter settings where annotation cost and observer variability are high.

major comments (2)
  1. [Abstract] Abstract: the claim that foundation-model choice and ensembling 'substantially affect performance' and 'improve five-grade NHI prediction' is load-bearing for the robustness conclusion, yet the abstract supplies no numerical metrics, error bars, data-split details, or baseline numbers, making independent verification of the multicenter claims impossible.
  2. [Results] Results (or equivalent evaluation section): the central claim that the MIL aggregator recovers the specific histological patterns (neutrophil density, distribution, intensity) defining each NHI grade rests on the assumption that slide- and case-level labels alone suffice; without reported evidence that attention/instance scores correlate with neutrophilic activity (rather than center-specific staining or scanner artifacts), the interpretability and cross-center robustness assertions cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'Nancy-low/high groupings' is used without a brief definition or reference to how the five-grade NHI is collapsed, which may reduce accessibility for readers outside the UC histology community.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have made revisions to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that foundation-model choice and ensembling 'substantially affect performance' and 'improve five-grade NHI prediction' is load-bearing for the robustness conclusion, yet the abstract supplies no numerical metrics, error bars, data-split details, or baseline numbers, making independent verification of the multicenter claims impossible.

    Authors: We agree that the abstract should provide sufficient quantitative detail to substantiate the claims. In the revised manuscript, we have updated the abstract to include key performance metrics (e.g., accuracy and macro-F1 for five-grade NHI prediction with Virchow2 and ensembling versus the hierarchical gating baseline), along with a brief description of the patient-level multicenter data splits and the use of multiple random seeds for error bars. These additions enable independent verification while preserving the abstract's conciseness. revision: yes

  2. Referee: [Results] Results (or equivalent evaluation section): the central claim that the MIL aggregator recovers the specific histological patterns (neutrophil density, distribution, intensity) defining each NHI grade rests on the assumption that slide- and case-level labels alone suffice; without reported evidence that attention/instance scores correlate with neutrophilic activity (rather than center-specific staining or scanner artifacts), the interpretability and cross-center robustness assertions cannot be assessed.

    Authors: This is a fair critique of the interpretability claims. The original manuscript provided qualitative attention visualizations but did not include quantitative correlation with neutrophilic features or explicit controls for center-specific artifacts. We have added a new paragraph in the Results section with per-center performance breakdowns to demonstrate robustness, plus expanded qualitative examples of attention maps overlaid on slides where high-attention regions align with pathologist-identified neutrophilic areas. A full quantitative correlation study would require dense pixel-level neutrophil annotations beyond the slide-level labels available in our dataset; we have therefore noted this as a limitation and outlined it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; standard MIL evaluation on external labels

full rationale

The paper describes a weakly supervised MIL pipeline that ingests pre-trained foundation-model embeddings of whole-slide images and aggregates them to predict slide- and case-level NHI grades. All reported metrics are computed against held-out human-assigned NHI labels from three independent centers; no equation, aggregation rule, or performance number is obtained by fitting a parameter to the target quantity and then re-labeling it a prediction. No self-citation is invoked to establish uniqueness or to forbid alternative architectures. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from computational pathology; no invented entities or heavily fitted parameters are described in the abstract.

free parameters (2)
  • Foundation model encoder selection
    Virchow2 and alternatives evaluated and chosen based on empirical performance on the target task.
  • MIL aggregation strategy
    Multiple strategies tested, with ensembling rule selected for five-grade prediction.
axioms (1)
  • domain assumption Multiple instance learning can learn accurate slide-level predictions from bag-level labels alone when using rich patch embeddings from foundation models.
    Core premise enabling the weakly supervised setup without dense annotations.

pith-pipeline@v0.9.0 · 5538 in / 1252 out tokens · 51077 ms · 2026-05-08T06:28:10.234352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 28 canonical work pages

  1. [1]

    Gut66(1), 43–49 (2017)

    Marchal-Bressenot, A., et al.: Development and validation of the nancy histological index for uc. Gut66(1), 43–49 (2017). https://doi.org/10.1136/gutjnl-2015-310187

  2. [2]

    Gut47(3), 404–409 (2000)

    Geboes, K., et al.: A reproducible grading scale for histological assessment of in- flammation in ulcerative colitis. Gut47(3), 404–409 (2000). https://doi.org/10. 1136/gut.47.3.404

  3. [3]

    Journal of Crohn’s and Colitis19(1), jjae198 (01 2025)

    Puga-Tejada, M., et al.: Artificial intelligence–enabled histology exhibits compara- ble accuracy to pathologists in assessing histological remission in ulcerative colitis: a systematic review, meta-analysis, and meta-regression. Journal of Crohn’s and Colitis19(1), jjae198 (01 2025). https://doi.org/10.1093/ecco-jcc/jjae198

  4. [4]

    Gut 66(1), 50–58 (2017)

    Mosli, M.H., et al.: Development and validation of a histological index for uc. Gut 66(1), 50–58 (2017). https://doi.org/10.1136/gutjnl-2015-310393

  5. [5]

    Clinical Gastroenterology and Hepatology23(5), 846–854.e7 (2025)

    Ohara, J., et al.: Automated neutrophil quantification and histological score es- timation in ulcerative colitis. Clinical Gastroenterology and Hepatology23(5), 846–854.e7 (2025). https://doi.org/10.1016/j.cgh.2024.06.040

  6. [6]

    Journal of Crohn’s and Colitis16(Supplement 1), i015–i017 (01 2022)

    Villanacci, V., et al.: Op15 a new simplified histology artificial intelligence system for accurate assessment of remission in ulcerative colitis. Journal of Crohn’s and Colitis16(Supplement 1), i015–i017 (01 2022). https://doi.org/10.1093/ecco-jcc/ jjab232.014

  7. [7]

    Gastroenterology164(7), 1180–1188.e2 (Jun 2023)

    Iacucci, M., et al.: Artificial intelligence enabled histological prediction of remission or activity and clinical outcomes in ulcerative colitis. Gastroenterology164(7), 1180–1188.e2 (Jun 2023). https://doi.org/10.1053/j.gastro.2023.02.031

  8. [8]

    https://doi.org/10.1016/j.dld.2024.05.033

    Furlanello, C., et al.: The development of artificial intelligence in the histological diagnosisofinflammatoryboweldisease(ibd-ai).DigestiveandLiverDisease57(1), 184–189 (Jan 2025). https://doi.org/10.1016/j.dld.2024.05.033

  9. [9]

    Modern Pathology36(6) (Jun 2023)

    Najdawi, F., et al.: Artificial intelligence enables quantitative assessment of ul- cerative colitis histology. Modern Pathology36(6) (Jun 2023). https://doi.org/10. 1016/j.modpat.2023.100124

  10. [10]

    Inflammatory Bowel Diseases31(6), 1630–1636 (09 2024)

    Rubin, D.T., et al.: Deployment of an artificial intelligence histology tool to aid qualitative assessment of histopathology using the nancy histopathology index in ulcerative colitis. Inflammatory Bowel Diseases31(6), 1630–1636 (09 2024). https: //doi.org/10.1093/ibd/izae204

  11. [11]

    United European Gastroen- terology Journal12(8), 1028–1033 (2024)

    Peyrin-Biroulet, L., et al.: An artificial intelligence-driven scoring system to mea- sure histological disease activity in ulcerative colitis. United European Gastroen- terology Journal12(8), 1028–1033 (2024). https://doi.org/10.1002/ueg2.12562

  12. [12]

    Inflammatory Bowel Diseases28(4), 539–546 (06 2021)

    Vande Casteele, N., et al.: Utilizing deep learning to analyze whole slide images of colonic biopsies for associations between eosinophil density and clinicopathologic features in active ulcerative colitis. Inflammatory Bowel Diseases28(4), 539–546 (06 2021). https://doi.org/10.1093/ibd/izab122

  13. [13]

    Gut69(10), 1778–1786 (2020)

    Bossuyt, P., et al.: Automatic, computer-aided determination of endoscopic and histological inflammation in patients with mild to moderate ulcerative colitis based on red density. Gut69(10), 1778–1786 (2020). https://doi.org/10.1136/ gutjnl-2019-320056

  14. [14]

    Mokhtari, R., et al.: Interpretable histopathology-based prediction of disease rele- vant features in inflammatory bowel disease biopsies using weakly-supervised deep learning (2023), https://arxiv.org/abs/2303.12095

  15. [15]

    Attention-based Deep Multiple Instance Learning

    Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing (2018), https://arxiv.org/abs/1802.04712 10 A. Kukučka et al

  16. [16]

    Nature Medicine25(8), 1301–1309 (Aug 2019)

    Campanella, G., et al.: Clinical-grade computational pathology using weakly su- pervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (Aug 2019). https://doi.org/10.1038/s41591-019-0508-1

  17. [17]

    Shao, Z., et al.: Transmil: Transformer based correlated multiple instance learning for whole slide image classification (2021), https://arxiv.org/abs/2106.00908

  18. [18]

    Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning (2021), https://arxiv.org/abs/2011.08939

  19. [19]

    Physics in Medicine & Biology68(15), 155007 (jul 2023)

    Zhou, Y., et al.: Iterative multiple instance learning for weakly annotated whole slide image classification. Physics in Medicine & Biology68(15), 155007 (jul 2023). https://doi.org/10.1088/1361-6560/acde3f

  20. [20]

    IEEE Robotics and Automation Letters7(2), 3858–3865 (April 2022)

    Seenivasan, L., et al.: Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robotics and Automation Letters7(2), 3858–3865 (April 2022). https://doi.org/10.1109/LRA.2022.3146544

  21. [21]

    Cancers14(23) (2022)

    Tavolara, T.E., Gurcan, M.N., Niazi, M.K.K.: Contrastive multiple instance learn- ing: An unsupervised framework for learning slide-level representations of whole slide histopathology images without labels. Cancers14(23) (2022). https://doi. org/10.3390/cancers14235778

  22. [22]

    org/abs/2505.01109

    Mammadov, A., et al.: Self-supervision enhances instance-based multiple instance learning methods in digital pathology: A benchmark study (2025), https://arxiv. org/abs/2505.01109

  23. [23]

    Information Fusion119, 103027 (2025)

    Zhang, Y., et al.: From patches to wsis: A systematic review of deep multiple instance learning in computational pathology. Information Fusion119, 103027 (2025). https://doi.org/10.1016/j.inffus.2025.103027

  24. [24]

    Computers in Biology and Medicine186, 109649 (2025)

    Saeed, A., Ismail, M.A., Ghanem, N.M.: Colorectal cancer classification using weakly annotated whole slide images: Multiple instance learning optimization study. Computers in Biology and Medicine186, 109649 (2025). https://doi.org/ doi.org/10.1016/j.compbiomed.2024.109649

  25. [25]

    Bilal, M., et al.: Foundation models in computational pathology: A review of chal- lenges, opportunities, and impact (2025), https://arxiv.org/abs/2502.08333

  26. [26]

    Khan, W., et al.: A comprehensive survey of foundation models in medicine (2025), https://arxiv.org/abs/2406.10729

  27. [27]

    Nature630(8015), 181–188 (Jun 2024)

    Xu, H., et al.: A whole-slide foundation model for digital pathology from real- world data. Nature630(8015), 181–188 (Jun 2024). https://doi.org/10.1038/ s41586-024-07441-w

  28. [28]

    Nature Medicine30(3), 850–862 (Mar 2024)

    Chen, R.J., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30(3), 850–862 (Mar 2024). https://doi.org/10.1038/ s41591-024-02857-3

  29. [29]

    Nature Medicine30(10), 2924–2935 (2024) https://doi.org/10.1038/s41591-024-03141-0

    Vorontsov, E., et al.: A foundation model for clinical-grade computational pathol- ogy and rare cancers detection. Nature Medicine30(10), 2924–2935 (Oct 2024). https://doi.org/10.1038/s41591-024-03141-0

  30. [30]

    Zimmermann, E., et al.: Virchow2: Scaling self-supervised mixed magnification models in pathology (2024), https://arxiv.org/abs/2408.00738

  31. [31]

    International Journal of Forecasting 27, 822–844

    Bilal, M., et al.: Benchmarking pathology foundation models for predict- ing microsatellite instability in colorectal cancer histopathology. Computerized Medical Imaging and Graphics127, 102680 (2026). https://doi.org/10.1016/j. compmedimag.2025.102680

  32. [32]

    https://doi.org/10.1111/cgf.14812

    Horák,J.,etal.:xopat:explainableopenpathologyanalysistool.ComputerGraph- ics Forum42(3), 63–73 (2023). https://doi.org/10.1111/cgf.14812