arxiv: 2604.23706 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models

Adam Kuku\v{c}ka , Ond\v{r}ej Fabi\'an , V\'it Musil , Tom\'a\v{s} Br\'azdil

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords ulcerative colitisNancy histological indexweakly supervised learningmultiple instance learningfoundation modelscomputational pathologywhole slide images

0 comments

The pith

A weakly supervised model using foundation models predicts the full five-grade Nancy histological index for ulcerative colitis biopsies from only slide- and case-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how multiple instance learning can automate Nancy index scoring in ulcerative colitis without needing expensive region-by-region annotations. It trains on labels available at the slide or case level and uses foundation model encoders to process whole-slide H&E images from three separate hospitals. The method produces predictions for neutrophilic activity, low versus high groupings, and the complete five-grade scale. A reader would care because current manual grading is slow and variable, so reliable automation could support consistent assessment in trials and daily care across different medical centers.

Core claim

Weakly supervised multiple instance learning that leverages foundation model representations can learn to predict the complete five-grade Nancy histological index, along with related endpoints such as neutrophilic activity, when trained only on slide-level and case-level labels in a realistic multicenter cohort of colon biopsies.

What carries the argument

Weakly supervised multiple instance learning that aggregates patch embeddings from foundation model encoders into slide-level predictions, with ensembling to improve five-grade accuracy.

If this is right

Histologic assessment of ulcerative colitis can proceed with far less annotation effort than region-level labeling requires.
The same pipeline yields both the full Nancy index and clinically useful groupings such as neutrophilic activity.
Performance holds across data from multiple hospitals, supporting deployment in varied clinical environments.
A simple ensembling step improves accuracy on the five-grade task relative to a hierarchical gating approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weak-supervision pattern could be tested on other inflammatory bowel disease scoring systems or on biopsies from additional organs.
Larger or pathology-specific foundation models might further reduce the performance gap to fully supervised methods.
Integration into existing digital pathology systems could make standardized Nancy scoring available at the point of care.

Load-bearing premise

Slide-level and case-level NHI labels alone contain enough signal for the model to learn and predict the fine-grained histological features such as neutrophilic activity.

What would settle it

Testing the trained model on an independent set of biopsies from a fourth hospital that uses different staining protocols or scanners and finding that five-grade NHI accuracy falls substantially below the reported multicenter level.

Figures

Figures reproduced from arXiv: 2604.23706 by Adam Kuku\v{c}ka, Ond\v{r}ej Fabi\'an, Tom\'a\v{s} Br\'azdil, V\'it Musil.

**Figure 1.** Figure 1: Overview of the proposed pipeline, illustrating preprocessing (tissue detection, quality control, tiling), inference with a frozen foundation-model encoder and MIL aggregation, and post-processing to obtain the final NHI grade. imbalanced (with more low-grade cases than high-grade), and class frequencies differ by hospital, motivating a multicenter evaluation. 2.2 Pipeline view at source ↗

**Figure 2.** Figure 2: Confusion matrices for five-grade NHI prediction: IKEM, TUH, and KNL (top row), whole final test set across all centers (Overall), and PathAI reference results. The colorbar indicates normalized frequencies. the results of PathAI, a prior multicenter AI system for UC histology. PathAI results are taken from [9]. In the PathAI work, agreement with expert grading was high (weighted Cohen’s κ = 0.91; Spearman… view at source ↗

**Figure 3.** Figure 3: Qualitative interpretability for a representative region: attention-weight overlay (left) and instance-first classification overlay (right). Overlays visualized in xOpat [32]. aggregator and an overlay derived from the instance-first classification head. Such visualizations help pathologists inspect model behavior, identify failure modes, and accelerate validation of model outputs. Expert Pathologist Feedb… view at source ↗

read the original abstract

Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows weakly supervised MIL with foundation models can deliver five-grade Nancy index predictions from slide-level labels in a three-center UC cohort, with ensembling helping, but offers little direct proof the model tracks actual neutrophilic patterns rather than proxies or artifacts.

read the letter

The main thing to know is that this work applies standard MIL aggregation to foundation model embeddings and gets workable performance on the full five-grade Nancy index plus neutrophilic activity in a real multicenter set of H&E colon biopsies, without needing dense annotations. It is new in targeting the complete NHI scale rather than simplified low/high groupings and in running the evaluation across three hospitals with their staining and scanner differences. The paper does a useful job comparing several encoders, finding Virchow2 more consistent than others, and showing that a simple ensemble rule beats a hierarchical gating baseline for the five-grade task. The multicenter design itself is a practical strength because it matches the variability seen in trials and care. The soft spot is exactly the one the stress test highlights: slide- and case-level labels alone may let the model latch onto center-specific artifacts or coarse tissue patterns instead of the neutrophil density, distribution, and intensity that define the grades. The abstract notes that model choice and resolution matter and that ensembling improves results, but there is no reported attention analysis or instance-level check confirming alignment with the histological criteria. Without those or at least quantitative metrics with error bars and clear baselines, it is hard to judge how robust the claims really are. This is for computational pathologists already working on IBD scoring or weakly supervised histology methods; a reader outside that niche will find the clinical motivation clear but the technical increments modest. I would send it for peer review because the problem is clinically relevant, the data are real and multicenter, and the methods are straightforward enough that referees can focus on whether the evidence supports the robustness claim.

Referee Report

2 major / 1 minor

Summary. The paper proposes a weakly supervised multiple instance learning (MIL) approach for whole-slide H&E images of ulcerative colitis biopsies that trains on slide- and case-level Nancy histological index (NHI) labels, leverages foundation-model encoders, and targets full five-grade NHI prediction along with neutrophilic activity and derived low/high groupings. It evaluates multiple encoders and aggregation strategies on a three-center multicenter cohort (2019-2025) and reports that encoder choice (e.g., Virchow2) and a simple ensembling rule improve performance over a hierarchical gating baseline.

Significance. If the quantitative results and interpretability analyses hold, the work would be significant for computational pathology: it shows that modern foundation models combined with standard MIL can deliver clinically relevant UC histology scoring without dense region-level annotations, which is especially valuable in heterogeneous multicenter settings where annotation cost and observer variability are high.

major comments (2)

[Abstract] Abstract: the claim that foundation-model choice and ensembling 'substantially affect performance' and 'improve five-grade NHI prediction' is load-bearing for the robustness conclusion, yet the abstract supplies no numerical metrics, error bars, data-split details, or baseline numbers, making independent verification of the multicenter claims impossible.
[Results] Results (or equivalent evaluation section): the central claim that the MIL aggregator recovers the specific histological patterns (neutrophil density, distribution, intensity) defining each NHI grade rests on the assumption that slide- and case-level labels alone suffice; without reported evidence that attention/instance scores correlate with neutrophilic activity (rather than center-specific staining or scanner artifacts), the interpretability and cross-center robustness assertions cannot be assessed.

minor comments (1)

[Abstract] Abstract: the phrase 'Nancy-low/high groupings' is used without a brief definition or reference to how the five-grade NHI is collapsed, which may reduce accessibility for readers outside the UC histology community.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have made revisions to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that foundation-model choice and ensembling 'substantially affect performance' and 'improve five-grade NHI prediction' is load-bearing for the robustness conclusion, yet the abstract supplies no numerical metrics, error bars, data-split details, or baseline numbers, making independent verification of the multicenter claims impossible.

Authors: We agree that the abstract should provide sufficient quantitative detail to substantiate the claims. In the revised manuscript, we have updated the abstract to include key performance metrics (e.g., accuracy and macro-F1 for five-grade NHI prediction with Virchow2 and ensembling versus the hierarchical gating baseline), along with a brief description of the patient-level multicenter data splits and the use of multiple random seeds for error bars. These additions enable independent verification while preserving the abstract's conciseness. revision: yes
Referee: [Results] Results (or equivalent evaluation section): the central claim that the MIL aggregator recovers the specific histological patterns (neutrophil density, distribution, intensity) defining each NHI grade rests on the assumption that slide- and case-level labels alone suffice; without reported evidence that attention/instance scores correlate with neutrophilic activity (rather than center-specific staining or scanner artifacts), the interpretability and cross-center robustness assertions cannot be assessed.

Authors: This is a fair critique of the interpretability claims. The original manuscript provided qualitative attention visualizations but did not include quantitative correlation with neutrophilic features or explicit controls for center-specific artifacts. We have added a new paragraph in the Results section with per-center performance breakdowns to demonstrate robustness, plus expanded qualitative examples of attention maps overlaid on slides where high-attention regions align with pathologist-identified neutrophilic areas. A full quantitative correlation study would require dense pixel-level neutrophil annotations beyond the slide-level labels available in our dataset; we have therefore noted this as a limitation and outlined it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; standard MIL evaluation on external labels

full rationale

The paper describes a weakly supervised MIL pipeline that ingests pre-trained foundation-model embeddings of whole-slide images and aggregates them to predict slide- and case-level NHI grades. All reported metrics are computed against held-out human-assigned NHI labels from three independent centers; no equation, aggregation rule, or performance number is obtained by fitting a parameter to the target quantity and then re-labeling it a prediction. No self-citation is invoked to establish uniqueness or to forbid alternative architectures. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from computational pathology; no invented entities or heavily fitted parameters are described in the abstract.

free parameters (2)

Foundation model encoder selection
Virchow2 and alternatives evaluated and chosen based on empirical performance on the target task.
MIL aggregation strategy
Multiple strategies tested, with ensembling rule selected for five-grade prediction.

axioms (1)

domain assumption Multiple instance learning can learn accurate slide-level predictions from bag-level labels alone when using rich patch embeddings from foundation models.
Core premise enabling the weakly supervised setup without dense annotations.

pith-pipeline@v0.9.0 · 5538 in / 1252 out tokens · 51077 ms · 2026-05-08T06:28:10.234352+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 28 canonical work pages

[1]

Gut66(1), 43–49 (2017)

Marchal-Bressenot, A., et al.: Development and validation of the nancy histological index for uc. Gut66(1), 43–49 (2017). https://doi.org/10.1136/gutjnl-2015-310187

work page doi:10.1136/gutjnl-2015-310187 2017
[2]

Gut47(3), 404–409 (2000)

Geboes, K., et al.: A reproducible grading scale for histological assessment of in- flammation in ulcerative colitis. Gut47(3), 404–409 (2000). https://doi.org/10. 1136/gut.47.3.404

2000
[3]

Journal of Crohn’s and Colitis19(1), jjae198 (01 2025)

Puga-Tejada, M., et al.: Artificial intelligence–enabled histology exhibits compara- ble accuracy to pathologists in assessing histological remission in ulcerative colitis: a systematic review, meta-analysis, and meta-regression. Journal of Crohn’s and Colitis19(1), jjae198 (01 2025). https://doi.org/10.1093/ecco-jcc/jjae198

work page doi:10.1093/ecco-jcc/jjae198 2025
[4]

Gut 66(1), 50–58 (2017)

Mosli, M.H., et al.: Development and validation of a histological index for uc. Gut 66(1), 50–58 (2017). https://doi.org/10.1136/gutjnl-2015-310393

work page doi:10.1136/gutjnl-2015-310393 2017
[5]

Clinical Gastroenterology and Hepatology23(5), 846–854.e7 (2025)

Ohara, J., et al.: Automated neutrophil quantification and histological score es- timation in ulcerative colitis. Clinical Gastroenterology and Hepatology23(5), 846–854.e7 (2025). https://doi.org/10.1016/j.cgh.2024.06.040

work page doi:10.1016/j.cgh.2024.06.040 2025
[6]

Journal of Crohn’s and Colitis16(Supplement 1), i015–i017 (01 2022)

Villanacci, V., et al.: Op15 a new simplified histology artificial intelligence system for accurate assessment of remission in ulcerative colitis. Journal of Crohn’s and Colitis16(Supplement 1), i015–i017 (01 2022). https://doi.org/10.1093/ecco-jcc/ jjab232.014

work page doi:10.1093/ecco-jcc/ 2022
[7]

Gastroenterology164(7), 1180–1188.e2 (Jun 2023)

Iacucci, M., et al.: Artificial intelligence enabled histological prediction of remission or activity and clinical outcomes in ulcerative colitis. Gastroenterology164(7), 1180–1188.e2 (Jun 2023). https://doi.org/10.1053/j.gastro.2023.02.031

work page doi:10.1053/j.gastro.2023.02.031 2023
[8]

https://doi.org/10.1016/j.dld.2024.05.033

Furlanello, C., et al.: The development of artificial intelligence in the histological diagnosisofinflammatoryboweldisease(ibd-ai).DigestiveandLiverDisease57(1), 184–189 (Jan 2025). https://doi.org/10.1016/j.dld.2024.05.033

work page doi:10.1016/j.dld.2024.05.033 2025
[9]

Modern Pathology36(6) (Jun 2023)

Najdawi, F., et al.: Artificial intelligence enables quantitative assessment of ul- cerative colitis histology. Modern Pathology36(6) (Jun 2023). https://doi.org/10. 1016/j.modpat.2023.100124

work page arXiv 2023
[10]

Inflammatory Bowel Diseases31(6), 1630–1636 (09 2024)

Rubin, D.T., et al.: Deployment of an artificial intelligence histology tool to aid qualitative assessment of histopathology using the nancy histopathology index in ulcerative colitis. Inflammatory Bowel Diseases31(6), 1630–1636 (09 2024). https: //doi.org/10.1093/ibd/izae204

work page doi:10.1093/ibd/izae204 2024
[11]

United European Gastroen- terology Journal12(8), 1028–1033 (2024)

Peyrin-Biroulet, L., et al.: An artificial intelligence-driven scoring system to mea- sure histological disease activity in ulcerative colitis. United European Gastroen- terology Journal12(8), 1028–1033 (2024). https://doi.org/10.1002/ueg2.12562

work page doi:10.1002/ueg2.12562 2024
[12]

Inflammatory Bowel Diseases28(4), 539–546 (06 2021)

Vande Casteele, N., et al.: Utilizing deep learning to analyze whole slide images of colonic biopsies for associations between eosinophil density and clinicopathologic features in active ulcerative colitis. Inflammatory Bowel Diseases28(4), 539–546 (06 2021). https://doi.org/10.1093/ibd/izab122

work page doi:10.1093/ibd/izab122 2021
[13]

Gut69(10), 1778–1786 (2020)

Bossuyt, P., et al.: Automatic, computer-aided determination of endoscopic and histological inflammation in patients with mild to moderate ulcerative colitis based on red density. Gut69(10), 1778–1786 (2020). https://doi.org/10.1136/ gutjnl-2019-320056

2020
[14]

Mokhtari, R., et al.: Interpretable histopathology-based prediction of disease rele- vant features in inflammatory bowel disease biopsies using weakly-supervised deep learning (2023), https://arxiv.org/abs/2303.12095

work page arXiv 2023
[15]

Attention-based Deep Multiple Instance Learning

Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing (2018), https://arxiv.org/abs/1802.04712 10 A. Kukučka et al

work page Pith review arXiv 2018
[16]

Nature Medicine25(8), 1301–1309 (Aug 2019)

Campanella, G., et al.: Clinical-grade computational pathology using weakly su- pervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (Aug 2019). https://doi.org/10.1038/s41591-019-0508-1

work page doi:10.1038/s41591-019-0508-1 2019
[17]

Shao, Z., et al.: Transmil: Transformer based correlated multiple instance learning for whole slide image classification (2021), https://arxiv.org/abs/2106.00908

work page arXiv 2021
[18]

Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning (2021), https://arxiv.org/abs/2011.08939

work page arXiv 2021
[19]

Physics in Medicine & Biology68(15), 155007 (jul 2023)

Zhou, Y., et al.: Iterative multiple instance learning for weakly annotated whole slide image classification. Physics in Medicine & Biology68(15), 155007 (jul 2023). https://doi.org/10.1088/1361-6560/acde3f

work page doi:10.1088/1361-6560/acde3f 2023
[20]

IEEE Robotics and Automation Letters7(2), 3858–3865 (April 2022)

Seenivasan, L., et al.: Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robotics and Automation Letters7(2), 3858–3865 (April 2022). https://doi.org/10.1109/LRA.2022.3146544

work page doi:10.1109/lra.2022.3146544 2022
[21]

Cancers14(23) (2022)

Tavolara, T.E., Gurcan, M.N., Niazi, M.K.K.: Contrastive multiple instance learn- ing: An unsupervised framework for learning slide-level representations of whole slide histopathology images without labels. Cancers14(23) (2022). https://doi. org/10.3390/cancers14235778

work page doi:10.3390/cancers14235778 2022
[22]

org/abs/2505.01109

Mammadov, A., et al.: Self-supervision enhances instance-based multiple instance learning methods in digital pathology: A benchmark study (2025), https://arxiv. org/abs/2505.01109

work page arXiv 2025
[23]

Information Fusion119, 103027 (2025)

Zhang, Y., et al.: From patches to wsis: A systematic review of deep multiple instance learning in computational pathology. Information Fusion119, 103027 (2025). https://doi.org/10.1016/j.inffus.2025.103027

work page doi:10.1016/j.inffus.2025.103027 2025
[24]

Computers in Biology and Medicine186, 109649 (2025)

Saeed, A., Ismail, M.A., Ghanem, N.M.: Colorectal cancer classification using weakly annotated whole slide images: Multiple instance learning optimization study. Computers in Biology and Medicine186, 109649 (2025). https://doi.org/ doi.org/10.1016/j.compbiomed.2024.109649

work page doi:10.1016/j.compbiomed.2024.109649 2025
[25]

Bilal, M., et al.: Foundation models in computational pathology: A review of chal- lenges, opportunities, and impact (2025), https://arxiv.org/abs/2502.08333

work page arXiv 2025
[26]

Khan, W., et al.: A comprehensive survey of foundation models in medicine (2025), https://arxiv.org/abs/2406.10729

work page arXiv 2025
[27]

Nature630(8015), 181–188 (Jun 2024)

Xu, H., et al.: A whole-slide foundation model for digital pathology from real- world data. Nature630(8015), 181–188 (Jun 2024). https://doi.org/10.1038/ s41586-024-07441-w

2024
[28]

Nature Medicine30(3), 850–862 (Mar 2024)

Chen, R.J., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30(3), 850–862 (Mar 2024). https://doi.org/10.1038/ s41591-024-02857-3

2024
[29]

Nature Medicine30(10), 2924–2935 (2024) https://doi.org/10.1038/s41591-024-03141-0

Vorontsov, E., et al.: A foundation model for clinical-grade computational pathol- ogy and rare cancers detection. Nature Medicine30(10), 2924–2935 (Oct 2024). https://doi.org/10.1038/s41591-024-03141-0

work page doi:10.1038/s41591-024-03141-0 2024
[30]

Zimmermann, E., et al.: Virchow2: Scaling self-supervised mixed magnification models in pathology (2024), https://arxiv.org/abs/2408.00738

work page arXiv 2024
[31]

International Journal of Forecasting 27, 822–844

Bilal, M., et al.: Benchmarking pathology foundation models for predict- ing microsatellite instability in colorectal cancer histopathology. Computerized Medical Imaging and Graphics127, 102680 (2026). https://doi.org/10.1016/j. compmedimag.2025.102680

work page doi:10.1016/j 2026
[32]

https://doi.org/10.1111/cgf.14812

Horák,J.,etal.:xopat:explainableopenpathologyanalysistool.ComputerGraph- ics Forum42(3), 63–73 (2023). https://doi.org/10.1111/cgf.14812

work page doi:10.1111/cgf.14812 2023