pith. sign in

arxiv: 2604.07298 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· eess.IV

Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Pith reviewed 2026-05-10 18:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIeess.IV
keywords whole-slide image classificationmixture of expertsoptimal transportmultiple instance learningcomputational pathologygraph regularizationentropic optimal transport
0
0 comments X

The pith

ROAM uses capacity-constrained optimal transport and graph regularization to balance expert routing for whole-slide image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multiple-instance learning routes every patch through the same pathway, while unconstrained mixture-of-experts models often collapse to one or two dominant experts. ROAM first compresses patches into spatial region tokens that respect tissue neighborhoods, then solves region-to-expert assignment with entropic optimal transport under explicit capacity marginals per slide. Sinkhorn iterations are further regularized by diffusion across the region graph so that neighboring regions tend to select the same experts. The result is balanced utilization enforced by construction rather than by auxiliary losses. The method is evaluated on four WSI benchmarks with frozen embeddings and reports competitive accuracy plus an external AUC of 0.845 on TCGA-CPTAC NSCLC slides.

Core claim

ROAM formulates region-to-expert assignment as entropic optimal transport with explicit per-slide capacity marginals solved by Sinkhorn iterations, with additional graph regularization over the spatial region graph, thereby enforcing balanced expert utilization by construction while aligning routing with local tissue neighborhoods.

What carries the argument

Graph-regularised Sinkhorn iterations for capacity-constrained entropic optimal transport on spatial region tokens.

If this is right

  • Balanced expert utilisation occurs without any auxiliary load-balancing losses.
  • Routing decisions respect local tissue neighborhoods through spatial binning and graph diffusion.
  • Performance remains competitive with strong MIL and MoE baselines across four WSI benchmarks.
  • External generalisation reaches 0.845 AUC on TCGA-CPTAC NSCLC data.
  • Region-token compression reduces the number of routing decisions while retaining neighbourhood structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same optimal-transport routing could be tested on other spatially structured MIL tasks such as remote-sensing scene classification.
  • Graph diffusion may implicitly encode larger-scale pathological patterns that single-region decisions miss.
  • Entropic optimal transport offers a differentiable alternative to softmax routing that may scale to larger expert pools.

Load-bearing premise

Compressing patches into spatially binned region tokens plus graph regularization preserves enough instance-level information to justify the added routing machinery and to improve downstream classification.

What would settle it

Removing the capacity marginals or the graph-regularization term produces either markedly imbalanced expert utilization or lower AUC on the same four WSI benchmarks and external NSCLC set.

Figures

Figures reproduced from arXiv: 2604.07298 by Bart Wanders, Colleen Knoth, Ephraim Tsalik, Jiuliu Lu, Julian Knight, Xin Tian.

Figure 1
Figure 1. Figure 1: Frozen patch embeddings are pooled into M spatial region tokens. A routing GNN on the region graph parameterises region-to-expert costs, and Sinkhorn opti￾mal transport routes region mass to E experts under per-slide capacity constraints (with optional graph smoothing). Each expert performs gated-attention pooling over its routed regions, and expert outputs are fused for slide-level prediction. ing step wi… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative routing visualisation on CPTAC (NSCLC). Two CPTAC external￾test slides (top: LUAD; bottom: LUSC); numbers denote P(LUSC). Left: ROAM dom￾inant expert per region (territories). Middle/Right: ABMIL and CLAM-SB attention heatmaps (normalised for visualisation). Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ROAM, a spatially aware MoE-MIL aggregator for gigapixel WSI classification. Patch embeddings are compressed into spatially binned region tokens; these are routed to a pool of expert poolers via capacity-constrained entropic optimal transport (Sinkhorn) whose per-slide marginals are intended to enforce balanced expert utilisation by construction. Graph-regularised Sinkhorn iterations are added to diffuse assignments over the region adjacency graph. The paper reports competitive performance against MIL and MoE baselines on four WSI benchmarks and an external AUC of 0.845 ± 0.019 on TCGA-CPTAC NSCLC generalisation.

Significance. If the routing mechanism truly preserves the prescribed marginals while adding spatial coherence, ROAM supplies a principled, auxiliary-loss-free alternative to unconstrained softmax routing in pathology MoE models. The explicit use of capacity marginals and the external validation set are concrete strengths that would support broader adoption of OT-based routing in computational pathology.

major comments (1)
  1. [Method (graph-regularised Sinkhorn iterations)] The central claim that balanced expert utilisation is guaranteed 'by construction' rests on the transport plan satisfying the per-slide capacity marginals after graph regularisation. The manuscript must demonstrate (via pseudocode, convergence argument, or explicit reprojection step) that the graph-diffusion operation does not violate these marginals; otherwise the stated advantage over standard softmax routing is not established.
minor comments (2)
  1. [Abstract and Results] The abstract states 'competitive results' and an external AUC but supplies neither the exact baseline implementations, number of experts, nor statistical tests; these must be detailed in the results section with tables showing per-expert utilisation statistics.
  2. [Results (external validation)] Clarify whether the reported ±0.019 on the external AUC is standard deviation across runs, cross-validation folds, or bootstrap; add this to the evaluation protocol description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the thorough review of our manuscript. We respond to the major comment point-by-point below and will incorporate the necessary clarifications in the revised version.

read point-by-point responses
  1. Referee: [Method (graph-regularised Sinkhorn iterations)] The central claim that balanced expert utilisation is guaranteed 'by construction' rests on the transport plan satisfying the per-slide capacity marginals after graph regularisation. The manuscript must demonstrate (via pseudocode, convergence argument, or explicit reprojection step) that the graph-diffusion operation does not violate these marginals; otherwise the stated advantage over standard softmax routing is not established.

    Authors: We thank the referee for highlighting this important point. The graph-regularised Sinkhorn procedure is constructed so that diffusion occurs as a convex combination within each iteration, after which the standard Sinkhorn row/column scaling steps are applied to restore the prescribed marginals. In the revision we will add explicit pseudocode (new Algorithm 1) together with a short convergence argument showing that the final transport plan satisfies the per-slide capacity marginals upon termination. This will rigorously establish the claimed advantage over unconstrained softmax routing. We will also include empirical expert-utilisation histograms on the validation sets to corroborate the theoretical guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity: OT marginal enforcement is standard and independent

full rationale

The paper presents region-to-expert routing as capacity-constrained entropic OT (Sinkhorn) with explicit per-slide marginals that enforce balanced utilisation by the definition of a transport plan. This is a direct application of the standard OT property that any feasible plan satisfies the supplied marginals; it does not reduce the claimed performance or mechanism to a fitted hyperparameter or self-referential quantity. Graph regularisation is described as an additional diffusion step over the region graph, but the abstract gives no indication that it is implemented in a manner that violates the marginal constraints or that the balance claim is derived from the regularisation itself. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked for the central routing formulation, and results are reported via external benchmarks rather than internal consistency checks. The derivation chain therefore remains self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on standard optimal-transport solvers and graph construction from spatial bins; no new entities are postulated and the only potential free parameters are the per-slide capacity marginals whose exact setting is not detailed in the abstract.

free parameters (1)
  • per-slide capacity marginals
    Explicit marginals are used to enforce balanced expert utilisation; their numerical values or selection rule are not specified in the abstract.
axioms (2)
  • standard math Sinkhorn iterations converge to the entropic OT solution for the given marginals
    Invoked to compute the region-to-expert assignment matrix.
  • domain assumption Spatially neighbouring regions share similar routing preferences
    Basis for the graph-regularisation term that diffuses assignments over the region graph.

pith-pipeline@v0.9.0 · 5607 in / 1400 out tokens · 64030 ms · 2026-05-10T18:14:45.436005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Nature medicine28(1), 154–163 (2022)

    Bulten, W., Kartasalo, K., Chen, P.H.C., Ström, P., Pinckaers, H., Nagpal, K., Cai, Y., Steiner, D.F., Van Boven, H., Vink, R., et al.: Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nature medicine28(1), 154–163 (2022)

  2. [2]

    Nature Medicine25(8), 1301–1309 (2019)

    Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade compu- tational pathology using weakly supervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (2019)

  3. [3]

    Nature Medicine30, 850–862 (2024)

    Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B.,Zhang,A.,Shao,D.,Shaban,M.,Williams,M.,Oldenburg,L.,Weishaupt,L.L., Wang,J.J.,Vaidya,A.,Le,L.P.,Gerber,G.,Sahai,S.,Williams,W.,Mahmood,F.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30, 850–862 (2024)

  4. [4]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transporta- tion distances. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 26, pp. 2292–2300 (2013)

  5. [5]

    Journal of Proteome Research14(6), 2707–2713 (Jun 2015)

    Edwards, N.J., Oberti, M., Thangudu, R.R., Cai, S., McGarvey, P.B., Jacob, S., Madhavan, S., Ketchum, K.A.: The CPTAC Data Portal: A Resource for Cancer Proteomics Research. Journal of Proteome Research14(6), 2707–2713 (Jun 2015)

  6. [6]

    In: ACM Multimedia (2024)

    Fang, H., Huang, S., Tang, W., Huangfu, L., Liu, B.: SAM-MIL: A spatial contex- tual aware multiple instance learning approach for whole slide image classification. In: ACM Multimedia (2024)

  7. [7]

    Journal of Machine Learning Research 23, 1–40 (2022)

    Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 1–40 (2022)

  8. [8]

    Advances in neural information processing systems30(2017)

    Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems30(2017)

  9. [9]

    In: Proceedings of the 35th International Conference on Machine Learning (ICML)

    Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing. In: Proceedings of the 35th International Conference on Machine Learning (ICML). vol. 80, pp. 2127–2136. PMLR (2018)

  10. [10]

    IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)

    Khamis, A., Tsuchida, R., Tarek, M., Rolland, V., Petersson, L.: Scalable optimal transport methods in machine learning: A contemporary survey. IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)

  11. [11]

    SIAM Journal on Matrix Analysis and Applications (2008)

    Knight, P.A.: The sinkhorn–knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications (2008)

  12. [12]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

  13. [13]

    Nature Biomedical Engineering5, 555–570 (2021)

    Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering5, 555–570 (2021)

  14. [14]

    arXiv preprint arXiv:2505.00792 (2025) 10 X

    Nguyen, T., Tran, N.N., Nguyen, K., Baraniuk, R.G.: Improving routing in sparse mixture of experts with graph of tokens. arXiv preprint arXiv:2505.00792 (2025) 10 X. Tian et al

  15. [15]

    In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)

    Ren, Q., Wang, Y., Fang, R., Ling, H., You, C.: OTSurv: A novel multiple in- stance learning framework for survival prediction with heterogeneity-aware opti- mal transport. In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)

  16. [16]

    In: International Conference on Learning Representations (ICLR) (2026)

    Shao, D., Runevic, J., Chen, R.J., Williamson, D.F.K., Kim, A., Song, A.H., Mah- mood, F.: Mixture of mini experts: Overcoming the linear layer bottleneck in mul- tiple instance learning. In: International Conference on Learning Representations (ICLR) (2026)

  17. [17]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., Zhang, Y.: Trans- MIL: Transformer based correlated multiple instance learning for whole slide image classification. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 34 (2021)

  18. [18]

    Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2009)

  19. [19]

    In: ICLR (2025)

    Wang, L., Huang, H., Wu, S., Ma, S., Wei, F.: Auxiliary-loss-free load balancing strategy for mixture-of-experts. In: ICLR (2025)

  20. [20]

    Nature genetics45(10), 1113–1120 (2013)

    Weinstein,J.N.,Collisson,E.A.,Mills,G.B.,Shaw,K.R.,Ozenberger,B.A.,Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature genetics45(10), 1113–1120 (2013)

  21. [21]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Wu, J., Chen, M., Ke, X., Xun, T., Jiang, X., Zhou, H., Shao, L., Kong, Y.: Learning heterogeneous tissues with mixture of experts for gigapixel whole slide images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  22. [22]

    In: International Conference on Learning Representations (ICLR) (2023)

    Xiang, J., Wang, X., Zhang, J., Yang, S., Han, X., Yang, W.: Exploring low- rank property in multiple instance learning for whole slide image classification. In: International Conference on Learning Representations (ICLR) (2023)

  23. [23]

    Nature630(8015), 181–188 (2024)

    Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024)

  24. [24]

    In: NeurIPS

    Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., Laudon, J.: Mixture-of-experts with expert choice routing. In: NeurIPS. vol. 35 (2022)