Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification
Pith reviewed 2026-05-10 18:14 UTC · model grok-4.3
The pith
ROAM uses capacity-constrained optimal transport and graph regularization to balance expert routing for whole-slide image classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ROAM formulates region-to-expert assignment as entropic optimal transport with explicit per-slide capacity marginals solved by Sinkhorn iterations, with additional graph regularization over the spatial region graph, thereby enforcing balanced expert utilization by construction while aligning routing with local tissue neighborhoods.
What carries the argument
Graph-regularised Sinkhorn iterations for capacity-constrained entropic optimal transport on spatial region tokens.
If this is right
- Balanced expert utilisation occurs without any auxiliary load-balancing losses.
- Routing decisions respect local tissue neighborhoods through spatial binning and graph diffusion.
- Performance remains competitive with strong MIL and MoE baselines across four WSI benchmarks.
- External generalisation reaches 0.845 AUC on TCGA-CPTAC NSCLC data.
- Region-token compression reduces the number of routing decisions while retaining neighbourhood structure.
Where Pith is reading between the lines
- The same optimal-transport routing could be tested on other spatially structured MIL tasks such as remote-sensing scene classification.
- Graph diffusion may implicitly encode larger-scale pathological patterns that single-region decisions miss.
- Entropic optimal transport offers a differentiable alternative to softmax routing that may scale to larger expert pools.
Load-bearing premise
Compressing patches into spatially binned region tokens plus graph regularization preserves enough instance-level information to justify the added routing machinery and to improve downstream classification.
What would settle it
Removing the capacity marginals or the graph-regularization term produces either markedly imbalanced expert utilization or lower AUC on the same four WSI benchmarks and external NSCLC set.
Figures
read the original abstract
Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ROAM, a spatially aware MoE-MIL aggregator for gigapixel WSI classification. Patch embeddings are compressed into spatially binned region tokens; these are routed to a pool of expert poolers via capacity-constrained entropic optimal transport (Sinkhorn) whose per-slide marginals are intended to enforce balanced expert utilisation by construction. Graph-regularised Sinkhorn iterations are added to diffuse assignments over the region adjacency graph. The paper reports competitive performance against MIL and MoE baselines on four WSI benchmarks and an external AUC of 0.845 ± 0.019 on TCGA-CPTAC NSCLC generalisation.
Significance. If the routing mechanism truly preserves the prescribed marginals while adding spatial coherence, ROAM supplies a principled, auxiliary-loss-free alternative to unconstrained softmax routing in pathology MoE models. The explicit use of capacity marginals and the external validation set are concrete strengths that would support broader adoption of OT-based routing in computational pathology.
major comments (1)
- [Method (graph-regularised Sinkhorn iterations)] The central claim that balanced expert utilisation is guaranteed 'by construction' rests on the transport plan satisfying the per-slide capacity marginals after graph regularisation. The manuscript must demonstrate (via pseudocode, convergence argument, or explicit reprojection step) that the graph-diffusion operation does not violate these marginals; otherwise the stated advantage over standard softmax routing is not established.
minor comments (2)
- [Abstract and Results] The abstract states 'competitive results' and an external AUC but supplies neither the exact baseline implementations, number of experts, nor statistical tests; these must be detailed in the results section with tables showing per-expert utilisation statistics.
- [Results (external validation)] Clarify whether the reported ±0.019 on the external AUC is standard deviation across runs, cross-validation folds, or bootstrap; add this to the evaluation protocol description.
Simulated Author's Rebuttal
Thank you for the thorough review of our manuscript. We respond to the major comment point-by-point below and will incorporate the necessary clarifications in the revised version.
read point-by-point responses
-
Referee: [Method (graph-regularised Sinkhorn iterations)] The central claim that balanced expert utilisation is guaranteed 'by construction' rests on the transport plan satisfying the per-slide capacity marginals after graph regularisation. The manuscript must demonstrate (via pseudocode, convergence argument, or explicit reprojection step) that the graph-diffusion operation does not violate these marginals; otherwise the stated advantage over standard softmax routing is not established.
Authors: We thank the referee for highlighting this important point. The graph-regularised Sinkhorn procedure is constructed so that diffusion occurs as a convex combination within each iteration, after which the standard Sinkhorn row/column scaling steps are applied to restore the prescribed marginals. In the revision we will add explicit pseudocode (new Algorithm 1) together with a short convergence argument showing that the final transport plan satisfies the per-slide capacity marginals upon termination. This will rigorously establish the claimed advantage over unconstrained softmax routing. We will also include empirical expert-utilisation histograms on the validation sets to corroborate the theoretical guarantee. revision: yes
Circularity Check
No circularity: OT marginal enforcement is standard and independent
full rationale
The paper presents region-to-expert routing as capacity-constrained entropic OT (Sinkhorn) with explicit per-slide marginals that enforce balanced utilisation by the definition of a transport plan. This is a direct application of the standard OT property that any feasible plan satisfies the supplied marginals; it does not reduce the claimed performance or mechanism to a fitted hyperparameter or self-referential quantity. Graph regularisation is described as an additional diffusion step over the region graph, but the abstract gives no indication that it is implemented in a manner that violates the marginal constraints or that the balance claim is derived from the regularisation itself. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked for the central routing formulation, and results are reported via external benchmarks rather than internal consistency checks. The derivation chain therefore remains self-contained against external validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-slide capacity marginals
axioms (2)
- standard math Sinkhorn iterations converge to the entropic OT solution for the given marginals
- domain assumption Spatially neighbouring regions share similar routing preferences
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per-slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Nature medicine28(1), 154–163 (2022)
Bulten, W., Kartasalo, K., Chen, P.H.C., Ström, P., Pinckaers, H., Nagpal, K., Cai, Y., Steiner, D.F., Van Boven, H., Vink, R., et al.: Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nature medicine28(1), 154–163 (2022)
work page 2022
-
[2]
Nature Medicine25(8), 1301–1309 (2019)
Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade compu- tational pathology using weakly supervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (2019)
work page 2019
-
[3]
Nature Medicine30, 850–862 (2024)
Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B.,Zhang,A.,Shao,D.,Shaban,M.,Williams,M.,Oldenburg,L.,Weishaupt,L.L., Wang,J.J.,Vaidya,A.,Le,L.P.,Gerber,G.,Sahai,S.,Williams,W.,Mahmood,F.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30, 850–862 (2024)
work page 2024
-
[4]
In: Advances in Neural Information Processing Systems (NeurIPS)
Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transporta- tion distances. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 26, pp. 2292–2300 (2013)
work page 2013
-
[5]
Journal of Proteome Research14(6), 2707–2713 (Jun 2015)
Edwards, N.J., Oberti, M., Thangudu, R.R., Cai, S., McGarvey, P.B., Jacob, S., Madhavan, S., Ketchum, K.A.: The CPTAC Data Portal: A Resource for Cancer Proteomics Research. Journal of Proteome Research14(6), 2707–2713 (Jun 2015)
work page 2015
-
[6]
Fang, H., Huang, S., Tang, W., Huangfu, L., Liu, B.: SAM-MIL: A spatial contex- tual aware multiple instance learning approach for whole slide image classification. In: ACM Multimedia (2024)
work page 2024
-
[7]
Journal of Machine Learning Research 23, 1–40 (2022)
Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 1–40 (2022)
work page 2022
-
[8]
Advances in neural information processing systems30(2017)
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems30(2017)
work page 2017
-
[9]
In: Proceedings of the 35th International Conference on Machine Learning (ICML)
Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing. In: Proceedings of the 35th International Conference on Machine Learning (ICML). vol. 80, pp. 2127–2136. PMLR (2018)
work page 2018
-
[10]
IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)
Khamis, A., Tsuchida, R., Tarek, M., Rolland, V., Petersson, L.: Scalable optimal transport methods in machine learning: A contemporary survey. IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)
work page 2024
-
[11]
SIAM Journal on Matrix Analysis and Applications (2008)
Knight, P.A.: The sinkhorn–knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications (2008)
work page 2008
-
[12]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
work page 2021
-
[13]
Nature Biomedical Engineering5, 555–570 (2021)
Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering5, 555–570 (2021)
work page 2021
-
[14]
arXiv preprint arXiv:2505.00792 (2025) 10 X
Nguyen, T., Tran, N.N., Nguyen, K., Baraniuk, R.G.: Improving routing in sparse mixture of experts with graph of tokens. arXiv preprint arXiv:2505.00792 (2025) 10 X. Tian et al
-
[15]
In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)
Ren, Q., Wang, Y., Fang, R., Ling, H., You, C.: OTSurv: A novel multiple in- stance learning framework for survival prediction with heterogeneity-aware opti- mal transport. In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)
work page 2025
-
[16]
In: International Conference on Learning Representations (ICLR) (2026)
Shao, D., Runevic, J., Chen, R.J., Williamson, D.F.K., Kim, A., Song, A.H., Mah- mood, F.: Mixture of mini experts: Overcoming the linear layer bottleneck in mul- tiple instance learning. In: International Conference on Learning Representations (ICLR) (2026)
work page 2026
-
[17]
In: Advances in Neural Information Processing Systems (NeurIPS)
Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., Zhang, Y.: Trans- MIL: Transformer based correlated multiple instance learning for whole slide image classification. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 34 (2021)
work page 2021
-
[18]
Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2009)
work page 2009
-
[19]
Wang, L., Huang, H., Wu, S., Ma, S., Wei, F.: Auxiliary-loss-free load balancing strategy for mixture-of-experts. In: ICLR (2025)
work page 2025
-
[20]
Nature genetics45(10), 1113–1120 (2013)
Weinstein,J.N.,Collisson,E.A.,Mills,G.B.,Shaw,K.R.,Ozenberger,B.A.,Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature genetics45(10), 1113–1120 (2013)
work page 2013
-
[21]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
Wu, J., Chen, M., Ke, X., Xun, T., Jiang, X., Zhou, H., Shao, L., Kong, Y.: Learning heterogeneous tissues with mixture of experts for gigapixel whole slide images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
work page 2025
-
[22]
In: International Conference on Learning Representations (ICLR) (2023)
Xiang, J., Wang, X., Zhang, J., Yang, S., Han, X., Yang, W.: Exploring low- rank property in multiple instance learning for whole slide image classification. In: International Conference on Learning Representations (ICLR) (2023)
work page 2023
-
[23]
Nature630(8015), 181–188 (2024)
Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024)
work page 2024
-
[24]
Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., Laudon, J.: Mixture-of-experts with expert choice routing. In: NeurIPS. vol. 35 (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.