Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Bart Wanders; Colleen Knoth; Ephraim Tsalik; Jiuliu Lu; Julian Knight; Xin Tian

arxiv: 2604.07298 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· eess.IV

Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Xin Tian , Jiuliu Lu , Ephraim Tsalik , Bart Wanders , Colleen Knoth , Julian Knight This is my paper

Pith reviewed 2026-05-10 18:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIeess.IV

keywords whole-slide image classificationmixture of expertsoptimal transportmultiple instance learningcomputational pathologygraph regularizationentropic optimal transport

0 comments

The pith

ROAM uses capacity-constrained optimal transport and graph regularization to balance expert routing for whole-slide image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multiple-instance learning routes every patch through the same pathway, while unconstrained mixture-of-experts models often collapse to one or two dominant experts. ROAM first compresses patches into spatial region tokens that respect tissue neighborhoods, then solves region-to-expert assignment with entropic optimal transport under explicit capacity marginals per slide. Sinkhorn iterations are further regularized by diffusion across the region graph so that neighboring regions tend to select the same experts. The result is balanced utilization enforced by construction rather than by auxiliary losses. The method is evaluated on four WSI benchmarks with frozen embeddings and reports competitive accuracy plus an external AUC of 0.845 on TCGA-CPTAC NSCLC slides.

Core claim

ROAM formulates region-to-expert assignment as entropic optimal transport with explicit per-slide capacity marginals solved by Sinkhorn iterations, with additional graph regularization over the spatial region graph, thereby enforcing balanced expert utilization by construction while aligning routing with local tissue neighborhoods.

What carries the argument

Graph-regularised Sinkhorn iterations for capacity-constrained entropic optimal transport on spatial region tokens.

If this is right

Balanced expert utilisation occurs without any auxiliary load-balancing losses.
Routing decisions respect local tissue neighborhoods through spatial binning and graph diffusion.
Performance remains competitive with strong MIL and MoE baselines across four WSI benchmarks.
External generalisation reaches 0.845 AUC on TCGA-CPTAC NSCLC data.
Region-token compression reduces the number of routing decisions while retaining neighbourhood structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optimal-transport routing could be tested on other spatially structured MIL tasks such as remote-sensing scene classification.
Graph diffusion may implicitly encode larger-scale pathological patterns that single-region decisions miss.
Entropic optimal transport offers a differentiable alternative to softmax routing that may scale to larger expert pools.

Load-bearing premise

Compressing patches into spatially binned region tokens plus graph regularization preserves enough instance-level information to justify the added routing machinery and to improve downstream classification.

What would settle it

Removing the capacity marginals or the graph-regularization term produces either markedly imbalanced expert utilization or lower AUC on the same four WSI benchmarks and external NSCLC set.

Figures

Figures reproduced from arXiv: 2604.07298 by Bart Wanders, Colleen Knoth, Ephraim Tsalik, Jiuliu Lu, Julian Knight, Xin Tian.

**Figure 1.** Figure 1: Frozen patch embeddings are pooled into M spatial region tokens. A routing GNN on the region graph parameterises region-to-expert costs, and Sinkhorn optimal transport routes region mass to E experts under per-slide capacity constraints (with optional graph smoothing). Each expert performs gated-attention pooling over its routed regions, and expert outputs are fused for slide-level prediction. ing step wi… view at source ↗

**Figure 2.** Figure 2: Qualitative routing visualisation on CPTAC (NSCLC). Two CPTAC externaltest slides (top: LUAD; bottom: LUSC); numbers denote P(LUSC). Left: ROAM dominant expert per region (territories). Middle/Right: ABMIL and CLAM-SB attention heatmaps (normalised for visualisation). Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROAM gives a clean OT-based fix for expert collapse in MoE-MIL on WSIs via per-slide marginals and graph diffusion, but the regularization step risks breaking the balance guarantee and the experiments lack the controls needed to confirm the mechanism works.

read the letter

The core contribution is a region-graph OT router that assigns spatially binned tokens to experts using Sinkhorn with explicit capacity marginals per slide, plus graph-regularized iterations to encourage local coherence. This is a direct attempt to enforce balanced utilisation without the usual auxiliary losses, and the abstract reports competitive numbers plus a solid external AUC of 0.845 on TCGA-CPTAC NSCLC data. That external result is the strongest piece of evidence they provide for generalisation under frozen foundation-model embeddings. The framing of the problem—standard MIL and unconstrained MoE both collapse under slide heterogeneity—is accurate and the proposed fix stays within standard OT machinery, which is a plus for reproducibility. The region token construction also aligns routing with tissue neighbourhoods, which makes sense for pathology. On the downside, the abstract and available details give no ablations that isolate the OT marginals from the graph regularisation or from the binning step itself. Without those, it is difficult to know whether the reported gains trace to the claimed mechanism. The stress-test concern about marginal violation is real: if the graph diffusion is applied by modifying the cost matrix or by post-hoc steps that are not re-projected onto the prescribed marginals, the transport plan can drift and the “balanced by construction” claim weakens. The paper would need to show the final row and column sums stay close to the targets after regularisation. Implementation specifics on how the Sinkhorn iterations are altered are also thin. This work is aimed at people already doing MIL or MoE in computational pathology who want a spatially aware aggregator. It is not broad enough or novel enough to change practice outside that niche, but the technical idea is coherent enough that a serious referee should see it. I would send it to review with a request for the missing ablations and a marginal-preservation check.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ROAM, a spatially aware MoE-MIL aggregator for gigapixel WSI classification. Patch embeddings are compressed into spatially binned region tokens; these are routed to a pool of expert poolers via capacity-constrained entropic optimal transport (Sinkhorn) whose per-slide marginals are intended to enforce balanced expert utilisation by construction. Graph-regularised Sinkhorn iterations are added to diffuse assignments over the region adjacency graph. The paper reports competitive performance against MIL and MoE baselines on four WSI benchmarks and an external AUC of 0.845 ± 0.019 on TCGA-CPTAC NSCLC generalisation.

Significance. If the routing mechanism truly preserves the prescribed marginals while adding spatial coherence, ROAM supplies a principled, auxiliary-loss-free alternative to unconstrained softmax routing in pathology MoE models. The explicit use of capacity marginals and the external validation set are concrete strengths that would support broader adoption of OT-based routing in computational pathology.

major comments (1)

[Method (graph-regularised Sinkhorn iterations)] The central claim that balanced expert utilisation is guaranteed 'by construction' rests on the transport plan satisfying the per-slide capacity marginals after graph regularisation. The manuscript must demonstrate (via pseudocode, convergence argument, or explicit reprojection step) that the graph-diffusion operation does not violate these marginals; otherwise the stated advantage over standard softmax routing is not established.

minor comments (2)

[Abstract and Results] The abstract states 'competitive results' and an external AUC but supplies neither the exact baseline implementations, number of experts, nor statistical tests; these must be detailed in the results section with tables showing per-expert utilisation statistics.
[Results (external validation)] Clarify whether the reported ±0.019 on the external AUC is standard deviation across runs, cross-validation folds, or bootstrap; add this to the evaluation protocol description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the thorough review of our manuscript. We respond to the major comment point-by-point below and will incorporate the necessary clarifications in the revised version.

read point-by-point responses

Referee: [Method (graph-regularised Sinkhorn iterations)] The central claim that balanced expert utilisation is guaranteed 'by construction' rests on the transport plan satisfying the per-slide capacity marginals after graph regularisation. The manuscript must demonstrate (via pseudocode, convergence argument, or explicit reprojection step) that the graph-diffusion operation does not violate these marginals; otherwise the stated advantage over standard softmax routing is not established.

Authors: We thank the referee for highlighting this important point. The graph-regularised Sinkhorn procedure is constructed so that diffusion occurs as a convex combination within each iteration, after which the standard Sinkhorn row/column scaling steps are applied to restore the prescribed marginals. In the revision we will add explicit pseudocode (new Algorithm 1) together with a short convergence argument showing that the final transport plan satisfies the per-slide capacity marginals upon termination. This will rigorously establish the claimed advantage over unconstrained softmax routing. We will also include empirical expert-utilisation histograms on the validation sets to corroborate the theoretical guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity: OT marginal enforcement is standard and independent

full rationale

The paper presents region-to-expert routing as capacity-constrained entropic OT (Sinkhorn) with explicit per-slide marginals that enforce balanced utilisation by the definition of a transport plan. This is a direct application of the standard OT property that any feasible plan satisfies the supplied marginals; it does not reduce the claimed performance or mechanism to a fitted hyperparameter or self-referential quantity. Graph regularisation is described as an additional diffusion step over the region graph, but the abstract gives no indication that it is implemented in a manner that violates the marginal constraints or that the balance claim is derived from the regularisation itself. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked for the central routing formulation, and results are reported via external benchmarks rather than internal consistency checks. The derivation chain therefore remains self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on standard optimal-transport solvers and graph construction from spatial bins; no new entities are postulated and the only potential free parameters are the per-slide capacity marginals whose exact setting is not detailed in the abstract.

free parameters (1)

per-slide capacity marginals
Explicit marginals are used to enforce balanced expert utilisation; their numerical values or selection rule are not specified in the abstract.

axioms (2)

standard math Sinkhorn iterations converge to the entropic OT solution for the given marginals
Invoked to compute the region-to-expert assignment matrix.
domain assumption Spatially neighbouring regions share similar routing preferences
Basis for the graph-regularisation term that diffuses assignments over the region graph.

pith-pipeline@v0.9.0 · 5607 in / 1400 out tokens · 64030 ms · 2026-05-10T18:14:45.436005+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per-slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Nature medicine28(1), 154–163 (2022)

Bulten, W., Kartasalo, K., Chen, P.H.C., Ström, P., Pinckaers, H., Nagpal, K., Cai, Y., Steiner, D.F., Van Boven, H., Vink, R., et al.: Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nature medicine28(1), 154–163 (2022)

work page 2022
[2]

Nature Medicine25(8), 1301–1309 (2019)

Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade compu- tational pathology using weakly supervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (2019)

work page 2019
[3]

Nature Medicine30, 850–862 (2024)

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B.,Zhang,A.,Shao,D.,Shaban,M.,Williams,M.,Oldenburg,L.,Weishaupt,L.L., Wang,J.J.,Vaidya,A.,Le,L.P.,Gerber,G.,Sahai,S.,Williams,W.,Mahmood,F.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30, 850–862 (2024)

work page 2024
[4]

In: Advances in Neural Information Processing Systems (NeurIPS)

Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transporta- tion distances. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 26, pp. 2292–2300 (2013)

work page 2013
[5]

Journal of Proteome Research14(6), 2707–2713 (Jun 2015)

Edwards, N.J., Oberti, M., Thangudu, R.R., Cai, S., McGarvey, P.B., Jacob, S., Madhavan, S., Ketchum, K.A.: The CPTAC Data Portal: A Resource for Cancer Proteomics Research. Journal of Proteome Research14(6), 2707–2713 (Jun 2015)

work page 2015
[6]

In: ACM Multimedia (2024)

Fang, H., Huang, S., Tang, W., Huangfu, L., Liu, B.: SAM-MIL: A spatial contex- tual aware multiple instance learning approach for whole slide image classification. In: ACM Multimedia (2024)

work page 2024
[7]

Journal of Machine Learning Research 23, 1–40 (2022)

Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 1–40 (2022)

work page 2022
[8]

Advances in neural information processing systems30(2017)

Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems30(2017)

work page 2017
[9]

In: Proceedings of the 35th International Conference on Machine Learning (ICML)

Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing. In: Proceedings of the 35th International Conference on Machine Learning (ICML). vol. 80, pp. 2127–2136. PMLR (2018)

work page 2018
[10]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)

Khamis, A., Tsuchida, R., Tarek, M., Rolland, V., Petersson, L.: Scalable optimal transport methods in machine learning: A contemporary survey. IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)

work page 2024
[11]

SIAM Journal on Matrix Analysis and Applications (2008)

Knight, P.A.: The sinkhorn–knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications (2008)

work page 2008
[12]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

work page 2021
[13]

Nature Biomedical Engineering5, 555–570 (2021)

Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering5, 555–570 (2021)

work page 2021
[14]

arXiv preprint arXiv:2505.00792 (2025) 10 X

Nguyen, T., Tran, N.N., Nguyen, K., Baraniuk, R.G.: Improving routing in sparse mixture of experts with graph of tokens. arXiv preprint arXiv:2505.00792 (2025) 10 X. Tian et al

work page arXiv 2025
[15]

In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)

Ren, Q., Wang, Y., Fang, R., Ling, H., You, C.: OTSurv: A novel multiple in- stance learning framework for survival prediction with heterogeneity-aware opti- mal transport. In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)

work page 2025
[16]

In: International Conference on Learning Representations (ICLR) (2026)

Shao, D., Runevic, J., Chen, R.J., Williamson, D.F.K., Kim, A., Song, A.H., Mah- mood, F.: Mixture of mini experts: Overcoming the linear layer bottleneck in mul- tiple instance learning. In: International Conference on Learning Representations (ICLR) (2026)

work page 2026
[17]

In: Advances in Neural Information Processing Systems (NeurIPS)

Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., Zhang, Y.: Trans- MIL: Transformer based correlated multiple instance learning for whole slide image classification. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 34 (2021)

work page 2021
[18]

Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2009)

work page 2009
[19]

In: ICLR (2025)

Wang, L., Huang, H., Wu, S., Ma, S., Wei, F.: Auxiliary-loss-free load balancing strategy for mixture-of-experts. In: ICLR (2025)

work page 2025
[20]

Nature genetics45(10), 1113–1120 (2013)

Weinstein,J.N.,Collisson,E.A.,Mills,G.B.,Shaw,K.R.,Ozenberger,B.A.,Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature genetics45(10), 1113–1120 (2013)

work page 2013
[21]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Wu, J., Chen, M., Ke, X., Xun, T., Jiang, X., Zhou, H., Shao, L., Kong, Y.: Learning heterogeneous tissues with mixture of experts for gigapixel whole slide images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[22]

In: International Conference on Learning Representations (ICLR) (2023)

Xiang, J., Wang, X., Zhang, J., Yang, S., Han, X., Yang, W.: Exploring low- rank property in multiple instance learning for whole slide image classification. In: International Conference on Learning Representations (ICLR) (2023)

work page 2023
[23]

Nature630(8015), 181–188 (2024)

Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024)

work page 2024
[24]

In: NeurIPS

Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., Laudon, J.: Mixture-of-experts with expert choice routing. In: NeurIPS. vol. 35 (2022)

work page 2022

[1] [1]

Nature medicine28(1), 154–163 (2022)

Bulten, W., Kartasalo, K., Chen, P.H.C., Ström, P., Pinckaers, H., Nagpal, K., Cai, Y., Steiner, D.F., Van Boven, H., Vink, R., et al.: Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nature medicine28(1), 154–163 (2022)

work page 2022

[2] [2]

Nature Medicine25(8), 1301–1309 (2019)

Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade compu- tational pathology using weakly supervised deep learning on whole slide images. Nature Medicine25(8), 1301–1309 (2019)

work page 2019

[3] [3]

Nature Medicine30, 850–862 (2024)

Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B.,Zhang,A.,Shao,D.,Shaban,M.,Williams,M.,Oldenburg,L.,Weishaupt,L.L., Wang,J.J.,Vaidya,A.,Le,L.P.,Gerber,G.,Sahai,S.,Williams,W.,Mahmood,F.: Towards a general-purpose foundation model for computational pathology. Nature Medicine30, 850–862 (2024)

work page 2024

[4] [4]

In: Advances in Neural Information Processing Systems (NeurIPS)

Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transporta- tion distances. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 26, pp. 2292–2300 (2013)

work page 2013

[5] [5]

Journal of Proteome Research14(6), 2707–2713 (Jun 2015)

Edwards, N.J., Oberti, M., Thangudu, R.R., Cai, S., McGarvey, P.B., Jacob, S., Madhavan, S., Ketchum, K.A.: The CPTAC Data Portal: A Resource for Cancer Proteomics Research. Journal of Proteome Research14(6), 2707–2713 (Jun 2015)

work page 2015

[6] [6]

In: ACM Multimedia (2024)

Fang, H., Huang, S., Tang, W., Huangfu, L., Liu, B.: SAM-MIL: A spatial contex- tual aware multiple instance learning approach for whole slide image classification. In: ACM Multimedia (2024)

work page 2024

[7] [7]

Journal of Machine Learning Research 23, 1–40 (2022)

Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 1–40 (2022)

work page 2022

[8] [8]

Advances in neural information processing systems30(2017)

Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems30(2017)

work page 2017

[9] [9]

In: Proceedings of the 35th International Conference on Machine Learning (ICML)

Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing. In: Proceedings of the 35th International Conference on Machine Learning (ICML). vol. 80, pp. 2127–2136. PMLR (2018)

work page 2018

[10] [10]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)

Khamis, A., Tsuchida, R., Tarek, M., Rolland, V., Petersson, L.: Scalable optimal transport methods in machine learning: A contemporary survey. IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2024)

work page 2024

[11] [11]

SIAM Journal on Matrix Analysis and Applications (2008)

Knight, P.A.: The sinkhorn–knopp algorithm: convergence and applications. SIAM Journal on Matrix Analysis and Applications (2008)

work page 2008

[12] [12]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

work page 2021

[13] [13]

Nature Biomedical Engineering5, 555–570 (2021)

Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering5, 555–570 (2021)

work page 2021

[14] [14]

arXiv preprint arXiv:2505.00792 (2025) 10 X

Nguyen, T., Tran, N.N., Nguyen, K., Baraniuk, R.G.: Improving routing in sparse mixture of experts with graph of tokens. arXiv preprint arXiv:2505.00792 (2025) 10 X. Tian et al

work page arXiv 2025

[15] [15]

In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)

Ren, Q., Wang, Y., Fang, R., Ling, H., You, C.: OTSurv: A novel multiple in- stance learning framework for survival prediction with heterogeneity-aware opti- mal transport. In: Medical Image Computing and Computer Assisted Intervention (MICCAI) (2025)

work page 2025

[16] [16]

In: International Conference on Learning Representations (ICLR) (2026)

Shao, D., Runevic, J., Chen, R.J., Williamson, D.F.K., Kim, A., Song, A.H., Mah- mood, F.: Mixture of mini experts: Overcoming the linear layer bottleneck in mul- tiple instance learning. In: International Conference on Learning Representations (ICLR) (2026)

work page 2026

[17] [17]

In: Advances in Neural Information Processing Systems (NeurIPS)

Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., Zhang, Y.: Trans- MIL: Transformer based correlated multiple instance learning for whole slide image classification. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 34 (2021)

work page 2021

[18] [18]

Villani, C., et al.: Optimal transport: old and new, vol. 338. Springer (2009)

work page 2009

[19] [19]

In: ICLR (2025)

Wang, L., Huang, H., Wu, S., Ma, S., Wei, F.: Auxiliary-loss-free load balancing strategy for mixture-of-experts. In: ICLR (2025)

work page 2025

[20] [20]

Nature genetics45(10), 1113–1120 (2013)

Weinstein,J.N.,Collisson,E.A.,Mills,G.B.,Shaw,K.R.,Ozenberger,B.A.,Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature genetics45(10), 1113–1120 (2013)

work page 2013

[21] [21]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Wu, J., Chen, M., Ke, X., Xun, T., Jiang, X., Zhou, H., Shao, L., Kong, Y.: Learning heterogeneous tissues with mixture of experts for gigapixel whole slide images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025

[22] [22]

In: International Conference on Learning Representations (ICLR) (2023)

Xiang, J., Wang, X., Zhang, J., Yang, S., Han, X., Yang, W.: Exploring low- rank property in multiple instance learning for whole slide image classification. In: International Conference on Learning Representations (ICLR) (2023)

work page 2023

[23] [23]

Nature630(8015), 181–188 (2024)

Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024)

work page 2024

[24] [24]

In: NeurIPS

Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., Laudon, J.: Mixture-of-experts with expert choice routing. In: NeurIPS. vol. 35 (2022)

work page 2022