HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction
Pith reviewed 2026-05-21 06:02 UTC · model grok-4.3
The pith
A two-level mixture-of-experts model with random feature reorganization removes redundant multimodal information and captures fine-grained intra- and inter-modality interactions to improve cancer survival prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The HDMoE framework employs shared and routed experts in the first-level MoE to remove redundant information and extract fine-grained specific features within each modality, uses the second-level MoE to perform fine-grained inter-modality feature decoupling, and applies random feature reorganization modules after each MoE level to fuse intra- and inter-modality features, thereby capturing more fine-grained relationships and yielding improved survival prediction on liver cancer and TCGA datasets.
What carries the argument
Two-level Mixture-of-Experts (MoE) structure with Random Feature Reorganization (RFR) modules that hierarchically decouple redundant modality information and fuse local intra- and inter-modality interactions.
If this is right
- Redundant modality information is stripped before decoupling, leading to cleaner feature separation.
- Fine-grained specific features are extracted within each modality rather than treating features uniformly.
- Local intra- and inter-modality interactions are explicitly modeled through the RFR fusion steps.
- Overall survival prediction accuracy increases on both private liver cancer and public TCGA multimodal cohorts.
Where Pith is reading between the lines
- The same hierarchical decoupling pattern could be tested on other multimodal medical tasks such as treatment-response prediction or disease subtyping.
- If the RFR modules prove robust, they might serve as drop-in replacements for standard fusion layers in non-medical multimodal settings like video-text or sensor fusion.
- Scaling the number of routed experts or adding dynamic routing could further reduce computation while preserving the reported accuracy gains.
- Cross-validation across more diverse patient populations would clarify whether the observed improvements generalize beyond the current training distributions.
Load-bearing premise
The hierarchical MoE and RFR modules will consistently reduce redundancy and model fine-grained relationships better than existing methods without overfitting or producing dataset-specific artifacts.
What would settle it
Failure of HDMoE to outperform prior decoupling-fusion baselines on a fresh, independent multimodal cancer dataset with different imaging and genomic characteristics would falsify the central effectiveness claim.
Figures
read the original abstract
Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HDMoE, a hierarchical decoupling-fusion Mixture-of-Experts framework for multimodal cancer survival prediction from WSIs and genomic profiles. It uses two levels of MoE (shared/routed experts at level 1 for intra-modality redundancy reduction and fine-grained feature extraction; level 2 for inter-modality decoupling) plus RFR modules after each level to capture local intra- and inter-modality interactions, addressing shortcomings of prior decoupling-fusion methods. Effectiveness is asserted via experiments on a private Liver Cancer (LC) dataset and three TCGA public datasets.
Significance. If the claims hold, the two-level MoE plus RFR design could provide a principled way to reduce modality redundancy and model fine-grained interactions in heterogeneous medical data, potentially improving survival prediction accuracy over existing fusion baselines. The availability of code is a positive for reproducibility.
major comments (3)
- [Experiments] Experimental section: the manuscript reports superior performance on the private LC and TCGA datasets but provides no information on data splits, patient counts, censoring rates, cross-validation procedure, or statistical tests. Without these, it is impossible to determine whether reported gains reflect the hierarchical structure or dataset-specific artifacts and extra capacity.
- [Method] §3 (Method): the central mechanistic claim—that level-1 MoE removes redundant modality information and level-2 MoE plus RFR captures localized intra-/inter-modality interactions—lacks supporting diagnostics such as feature mutual information before/after each stage or attention visualizations. Absent these, gains could be explained by increased expressivity rather than the asserted decoupling-fusion benefits.
- [Ablation Studies] Table or results section: no ablation studies isolating the contribution of the two-level hierarchy versus a single-level MoE or standard fusion baselines are described, undermining the claim that the specific architecture is responsible for improvements.
minor comments (2)
- [Method] Notation for the RFR module and expert routing could be clarified with explicit equations showing how reorganization occurs after each MoE level.
- [Abstract] The abstract should include concrete metrics (e.g., C-index deltas) and the number of TCGA cohorts to allow quick assessment of scope.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving reproducibility, mechanistic support, and validation of our architectural contributions. We address each point below and will revise the manuscript to incorporate the suggested additions.
read point-by-point responses
-
Referee: [Experiments] Experimental section: the manuscript reports superior performance on the private LC and TCGA datasets but provides no information on data splits, patient counts, censoring rates, cross-validation procedure, or statistical tests. Without these, it is impossible to determine whether reported gains reflect the hierarchical structure or dataset-specific artifacts and extra capacity.
Authors: We agree that these details are essential for proper evaluation and reproducibility. In the revised manuscript, we will add a comprehensive Experimental Setup subsection specifying patient counts for the private LC dataset and each TCGA cohort, censoring rates, the stratified 5-fold cross-validation procedure, train/validation/test splits, and statistical significance testing (e.g., paired t-tests or log-rank tests with reported p-values on C-index and other metrics). revision: yes
-
Referee: [Method] §3 (Method): the central mechanistic claim—that level-1 MoE removes redundant modality information and level-2 MoE plus RFR captures localized intra-/inter-modality interactions—lacks supporting diagnostics such as feature mutual information before/after each stage or attention visualizations. Absent these, gains could be explained by increased expressivity rather than the asserted decoupling-fusion benefits.
Authors: We acknowledge the need for direct evidence supporting the mechanistic claims. We will add attention visualizations from the MoE experts and RFR modules to the revised main paper or supplementary material. We will also include quantitative diagnostics such as pairwise feature similarity (cosine) and estimated mutual information before and after each hierarchical stage to demonstrate redundancy reduction and fine-grained interaction capture. These additions will help distinguish the benefits of the proposed design from general capacity increases. revision: yes
-
Referee: [Ablation Studies] Table or results section: no ablation studies isolating the contribution of the two-level hierarchy versus a single-level MoE or standard fusion baselines are described, undermining the claim that the specific architecture is responsible for improvements.
Authors: We agree that targeted ablations are necessary to substantiate the value of the two-level hierarchy. We will introduce a new ablation table comparing the full HDMoE against (i) single-level MoE variants, (ii) the model without RFR modules, and (iii) standard fusion baselines (early concatenation, late fusion, and attention-based fusion). Performance deltas on the LC and TCGA datasets will be reported to isolate the contribution of each component. revision: yes
Circularity Check
No circularity: empirical validation of a novel architectural design with no derivation chain
full rationale
The paper introduces HDMoE as a new hierarchical framework combining two levels of Mixture-of-Experts with Random Feature Reorganization modules to address shortcomings in prior decoupling-fusion methods for multimodal survival prediction. No equations, derivations, or first-principles results are presented that could reduce any claimed prediction or benefit to fitted parameters or self-referential inputs by construction. Effectiveness is asserted via experimental results on external private LC and public TCGA datasets rather than any internal mathematical reduction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the central claims rest on the proposed design's empirical performance, which remains independently falsifiable on held-out data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Feature decoupling-fusion is a dominant paradigm for multimodal survival prediction but has specific shortcomings in redundancy reduction and fine-grained modeling.
Reference graph
Works this paper leans on
-
[1]
Boehm, Pegah Khosravi, Rami Vanguri, Jianjiong Gao, and Sohrab P
Kevin M. Boehm, Pegah Khosravi, Rami Vanguri, Jianjiong Gao, and Sohrab P. Shah. 2022. Harnessing multimodal data integration to advance precision oncol- ogy. Nature Reviews Cancer (Feb 2022), 114–126. doi:10.1038/s41568-021-00408-3
-
[2]
Ke Chen, Lei Xu, and Huisheng Chi. 1999. Improved learning algorithms for mixture of experts in multiclass classification. Neural networks 12, 9 (1999), 1229–1252
work page 1999
-
[3]
Richard J Chen, Ming Y Lu, Jingwen Wang, Drew FK Williamson, Scott J Rodig, Neal I Lindeman, and Faisal Mahmood. 2020. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Transactions on Medical Imaging 41, 4 (2020), 757–770
work page 2020
-
[4]
Richard J Chen, Ming Y Lu, Wei-Hung Weng, Tiffany Y Chen, Drew FK Williamson, Trevor Manz, Maha Shady, and Faisal Mahmood. 2021. Multimodal co-attention Transformer for survival prediction in gigapixel whole slide images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4025
work page 2021
-
[5]
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. 2013. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[7]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39
work page 2022
-
[8]
Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. 2017. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6865–6873
work page 2017
-
[9]
Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati
-
[10]
JAMA 247, 18 (1982), 2543–2546
Evaluating the yield of medical tests. JAMA 247, 18 (1982), 2543–2546
work page 1982
-
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid- ual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778
work page 2016
-
[12]
Silvia Hernández, Elena López-Knowles, Josep Lloreta, Manolis Kogevinas, Alex Amorós, Adonina Tardón, Alfredo Carrato, Consol Serra, Núria Malats, and Francisco X Real. 2006. Prospective study of FGFR3 mutations as a prognostic factor in nonmuscle invasive urothelial bladder carcinomas. Journal of Clinical Oncology 24, 22 (2006), 3664–3671
work page 2006
-
[13]
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. 2023. Tutel: Adaptive mixture- of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023), 269–287
work page 2023
-
[14]
Maximilian Ilse, Jakub Tomczak, and Max Welling. 2018. Attention-based deep multiple instance learning. In International conference on machine learning. PMLR, 2127–2136
work page 2018
-
[15]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton
-
[16]
Neural Computation 3 (1991), 79–87
Adaptive Mixtures of Local Experts. Neural Computation 3 (1991), 79–87
work page 1991
-
[17]
Guillaume Jaume, Anurag Vaidya, Richard J Chen, Drew FK Williamson, Paul Pu Liang, and Faisal Mahmood. 2024. Modeling dense multimodal interactions be- tween biological pathways and histology for survival prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11579–11590
work page 2024
-
[18]
Kyuichi Kadota, Kei Suzuki, Christos Colovos, Camelia S Sima, Valerie W Rusch, William D Travis, and Prasad S Adusumilli. 2012. A nuclear grading system is a strong predictor of survival in epitheloid diffuse malignant pleural mesothelioma. Modern Pathology 25, 2 (2012), 260–271. doi:10.1038/modpathol.2011.146
-
[19]
Guoliang Kang, Xuanyi Dong, Liang Zheng, and Yi Yang. 2017. Patchshuffle regularization. arXiv preprint arXiv:1707.07103 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Edward L Kaplan and Paul Meier. 1958. Nonparametric estimation from incom- plete observations. J. Amer. Statist. Assoc. 53, 282 (1958), 457–481
work page 1958
-
[21]
Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter
-
[22]
Advances in Neural Information Processing Systems 30 (2017)
Self-normalizing neural networks. Advances in Neural Information Processing Systems 30 (2017)
work page 2017
-
[23]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[24]
Ruiqing Li, Xingqi Wu, Ao Li, and Minghui Wang. 2022. HFBSurv: hierarchical multimodal fusion with factorized bilinear models for cancer survival prediction. Bioinformatics 38, 9 (2022), 2587–2594. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huayi Wang, Haochao Ying, Yuyang Xu, Qiyao Zheng, Jun Wang, Cheng Zhang, Ying Sun, and Jian Wu
work page 2022
-
[25]
Ralph Dougall Lillie. 1954. Histopathologic Technique and Practical Histochemistry. Blakiston
work page 1954
-
[26]
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Tianling Liu, Hongying Liu, Fanhua Shang, Lequan Yu, Tong Han, and Liang Wan. 2025. Completed Feature Disentanglement Learning for Multimodal MRIs Analysis. IEEE Journal of Biomedical and Health Informatics (2025)
work page 2025
-
[28]
Cheng Lu, Rakesh Shiradkar, and Zaiyi Liu. 2021. Integrating pathomics with radiomics and genomics for cancer prognosis: A brief review. Chinese Journal of Cancer Research 33, 5 (2021), 563
work page 2021
-
[29]
Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. 2021. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering 5, 6 (2021), 555–570
work page 2021
-
[30]
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of- experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939
work page 2018
-
[31]
Chems Eddine Louahem M’Sabah, Ahmed Bouziane, and Youcef Ferdi. 2021. A Survey on Deep Learning Methods for Cancer Diagnosis Using Multimodal Data Fusion. In 2021 International Conference on e-Health and Bioengineering (EHB). doi:10.1109/ehb52898.2021.9657722
-
[32]
Ryuji Ohashi, Shigeki Namimatsu, Takashi Sakatani, Zenya Naito, Hiroyuki Takei, and Akira Shimizu. 2018. Prognostic utility of atypical mitoses in patients with breast cancer: A comparative study with Ki67 and phosphohistone H3. Journal of surgical oncology 118, 3 (2018), 557–567
work page 2018
-
[33]
Sebastian Polsterl. 2020. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.Journal of Machine Learning Research 21, 212 (2020), 1–6. http://jmlr.org/papers/v21/20-729.html
work page 2020
-
[34]
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. 2021. Scaling vi- sion with sparse mixture of experts. Advances in Neural Information Processing Systems 34 (2021), 8583–8595
work page 2021
-
[35]
Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al
-
[36]
Advances in Neural Information Processing Systems 34 (2021), 2136–2147
TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification. Advances in Neural Information Processing Systems 34 (2021), 2136–2147
work page 2021
-
[37]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
N Shazeer, A Mirhoseini, K Maziarz, A Davis, Q Le, G Hinton, and J Dean
-
[39]
Outrageously large neural networks (2017)
The sparsely-gated mixture-of-experts layer. Outrageously large neural networks (2017)
work page 2017
- [40]
-
[41]
Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. 2018. Revisiting multiple instance neural networks. Pattern recognition 74 (2018), 15–24
work page 2018
- [42]
-
[43]
Conghao Xiong, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph JY Sung, and Irwin King. 2024. Mome: Mixture of multimodal experts for cancer survival prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 318–328
work page 2024
-
[44]
Yingxue Xu and Hao Chen. 2023. Multimodal optimal transport-based co- attention Transformer with global structure consistency for survival prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21241–21251
work page 2023
-
[45]
T Yang, J Fan, H Liang, D He, X Zeng, and K Wu. 2020. Reduced E-cadherin expression as a prognostic factor in non-muscle-invasive bladder cancer: A systematic review and meta-analysis. Progrès en Urologie 30, 2 (2020), 66–74
work page 2020
-
[46]
Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, and Yi Xu. 2024. Facilitating multimodal classification via dynamically learning modality gap. Advances in Neural Information Processing Systems 37 (2024), 62108–62122
work page 2024
-
[47]
Hongxuan Yu, Jiayi Wu, Jichen Xu, Shuhao Wang, Wei Wang, Siyi Chai, and Jingmin Xin. 2024. RCNet: A Redundant Compression Network Using Infor- mation Bottleneck for Pathology Whole Slide Image Classification. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 3949–3954
work page 2024
-
[48]
Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. 2012. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems 23, 8 (2012), 1177–1193
work page 2012
-
[49]
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. 2017. Deep sets. Advances in neural information processing systems 30 (2017)
work page 2017
-
[50]
Yilan Zhang, Yingxue Xu, Jianqi Chen, Fengying Xie, and Hao Chen. 2024. Pro- totypical Information Bottlenecking and Disentangling for Multimodal Can- cer Survival Prediction. In The Twelfth International Conference on Learning Representations
work page 2024
-
[51]
Yi Zheng, Regan D Conrad, Emily J Green, Eric J Burks, Margrit Betke, Jen- nifer E Beane, and Vijaya B Kolachalama. 2024. Graph attention-based fusion of pathology images and gene expression for prediction of cancer survival. IEEE transactions on medical imaging (2024)
work page 2024
-
[52]
Fengtao Zhou and Hao Chen. 2023. Cross-modal translation and alignment for survival analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21485–21494
work page 2023
-
[53]
Huajun Zhou, Fengtao Zhou, and Hao Chen. 2024. Cohort-individual cooperative learning for multimodal cancer survival analysis. IEEE Transactions on Medical Imaging (2024)
work page 2024
- [54]
-
[55]
Junjie Zhou, Jiao Tang, Yingli Zuo, Peng Wan, Daoqiang Zhang, and Wei Shao
-
[56]
InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR)
Robust Multimodal Survival Prediction with Conditional Latent Differ- entiation Variational AutoEncoder. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR). 10384–10393. HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction Conference acronym ’XX, June 03–05, 2018, Woodst...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.