Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models
Pith reviewed 2026-05-10 17:47 UTC · model grok-4.3
The pith
LogitProd fuses logits from any set of pathology foundation models using learned sample weights to match or exceed the best single model without retraining encoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Treating independently trained heterogeneous pathology foundation models as fixed experts and learning sample-adaptive weights for a weighted product of their logits yields a combined predictor whose training objective value is guaranteed to be no worse than that of the best expert, and this construction delivers measurable gains across diverse pathology benchmarks without encoder modification.
What carries the argument
LogitProd, a post-encoder fusion that multiplies logits after weighting them by learned sample-specific scalars derived from the models' outputs.
If this is right
- Any collection of existing pathology models can be combined into a stronger predictor without retraining the backbones or aligning their feature spaces.
- Training cost for the fusion step stays about twelve times lower than that of feature-fusion methods while still producing higher average accuracy.
- The performance guarantee ensures the combined model never falls below the best expert under the training objective.
- Exhaustive per-task model selection becomes unnecessary because a single fusion step can upgrade performance across many endpoints.
Where Pith is reading between the lines
- The same logit-weighting idea could be tested on models trained for different imaging modalities to check whether the guarantee survives domain shifts.
- If the method extends to continuous outputs, it might reduce the need for separate survival or regression models in clinical pipelines.
- Applying the fusion to an expanding library of models would allow incremental gains without repeated full-model validation.
Load-bearing premise
Logits from separately trained heterogeneous models contain compatible information that can be combined directly by learned weights without any feature alignment or encoder updates.
What would settle it
If the fused model underperforms the strongest single expert on more than a few of the 22 benchmarks or if the gains vanish when all models are trained on identical data, the central claim would not hold.
Figures
read the original abstract
Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LogitProd, a plug-and-play fusion method for heterogeneous pathology foundation models. It learns sample-adaptive weights to perform weighted product fusion directly on slide-level logits from fixed, independently trained experts, without feature alignment or encoder retraining. A theoretical analysis claims that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. Systematic experiments on 22 benchmarks (WSI classification, tile classification, gene mutation prediction, and survival modeling) show LogitProd ranking first on 20/22 tasks with an average ~3% gain over the strongest single expert and ~12x lower training cost than feature-fusion baselines.
Significance. If the theoretical guarantee holds for the learned weights and the empirical gains prove robust, the work provides a practical, low-cost solution to the model-selection bottleneck created by proliferating pathology FMs. The plug-and-play design and non-inferiority guarantee would allow practitioners to combine existing models without expensive adaptation, representing a meaningful engineering contribution to computational histopathology.
major comments (2)
- [theoretical analysis] Theoretical analysis section: the guarantee that optimal weighted product fusion matches or exceeds the best expert appears to hold by construction when weights can recover any single expert. However, the manuscript must demonstrate (via bound, convergence argument, or ablation) that the sample-adaptive learner reaches a configuration sufficiently close to this optimum when logits from heterogeneous, unaligned models differ in scale, variance, and calibration. This is load-bearing for the central claim that the implemented LogitProd inherits the theoretical guarantee.
- [experimental evaluation] Experimental evaluation (22 benchmarks): the reported ~3% average improvement and 20/22 first-place ranking require explicit reporting of per-task metrics with standard deviations, number of runs, and statistical tests (e.g., paired t-tests or Wilcoxon) against the strongest single expert. Without these, it is unclear whether the gains are consistent or sensitive to the choice of fusion-weight optimizer and temperature scaling (or lack thereof).
minor comments (2)
- [abstract] The abstract states 'no feature-space alignment' yet the method operates on logits; a brief note on whether any implicit per-expert normalization is applied would improve clarity.
- Table or figure reporting the 22 benchmarks should include the exact metric used for each task (AUROC, C-index, etc.) to allow direct comparison with prior work.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, outlining clarifications and planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Theoretical analysis section: the guarantee that optimal weighted product fusion matches or exceeds the best expert appears to hold by construction when weights can recover any single expert. However, the manuscript must demonstrate (via bound, convergence argument, or ablation) that the sample-adaptive learner reaches a configuration sufficiently close to this optimum when logits from heterogeneous, unaligned models differ in scale, variance, and calibration. This is load-bearing for the central claim that the implemented LogitProd inherits the theoretical guarantee.
Authors: We thank the referee for this important observation. The theoretical guarantee applies to the optimal weights, which can recover any single expert by construction (setting its weight to 1 and others to 0). For the learned sample-adaptive weights, the optimization directly targets the fused likelihood, and empirical results show strong performance. To address the concern for heterogeneous logits, we will add to the revised manuscript: (i) an ablation comparing LogitProd to an oracle optimum (weights solved post-hoc on validation data), (ii) histograms of learned weight distributions demonstrating preference for stronger experts, and (iii) discussion of implicit handling of scale via the product formulation and optional temperature scaling. A formal convergence bound is challenging due to non-convexity, but the added empirical evidence will support that the learner approaches the guarantee in practice. revision: yes
-
Referee: Experimental evaluation (22 benchmarks): the reported ~3% average improvement and 20/22 first-place ranking require explicit reporting of per-task metrics with standard deviations, number of runs, and statistical tests (e.g., paired t-tests or Wilcoxon) against the strongest single expert. Without these, it is unclear whether the gains are consistent or sensitive to the choice of fusion-weight optimizer and temperature scaling (or lack thereof).
Authors: We agree that greater statistical detail is needed to substantiate the empirical claims. The current results reflect single-run evaluations per task (due to the scale of the 22 benchmarks). In the revision, we will expand the experimental section to include: per-task metrics with mean and standard deviation over 5 independent runs (varying optimizer seeds), explicit reporting of temperature scaling (default 1.0, with sensitivity analysis), and paired Wilcoxon signed-rank tests against the strongest single expert for each task. These will be added to the main results table and supplementary material to demonstrate consistency and robustness. revision: yes
Circularity Check
Optimal weighted product fusion guarantee reduces to single-expert recovery by construction
specific steps
-
self definitional
[theoretical analysis (abstract)]
"We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective."
The guarantee holds by construction: the weighted product fusion includes the case of using only the best expert (via weight assignment of 1 to the best and 0 to others), so the optimal fusion performance is at least as good as the best expert by the definition of optimality, without additional mathematical content.
full rationale
The paper's central theoretical claim states that optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. This follows directly from the definition of the fusion operator, which can recover any single expert by assigning full weight to one model and zero to others. The result is therefore equivalent to the input assumption that individual experts are available, rather than a non-trivial derivation. Empirical results on 22 benchmarks remain independent, but the load-bearing guarantee itself is self-definitional. No self-citations, ansatzes, or fitted predictions are invoked in the abstract description.
Axiom & Free-Parameter Ledger
free parameters (1)
- sample-adaptive fusion weights
axioms (1)
- domain assumption Independently trained foundation models can be treated as fixed experts with no need for encoder retraining or feature alignment
Reference graph
Works this paper leans on
-
[1]
Towards large-scale training of pathology foundation models.arXiv preprint arXiv:2404.15217, 2024
Aben, N., de Jong, E.D., Gatopoulos, I., Känzig, N., Karasikov, M., Lagré, A., Moser, R., van Doorn, J., Tang, F., et al.: Towards large-scale training of pathology foundation models. arXiv preprint arXiv:2404.15217 (2024)
-
[2]
Bioptimus: H-optimus-1 (2025),https://huggingface.co/bioptimus/ H-optimus-1
work page 2025
-
[3]
Nature Medicine (2024).https://doi.org/ 10.1038/s41591-024-02857-3
Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Chen, B., Zhang, A., Shao, D., Song, A.H., Shaban, M., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine (2024).https://doi.org/ 10.1038/s41591-024-02857-3
-
[4]
In: International work- shop on multiple classifier systems
Dietterich, T.G.: Ensemble methods in machine learning. In: International work- shop on multiple classifier systems. pp. 1–15. Springer (2000)
work page 2000
-
[5]
medRxiv (2023).https://doi.org/10.1101/2023.07.21.23292757
Filiot, A., Ghermi, R., Olivier, A., Jacob, P., Fidon, L., Mac Kain, A., Saillard, C., Schiratti, J.B.: Scaling self-supervised learning for histopathology with masked im- age modeling. medRxiv (2023).https://doi.org/10.1101/2023.07.21.23292757
-
[6]
In: International conference on machine learning
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International conference on machine learning. pp. 1321–1330. PMLR (2017)
work page 2017
-
[7]
Neural computation14(8), 1771–1800 (2002)
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural computation14(8), 1771–1800 (2002)
work page 2002
-
[8]
In: Proceedings of the 35th International Conference on Machine Learning (ICML)
Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing. In: Proceedings of the 35th International Conference on Machine Learning (ICML). pp. 2132–2141 (2018)
work page 2018
-
[9]
In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR)
Kang, M., Song, H., Park, S., Yoo, D., Pereira, S.: Benchmarking self-supervised learning on diverse pathology datasets. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 3344–3354 (June 2023)
work page 2023
-
[10]
Advances in neural information pro- cessing systems30(2017)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information pro- cessing systems30(2017)
work page 2017
-
[11]
arXiv preprint arXiv:2503.00736 (2025)
Lei, W., Li, A., Tan, Y., Chen, H., Zhang, X.: Shazam: Unifying multi- ple foundation models for advanced computational pathology. arXiv preprint arXiv:2503.00736 (2025)
-
[12]
Nature Medicine (2024), volume 30(3):863–874
Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Ding, T., Jaume, G., Le, L.P., Parwani, A., Zhang, A., Mahmood, F., et al.: A visual-language foundation model for computational pathology. Nature Medicine (2024), volume 30(3):863–874
work page 2024
-
[13]
Nature Biomedical Engineering5(6), 555–570 (2021).https://doi.org/ 10.1038/s41551-020-00682-w
Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering5(6), 555–570 (2021).https://doi.org/ 10.1038/s41551-020-00682-w
-
[14]
arXiv preprint arXiv:2508.16085 (2025)
Luo, X., Wang, X., Eweje, F., Zhang, X., Yang, S., Quinton, R., Xiang, J., Li, Y., Ji, Y., Li, Z., et al.: Ensemble learning of foundation models for precision oncology. arXiv preprint arXiv:2508.16085 (2025)
-
[15]
Nature Biomedical Engineering pp
Ma, J., Guo, Z., Zhou, F., Wang, Y., Xu, Y., Li, J., Yan, F., Cai, Y., Zhu, Z., Jin, C., et al.: A generalizable pathology foundation model using a unified knowledge distillation pretraining framework. Nature Biomedical Engineering pp. 1–20 (2025) 10 Gexin Huang et al
work page 2025
-
[16]
Ma, J., et al.: Pathbench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology. arXiv preprint arXiv:2505.20202 (2025).https://doi.org/10.48550/arXiv.2505.20202
-
[17]
Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01516-3
Neidlinger, P., et al.: Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01516-3
-
[18]
In: Advances in Neural Information Processing Systems (NeurIPS)
Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., Zhang, Y.: Transmil: Transformer based correlated multiple instance learning for whole slide image clas- sification. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 2136–2148 (2021)
work page 2021
-
[19]
Nature Medicine30(10), 2924–2935 (2024).https://doi.org/10.1038/ s41591-024-03141-0
Vorontsov, E., Bozkurt, A., Casson, A., Shaikovski, G., Zelechowski, M., Sev- erson, K., Zimmermann, E., Hall, J., Tenenholtz, N., Fusi, N., et al.: A foun- dation model for clinical-grade computational pathology and rare cancers de- tection. Nature Medicine30(10), 2924–2935 (2024).https://doi.org/10.1038/ s41591-024-03141-0
work page 2024
-
[20]
Medical Image Analysis81, 102559 (2022).https://doi.org/10.1016/j.media.2022.102559
Wang, X., Chen, H., Gan, C., Lin, Y., Dou, Q., et al.: Transformer-based unsuper- vised contrastive learning for histopathological image classification. Medical Image Analysis81, 102559 (2022).https://doi.org/10.1016/j.media.2022.102559
-
[21]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wu,J.,Chen,M.,Ke,X.,Xun,T.,Jiang,X.,Zhou,H.,Shao,L.,Kong,Y.:Learning heterogeneous tissues with mixture of experts for gigapixel whole slide images. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5144–5153 (2025)
work page 2025
-
[22]
Nature630(8015), 181–188 (2024).https://doi
Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024).https://doi. org/10.1038/s41586-024-07441-w
-
[23]
Medical Image Analysis101, 103456 (2025).https://doi.org/10.1016/j.media.2025.103456
Xu, H., Wang, M., Shi, D., Qin, H., Zhang, Y., Liu, Z., Madabhushi, A., Gao, P., Cong, F., Lu, C.: When multiple instance learning meets foundation models: Advancing histological whole slide image analysis. Medical Image Analysis101, 103456 (2025).https://doi.org/10.1016/j.media.2025.103456
-
[24]
arXiv preprint arXiv:2510.27237 (2025)
Yang, Z., Shi, X., Ba, W., Song, Z., Luan, H., Hu, T., Lin, S., Wang, J., Zhou, S.K., Yan, R.: Fusion of multi-scale heterogeneous pathology foundation models for whole slide image analysis. arXiv preprint arXiv:2510.27237 (2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.