pith. sign in

arxiv: 2604.07779 · v2 · submitted 2026-04-09 · 💻 cs.CV

Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models

Pith reviewed 2026-05-10 17:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords pathology foundation modelslogit fusionmodel ensemblecomputational histopathologyplug-and-play fusionheterogeneous modelsslide-level prediction
0
0 comments X p. Extension

The pith

LogitProd fuses logits from any set of pathology foundation models using learned sample weights to match or exceed the best single model without retraining encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pathology foundation models differ in performance across tasks, yet choosing or adapting one for every new diagnostic or prognostic endpoint is costly. The paper presents a fusion approach that treats each model as a fixed expert and learns to combine their slide-level logits through sample-adaptive weights. This operates entirely after the encoders, requires no feature alignment, and carries a theoretical guarantee that the optimal fusion will perform at least as well as the strongest expert under the training loss. On 22 benchmarks covering classification, mutation prediction, and survival modeling, the fused predictor ranks first on 20 tasks and lifts average performance by roughly three percent over the strongest individual model while using far less training compute than feature-level alternatives.

Core claim

Treating independently trained heterogeneous pathology foundation models as fixed experts and learning sample-adaptive weights for a weighted product of their logits yields a combined predictor whose training objective value is guaranteed to be no worse than that of the best expert, and this construction delivers measurable gains across diverse pathology benchmarks without encoder modification.

What carries the argument

LogitProd, a post-encoder fusion that multiplies logits after weighting them by learned sample-specific scalars derived from the models' outputs.

If this is right

  • Any collection of existing pathology models can be combined into a stronger predictor without retraining the backbones or aligning their feature spaces.
  • Training cost for the fusion step stays about twelve times lower than that of feature-fusion methods while still producing higher average accuracy.
  • The performance guarantee ensures the combined model never falls below the best expert under the training objective.
  • Exhaustive per-task model selection becomes unnecessary because a single fusion step can upgrade performance across many endpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same logit-weighting idea could be tested on models trained for different imaging modalities to check whether the guarantee survives domain shifts.
  • If the method extends to continuous outputs, it might reduce the need for separate survival or regression models in clinical pipelines.
  • Applying the fusion to an expanding library of models would allow incremental gains without repeated full-model validation.

Load-bearing premise

Logits from separately trained heterogeneous models contain compatible information that can be combined directly by learned weights without any feature alignment or encoder updates.

What would settle it

If the fused model underperforms the strongest single expert on more than a few of the 22 benchmarks or if the gains vanish when all models are trained on identical data, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2604.07779 by Anqi Li, Beidi Zhao, Gang Wang, Gexin Huang, Xiaoxiao Li, Yusheng Tan, Zu-Hua Gao.

Figure 1
Figure 1. Figure 1: Overview of LogitProd. Frozen FM experts output logits; Logit￾Prod derives confidence/entropy/disagreement cues to predict sample-adaptive weights and fuses experts via weighted product fusion, enabling efficient multi￾task prediction without re-encoding or feature alignment. accessing patch embeddings, expert logits contain informative reliability cues, e.g., confidence/uncertainty statistics and inter-ex… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation across 22 pathology tasks. a–f, Gene mutation prediction (mAUC): a, mean across five genes; b–f, per-gene performance with prevalence. g–i, TIL classification (AUC) across six datasets. j, WSI-level tumour diagnosis (AUC). k, Breast carcinoma subtyping (AUC). l–q, C-index distributions across six TCGA cohorts for all FM-based experts and LogitProd. Box plots summarize cross-validation folds. r, … view at source ↗
read the original abstract

Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LogitProd, a plug-and-play fusion method for heterogeneous pathology foundation models. It learns sample-adaptive weights to perform weighted product fusion directly on slide-level logits from fixed, independently trained experts, without feature alignment or encoder retraining. A theoretical analysis claims that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. Systematic experiments on 22 benchmarks (WSI classification, tile classification, gene mutation prediction, and survival modeling) show LogitProd ranking first on 20/22 tasks with an average ~3% gain over the strongest single expert and ~12x lower training cost than feature-fusion baselines.

Significance. If the theoretical guarantee holds for the learned weights and the empirical gains prove robust, the work provides a practical, low-cost solution to the model-selection bottleneck created by proliferating pathology FMs. The plug-and-play design and non-inferiority guarantee would allow practitioners to combine existing models without expensive adaptation, representing a meaningful engineering contribution to computational histopathology.

major comments (2)
  1. [theoretical analysis] Theoretical analysis section: the guarantee that optimal weighted product fusion matches or exceeds the best expert appears to hold by construction when weights can recover any single expert. However, the manuscript must demonstrate (via bound, convergence argument, or ablation) that the sample-adaptive learner reaches a configuration sufficiently close to this optimum when logits from heterogeneous, unaligned models differ in scale, variance, and calibration. This is load-bearing for the central claim that the implemented LogitProd inherits the theoretical guarantee.
  2. [experimental evaluation] Experimental evaluation (22 benchmarks): the reported ~3% average improvement and 20/22 first-place ranking require explicit reporting of per-task metrics with standard deviations, number of runs, and statistical tests (e.g., paired t-tests or Wilcoxon) against the strongest single expert. Without these, it is unclear whether the gains are consistent or sensitive to the choice of fusion-weight optimizer and temperature scaling (or lack thereof).
minor comments (2)
  1. [abstract] The abstract states 'no feature-space alignment' yet the method operates on logits; a brief note on whether any implicit per-expert normalization is applied would improve clarity.
  2. Table or figure reporting the 22 benchmarks should include the exact metric used for each task (AUROC, C-index, etc.) to allow direct comparison with prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, outlining clarifications and planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Theoretical analysis section: the guarantee that optimal weighted product fusion matches or exceeds the best expert appears to hold by construction when weights can recover any single expert. However, the manuscript must demonstrate (via bound, convergence argument, or ablation) that the sample-adaptive learner reaches a configuration sufficiently close to this optimum when logits from heterogeneous, unaligned models differ in scale, variance, and calibration. This is load-bearing for the central claim that the implemented LogitProd inherits the theoretical guarantee.

    Authors: We thank the referee for this important observation. The theoretical guarantee applies to the optimal weights, which can recover any single expert by construction (setting its weight to 1 and others to 0). For the learned sample-adaptive weights, the optimization directly targets the fused likelihood, and empirical results show strong performance. To address the concern for heterogeneous logits, we will add to the revised manuscript: (i) an ablation comparing LogitProd to an oracle optimum (weights solved post-hoc on validation data), (ii) histograms of learned weight distributions demonstrating preference for stronger experts, and (iii) discussion of implicit handling of scale via the product formulation and optional temperature scaling. A formal convergence bound is challenging due to non-convexity, but the added empirical evidence will support that the learner approaches the guarantee in practice. revision: yes

  2. Referee: Experimental evaluation (22 benchmarks): the reported ~3% average improvement and 20/22 first-place ranking require explicit reporting of per-task metrics with standard deviations, number of runs, and statistical tests (e.g., paired t-tests or Wilcoxon) against the strongest single expert. Without these, it is unclear whether the gains are consistent or sensitive to the choice of fusion-weight optimizer and temperature scaling (or lack thereof).

    Authors: We agree that greater statistical detail is needed to substantiate the empirical claims. The current results reflect single-run evaluations per task (due to the scale of the 22 benchmarks). In the revision, we will expand the experimental section to include: per-task metrics with mean and standard deviation over 5 independent runs (varying optimizer seeds), explicit reporting of temperature scaling (default 1.0, with sensitivity analysis), and paired Wilcoxon signed-rank tests against the strongest single expert for each task. These will be added to the main results table and supplementary material to demonstrate consistency and robustness. revision: yes

Circularity Check

1 steps flagged

Optimal weighted product fusion guarantee reduces to single-expert recovery by construction

specific steps
  1. self definitional [theoretical analysis (abstract)]
    "We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective."

    The guarantee holds by construction: the weighted product fusion includes the case of using only the best expert (via weight assignment of 1 to the best and 0 to others), so the optimal fusion performance is at least as good as the best expert by the definition of optimality, without additional mathematical content.

full rationale

The paper's central theoretical claim states that optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. This follows directly from the definition of the fusion operator, which can recover any single expert by assigning full weight to one model and zero to others. The result is therefore equivalent to the input assumption that individual experts are available, rather than a non-trivial derivation. Empirical results on 22 benchmarks remain independent, but the load-bearing guarantee itself is self-definitional. No self-citations, ansatzes, or fitted predictions are invoked in the abstract description.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on treating pre-trained models as fixed experts whose logits can be fused directly, plus learned sample-adaptive weights as the main free parameters; no new entities are postulated.

free parameters (1)
  • sample-adaptive fusion weights
    Weights learned over the logits of each expert model for each sample during the fusion training phase.
axioms (1)
  • domain assumption Independently trained foundation models can be treated as fixed experts with no need for encoder retraining or feature alignment
    Core premise enabling the plug-and-play logit-only fusion.

pith-pipeline@v0.9.0 · 5553 in / 1366 out tokens · 46756 ms · 2026-05-10T17:47:11.137451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Towards large-scale training of pathology foundation models.arXiv preprint arXiv:2404.15217, 2024

    Aben, N., de Jong, E.D., Gatopoulos, I., Känzig, N., Karasikov, M., Lagré, A., Moser, R., van Doorn, J., Tang, F., et al.: Towards large-scale training of pathology foundation models. arXiv preprint arXiv:2404.15217 (2024)

  2. [2]

    Bioptimus: H-optimus-1 (2025),https://huggingface.co/bioptimus/ H-optimus-1

  3. [3]

    Nature Medicine (2024).https://doi.org/ 10.1038/s41591-024-02857-3

    Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Chen, B., Zhang, A., Shao, D., Song, A.H., Shaban, M., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine (2024).https://doi.org/ 10.1038/s41591-024-02857-3

  4. [4]

    In: International work- shop on multiple classifier systems

    Dietterich, T.G.: Ensemble methods in machine learning. In: International work- shop on multiple classifier systems. pp. 1–15. Springer (2000)

  5. [5]

    medRxiv (2023).https://doi.org/10.1101/2023.07.21.23292757

    Filiot, A., Ghermi, R., Olivier, A., Jacob, P., Fidon, L., Mac Kain, A., Saillard, C., Schiratti, J.B.: Scaling self-supervised learning for histopathology with masked im- age modeling. medRxiv (2023).https://doi.org/10.1101/2023.07.21.23292757

  6. [6]

    In: International conference on machine learning

    Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International conference on machine learning. pp. 1321–1330. PMLR (2017)

  7. [7]

    Neural computation14(8), 1771–1800 (2002)

    Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural computation14(8), 1771–1800 (2002)

  8. [8]

    In: Proceedings of the 35th International Conference on Machine Learning (ICML)

    Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learn- ing. In: Proceedings of the 35th International Conference on Machine Learning (ICML). pp. 2132–2141 (2018)

  9. [9]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR)

    Kang, M., Song, H., Park, S., Yoo, D., Pereira, S.: Benchmarking self-supervised learning on diverse pathology datasets. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 3344–3354 (June 2023)

  10. [10]

    Advances in neural information pro- cessing systems30(2017)

    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information pro- cessing systems30(2017)

  11. [11]

    arXiv preprint arXiv:2503.00736 (2025)

    Lei, W., Li, A., Tan, Y., Chen, H., Zhang, X.: Shazam: Unifying multi- ple foundation models for advanced computational pathology. arXiv preprint arXiv:2503.00736 (2025)

  12. [12]

    Nature Medicine (2024), volume 30(3):863–874

    Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Ding, T., Jaume, G., Le, L.P., Parwani, A., Zhang, A., Mahmood, F., et al.: A visual-language foundation model for computational pathology. Nature Medicine (2024), volume 30(3):863–874

  13. [13]

    Nature Biomedical Engineering5(6), 555–570 (2021).https://doi.org/ 10.1038/s41551-020-00682-w

    Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering5(6), 555–570 (2021).https://doi.org/ 10.1038/s41551-020-00682-w

  14. [14]

    arXiv preprint arXiv:2508.16085 (2025)

    Luo, X., Wang, X., Eweje, F., Zhang, X., Yang, S., Quinton, R., Xiang, J., Li, Y., Ji, Y., Li, Z., et al.: Ensemble learning of foundation models for precision oncology. arXiv preprint arXiv:2508.16085 (2025)

  15. [15]

    Nature Biomedical Engineering pp

    Ma, J., Guo, Z., Zhou, F., Wang, Y., Xu, Y., Li, J., Yan, F., Cai, Y., Zhu, Z., Jin, C., et al.: A generalizable pathology foundation model using a unified knowledge distillation pretraining framework. Nature Biomedical Engineering pp. 1–20 (2025) 10 Gexin Huang et al

  16. [16]

    PathBench: A comprehensive comparison benchmark for pathology foundation models towards preci- sion oncology.arXiv preprint arXiv:2505.20202, 2025

    Ma, J., et al.: Pathbench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology. arXiv preprint arXiv:2505.20202 (2025).https://doi.org/10.48550/arXiv.2505.20202

  17. [17]

    Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01516-3

    Neidlinger, P., et al.: Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. Nature Biomedical Engineering (2025).https://doi.org/10.1038/s41551-025-01516-3

  18. [18]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., Zhang, Y.: Transmil: Transformer based correlated multiple instance learning for whole slide image clas- sification. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 2136–2148 (2021)

  19. [19]

    Nature Medicine30(10), 2924–2935 (2024).https://doi.org/10.1038/ s41591-024-03141-0

    Vorontsov, E., Bozkurt, A., Casson, A., Shaikovski, G., Zelechowski, M., Sev- erson, K., Zimmermann, E., Hall, J., Tenenholtz, N., Fusi, N., et al.: A foun- dation model for clinical-grade computational pathology and rare cancers de- tection. Nature Medicine30(10), 2924–2935 (2024).https://doi.org/10.1038/ s41591-024-03141-0

  20. [20]

    Medical Image Analysis81, 102559 (2022).https://doi.org/10.1016/j.media.2022.102559

    Wang, X., Chen, H., Gan, C., Lin, Y., Dou, Q., et al.: Transformer-based unsuper- vised contrastive learning for histopathological image classification. Medical Image Analysis81, 102559 (2022).https://doi.org/10.1016/j.media.2022.102559

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu,J.,Chen,M.,Ke,X.,Xun,T.,Jiang,X.,Zhou,H.,Shao,L.,Kong,Y.:Learning heterogeneous tissues with mixture of experts for gigapixel whole slide images. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5144–5153 (2025)

  22. [22]

    Nature630(8015), 181–188 (2024).https://doi

    Xu, H., Usuyama, N., Bagga, J., Zhang, S., Rao, R., Naumann, T., Wong, C., Gero, Z., González, J., Gu, Y., et al.: A whole-slide foundation model for digital pathology from real-world data. Nature630(8015), 181–188 (2024).https://doi. org/10.1038/s41586-024-07441-w

  23. [23]

    Medical Image Analysis101, 103456 (2025).https://doi.org/10.1016/j.media.2025.103456

    Xu, H., Wang, M., Shi, D., Qin, H., Zhang, Y., Liu, Z., Madabhushi, A., Gao, P., Cong, F., Lu, C.: When multiple instance learning meets foundation models: Advancing histological whole slide image analysis. Medical Image Analysis101, 103456 (2025).https://doi.org/10.1016/j.media.2025.103456

  24. [24]

    arXiv preprint arXiv:2510.27237 (2025)

    Yang, Z., Shi, X., Ba, W., Song, Z., Luan, H., Hu, T., Lin, S., Wang, J., Zhou, S.K., Yan, R.: Fusion of multi-scale heterogeneous pathology foundation models for whole slide image analysis. arXiv preprint arXiv:2510.27237 (2025)