MMAP: A Multi-Magnification and Prototype-Aware Architecture for Predicting Spatial Gene Expression
Pith reviewed 2026-05-21 20:32 UTC · model grok-4.3
The pith
MMAP predicts spatial gene expression from H&E slides using multi-magnification features and prototype embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MMAP framework simultaneously tackles insufficient granularity in local feature extraction and inadequate coverage of global spatial context by leveraging multi-magnification patch representations that capture fine-grained histological details together with a set of latent prototype embeddings that serve as compact representations of slide-level information, resulting in consistent outperformance of prior methods on MAE, MSE, and PCC.
What carries the argument
Multi-magnification patch representations for local histological detail combined with learned latent prototype embeddings that compactly encode slide-level global context.
If this is right
- Lower mean absolute error and mean squared error in gene expression regression from image patches.
- Higher Pearson correlation coefficients between predicted and measured expression values.
- More effective use of both fine local tissue details and overall slide context in a single model.
- Reduced reliance on direct spatial transcriptomics measurements for mapping gene activity across tissue.
Where Pith is reading between the lines
- The learned prototypes could be examined post-training to surface recurring histological patterns linked to specific gene programs.
- The same multi-scale plus prototype design might transfer to predicting other molecular readouts such as protein levels from routine slides.
- Testing the architecture on tissues with known spatial heterogeneity would reveal whether the global prototypes scale to complex microenvironments.
Load-bearing premise
That multi-magnification local features plus a fixed collection of learned prototype embeddings are enough to overcome the modality gap between histological images and molecular signals without extra biological priors or constraints.
What would settle it
Finding a new paired dataset of whole-slide images and spatial transcriptomics measurements where MMAP shows no improvement or worse results than current baselines on MAE, MSE, or PCC would falsify the claim of consistent superiority.
Figures
read the original abstract
Spatial Transcriptomics (ST) enables the measurement of gene expression while preserving spatial information, offering critical insights into tissue architecture and disease pathology. Recent developments have explored the use of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) to predict transcriptome-wide gene expression profiles through deep neural networks. This task is commonly framed as a regression problem, where each input corresponds to a localized image patch extracted from the WSI. However, predicting spatial gene expression from histological images remains a challenging problem due to the significant modality gap between visual features and molecular signals. Recent studies have attempted to incorporate both local and global information into predictive models. Nevertheless, existing methods still suffer from two key limitations: (1) insufficient granularity in local feature extraction, and (2) inadequate coverage of global spatial context. In this work, we propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture), that addresses both challenges simultaneously. To enhance local feature granularity, MMAP leverages multi-magnification patch representations that capture fine-grained histological details. To improve global contextual understanding, it learns a set of latent prototype embeddings that serve as compact representations of slide-level information. Extensive experimental results demonstrate that MMAP consistently outperforms all existing state-of-the-art methods across multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Pearson Correlation Coefficient (PCC).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MMAP, a Multi-Magnification and Prototype-Aware Architecture for predicting spatial gene expression from H&E-stained whole-slide images. It addresses the modality gap by extracting multi-magnification local patches for fine-grained histological features and learning a fixed set of latent prototype embeddings to capture slide-level global context. The central claim is that this design yields consistent outperformance over prior state-of-the-art methods on regression metrics including MAE, MSE, and PCC.
Significance. If the reported gains prove robust and generalizable, the work could advance computational approaches to spatial transcriptomics by offering a practical way to infer transcriptome-wide profiles directly from routine histological images, with downstream utility in pathology and disease modeling.
major comments (2)
- [Section 4] Section 4 (Experiments): the central empirical claim of consistent outperformance requires explicit reporting of the datasets (number of slides, spots per slide, genes predicted), the exact baselines and their re-implementation details, and statistical testing (e.g., paired t-tests or Wilcoxon with correction). Without these, the superiority on MAE/MSE/PCC cannot be evaluated.
- [Section 3.2] Section 3.2 (Prototype Module): the claim that learned prototypes overcome the modality gap rests on the assumption that a fixed set of embeddings suffices without biological priors; an ablation removing the prototype component (or varying its count) is needed to confirm this is load-bearing rather than incidental to other architectural choices.
minor comments (2)
- [Figure 2] Figure 2: the diagram of multi-magnification feature fusion should include explicit tensor shapes and the precise aggregation operation (concatenation, attention, etc.) to aid reproducibility.
- [Related Work] Related Work: ensure coverage of recent prototype-based or multi-scale methods in spatial transcriptomics prediction published after 2023.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript to incorporate additional details and experiments as requested.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Experiments): the central empirical claim of consistent outperformance requires explicit reporting of the datasets (number of slides, spots per slide, genes predicted), the exact baselines and their re-implementation details, and statistical testing (e.g., paired t-tests or Wilcoxon with correction). Without these, the superiority on MAE/MSE/PCC cannot be evaluated.
Authors: We agree that greater transparency in the experimental protocol is necessary. In the revised manuscript, Section 4 now includes a dedicated table summarizing dataset characteristics (number of slides, spots per slide, and genes predicted per dataset). We have also expanded the description of baseline re-implementations, specifying that we followed the original papers' architectures and training protocols using publicly available code where possible, with all hyperparameters documented in the supplementary material. Finally, we added paired t-tests with Bonferroni correction across the three metrics to establish statistical significance of the reported improvements. revision: yes
-
Referee: [Section 3.2] Section 3.2 (Prototype Module): the claim that learned prototypes overcome the modality gap rests on the assumption that a fixed set of embeddings suffices without biological priors; an ablation removing the prototype component (or varying its count) is needed to confirm this is load-bearing rather than incidental to other architectural choices.
Authors: We concur that an ablation study is required to substantiate the contribution of the prototype module. The revised manuscript now reports results from an ablation in which the prototype module is removed entirely, as well as experiments varying the number of prototypes (8, 16, and 32). These results are presented in a new table in Section 4 and confirm that the prototype component contributes measurably to performance, independent of the multi-magnification features. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical deep-learning architecture (MMAP) for spatial gene expression prediction from H&E images, relying on multi-magnification patches and learned prototype embeddings. No mathematical derivation chain, first-principles equations, or fitted-parameter predictions are described in the provided text; performance claims rest on experimental comparisons (MAE, MSE, PCC) rather than any reduction of outputs to inputs by construction. The work is self-contained as a methodological proposal with external benchmarks, exhibiting none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To enhance local feature granularity, MMAP leverages multi-magnification patch representations... learns a set of latent prototype embeddings... K-means clustering on the set of fused embeddings... cross-attention mechanism
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel framework, MMAP (Multi-MAgnification and Prototype-enhanced architecture)... adaptive retrieval... L = 0.5K
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alon, S., Goodwin, D.R., Sinha, A., Wassie, A.T., Chen, F., Daugharthy, E.R., Bando, Y., Kajita, A., Xue, A.G., Marrett, K., Prior, R., Cui, Y., Payne, A.C., Yao, C.C., Suk, H.J., Wang, R., Yu, C.C.J., Tillberg, P., Reginato, P., Pak, N., Liu, S., Punthambaker, S., Iyer, E.P.R., Kohman, R.E., Miller, J.A., Lein, E.S., Lako, A., Cullen, N., Rodig, S., Helv...
-
[2]
Andersson, A., Larsson, L., Stenbeck, L., Salmén, F., Ehinger, A., Wu, S.Z., Al-Eryani, G., Roden, D., Swarbrick, A., Borg, A., Frisén, J., Engblom, C., Lundeberg, J.: Spatial deconvolution of her2-positive breast cancer delineates tumor-associated cell type interactions. Nature Communications12(2021). https://doi.org/10.1038/s41467-021-26271-2, published...
-
[3]
Oncotarget8(12), 18680–18698 (Mar 2017)
Annaratone, L., Simonetti, M., Wernersson, E., Marchiò, C., Garnerone, S., Scalzo, M.S., Bienko, M., Chiarle, R., Sapino, A., Crosetto, N.: Quantification of her2 and estrogen receptor heterogeneity in breast cancer by single-molecule rna fluorescence in situ hybridization. Oncotarget8(12), 18680–18698 (Mar 2017). https://doi.org/10.18632/oncotarget.15727
-
[4]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chan,T.H.,Cendra,F.J.,Ma,L.,Yin,G.,Yu,L.:Histopathologywholeslideimage analysis with heterogeneous graph representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15661–15670 (June 2023)
work page 2023
-
[5]
J.et al.Towards a general-purpose foundation model for computational pathology.Nat
Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F.K., Jaume, G., Song, A.H., Chen, B.,Zhang,A.,Shao,D.,Shaban,M.,Williams,M.,Oldenburg,L.,Weishaupt,L.L., Wang, J., Vaidya, A., Le, L.P., Gerber, G., Sahai, S., Williams, W., Mahmood, F.: Towards a general-purpose foundation model for computational pathology. Na- ture Medicine30(3), 850–862 (2024). https://doi...
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chung, Y., Ha, J.H., Im, K.C., Lee, J.S.: Accurate spatial gene expression predic- tion by integrating multi-resolution features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11591–11600 (2024)
work page 2024
-
[7]
Nature Reviews Methods Primers , volume =
Corso, G., Stark, H., Jegelka, S., Jaakkola, T., Barzilay, R.: Graph neural networks. Nature Reviews Methods Primers4(1), 17 (2024). https://doi.org/10.1038/s43586-024-00294-7,https://doi.org/10.1038/ s43586-024-00294-7
-
[8]
Nature Biomedical Engineering4(8), 827– 834 (Aug 2020)
He, B., Bergenstråhle, L., Stenbeck, L., Abid, A., Andersson, A., Borg, Å., Maaskola, J., Lundeberg, J., Zou, J.: Integrating spatial gene expression and breast tumour morphology via deep learning. Nature Biomedical Engineering4(8), 827– 834 (Aug 2020)
work page 2020
-
[9]
Nature Cancer5(9), 1305–1317 (2024)
Hoang, D.T., Dinstag, G., Shulman, E.D., Hermida, L.C., BenZvi, D.S., Elis, E., Caley, K., Sammut, S., Sinha, S., Sinha, N., Dampier, C.H., Stossel, C., Patil, T., Rajan, A., Lassoued, W., Strauss, J., Bailey, S., Allen, C., Redman, J., Beker, T., Jiang, P., Golan, T., Wilkinson, S., Sowalsky, A.G., Pine, S.R., Caldas, C., Gulley, J.L., Aldape, K., Aharon...
-
[10]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, e.a.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (CVPR)
Lenz, T., Neidlinger, P., Ligero, M., Wölflein, G., van Treeck, M., Kather, J.N.: Un- supervised foundation model-agnostic slide-level representation learning. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 30807–30817 (June 2025)
work page 2025
-
[12]
Nature Protocols18(1), 239–264 (2023)
Lugmayr, W., Kotov, V., Goessweiner-Mohr, N., Wald, J., DiMaio, F., Marlovits, T.C.: Starmap: a user-friendly workflow for rosetta-driven molecular structure refinement. Nature Protocols18(1), 239–264 (2023). https://doi.org/10.1038/s41596-022-00757-9,https://doi.org/10.1038/ s41596-022-00757-9
-
[13]
Journal for ImmunoTherapy of Cancer11(Suppl 1) (2023)
Nagendran,M.,Sapida,J.,Arthur,J.,Yin,Y.,Tuncer,S.D.,Anaparthy,N.,Gupta, A., Serra, M., Patterson, D., Tentori, A.: 1457 visium hd enables spatially resolved, single-cell scale resolution mapping of ffpe human breast cancer tissue. Journal for ImmunoTherapy of Cancer11(Suppl 1) (2023). https://doi.org/10.1136/jitc- MMAP Architecture for Predicting Spatial ...
-
[14]
In:2025IEEE22ndInternationalSymposiumonBiomedicalImaging(ISBI).pp.1– 4 (2025)
Nguyen, M.D., Huy Pham, N.D., Nguyen, P.L., Do, M.N.: A semi-supervised learn- ing framework with cross-magnification attention for glioma mitosis classification. In:2025IEEE22ndInternationalSymposiumonBiomedicalImaging(ISBI).pp.1– 4 (2025). https://doi.org/10.1109/ISBI60581.2025.10981240
- [15]
-
[17]
Pang, M., Su, K., Li, M.: Leveraging information in spatial transcriptomics to pre- dict super-resolution gene expression from histology images in tumors. bioRxiv (2021). https://doi.org/10.1101/2021.11.28.470212,https://www.biorxiv.org/ content/early/2021/11/28/2021.11.28.470212
-
[18]
Saillard, C., Jenatton, R., Llinares-López, F., Mariet, Z., Cahané, D., Durand, E., Vert, J.P.: H-optimus-0 (2024),https://github.com/bioptimus/releases/tree/ main/models/h-optimus/v0
work page 2024
-
[19]
Cell174(2), 363–376.e16 (2018)
Shah, S., Takei, Y., Zhou, W., Lubeck, E., Yun, J., Eng, C.H.L., Koulena, N., Cronin, C., Karp, C., Liaw, E.J., Amin, M., Cai, L.: Dynamics and spatial ge- nomics of the nascent transcriptome by intron seqfish. Cell174(2), 363–376.e16 (2018). https://doi.org/10.1016/j.cell.2018.05.035,https://doi.org/10.1016/j. cell.2018.05.035
-
[20]
Emogen: Emotional image content generation with text-to-image diffusion models,
Shao, W., Shi, Y., Zhang, D., Zhou, J., Wan, P.: Tumor Micro- Environment Interactions Guided Graph Learning for Survival Anal- ysis of Human Cancers from Whole-Slide Pathological Images . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 11694–11703. IEEE Computer Society, Los Alami- tos, CA, USA (Jun 2024). https://do...
-
[21]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Song, A.H., Chen, R.J., Ding, T., Williamson, D.F., Jaume, G., Mahmood, F.: Morphological prototyping for unsupervised slide representation learning in com- putational pathology. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
work page 2024
-
[22]
Med Image Anal67, 101813 (Sep 2020)
Srinidhi, C.L., Ciga, O., Martel, A.L.: Deep neural network models for computa- tional histopathology: A survey. Med Image Anal67, 101813 (Sep 2020)
work page 2020
-
[23]
Wang, C., Chan, A.S., Fu, X., Ghazanfar, S., Kim, J., Patrick, E., Yang, J., et al.: Benchmarkingthetranslationalpotentialofspatialgeneexpressionpredictionfrom histology.NatureCommunications16(2025).https://doi.org/10.1038/s41467-025- 56618-y, open access; Published 11 February 2025
-
[24]
Medical Image Analysis81, 102559 (2022)
Wang, X., Yang, S., Zhang, J., Wang, M., Zhang, J., Yang, W., Huang, J., Han, X.: Transformer-based unsupervised contrastive learning for histopathological image classification. Medical Image Analysis81, 102559 (2022). https://doi.org/https://doi.org/10.1016/j.media.2022.102559,https:// www.sciencedirect.com/science/article/pii/S1361841522002043
-
[25]
Xiao, X., Kong, Y., Li, R., Wang, Z., Lu, H.: Transformer with convolution and graph-node co-embedding: An accurate and inter- pretable vision backbone for predicting gene expressions from local 16 Nguyen and Pham et al. histopathological image. Medical Image Analysis91, 103040 (2024). https://doi.org/https://doi.org/10.1016/j.media.2023.103040,https://ww...
-
[26]
In: Oh, A., Naumann, T., Glober- son, A., Saenko, K., Hardt, M., Levine, S
Xie, R., Pang, K., Chung, S., Perciani, C., MacParland, S., Wang, B., Bader, G.: Spatially resolved gene expression prediction from histology im- ages via bi-modal contrastive learning. In: Oh, A., Naumann, T., Glober- son, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural In- formation Processing Systems. vol. 36, pp. 70626–70637. Curran As...
work page 2023
-
[27]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Yang, Y., Hossain, M.Z., Stone, E.A., Rahman, S.: Exemplar guided deep neu- ral network for spatial transcriptomics analysis of gene expression prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 5039–5048 (January 2023)
work page 2023
-
[28]
Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N., Huang, J.: Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Medical Image Analysis p. 101789 (July 2020). https://doi.org/10.1016/j.media.2020.101789,https://linkinghub. elsevier.com/retrieve/pii/S1361841520301535
-
[29]
Zeng, Y., Wei, Z., Yu, W., Yin, R., Li, B., Tang, Z., Lu, Y., Yang, Y.: Spatial tran- scriptomics prediction from histology jointly through transformer and graph neu- ral networks. bioRxiv (2022). https://doi.org/10.1101/2022.04.25.489397,https: //www.biorxiv.org/content/early/2022/04/26/2022.04.25.489397
-
[30]
Nature598(7879), 137–143 (2021)
Zhang, M., Eichhorn, S.W., Zingg, B., Yao, Z., Cotter, K., Zeng, H., Dong, H., Zhuang, X.: Spatially resolved cell atlas of the mouse primary motor cortex by merfish. Nature598(7879), 137–143 (2021). https://doi.org/10.1038/s41586-021- 03705-x,https://doi.org/10.1038/s41586-021-03705-x
-
[31]
In: 2017 IEEE In- ternational Conference on Image Processing (ICIP)
Zhu, Y., Newsam, S.: Densenet for dense flow. In: 2017 IEEE In- ternational Conference on Image Processing (ICIP). pp. 790–794 (2017). https://doi.org/10.1109/ICIP.2017.8296389
-
[32]
Zimmermann, E., Vorontsov, E., Viret, J., Casson, A., Zelechowski, M., Shaikovski, G., Tenenholtz, N., Hall, J., Klimstra, D., Yousfi, R., Fuchs, T., Fusi, N., Liu, S., Severson, K.: Virchow2: Scaling self-supervised mixed magnification models in pathology (2024),https://arxiv.org/abs/2408.00738
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.