pith. sign in

arxiv: 2605.18055 · v1 · pith:5N3CTOM6new · submitted 2026-05-18 · 💻 cs.LG · cs.AI

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

Pith reviewed 2026-05-20 12:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords spatial gene expressiondiffusion modelsgraph neural networksfoundation modelsstructural correlationH&E imagesGene Dimension Curse
0
0 comments X

The pith

FLAG uses graph encoding and foundation model alignment in a diffusion framework to predict spatial gene expression while preserving biological structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLAG to predict spatial gene expression from H&E images by treating the task as structured distribution modeling rather than isolated predictions. It identifies the Gene Dimension Curse as the reason current joint modeling approaches fail in high dimensions and addresses it with a spatial graph encoder and Gene Foundation Model alignment. This allows the model to maintain both gene coordination and spatial distribution patterns. A sympathetic reader would care because accurate spatial gene maps from routine images could enable large-scale molecular profiling without expensive sequencing.

Core claim

FLAG redefines spatial gene expression prediction as structured distribution modeling using latent diffusion. It overcomes the Gene Dimension Curse through a spatial graph encoder that ensures topological consistency and Gene Foundation Model alignment that maintains gene-gene fidelity during generation. This results in significantly enhanced structural fidelity on new metrics like Gene Structural Correlation and Spatial Structural Correlation, while remaining competitive on standard PCC and MSE measures.

What carries the argument

The spatial graph encoder combined with Gene Foundation Model alignment in the latent diffusion process, which enforces topological consistency and gene-gene fidelity to solve the Gene Dimension Curse.

If this is right

  • Models can now capture gene coordination relationships that pointwise methods miss.
  • New structural metrics GSC and SSC provide better evaluation of biological fidelity.
  • Large-scale molecular profiling becomes feasible from routine H&E stained slides.
  • The approach maintains accuracy on traditional metrics like PCC and MSE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar graph and alignment techniques might improve other high-dimensional prediction tasks in genomics.
  • Applying this to different tissue types could test if the Gene Dimension Curse is general.
  • This structured modeling could lead to better understanding of spatial biology in disease contexts.

Load-bearing premise

That joint modeling of gene expression and spatial interactions necessarily fails in high-dimensional spaces and that the graph encoder plus foundation model alignment will restore the relationships without creating new inconsistencies.

What would settle it

Running FLAG on a held-out dataset and checking if the improvements in GSC and SSC disappear while PCC/MSE remain similar would falsify the claim that the components solve the curse for structural fidelity.

Figures

Figures reproduced from arXiv: 2605.18055 by Penglei Wang, Qi Si, Xin Guo, Xuyang Liu, Yifeng Jiao, Yuan Cheng, Yuan Qi, Yushuai Wu.

Figure 1
Figure 1. Figure 1: Ablation study on edge attributes (HEST-1K (Jaume et al., 2024) HER2ST Dataset). Comparison of PCC performance using different edge construction strategies: Image Similarity only, Image Similarity + Distance, and with additional Spot-Spot Gene PCC. Joint Node-Edge Diffusion To break the circular depen￾dency between expression and correlations, we treat both node states and functional edges as generative ta… view at source ↗
Figure 2
Figure 2. Figure 2: Failure of joint node-edge diffusion. (a) Empirical performance on HER2ST as gene dimensionality G increases. Joint node-edge diffusion performs well at small G but collapses beyond a critical dimensionality, while node-only diffusion de￾grades more smoothly. (b) Mechanism analysis. As G increases, simulated correlation distribution concentrate sharply, causing the consistency manifold to become increasing… view at source ↗
Figure 3
Figure 3. Figure 3: The FLAG Framework Architecture. Left: H&E tiles are encoded by a pathology foundation model and assembled into a spot-wise graph, which a graph encoder aggregates into spatial context embeddings Hspatial. Right: a conditional diffusion transformer denoises noisy gene expression Xt under this spatial context, while an intermediate-layer alignment constrains hidden states to match embeddings from a pretrain… view at source ↗
Figure 4
Figure 4. Figure 4: Gene Dimension Curse on HER2ST. We compare the PCC of FLAG against baseline diffusion strategies across varying gene panel sizes (G). 4.4. Ablation Study To validate FLAG’s design and explicitly attribute the perfor￾mance gains, we ablated key components on the HER2ST dataset ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Recovery of Gene Regulatory Networks The co-expression matrices for genes in the intersection of the Estrogen Response Early pathway and the top-200 HMHVG panel [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GT vs Predicts Moran’s I scatter. The predicted Moran’s I against the Ground Truth for top-32 spatially variable genes. Ground_Truth FLAG Stem STFlow TRIPLEX BLEEP HisToGene (a) ERBB2 Gene Expression (b) Spatial Domain Identification via Clustering [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation on spatial relationships. (a) Single-gene Spatial Pattern Recovery. The spatial expression map of the representative marker gene ERBB2. (b) Spatial Domain Identification via Clustering. An unsupervised method to cluster spatial gene expression. way for scalable, biologically consistent spatial transcrip￾tomics generation, offering a powerful tool for computa￾tional pathology. In the future, seve… view at source ↗
Figure 8
Figure 8. Figure 8: Dual-Mode Spatial Graph Encoder. Left (Mode 1): In the Graph Diffusion setting, the encoder operates dynamically, where both nodes Xt and edges Et are noisy latent variables updated at each timestep. The attention mechanism is modulated by the evolving edge features. Right (Mode 2): In the FLAG framework, the encoder functions as a spatial feature extractor. The dynamic edge evolution is suppressed (At → 0… view at source ↗
Figure 9
Figure 9. Figure 9: Training Dynamics on HER2ST Dataset HMHVG-200 genes panel. The curves depict the validation PCC over training epochs. anchor, providing a ”warm start” for the gene embeddings. The synergy between the structural capability of the graph and the semantic guidance of the FM allows FLAG to navigate the complex optimization landscape efficiently. H.5. Uncertainty and Sampling Variance Generative models, such as … view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Comparison of Gene Regulatory Networks on KIDNEY Dataset. (a) EMT Pathway. (b) Hypoxia Pathway. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Spatial Expression Recovery on KIDNEY Dataset. Visual comparison of three marker genes representing distinct spatial textures: PODXL, LRP2, and VIM. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
read the original abstract

Predicting spatial gene expression from routine H\&E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbf{FLAG}, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbf{Gene Dimension Curse}, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbf{GSC}) and Spatial Structural Correlation (\textbf{SSC}). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at https://github.com/darkflash03/FLAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FLAG, a latent diffusion model for predicting spatial gene expression from H&E images. It identifies a 'Gene Dimension Curse' that purportedly causes joint modeling of gene expression and spatial interactions to fail in high dimensions, and addresses it by combining a spatial graph encoder for topological consistency with alignment to a Gene Foundation Model (GFM) for gene-gene fidelity. Novel metrics Gene Structural Correlation (GSC) and Spatial Structural Correlation (SSC) are proposed to assess structural properties, with experiments claiming competitive PCC/MSE performance alongside significantly improved structural fidelity in capturing gene-gene and gene-spatial relationships.

Significance. If the empirical support and necessity of the proposed components hold, the work could advance spatial transcriptomics by better preserving biological structures such as coordinated gene expression and spatial topology. The combination of graph encoding with foundation-model alignment is a reasonable direction, and the public code release aids reproducibility.

major comments (3)
  1. [§1 and §3] §1 and §3: The Gene Dimension Curse is introduced as the core motivation for why standard joint modeling fails at high gene counts, yet no formal definition, scaling analysis (e.g., mutual-information bounds or gradient-variance scaling with dimension), or controlled ablation that isolates gene dimensionality while fixing model size and data volume is provided. Without this, it remains unclear whether the graph encoder and GFM alignment solve a unique dimensionality-driven problem or simply supply useful inductive biases.
  2. [§4 (Metrics)] §4 (Metrics): GSC and SSC are defined to quantify the very structural properties (gene-gene and gene-spatial relationships) that the model is explicitly trained to preserve. It is not shown that these metrics are independent of the training objectives or that they would not be improved by any model that adds similar graph or alignment biases; this risks circular evaluation of the central claim.
  3. [Results section] Results section: The abstract asserts competitive PCC/MSE and significantly better structural fidelity, but the provided summary supplies no quantitative tables, error bars, dataset sizes, baseline comparisons, or ablation studies isolating the graph encoder and GFM components. Full results must demonstrate that these additions are load-bearing for the reported gains.
minor comments (2)
  1. [Throughout] Ensure all acronyms (GFM, GSC, SSC) are defined on first use and used consistently.
  2. [Figures] Figure captions should explicitly state which panels report PCC/MSE versus GSC/SSC to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach and indicating planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§1 and §3] §1 and §3: The Gene Dimension Curse is introduced as the core motivation for why standard joint modeling fails at high gene counts, yet no formal definition, scaling analysis (e.g., mutual-information bounds or gradient-variance scaling with dimension), or controlled ablation that isolates gene dimensionality while fixing model size and data volume is provided. Without this, it remains unclear whether the graph encoder and GFM alignment solve a unique dimensionality-driven problem or simply supply useful inductive biases.

    Authors: We agree that a more formal treatment of the Gene Dimension Curse would strengthen the motivation. In the revised manuscript, we will add a subsection with a scaling analysis drawing on mutual-information bounds between high-dimensional gene expressions and spatial coordinates, along with controlled ablations that vary gene count while holding model size and data volume fixed. These additions will better isolate the dimensionality-specific challenges addressed by the graph encoder and GFM alignment. revision: yes

  2. Referee: [§4 (Metrics)] §4 (Metrics): GSC and SSC are defined to quantify the very structural properties (gene-gene and gene-spatial relationships) that the model is explicitly trained to preserve. It is not shown that these metrics are independent of the training objectives or that they would not be improved by any model that adds similar graph or alignment biases; this risks circular evaluation of the central claim.

    Authors: We acknowledge the risk of circularity in evaluation. Although GSC and SSC target the structural relationships our components aim to preserve, they are computed post-hoc on generated samples and are not directly optimized by the training loss. In revision we will add comparative baselines that incorporate graph or alignment biases through alternative mechanisms, demonstrating that our specific integration produces measurably higher structural fidelity on these metrics. revision: partial

  3. Referee: [Results section] Results section: The abstract asserts competitive PCC/MSE and significantly better structural fidelity, but the provided summary supplies no quantitative tables, error bars, dataset sizes, baseline comparisons, or ablation studies isolating the graph encoder and GFM components. Full results must demonstrate that these additions are load-bearing for the reported gains.

    Authors: The full manuscript already contains tables reporting PCC, MSE, GSC, and SSC values with error bars across multiple datasets and runs, together with ablation studies that isolate the graph encoder and GFM alignment. We will revise the presentation to make these quantitative results and ablations more prominent in the main text and ensure the abstract claims are explicitly tied to the reported numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper identifies the Gene Dimension Curse as a challenge for joint modeling in high dimensions and introduces FLAG with a spatial graph encoder and GFM alignment to enforce topological consistency and gene-gene fidelity. It also proposes independent structural metrics GSC and SSC for evaluation alongside standard PCC/MSE. No equations, definitions, or self-citations are exhibited that reduce the claimed curse, the necessity of the added components, or the reported fidelity improvements to the model's own inputs or fitted quantities by construction. The central claims rest on the proposed architecture and experimental comparisons rather than tautological redefinitions or load-bearing self-references, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; specific free parameters, axioms, and invented entities cannot be audited in detail. The paper introduces the Gene Dimension Curse as a named challenge and relies on the domain assumption that biological structures can be preserved via graph and alignment mechanisms.

axioms (1)
  • domain assumption Gene expression patterns and spatial distributions exhibit coordinated biological structures that are worth preserving in predictions
    Invoked to justify moving from pointwise to structured distribution modeling
invented entities (1)
  • Gene Dimension Curse no independent evidence
    purpose: To name the failure mode of joint high-dimensional modeling of gene expression and spatial interactions
    Introduced in the abstract as the key challenge that FLAG is designed to solve

pith-pipeline@v0.9.0 · 5734 in / 1450 out tokens · 57952 ms · 2026-05-20T12:03:03.507339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    and Hütter, J.-C

    Curran Associates, Inc., 2020. Hu, J., Li, X., Coleman, K., Schroeder, A., Ma, N., Irwin, D. J., Lee, E. B., Shinohara, R. T., and Li, M. Spagcn: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network.Nature methods, 18(11): 1342–1351, 2021. Huang, T., Liu, T., Ba...

  2. [2]

    Common genes present across all slides are identified first

    Filtering: We strictly use only thetrainingslides to calculate statistics. Common genes present across all slides are identified first. 2.Ranking: For each geneg, we compute the mean expressionµ g and standard deviationσ g across all training spots

  3. [3]

    Intersection: We rank genes by µg and σg in descending order. The final panel S is the intersection of the top-Ksearch genes from both lists: S={g|Rank(µ g)≤K search} ∩ {g|Rank(σ g)≤K search}.(17) Ksearch is adjusted dynamically to yield exact target sizes of G∈ {50,100,200,400,800} for the gene dimensional analysis. For all standard benchmarking experime...

  4. [4]

    Adaptive Layer Normalization (AdaLN)First, we fuse the time embedding temb with pooled representations of the conditions to form a global context vectorz. This vector regresses the scale and shift parameters for normalization: z=MLP fuse([temb,Pool(C v),Pool(C e)])(23) ˆHx =AdaLN(H x, z) = (1 +γ x(z))⊙LN(H x) +β x(z)(24) ˆHe =AdaLN(H e, z) = (1 +γ e(z))⊙L...

  5. [5]

    LetQ,K,Vbe projections of ˆHx

    Joint Structure Learning (Edge-Modulated Attention)We compute the attention scores S∈R N×N×H by interacting node queries/keys with edge-based gating. LetQ,K,Vbe projections of ˆHx. The attention topology is computed as: Sij = qikT j√ d ! ⊙ 1 +Linear( ˆHe,ij) +αLinear(C e,ij) | {z } Structural Gating +Linear( ˆHe,ij) +γLinear(C e,ij)| {z } Structural Bias ...

  6. [6]

    Dual-Stream UpdatesThe structural informationSis bifurcated to update nodes and edges: •Node Update.The structural attention matrixSacts as standard attention weights to aggregate value vectorsV: Hattn x =H (l) x + Linout Softmax(S)·V .(27) • Edge Update.The raw score matrix S is also directly projected to update the edge features, ensuring that edge repr...

  7. [7]

    Gated Feed-Forward NetworksFinally, both streams undergo point-wise processing via Gated-GELU networks. Unlike standard FFNs, GEGLU projects inputs into a gating stream and a value stream: FFN(h) =W 2 ·(GELU(W gateh)⊙(W valh))(29) The block output is then computed with residual connections: H(l+1) x =H attn x +FFN node(AdaLN(Hattn x , z))(30) H(l+1) e =H ...

  8. [8]

    Denoise (Tweedie’s Formula):We estimate the clean data ˆX0 and ˆA0 from the current noisy states and predicted scores: ˆX0 =X t +σ(t) 2sX θ , ˆA0 =A t +σ(t) 2sA θ (32) 2.Empirical Correlation:We compute the PCC of the estimated node expression within the batch: Ppred =Corr( ˆX0) = ( ˆX0 −µ)( ˆX0 −µ) T σxσTx +ϵ (33)

  9. [9]

    Loss Computation:We minimize the L1 distance between the explicitly predicted edge ˆA0 and the implicit node correlationP pred, masking out diagonal self-loops: Lcons = 1 N(N−1) X i̸=j ˆA0,ij −P pred,ij (34) E. Analysis of the Gene Dimension Curse We provide a simplified analysis to explain why jointly denoising node expressions and functional edges becom...

  10. [10]

    We initialize a zero-filled embedding matrixEi ∈R |S|×D GF M , where |S| denotes the number of genes in the target panel

  11. [11]

    For each positionkin the output sequence, we decode the tokent π(k) back to its Ensembl ID

  12. [12]

    Genes not present in the top-ranked context of spot i remain as zero vectors and are masked during the alignment loss calculation

    If this ID corresponds to thej-th gene in our target panelS(i.e.,g j ∈ S), we populate the matrix:E i,j ←H GF M k . Genes not present in the top-ranked context of spot i remain as zero vectors and are masked during the alignment loss calculation. F.2.2.SCGPT: VALUE-BINNEDCONTEXTUALIZATION Preprocessing & Input ConstructionWe utilize the scGPT-human checkp...

  13. [13]

    2.w/ scGPT: We utilize scGPT embeddings as the prior

    No GFM: The gene embeddings are randomly initialized and learned from scratch without any pre-trained biological knowledge. 2.w/ scGPT: We utilize scGPT embeddings as the prior

  14. [14]

    Note that since CellPLM produces cell-level embeddings, we apply average pooling along the gene dimensionto align its output with our target gene embedding space

    w/ CellPLM: We employ CellPLM as the prior. Note that since CellPLM produces cell-level embeddings, we apply average pooling along the gene dimensionto align its output with our target gene embedding space. 4.w/ Geneformer (Ours): Our default setting using Geneformer token embeddings. Table 6.Ablation of Foundation Model Priors (w/o Graph Backbone).Perfor...