Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows

Dario Rancati; Francesco Locatello; Jan Maas

arxiv: 2604.11311 · v1 · submitted 2026-04-13 · 💻 cs.LG · stat.ML

Learning Discrete Diffusion of Graphs via Free-Energy Gradient Flows

Dario Rancati , Jan Maas , Francesco Locatello This is my paper

Pith reviewed 2026-05-10 15:19 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords discrete diffusiongraph diffusionfree-energy gradient flowsJKO schemeWasserstein metricMarkov chainsoptimal transport

0 comments

The pith

A quadratic loss from JKO optimality recovers the free-energy functional driving discrete graph diffusions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for learning diffusion processes on discrete spaces such as graphs by viewing them as gradient flows of free-energy functionals under a metric W_K on the probability simplex. It shows that the first-order optimality conditions of the JKO scheme can be rearranged into a simple quadratic loss whose minimizer directly recovers the unknown functional. The approach requires only a preprocessing step to compute W_K-geodesics and needs no individual sample trajectories. A sympathetic reader would care because it supplies an efficient, trajectory-free route to model and learn the driving energies behind discrete diffusions that appear in graph-structured data and continuous-time Markov chains.

Core claim

By introducing an appropriate metric W_K on the simplex of probability distributions, the authors interpret standard discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. They then leverage the first-order optimality conditions of the JKO scheme to derive a quadratic loss that recovers the underlying functional directly. The resulting procedure optimizes this loss after a numerical preprocessing step that computes W_K-geodesics, requires no sample trajectories, and is validated on synthetic data by recovering the functional for a variety of graph classes.

What carries the argument

The W_K metric on the probability simplex, which renders common discrete diffusion paths exact gradient flows of free-energy functionals, together with the first-order JKO optimality conditions reformulated as a quadratic loss for direct functional recovery.

If this is right

The method recovers the underlying functional for a variety of graph classes on synthetic data.
Training proceeds by optimizing a simple quadratic loss and completes extremely fast.
No individual sample trajectories are required, only a one-time numerical computation of W_K-geodesics.
The framework applies directly to widely used discrete diffusion models such as the heat equation on graphs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same JKO-derived quadratic loss could be tested on discrete structures other than graphs, such as regular lattices or hypergraphs.
If recovery remains accurate, the approach might allow construction of discrete diffusion models directly from observed marginals without prescribing the energy in advance.
The preprocessing step for W_K-geodesics could be examined for scalability on larger state spaces where the simplex dimension grows.

Load-bearing premise

A suitable metric W_K exists on the probability simplex such that standard discrete diffusion paths are exactly the gradient flows of free-energy functionals and the JKO optimality conditions yield a quadratic loss whose minimizer recovers the true functional without further adjustments.

What would settle it

A controlled experiment on the discrete heat equation for a known graph where the quadratic loss minimizer fails to recover the expected free-energy functional or where the learned dynamics do not reproduce the original diffusion paths.

Figures

Figures reproduced from arXiv: 2604.11311 by Dario Rancati, Francesco Locatello, Jan Maas.

**Figure 1.** Figure 1: Hellinger distance of our model against OpenFIM, averaged across all β levels: error bars computed over 5 samples for each configuration. 500 1,000 2,000 5,000 10,000 20,000 50,000 Samples 0.04 0.06 0.07 0.09 0.12 0.14 0.15 0.17 Hellinger [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Hellinger distance as the number of samples increases. Averaged across all graph classes, with fixed size 10. sequence-generation experiments: in language-style settings the state space is the Hamming graph with n d vertices (e.g. n = 50257 for GPT2 (Radford et al., 2019)), where joint density estimation from snapshots is not computationally viable. Importantly, this does not break the conceptual link with… view at source ↗

**Figure 4.** Figure 4: Schematic breakdown of our pipeline: we first estimate densities, use them to compute geodesics, then feed everything into the quadratic loss. D. Inference Details We use the same procedure to evolve a population given a gradient of a potential ∇V and a parameter β both in the data generation (using the ground truth values) and for inference. Given the estimated density from the previous timestep ρˆ, poten… view at source ↗

**Figure 5.** Figure 5: Examples of the twelve graph classes we use to benchmark our model. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: A sample smooth potential on a Delaunay graph with 1000 vertices (on the left) and the predicted V from our numerical routine. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Diffusion-based models on continuous spaces have seen substantial recent progress through the mathematical framework of gradient flows, leveraging the Wasserstein-2 (${W}_2$) metric via the Jordan-Kinderlehrer-Otto (JKO) scheme. Despite the increasing popularity of diffusion models on discrete spaces using continuous-time Markov chains, a parallel theoretical framework based on gradient flows has remained elusive due to intrinsic challenges in translating the ${W}_2$ distance directly into these settings. In this work, we propose the first computational approach addressing these challenges, leveraging an appropriate metric $W_K$ on the simplex of probability distributions, which enables us to interpret widely used discrete diffusion paths, such as the discrete heat equation, as gradient flows of specific free-energy functionals. Through this theoretical insight, we introduce a novel methodology for learning diffusion dynamics over discrete spaces, which recovers the underlying functional directly by leveraging first-order optimality conditions for the JKO scheme. The resulting method optimizes a simple quadratic loss, trains extremely fast, does not require individual sample trajectories, and only needs a numerical preprocessing computing $W_K$-geodesics. We validate our method through extensive numerical experiments on synthetic data, showing that we can recover the underlying functional for a variety of graph classes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames discrete graph diffusion as gradient flows under a custom WK metric on the simplex and recovers the free-energy functional via a quadratic loss from JKO first-order conditions without trajectory simulation.

read the letter

The key thing to know is that this work gives a way to learn the functional behind discrete diffusion on graphs by minimizing a simple quadratic loss derived from JKO optimality conditions once a suitable metric makes the paths exact gradient flows. They avoid simulating individual trajectories and instead do some upfront geodesic computation under that metric, then optimize directly on the data. This is new as a combination: the metric interpretation for discrete heat flows plus the direct reduction to quadratic recovery, and it is practically useful because training becomes fast and does not require full path samples. The synthetic experiments on various graph classes show that the method can recover the underlying functional in those controlled cases, which is a concrete demonstration of the approach working at least on toy data. The soft spots are in the details of the metric construction and the loss derivation. The central claim requires that the chosen WK metric makes standard discrete diffusions exact gradient flows with no hidden functional dependence or non-quadratic terms in the proximal step; if that does not hold exactly, the minimizer will not recover the true functional even on synthetic data. The abstract gives no error bars, no baseline comparisons, and no explicit checks on numerical stability of the geodesics, so the strength of the recovery is hard to judge from the summary alone. This paper is for researchers working on generative models for graphs and discrete structures who are already thinking about optimal transport or gradient-flow ideas. A reader who wants a trajectory-free learning rule grounded in JKO theory will find the setup worth examining. It has enough formal grounding and a clear algorithmic contribution to deserve a serious referee, even though the experiments remain preliminary and the metric assumption needs close checking.

Referee Report

2 major / 2 minor

Summary. The paper proposes a metric W_K on the probability simplex such that discrete diffusion processes (e.g., the discrete heat equation on graphs) can be interpreted as gradient flows of free-energy functionals. It then uses the first-order optimality conditions of the associated JKO scheme to derive a quadratic loss that recovers the unknown functional from data. The resulting algorithm requires only a preprocessing step to compute W_K-geodesics, trains rapidly, and needs no individual sample trajectories. Validation consists of synthetic experiments claiming recovery of the functional across multiple graph classes.

Significance. If the construction of W_K makes standard discrete diffusions exact gradient flows and the JKO optimality conditions reduce exactly to an unbiased quadratic loss, the work supplies a missing theoretical bridge between continuous Wasserstein gradient-flow models and discrete diffusion on graphs. The trajectory-free, fast-training character would be a practical advantage for learning graph dynamics. The synthetic recovery results, if rigorously supported, would constitute a concrete demonstration of the framework.

major comments (2)

[Sections 3–4] The central derivation (Sections 3–4) asserts that the first-order optimality conditions of the JKO proximal step under W_K reduce to a quadratic loss in the unknown functional whose minimizer recovers the ground truth without additional fitting. An explicit expansion of the optimality condition is needed to confirm that the variation of the proximal mapping introduces neither non-quadratic terms nor implicit dependence on the functional itself; otherwise the recovery claim does not hold even on synthetic data.
[Section 5] The synthetic experiments (Section 5) state that the method recovers the functional for various graph classes, yet report neither error bars, baseline comparisons, nor the precise data-generation and exclusion rules used to compute the quadratic loss. Without these, it is impossible to determine whether the observed recovery is robust or an artifact of the chosen preprocessing and metric.

minor comments (2)

[Section 2] The definition and construction of the metric W_K should be stated explicitly in the main text (with a self-contained formula) rather than deferred entirely to the appendix, to allow readers to verify the gradient-flow property without external references.
[Throughout] Notation for the free-energy functional and its discrete approximation should be unified across the theoretical and experimental sections to avoid ambiguity when comparing recovered parameters to ground truth.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the clarity of the theoretical derivation and the reproducibility of the experiments. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Sections 3–4] The central derivation (Sections 3–4) asserts that the first-order optimality conditions of the JKO proximal step under W_K reduce to a quadratic loss in the unknown functional whose minimizer recovers the ground truth without additional fitting. An explicit expansion of the optimality condition is needed to confirm that the variation of the proximal mapping introduces neither non-quadratic terms nor implicit dependence on the functional itself; otherwise the recovery claim does not hold even on synthetic data.

Authors: We agree that an explicit expansion will improve rigor and readability. In the revised manuscript we will insert a detailed derivation immediately following the statement of the JKO optimality conditions in Section 4. Starting from the proximal mapping under the W_K metric on the probability simplex, we will compute its first variation with respect to the unknown free-energy functional. Because W_K is defined via a weighted inner product that is independent of the functional, the resulting Euler-Lagrange equation contains only a quadratic term in the functional; all higher-order contributions cancel identically. Consequently the loss remains strictly quadratic and its unique minimizer recovers the ground-truth functional without bias or implicit dependence. This expansion relies only on the algebraic structure already present in Sections 3–4 and does not alter any claims. revision: yes
Referee: [Section 5] The synthetic experiments (Section 5) state that the method recovers the functional for various graph classes, yet report neither error bars, baseline comparisons, nor the precise data-generation and exclusion rules used to compute the quadratic loss. Without these, it is impossible to determine whether the observed recovery is robust or an artifact of the chosen preprocessing and metric.

Authors: We acknowledge that the current experimental section lacks several standard elements required for full reproducibility. In the revision we will expand Section 5 as follows: (i) report mean and standard deviation of the recovery error over 20 independent random seeds for each graph class; (ii) add comparisons against two natural baselines—one that directly regresses the functional from observed marginals and another that uses a trajectory-based maximum-likelihood estimator; (iii) provide an explicit description of the data-generation pipeline, including the exact discretization of the continuous-time Markov chain, the number of samples drawn per marginal, and the precise exclusion rule (samples whose W_K-geodesic distance exceeds a threshold are discarded before loss evaluation). These additions will allow readers to verify that the reported recovery is robust and not an artifact of preprocessing choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation follows from adapted JKO optimality conditions

full rationale

The paper constructs a metric W_K on the probability simplex to interpret standard discrete diffusion paths (e.g., discrete heat equation) as gradient flows of free-energy functionals, then derives a quadratic loss directly from the first-order optimality conditions of the associated JKO scheme. This recovery step is presented as a mathematical consequence of the scheme rather than a parameter fit or self-referential definition; the loss minimizer is claimed to recover the underlying functional on synthetic data without additional adjustments. No load-bearing self-citations, ansatzes smuggled via prior work, or reductions where the prediction equals the input by construction appear in the abstract or described chain. The numerical preprocessing for W_K-geodesics is treated as an independent computational step, keeping the central claim externally grounded in continuous gradient-flow theory.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of a metric WK that turns standard discrete Markov chains into gradient flows of free energies, plus the applicability of the JKO scheme and its optimality conditions to the discrete simplex; these are treated as standard extensions of continuous optimal-transport theory rather than new axioms invented here.

axioms (2)

domain assumption The JKO scheme and its first-order optimality conditions extend from the Wasserstein-2 space to the simplex equipped with the metric WK.
Invoked when the paper states that discrete diffusion paths are gradient flows and that the functional can be recovered from JKO conditions.
domain assumption Standard discrete diffusion paths such as the discrete heat equation are exactly the gradient flows of certain free-energy functionals under WK.
Stated as the theoretical insight that enables the learning method.

invented entities (1)

Metric WK on the probability simplex no independent evidence
purpose: To make discrete diffusion paths interpretable as gradient flows of free-energy functionals
Introduced as the appropriate metric that closes the gap left by the Wasserstein-2 distance; no independent evidence of its properties is given beyond the claim that it works for the discrete heat equation.

pith-pipeline@v0.9.0 · 5517 in / 1678 out tokens · 52690 ms · 2026-05-10T15:19:21.776934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

doi: https://doi.org/10.1016/ 0378-8733(83)90021-7

ISSN 0378-8733. doi: https://doi.org/10.1016/ 0378-8733(83)90021-7. Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker–planck equation, 1998. Kim, J. H., Kim, S., Moon, S., Kim, H., Woo, J., and Kim, W. Y . Discrete diffusion schr¨odinger bridge matching for graph transformation, 2025. URL https://arxiv. org/abs/2410.0150...

work page arXiv 1998
[2]

Watts and Steven H

doi: 10.1038/30918. Xu, C., Cheng, X., and Xie, Y . Normalizing flow neu- ral networks by jko scheme, 2024a. URL https: //arxiv.org/abs/2212.14424. Xu, Z., Qiu, R., Chen, Y ., Chen, H., Fan, X., Pan, M., Zeng, Z., Das, M., and Tong, H. Discrete-state continuous-time diffusion for graph generation, 2024b. URL https: //arxiv.org/abs/2405.11416. 10 Learning ...

work page doi:10.1038/30918 2013
[3]

there exists an eigenvectorvforλ A with strictly positive entries

work page
[4]

transport cost

v and scalar multiples are the only eigenvectors ofAwith all entries strictly positive. By the Perron-Frobenius theorem, any irreducible stochastic matrix admits a unique stationary distribution π with strictly positive entries, i.e.π i >0for alli. Finally, we say thatKisreversiblewith respect to a probability distributionπif thedetailed balance equations...

work page 2008

[1] [1]

doi: https://doi.org/10.1016/ 0378-8733(83)90021-7

ISSN 0378-8733. doi: https://doi.org/10.1016/ 0378-8733(83)90021-7. Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker–planck equation, 1998. Kim, J. H., Kim, S., Moon, S., Kim, H., Woo, J., and Kim, W. Y . Discrete diffusion schr¨odinger bridge matching for graph transformation, 2025. URL https://arxiv. org/abs/2410.0150...

work page arXiv 1998

[2] [2]

Watts and Steven H

doi: 10.1038/30918. Xu, C., Cheng, X., and Xie, Y . Normalizing flow neu- ral networks by jko scheme, 2024a. URL https: //arxiv.org/abs/2212.14424. Xu, Z., Qiu, R., Chen, Y ., Chen, H., Fan, X., Pan, M., Zeng, Z., Das, M., and Tong, H. Discrete-state continuous-time diffusion for graph generation, 2024b. URL https: //arxiv.org/abs/2405.11416. 10 Learning ...

work page doi:10.1038/30918 2013

[3] [3]

there exists an eigenvectorvforλ A with strictly positive entries

work page

[4] [4]

transport cost

v and scalar multiples are the only eigenvectors ofAwith all entries strictly positive. By the Perron-Frobenius theorem, any irreducible stochastic matrix admits a unique stationary distribution π with strictly positive entries, i.e.π i >0for alli. Finally, we say thatKisreversiblewith respect to a probability distributionπif thedetailed balance equations...

work page 2008