Fast Wasserstein rates for estimating probability distributions of probabilistic graphical models
Pith reviewed 2026-05-22 12:42 UTC · model grok-4.3
The pith
The Wasserstein rate for estimating graphical model distributions is set by local node-parent dimensions under kernel smoothness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance suffers from the curse of dimensionality. For distributions arising from probabilistic graphical models on a fixed directed acyclic graph, this curse can be overcome when the transition kernels in the corresponding disintegration satisfy smoothness conditions. The resulting estimation rate is then governed by the local dimensions of nodes together with their parents, with the precise rate depending on the strength of the smoothness assumption.
What carries the argument
Smoothness conditions on the transition kernels of the graph disintegration, which link the local node and parent dimensions to the global Wasserstein estimation rate.
If this is right
- The rate improves beyond the unstructured high-dimensional case when smoothness holds.
- Under bidirectional Total-Variation-Lipschitz conditions the rate is sharp.
- Distributions with positive Lipschitz density are covered as a special case.
- Only quantified structural knowledge via the kernels makes the graph helpful.
Where Pith is reading between the lines
- This suggests that accurate knowledge of the graph and kernel properties is crucial for efficient estimation in applications like causal modeling.
- Similar local dimension benefits might appear in other distance-based estimation tasks for structured data.
- Testing the rates on concrete examples such as Bayesian networks with known conditional densities would validate the local structure effect.
Load-bearing premise
Smoothness conditions hold for the transition kernels corresponding to the nodes in the directed acyclic graph.
What would settle it
A numerical experiment estimating a distribution from a graphical model with non-smooth kernels and observing the rate degrade to the high-dimensional case would disprove the local structure governing the rate.
Figures
read the original abstract
Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance is a fundamental instance of the curse of dimensionality. We explore how structural knowledge about the data-generating process which gives rise to the distribution can be used to overcome this curse. More precisely, we work with the set of distributions of probabilistic graphical models for a given directed acyclic graph. It turns out that this knowledge is only helpful if it can be quantified, which we formalize via smoothness conditions on the transition kernels in the disintegration corresponding to the graph. In this case, we prove that the rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes. The precise rate depends on the exact notion of smoothness assumed for the kernels, where either weak (Wasserstein-Lipschitz) or strong (bidirectional Total-Variation-Lipschitz) conditions lead to different results. We prove sharpness under the strong condition and show that this condition covers, as a special case, distributions having a positive Lipschitz density.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for distributions of probabilistic graphical models on a fixed DAG, the minimax rate of estimation in Wasserstein distance from i.i.d. samples is governed by the local dimensions (each node together with its parents) once the transition kernels satisfy either weak Wasserstein-Lipschitz or strong bidirectional TV-Lipschitz smoothness. Upper bounds are derived for both regimes via disintegration into local conditionals; sharpness is proved under the strong condition, which also covers positive Lipschitz densities as a special case. No dependence on global dimension or graph depth appears in the final rates.
Significance. If the derivations hold, the work provides a clean theoretical mechanism for overcoming the curse of dimensionality in distribution estimation by combining graphical structure with quantifiable local smoothness. The reduction to per-node covering numbers or entropy integrals, together with explicit sharpness under the strong condition, is a substantive contribution to nonparametric statistics and optimal transport. The results are directly relevant to structured high-dimensional data and causal models where local smoothness is plausible.
major comments (2)
- [§3.2] §3.2, the statement of the main upper bound (presumably Theorem 3.1): the global Wasserstein bound is obtained by summing local discrepancies; the argument must confirm that the triangle inequality or coupling constants remain uniform across nodes and do not accumulate with graph depth or number of nodes, as this is load-bearing for the claim of purely local rates.
- [Theorem 4.1] Theorem 4.1 (sharpness under strong condition): the lower-bound construction packs local kernels while keeping the joint in the PGM class; it should be verified that the same packing does not inadvertently violate the bidirectional TV-Lipschitz assumption for some nodes, which would weaken the sharpness statement.
minor comments (3)
- [Introduction] The abstract and introduction use 'bidirectional Total-Variation-Lipschitz' without an immediate equation reference; adding the precise definition (e.g., Eq. (2) or (3)) at first use would improve readability.
- [Preliminaries] Notation for the disintegration and the local Wasserstein distance on each conditional should be introduced once and used consistently; occasional reuse of P for both joint and conditional measures is mildly confusing.
- [Introduction] Figure 1 (if present) or the illustrative DAG examples would benefit from explicit labeling of the local dimensions d_v + |pa(v)| next to each node.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment, and constructive comments on our manuscript. We address each major comment point by point below.
read point-by-point responses
-
Referee: [§3.2] §3.2, the statement of the main upper bound (presumably Theorem 3.1): the global Wasserstein bound is obtained by summing local discrepancies; the argument must confirm that the triangle inequality or coupling constants remain uniform across nodes and do not accumulate with graph depth or number of nodes, as this is load-bearing for the claim of purely local rates.
Authors: In the proof of Theorem 3.1 we disintegrate the joint distributions according to the fixed DAG and construct a global coupling as the product of local couplings between the conditional kernels. The Wasserstein distance between the joints is then bounded by the sum of the local Wasserstein distances; the triangle inequality is applied along the topological order with multiplicative constants identically equal to one. Because the underlying metric is additive across coordinates and each local coupling is chosen independently, no depth- or size-dependent factors arise. The uniformity is stated after Equation (3.2) and proved in the appendix; we will add a short clarifying sentence in the main-text proof sketch to make this explicit. revision: partial
-
Referee: [Theorem 4.1] Theorem 4.1 (sharpness under strong condition): the lower-bound construction packs local kernels while keeping the joint in the PGM class; it should be verified that the same packing does not inadvertently violate the bidirectional TV-Lipschitz assumption for some nodes, which would weaken the sharpness statement.
Authors: The packing used for the lower bound perturbs each local transition kernel inside a small ball that preserves the bidirectional TV-Lipschitz constant of the original kernel. Because the perturbations are chosen to be Lipschitz with respect to the same constant and the graph is fixed, every constructed conditional satisfies the strong smoothness assumption uniformly across nodes. This verification appears in the proof of Theorem 4.1 (Section 4) and the accompanying supplementary calculations; the resulting family remains inside the model class. revision: no
Circularity Check
Derivation self-contained from first-principles smoothness assumptions
full rationale
The paper derives minimax Wasserstein rates for distributions of probabilistic graphical models on a fixed DAG by assuming either Wasserstein-Lipschitz or bidirectional TV-Lipschitz conditions on the transition kernels in the disintegration. The joint estimation problem reduces to separate control of each conditional via standard entropy integrals or covering numbers, with the global rate then depending only on the local dimensions (node plus parents). This follows from general statistical theory on empirical processes and does not reduce to any fitted parameter, self-definition, or load-bearing self-citation chain. Sharpness under the strong condition is shown by explicit lower bounds that match the upper bounds without circularity. The argument is independent of the target result and externally falsifiable via the stated Lipschitz assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transition kernels admit either Wasserstein-Lipschitz or bidirectional Total-Variation-Lipschitz smoothness
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
smoothness conditions on the transition kernels in the disintegration
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
B. Acciaio and S. Hou. Convergence of adapted empirical measures on r d.The Annals of Applied Probability, 34(5):4799–4835, 2024. 5For the following, we note that whileνM is supported on the midpoints of the cells, we naturally extend its kernels in a constant fashion toX. That is,νM(dxk:K |x 1:k−1) =ν M(dxk:K |c 1:k−1(x1:k−1)). 33
work page 2024
-
[2]
M. Azadkia and S. Chatterjee. A simple measure of conditional dependence.The Annals of Statistics, 49(6):3070–3102, 2021
work page 2021
-
[3]
J. Backhoff, D. Bartl, M. Beiglböck, and J. Wiesel. Estimating processes in adapted wasser- stein distance.The Annals of Applied Probability, 32(1):529–550, 2022
work page 2022
-
[4]
D. Bartl and S. Mendelson. Structure preservation via the wasserstein distance.Journal of Functional Analysis, 288(7):110810, 2025
work page 2025
-
[5]
P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
C. Bénézet, Z. Cheng, and S. Jaimungal. Learning conditional distributions on continuous spaces.arXiv preprint arXiv:2406.09375, 2024
-
[7]
D. P. Bertsekas and S. E. Shreve.Stochastic Optimal Control: The Discrete Time Case, volume 139 ofMathematics in Science and Engineering. Academic Press, New York, 1978
work page 1978
-
[8]
P. Cheridito and S. Eckstein. Optimal transport and wasserstein distances for causal models. Bernoulli, 31(2):1351–1376, 2025
work page 2025
- [9]
-
[10]
C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968
work page 1968
-
[11]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to algorithms. MIT press, 3rd edition, 2009
work page 2009
-
[12]
R. M. Dudley. The speed of mean glivenko-cantelli convergence.The Annals of Mathematical Statistics, 40(1):40–50, 1969
work page 1969
-
[13]
I. Ebert-Uphoff and Y. Deng. Causal discovery for climate research using graphical models. Journal of Climate, 25(17):5648–5665, 2012
work page 2012
-
[14]
P. M. Faller, L. C. Vankadara, A. A. Mastakouri, F. Locatello, and D. Janzing. Self- compatibility: Evaluating causal discovery without ground truth. InInternational Conference on Artificial Intelligence and Statistics, pages 4132–4140. PMLR, 2024
work page 2024
- [15]
-
[16]
N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015
work page 2015
-
[17]
A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. InThe 22nd international conference on artificial intelligence and statistics, pages 1574–1583. PMLR, 2019
work page 2019
-
[18]
Z. Goldfeld, K. Greenewald, J. Niles-Weed, and Y. Polyanskiy. Convergence of smoothed em- pirical measures with applications to entropy estimation.IEEE Transactions on Information Theory, 66(7):4368–4391, 2020
work page 2020
-
[19]
S. Graf and H. Luschgy.Foundations of quantization for probability distributions. Springer Science & Business Media, 2000
work page 2000
-
[20]
L. Gresele, J. Von Kügelgen, J. Kübler, E. Kirschbaum, B. Schölkopf, and D. Janzing. Causal inference through the structural causal marginal problem. InInternational conference on machine learning, pages 7793–7824. PMLR, 2022
work page 2022
- [21]
- [22]
-
[23]
B. R. Kloeckner. Empirical measures: regularity is a counter-curse to dimensionality.ESAIM: Probability and Statistics, 24:408–434, 2020
work page 2020
-
[24]
D. Koller and N. Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009. 34
work page 2009
-
[25]
Z. M. Laubach, E. J. Murray, K. L. Hoke, R. J. Safran, and W. Perng. A biologist’s guide to model selection and causal inference.Proceedings of the Royal Society B, 288(1943):20202815, 2021
work page 1943
-
[26]
T. Liang. How well can generative adversarial networks learn densities: A nonparametric view.arXiv preprint arXiv:1712.08244, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
S. Ling and C. Xing.Coding theory: a first course. Cambridge university press, 2004
work page 2004
- [28]
-
[29]
B. G. Marcot and T. D. Penman. Advances in bayesian network modelling: Integration of modelling technologies.Environmental modelling & software, 111:386–393, 2019
work page 2019
- [30]
-
[31]
E. Nichani, A. Damian, and J. D. Lee. How transformers learn causal structure with gradient descent.arXiv preprint arXiv:2402.14735, 2024
-
[32]
S. Nietert, Z. Goldfeld, R. Sadhu, and K. Kato. Statistical, robustness, and computational guarantees for sliced wasserstein distances.Advances in Neural Information Processing Sys- tems, 35:28179–28193, 2022
work page 2022
-
[33]
Minimaxestimationofsmoothdensitiesinwassersteindistance
J.Niles-WeedandQ.Berthet. Minimaxestimationofsmoothdensitiesinwassersteindistance. The Annals of Statistics, 50(3):1519–1540, 2022
work page 2022
- [34]
- [35]
- [36]
- [37]
-
[38]
R. D. Shah and J. Peters. The hardness of conditional independence testing and the gener- alised covariance measure. 2020
work page 2020
- [39]
-
[40]
A. B. Tsybakov.Introduction to Nonparametric Estimation. Springer New York, NY, 2009
work page 2009
- [41]
-
[42]
Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47
R. Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018
work page 2018
-
[43]
Villani.Optimal transport: old and new, volume 338
C. Villani.Optimal transport: old and new, volume 338. Springer, 2008
work page 2008
-
[44]
M. Zečević, D. S. Dhami, P. Veličković, and K. Kersting. Relating graph neural networks to structural causal models.arXiv preprint arXiv:2109.04173, 2021. 35
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.