Fast Wasserstein rates for estimating probability distributions of probabilistic graphical models

Daniel Bartl; Stephan Eckstein

arxiv: 2510.09270 · v2 · pith:BGE4FDNJnew · submitted 2025-10-10 · 🧮 math.ST · math.PR· stat.TH

Fast Wasserstein rates for estimating probability distributions of probabilistic graphical models

Daniel Bartl , Stephan Eckstein This is my paper

Pith reviewed 2026-05-22 12:42 UTC · model grok-4.3

classification 🧮 math.ST math.PRstat.TH

keywords wasserstein distancegraphical modelsestimation ratesdirected acyclic graphtransition kernelssmoothnesscurse of dimensionalitylocal dimension

0 comments

The pith

The Wasserstein rate for estimating graphical model distributions is set by local node-parent dimensions under kernel smoothness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that knowing the directed acyclic graph structure of a probabilistic graphical model allows faster rates for estimating its probability distribution in Wasserstein distance. This improvement requires quantifying the structure through smoothness conditions on the transition kernels. Under these conditions the rate is determined by the dimensions of single nodes and their parents, avoiding the full curse of dimensionality. Sympathetic readers care because this shows how domain structure can make high-dimensional estimation feasible with fewer samples.

Core claim

Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance suffers from the curse of dimensionality. For distributions arising from probabilistic graphical models on a fixed directed acyclic graph, this curse can be overcome when the transition kernels in the corresponding disintegration satisfy smoothness conditions. The resulting estimation rate is then governed by the local dimensions of nodes together with their parents, with the precise rate depending on the strength of the smoothness assumption.

What carries the argument

Smoothness conditions on the transition kernels of the graph disintegration, which link the local node and parent dimensions to the global Wasserstein estimation rate.

If this is right

The rate improves beyond the unstructured high-dimensional case when smoothness holds.
Under bidirectional Total-Variation-Lipschitz conditions the rate is sharp.
Distributions with positive Lipschitz density are covered as a special case.
Only quantified structural knowledge via the kernels makes the graph helpful.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that accurate knowledge of the graph and kernel properties is crucial for efficient estimation in applications like causal modeling.
Similar local dimension benefits might appear in other distance-based estimation tasks for structured data.
Testing the rates on concrete examples such as Bayesian networks with known conditional densities would validate the local structure effect.

Load-bearing premise

Smoothness conditions hold for the transition kernels corresponding to the nodes in the directed acyclic graph.

What would settle it

A numerical experiment estimating a distribution from a graphical model with non-smooth kernels and observing the rate degrade to the high-dimensional case would disprove the local structure governing the rate.

Figures

Figures reproduced from arXiv: 2510.09270 by Daniel Bartl, Stephan Eckstein.

**Figure 1.** Figure 1: Exemplification of the constants occurring in Lemma 2.4. The red numbers indicate the number of outgoing paths of different lengths (e.g., (2, 1) below node 3 indicates that there are 2 paths of length 1, and 1 path of length 2 outgoing). The green numbers indicate how the constants for the cost change in the backward induction of the proof of Lemma 2.4. At the end of the backward induction (bottom right),… view at source ↗

**Figure 2.** Figure 2: Visualization of estimators for a simple graph 1 → 2 with X1 = X2 = [0, 1] with a partition of each interval into three subsets. We see the support of µˆ (on the left), the support of µˆ A (middle) and the support of µˆ bA (right). Hereby, blue crosses are the initial data points, green are the new data points which are added by making the kernels constant in the direction from first to second coordinate, … view at source ↗

read the original abstract

Using i.i.d. data to estimate a high-dimensional distribution in Wasserstein distance is a fundamental instance of the curse of dimensionality. We explore how structural knowledge about the data-generating process which gives rise to the distribution can be used to overcome this curse. More precisely, we work with the set of distributions of probabilistic graphical models for a given directed acyclic graph. It turns out that this knowledge is only helpful if it can be quantified, which we formalize via smoothness conditions on the transition kernels in the disintegration corresponding to the graph. In this case, we prove that the rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes. The precise rate depends on the exact notion of smoothness assumed for the kernels, where either weak (Wasserstein-Lipschitz) or strong (bidirectional Total-Variation-Lipschitz) conditions lead to different results. We prove sharpness under the strong condition and show that this condition covers, as a special case, distributions having a positive Lipschitz density.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows Wasserstein estimation rates for PGM distributions depend on local node-plus-parent dimensions once the transition kernels meet one of two Lipschitz smoothness conditions.

read the letter

The key point is that this work derives rates for Wasserstein estimation of distributions generated by a fixed DAG that are governed by the dimensions of single nodes and their parents, not the global dimension, provided the kernels in the disintegration satisfy either Wasserstein-Lipschitz or bidirectional TV-Lipschitz smoothness. The strong condition yields sharp rates, and the weak one still produces a local rate, though possibly slower. Positive Lipschitz densities fall out as a special case under the strong condition. This is the main new piece relative to earlier distribution estimation results. The argument uses the graph structure to break the joint problem into separate conditional estimations and then carries local entropy numbers through to the global Wasserstein bound. That reduction is clean and avoids hidden dependence on graph depth or uniformity constants. The stress-test confirms the central claim holds under the stated assumptions with no circularity or post-hoc fitting. The proofs appear first-principles and the citation pattern against standard high-dimensional benchmarks looks appropriate. One minor soft spot is that the weak condition may not improve much over unstructured rates on dense graphs, but the paper states this explicitly rather than overstating it. Full verification of the entropy integral steps would still be useful in review, yet nothing in the outline suggests a load-bearing gap. This paper is for statisticians working on nonparametric rates in structured high-dimensional settings and for people who care about Wasserstein metrics in graphical models. A reader already thinking about local dimension effects or kernel smoothness will get concrete value from the local-rate statements. It deserves peer review because the result is precise enough to be checked and the contribution is focused enough to matter in the literature.

Referee Report

2 major / 3 minor

Summary. The paper claims that for distributions of probabilistic graphical models on a fixed DAG, the minimax rate of estimation in Wasserstein distance from i.i.d. samples is governed by the local dimensions (each node together with its parents) once the transition kernels satisfy either weak Wasserstein-Lipschitz or strong bidirectional TV-Lipschitz smoothness. Upper bounds are derived for both regimes via disintegration into local conditionals; sharpness is proved under the strong condition, which also covers positive Lipschitz densities as a special case. No dependence on global dimension or graph depth appears in the final rates.

Significance. If the derivations hold, the work provides a clean theoretical mechanism for overcoming the curse of dimensionality in distribution estimation by combining graphical structure with quantifiable local smoothness. The reduction to per-node covering numbers or entropy integrals, together with explicit sharpness under the strong condition, is a substantive contribution to nonparametric statistics and optimal transport. The results are directly relevant to structured high-dimensional data and causal models where local smoothness is plausible.

major comments (2)

[§3.2] §3.2, the statement of the main upper bound (presumably Theorem 3.1): the global Wasserstein bound is obtained by summing local discrepancies; the argument must confirm that the triangle inequality or coupling constants remain uniform across nodes and do not accumulate with graph depth or number of nodes, as this is load-bearing for the claim of purely local rates.
[Theorem 4.1] Theorem 4.1 (sharpness under strong condition): the lower-bound construction packs local kernels while keeping the joint in the PGM class; it should be verified that the same packing does not inadvertently violate the bidirectional TV-Lipschitz assumption for some nodes, which would weaken the sharpness statement.

minor comments (3)

[Introduction] The abstract and introduction use 'bidirectional Total-Variation-Lipschitz' without an immediate equation reference; adding the precise definition (e.g., Eq. (2) or (3)) at first use would improve readability.
[Preliminaries] Notation for the disintegration and the local Wasserstein distance on each conditional should be introduced once and used consistently; occasional reuse of P for both joint and conditional measures is mildly confusing.
[Introduction] Figure 1 (if present) or the illustrative DAG examples would benefit from explicit labeling of the local dimensions d_v + |pa(v)| next to each node.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive comments on our manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [§3.2] §3.2, the statement of the main upper bound (presumably Theorem 3.1): the global Wasserstein bound is obtained by summing local discrepancies; the argument must confirm that the triangle inequality or coupling constants remain uniform across nodes and do not accumulate with graph depth or number of nodes, as this is load-bearing for the claim of purely local rates.

Authors: In the proof of Theorem 3.1 we disintegrate the joint distributions according to the fixed DAG and construct a global coupling as the product of local couplings between the conditional kernels. The Wasserstein distance between the joints is then bounded by the sum of the local Wasserstein distances; the triangle inequality is applied along the topological order with multiplicative constants identically equal to one. Because the underlying metric is additive across coordinates and each local coupling is chosen independently, no depth- or size-dependent factors arise. The uniformity is stated after Equation (3.2) and proved in the appendix; we will add a short clarifying sentence in the main-text proof sketch to make this explicit. revision: partial
Referee: [Theorem 4.1] Theorem 4.1 (sharpness under strong condition): the lower-bound construction packs local kernels while keeping the joint in the PGM class; it should be verified that the same packing does not inadvertently violate the bidirectional TV-Lipschitz assumption for some nodes, which would weaken the sharpness statement.

Authors: The packing used for the lower bound perturbs each local transition kernel inside a small ball that preserves the bidirectional TV-Lipschitz constant of the original kernel. Because the perturbations are chosen to be Lipschitz with respect to the same constant and the graph is fixed, every constructed conditional satisfies the strong smoothness assumption uniformly across nodes. This verification appears in the proof of Theorem 4.1 (Section 4) and the accompanying supplementary calculations; the resulting family remains inside the model class. revision: no

Circularity Check

0 steps flagged

Derivation self-contained from first-principles smoothness assumptions

full rationale

The paper derives minimax Wasserstein rates for distributions of probabilistic graphical models on a fixed DAG by assuming either Wasserstein-Lipschitz or bidirectional TV-Lipschitz conditions on the transition kernels in the disintegration. The joint estimation problem reduces to separate control of each conditional via standard entropy integrals or covering numbers, with the global rate then depending only on the local dimensions (node plus parents). This follows from general statistical theory on empirical processes and does not reduce to any fitted parameter, self-definition, or load-bearing self-citation chain. Sharpness under the strong condition is shown by explicit lower bounds that match the upper bounds without circularity. The argument is independent of the target result and externally falsifiable via the stated Lipschitz assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about kernel smoothness and standard disintegration of the joint distribution along the DAG; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Transition kernels admit either Wasserstein-Lipschitz or bidirectional Total-Variation-Lipschitz smoothness
This quantitative smoothness is invoked to obtain the improved rates governed by local dimensions.

pith-pipeline@v0.9.0 · 5711 in / 1171 out tokens · 41613 ms · 2026-05-22T12:42:39.356646+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rate of estimation is governed by the local structure of the graph, more precisely by dimensions corresponding to single nodes together with their parent nodes
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

smoothness conditions on the transition kernels in the disintegration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Acciaio and S

B. Acciaio and S. Hou. Convergence of adapted empirical measures on r d.The Annals of Applied Probability, 34(5):4799–4835, 2024. 5For the following, we note that whileνM is supported on the midpoints of the cells, we naturally extend its kernels in a constant fashion toX. That is,νM(dxk:K |x 1:k−1) =ν M(dxk:K |c 1:k−1(x1:k−1)). 33

work page 2024
[2]

Azadkia and S

M. Azadkia and S. Chatterjee. A simple measure of conditional dependence.The Annals of Statistics, 49(6):3070–3102, 2021

work page 2021
[3]

Backhoff, D

J. Backhoff, D. Bartl, M. Beiglböck, and J. Wiesel. Estimating processes in adapted wasser- stein distance.The Annals of Applied Probability, 32(1):529–550, 2022

work page 2022
[4]

Bartl and S

D. Bartl and S. Mendelson. Structure preservation via the wasserstein distance.Journal of Functional Analysis, 288(7):110810, 2025

work page 2025
[5]

P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Bénézet, Z

C. Bénézet, Z. Cheng, and S. Jaimungal. Learning conditional distributions on continuous spaces.arXiv preprint arXiv:2406.09375, 2024

work page arXiv 2024
[7]

D. P. Bertsekas and S. E. Shreve.Stochastic Optimal Control: The Discrete Time Case, volume 139 ofMathematics in Science and Engineering. Academic Press, New York, 1978

work page 1978
[8]

Cheridito and S

P. Cheridito and S. Eckstein. Optimal transport and wasserstein distances for causal models. Bernoulli, 31(2):1351–1376, 2025

work page 2025
[9]

Chewi, J

S. Chewi, J. Niles-Weed, and P. Rigollet. Statistical optimal transport.arXiv preprint arXiv:2407.18163, 2024

work page arXiv 2024
[10]

Chow and C

C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968

work page 1968
[11]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to algorithms. MIT press, 3rd edition, 2009

work page 2009
[12]

R. M. Dudley. The speed of mean glivenko-cantelli convergence.The Annals of Mathematical Statistics, 40(1):40–50, 1969

work page 1969
[13]

Ebert-Uphoff and Y

I. Ebert-Uphoff and Y. Deng. Causal discovery for climate research using graphical models. Journal of Climate, 25(17):5648–5665, 2012

work page 2012
[14]

P. M. Faller, L. C. Vankadara, A. A. Mastakouri, F. Locatello, and D. Janzing. Self- compatibility: Evaluating causal discovery without ground truth. InInternational Conference on Artificial Intelligence and Statistics, pages 4132–4140. PMLR, 2024

work page 2024
[15]

Fournier

N. Fournier. Convergence of the empirical measure in expected wasserstein distance: non- asymptotic explicit bounds in rd.ESAIM: Probability and Statistics, 27:749–775, 2023

work page 2023
[16]

Fournier and A

N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015

work page 2015
[17]

Genevay, L

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. InThe 22nd international conference on artificial intelligence and statistics, pages 1574–1583. PMLR, 2019

work page 2019
[18]

Goldfeld, K

Z. Goldfeld, K. Greenewald, J. Niles-Weed, and Y. Polyanskiy. Convergence of smoothed em- pirical measures with applications to entropy estimation.IEEE Transactions on Information Theory, 66(7):4368–4391, 2020

work page 2020
[19]

Graf and H

S. Graf and H. Luschgy.Foundations of quantization for probability distributions. Springer Science & Business Media, 2000

work page 2000
[20]

Gresele, J

L. Gresele, J. Von Kügelgen, J. Kübler, E. Kirschbaum, B. Schölkopf, and D. Janzing. Causal inference through the structural causal marginal problem. InInternational conference on machine learning, pages 7793–7824. PMLR, 2022

work page 2022
[21]

Györfi, A

L. Györfi, A. Kontorovich, and R. Weiss. Tree density estimation.IEEE Transactions on Information Theory, 69(2):1168–1176, 2022

work page 2022
[22]

Heckerman

D. Heckerman. A tutorial on learning with bayesian networks.Learning in graphical models, pages 301–354, 1998

work page 1998
[23]

B. R. Kloeckner. Empirical measures: regularity is a counter-curse to dimensionality.ESAIM: Probability and Statistics, 24:408–434, 2020

work page 2020
[24]

Koller and N

D. Koller and N. Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009. 34

work page 2009
[25]

Z. M. Laubach, E. J. Murray, K. L. Hoke, R. J. Safran, and W. Perng. A biologist’s guide to model selection and causal inference.Proceedings of the Royal Society B, 288(1943):20202815, 2021

work page 1943
[26]

T. Liang. How well can generative adversarial networks learn densities: A nonparametric view.arXiv preprint arXiv:1712.08244, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Ling and C

S. Ling and C. Xing.Coding theory: a first course. Cambridge university press, 2004

work page 2004
[28]

Manole, S

T. Manole, S. Balakrishnan, and L. Wasserman. Minimax confidence intervals for the sliced wasserstein distance.Electronic Journal of Statistics, 16(1):2252–2345, 2022

work page 2022
[29]

B. G. Marcot and T. D. Penman. Advances in bayesian network modelling: Integration of modelling technologies.Environmental modelling & software, 111:386–393, 2019

work page 2019
[30]

Neykov, S

M. Neykov, S. Balakrishnan, and L. Wasserman. Minimax optimal conditional independence testing.The Annals of Statistics, 49(4):2151–2177, 2021

work page 2021
[31]

Nichani, A

E. Nichani, A. Damian, and J. D. Lee. How transformers learn causal structure with gradient descent.arXiv preprint arXiv:2402.14735, 2024

work page arXiv 2024
[32]

Nietert, Z

S. Nietert, Z. Goldfeld, R. Sadhu, and K. Kato. Statistical, robustness, and computational guarantees for sliced wasserstein distances.Advances in Neural Information Processing Sys- tems, 35:28179–28193, 2022

work page 2022
[33]

Minimaxestimationofsmoothdensitiesinwassersteindistance

J.Niles-WeedandQ.Berthet. Minimaxestimationofsmoothdensitiesinwassersteindistance. The Annals of Statistics, 50(3):1519–1540, 2022

work page 2022
[34]

Pearl.Causality

J. Pearl.Causality. Cambridge university press, 2009

work page 2009
[35]

Peters, D

J. Peters, D. Janzing, and B. Schölkopf.Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017

work page 2017
[36]

N. Sani, A. A. Mastakouri, and D. Janzing. Bounding probabilities of causation through the causal marginal problem.arXiv preprint arXiv:2304.02023, 2023

work page arXiv 2023
[37]

Schölkopf

B. Schölkopf. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl, pages 765–804. 2022

work page 2022
[38]

R. D. Shah and J. Peters. The hardness of conditional independence testing and the gener- alised covariance measure. 2020

work page 2020
[39]

Singh, A

S. Singh, A. Uppal, B. Li, C.-L. Li, M. Zaheer, and B. Póczos. Nonparametric density estimation under adversarial losses.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[40]

A. B. Tsybakov.Introduction to Nonparametric Estimation. Springer New York, NY, 2009

work page 2009
[41]

R. A. Vandermeulen, W. M. Tai, and B. Aragam. Breaking the curse of dimensionality in structured density estimation.arXiv preprint arXiv:2410.07685, 2024

work page arXiv 2024
[42]

Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47

R. Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018
[43]

Villani.Optimal transport: old and new, volume 338

C. Villani.Optimal transport: old and new, volume 338. Springer, 2008

work page 2008
[44]

Zečević, D

M. Zečević, D. S. Dhami, P. Veličković, and K. Kersting. Relating graph neural networks to structural causal models.arXiv preprint arXiv:2109.04173, 2021. 35

work page arXiv 2021

[1] [1]

Acciaio and S

B. Acciaio and S. Hou. Convergence of adapted empirical measures on r d.The Annals of Applied Probability, 34(5):4799–4835, 2024. 5For the following, we note that whileνM is supported on the midpoints of the cells, we naturally extend its kernels in a constant fashion toX. That is,νM(dxk:K |x 1:k−1) =ν M(dxk:K |c 1:k−1(x1:k−1)). 33

work page 2024

[2] [2]

Azadkia and S

M. Azadkia and S. Chatterjee. A simple measure of conditional dependence.The Annals of Statistics, 49(6):3070–3102, 2021

work page 2021

[3] [3]

Backhoff, D

J. Backhoff, D. Bartl, M. Beiglböck, and J. Wiesel. Estimating processes in adapted wasser- stein distance.The Annals of Applied Probability, 32(1):529–550, 2022

work page 2022

[4] [4]

Bartl and S

D. Bartl and S. Mendelson. Structure preservation via the wasserstein distance.Journal of Functional Analysis, 288(7):110810, 2025

work page 2025

[5] [5]

P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Bénézet, Z

C. Bénézet, Z. Cheng, and S. Jaimungal. Learning conditional distributions on continuous spaces.arXiv preprint arXiv:2406.09375, 2024

work page arXiv 2024

[7] [7]

D. P. Bertsekas and S. E. Shreve.Stochastic Optimal Control: The Discrete Time Case, volume 139 ofMathematics in Science and Engineering. Academic Press, New York, 1978

work page 1978

[8] [8]

Cheridito and S

P. Cheridito and S. Eckstein. Optimal transport and wasserstein distances for causal models. Bernoulli, 31(2):1351–1376, 2025

work page 2025

[9] [9]

Chewi, J

S. Chewi, J. Niles-Weed, and P. Rigollet. Statistical optimal transport.arXiv preprint arXiv:2407.18163, 2024

work page arXiv 2024

[10] [10]

Chow and C

C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968

work page 1968

[11] [11]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to algorithms. MIT press, 3rd edition, 2009

work page 2009

[12] [12]

R. M. Dudley. The speed of mean glivenko-cantelli convergence.The Annals of Mathematical Statistics, 40(1):40–50, 1969

work page 1969

[13] [13]

Ebert-Uphoff and Y

I. Ebert-Uphoff and Y. Deng. Causal discovery for climate research using graphical models. Journal of Climate, 25(17):5648–5665, 2012

work page 2012

[14] [14]

P. M. Faller, L. C. Vankadara, A. A. Mastakouri, F. Locatello, and D. Janzing. Self- compatibility: Evaluating causal discovery without ground truth. InInternational Conference on Artificial Intelligence and Statistics, pages 4132–4140. PMLR, 2024

work page 2024

[15] [15]

Fournier

N. Fournier. Convergence of the empirical measure in expected wasserstein distance: non- asymptotic explicit bounds in rd.ESAIM: Probability and Statistics, 27:749–775, 2023

work page 2023

[16] [16]

Fournier and A

N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure.Probability theory and related fields, 162(3):707–738, 2015

work page 2015

[17] [17]

Genevay, L

A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample complexity of sinkhorn divergences. InThe 22nd international conference on artificial intelligence and statistics, pages 1574–1583. PMLR, 2019

work page 2019

[18] [18]

Goldfeld, K

Z. Goldfeld, K. Greenewald, J. Niles-Weed, and Y. Polyanskiy. Convergence of smoothed em- pirical measures with applications to entropy estimation.IEEE Transactions on Information Theory, 66(7):4368–4391, 2020

work page 2020

[19] [19]

Graf and H

S. Graf and H. Luschgy.Foundations of quantization for probability distributions. Springer Science & Business Media, 2000

work page 2000

[20] [20]

Gresele, J

L. Gresele, J. Von Kügelgen, J. Kübler, E. Kirschbaum, B. Schölkopf, and D. Janzing. Causal inference through the structural causal marginal problem. InInternational conference on machine learning, pages 7793–7824. PMLR, 2022

work page 2022

[21] [21]

Györfi, A

L. Györfi, A. Kontorovich, and R. Weiss. Tree density estimation.IEEE Transactions on Information Theory, 69(2):1168–1176, 2022

work page 2022

[22] [22]

Heckerman

D. Heckerman. A tutorial on learning with bayesian networks.Learning in graphical models, pages 301–354, 1998

work page 1998

[23] [23]

B. R. Kloeckner. Empirical measures: regularity is a counter-curse to dimensionality.ESAIM: Probability and Statistics, 24:408–434, 2020

work page 2020

[24] [24]

Koller and N

D. Koller and N. Friedman.Probabilistic graphical models: principles and techniques. MIT press, 2009. 34

work page 2009

[25] [25]

Z. M. Laubach, E. J. Murray, K. L. Hoke, R. J. Safran, and W. Perng. A biologist’s guide to model selection and causal inference.Proceedings of the Royal Society B, 288(1943):20202815, 2021

work page 1943

[26] [26]

T. Liang. How well can generative adversarial networks learn densities: A nonparametric view.arXiv preprint arXiv:1712.08244, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Ling and C

S. Ling and C. Xing.Coding theory: a first course. Cambridge university press, 2004

work page 2004

[28] [28]

Manole, S

T. Manole, S. Balakrishnan, and L. Wasserman. Minimax confidence intervals for the sliced wasserstein distance.Electronic Journal of Statistics, 16(1):2252–2345, 2022

work page 2022

[29] [29]

B. G. Marcot and T. D. Penman. Advances in bayesian network modelling: Integration of modelling technologies.Environmental modelling & software, 111:386–393, 2019

work page 2019

[30] [30]

Neykov, S

M. Neykov, S. Balakrishnan, and L. Wasserman. Minimax optimal conditional independence testing.The Annals of Statistics, 49(4):2151–2177, 2021

work page 2021

[31] [31]

Nichani, A

E. Nichani, A. Damian, and J. D. Lee. How transformers learn causal structure with gradient descent.arXiv preprint arXiv:2402.14735, 2024

work page arXiv 2024

[32] [32]

Nietert, Z

S. Nietert, Z. Goldfeld, R. Sadhu, and K. Kato. Statistical, robustness, and computational guarantees for sliced wasserstein distances.Advances in Neural Information Processing Sys- tems, 35:28179–28193, 2022

work page 2022

[33] [33]

Minimaxestimationofsmoothdensitiesinwassersteindistance

J.Niles-WeedandQ.Berthet. Minimaxestimationofsmoothdensitiesinwassersteindistance. The Annals of Statistics, 50(3):1519–1540, 2022

work page 2022

[34] [34]

Pearl.Causality

J. Pearl.Causality. Cambridge university press, 2009

work page 2009

[35] [35]

Peters, D

J. Peters, D. Janzing, and B. Schölkopf.Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017

work page 2017

[36] [36]

N. Sani, A. A. Mastakouri, and D. Janzing. Bounding probabilities of causation through the causal marginal problem.arXiv preprint arXiv:2304.02023, 2023

work page arXiv 2023

[37] [37]

Schölkopf

B. Schölkopf. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl, pages 765–804. 2022

work page 2022

[38] [38]

R. D. Shah and J. Peters. The hardness of conditional independence testing and the gener- alised covariance measure. 2020

work page 2020

[39] [39]

Singh, A

S. Singh, A. Uppal, B. Li, C.-L. Li, M. Zaheer, and B. Póczos. Nonparametric density estimation under adversarial losses.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[40] [40]

A. B. Tsybakov.Introduction to Nonparametric Estimation. Springer New York, NY, 2009

work page 2009

[41] [41]

R. A. Vandermeulen, W. M. Tai, and B. Aragam. Breaking the curse of dimensionality in structured density estimation.arXiv preprint arXiv:2410.07685, 2024

work page arXiv 2024

[42] [42]

Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47

R. Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018

[43] [43]

Villani.Optimal transport: old and new, volume 338

C. Villani.Optimal transport: old and new, volume 338. Springer, 2008

work page 2008

[44] [44]

Zečević, D

M. Zečević, D. S. Dhami, P. Veličković, and K. Kersting. Relating graph neural networks to structural causal models.arXiv preprint arXiv:2109.04173, 2021. 35

work page arXiv 2021