pith. sign in

arxiv: 2605.15996 · v1 · pith:OLZKI52Bnew · submitted 2026-05-15 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Testing properties of trees in graphical models with covariance queries

Pith reviewed 2026-05-19 19:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH
keywords graphical modelstree structurecovariance queriesproperty testingquery complexityhigh-dimensional statisticsstructural propertiesrandomized algorithms
0
0 comments X

The pith

Global structural properties of trees in graphical models can be tested with sub-quadratic covariance queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper considers testing properties of tree graphs that underlie high-dimensional graphical models when only covariance queries are available. It establishes that full reconstruction of the tree is costly in query terms, yet randomized tests can efficiently verify several global properties using a sub-quadratic number of queries. Explicit procedures and bounds are given for testing the number of leaves, the maximum degree, the typical distance, and the diameter, with the complexity depending on chosen thresholds and error tolerances. A sympathetic reader would care because this separates the cost of learning the full structure from the cost of checking whether the structure satisfies basic global conditions that often matter in applications.

Core claim

The main results show that while reconstructing the entire tree may be costly, certain global structural properties can be tested efficiently. In particular, randomized tests for global structural properties use a sub-quadratic number of queries. Testing procedures are developed for the number of leaves, the maximum degree, the typical distance, and the diameter of the tree, each with explicit query complexity bounds that depend on the target threshold and tolerance parameters.

What carries the argument

The covariance query model that returns the covariance between any chosen pair of observed variables, which is used to design randomized tests that decide whether a tree property meets a given threshold.

If this is right

  • The number of leaves can be tested to within a given additive tolerance using fewer than quadratic queries.
  • The maximum degree of the tree admits a sub-quadratic query test with explicit dependence on the degree threshold.
  • Typical distance and diameter each admit sub-quadratic randomized tests controlled by tolerance parameters.
  • Property testing remains cheaper than full tree reconstruction whenever only these global features are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query model might support efficient tests for other global invariants such as the number of internal nodes once similar concentration arguments are developed.
  • In high-dimensional data pipelines, these tests could serve as cheap sanity checks before committing to expensive structure-learning algorithms.
  • The separation between reconstruction cost and testing cost suggests analogous savings may exist for other sparse graphical models beyond trees.

Load-bearing premise

The underlying graph is exactly a tree and the covariance query model applies directly to the testing procedures.

What would settle it

A family of trees where every algorithm that correctly tests one of the listed properties (number of leaves, maximum degree, typical distance, or diameter) requires quadratic or more covariance queries in the worst case would falsify the sub-quadratic claim.

read the original abstract

We consider the problem of testing properties of graphs underlying high-dimensional graphical models. We adopt the model of covariance queries introduced by Lugosi, Truszkowski, Velona, and Zwiernik (2021). We study the case when the underlying graph is a tree. The main results of the paper show that, while reconstructing the entire tree may be costly, certain global structural properties can be tested efficiently. In particular, we design randomized tests for global structural properties that use a sub-quadratic number of queries. We develop testing procedures for several fundamental properties, including the number of leaves, the maximum degree, the typical distance, and the diameter of the tree. For each property, we obtain explicit query complexity bounds that depend on the target threshold and tolerance parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper studies testing global structural properties of tree-structured graphical models under the covariance query model of Lugosi et al. (2021). While full tree reconstruction can be query-expensive, the authors design randomized testers for the number of leaves, maximum degree, typical distance, and diameter that succeed with a sub-quadratic number of queries. Explicit query-complexity bounds are stated in terms of the target threshold and additive tolerance parameters.

Significance. If the stated bounds and correctness proofs hold, the work is significant for separating the query cost of global property testing from full reconstruction in tree graphical models. The explicit dependence on threshold and tolerance, together with the focus on fundamental tree invariants, supplies concrete, falsifiable complexity statements that can guide both theory and algorithm design in high-dimensional covariance-based inference.

major comments (2)
  1. [§3] §3 (number-of-leaves tester): the sub-quadratic bound is derived by sampling pairs and thresholding their covariances to estimate the fraction of leaves; the variance analysis must explicitly track dependence on the minimum edge covariance (or correlation strength) to ensure the claimed O(n^{3/2} poly(log n, 1/ε)) scaling remains uniform over all valid tree models.
  2. [§5] §5 (diameter tester): the reduction from diameter estimation to typical-distance queries via random sampling of root-to-leaf paths assumes that the covariance oracle returns exact values; any additive noise model or finite-precision arithmetic would inflate the stated query bound by an extra 1/δ factor that is not currently accounted for in the tolerance analysis.
minor comments (3)
  1. [Abstract] The abstract and introduction should state the precise polynomial dependence on n (number of nodes) rather than only saying “sub-quadratic.”
  2. [§2] Notation for the covariance query oracle (Definition 2.1) should be cross-referenced each time a new tester is introduced so that the reader can immediately see which pairs are queried.
  3. [§6] A short table summarizing the four query bounds side-by-side (with their dependence on threshold τ and tolerance ε) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of significance, and recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (number-of-leaves tester): the sub-quadratic bound is derived by sampling pairs and thresholding their covariances to estimate the fraction of leaves; the variance analysis must explicitly track dependence on the minimum edge covariance (or correlation strength) to ensure the claimed O(n^{3/2} poly(log n, 1/ε)) scaling remains uniform over all valid tree models.

    Authors: We thank the referee for highlighting this point. The variance of the empirical fraction of leaf pairs indeed depends on the minimum edge covariance ρ_min, since covariances between non-adjacent leaves are bounded by ρ_min raised to the path length. Our current proof absorbs this into the logarithmic factors under the implicit assumption that ρ_min is bounded below by a positive constant (standard to avoid degenerate models). To make the dependence explicit and ensure the bound holds uniformly, we will revise Theorem 3.1 and its proof to include an explicit multiplicative factor of O(1/ρ_min²) in the query complexity. This change preserves the sub-quadratic scaling for any fixed ρ_min > 0 while addressing the uniformity concern. revision: yes

  2. Referee: [§5] §5 (diameter tester): the reduction from diameter estimation to typical-distance queries via random sampling of root-to-leaf paths assumes that the covariance oracle returns exact values; any additive noise model or finite-precision arithmetic would inflate the stated query bound by an extra 1/δ factor that is not currently accounted for in the tolerance analysis.

    Authors: The referee is correct that the analysis relies on the exact covariance oracle model introduced by Lugosi et al. (2021). Under this model, the oracle returns precise values, so no additional 1/δ factor appears. We agree that an additive δ-noise model would require an extra O(1/δ²) factor in the sample size to control the error in typical-distance estimates. Since the manuscript is situated in the exact-query setting, we will add a short remark in Section 5 noting this modeling assumption and the implication for noisy extensions, without altering the stated bounds. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper adopts the covariance query model as an external assumption from the 2021 Lugosi et al. reference and restricts to the exact tree case to derive randomized testers for properties such as number of leaves, maximum degree, typical distance, and diameter. These testers are constructed via sampling of covariance pairs to extract path-length statistics, yielding explicit sub-quadratic query bounds that depend directly on the stated threshold and tolerance parameters. No step equates a derived quantity to a fitted input by construction, invokes a self-citation as the sole justification for a uniqueness claim, or renames an empirical pattern as a new result. The central claims follow from the model definitions and sampling arguments without reducing to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that the graph is a tree and on the covariance query model from prior work; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The underlying graph is a tree.
    Explicitly stated as the case studied in the abstract.
  • domain assumption Covariance queries follow the model from Lugosi et al. (2021).
    The paper adopts this model for all testing procedures.

pith-pipeline@v0.9.0 · 5670 in / 1183 out tokens · 33381 ms · 2026-05-19T19:12:53.204897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Boucheron, G

    St ´ephane Boucheron, G ´abor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 9780199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001

  2. [2]

    The recovery of trees from measures of dissimilarity.Mathematics in the archae- ological and historical sciences, 1971

    Peter Buneman. The recovery of trees from measures of dissimilarity.Mathematics in the archae- ological and historical sciences, 1971

  3. [3]

    Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation.The Annals of Statistics, 44(2):455–488, 2016

    T Tony Cai, Weidong Liu, Harrison H Zhou, et al. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation.The Annals of Statistics, 44(2):455–488, 2016

  4. [4]

    Active learning algo- rithms for graphical model selection

    Gautamd Dasarathy, Aarti Singh, Maria-Florina Balcan, and Jong H Park. Active learning algo- rithms for graphical model selection. InArtificial Intelligence and Statistics, pages 1356–1364, 2016

  5. [5]

    Property testing in graphical models: testing small separation numbers.Information and Inference: A Journal of the IMA, 2026, to appear

    Luc Devroye, G ´abor Lugosi, and Piotr Zwiernik. Property testing in graphical models: testing small separation numbers.Information and Inference: A Journal of the IMA, 2026, to appear

  6. [6]

    Learning latent tree models with small query complexity.Bernoulli, 2026, to appear

    Luc Devroye, G ´abor Lugosi, and Piotr Zwiernik. Learning latent tree models with small query complexity.Bernoulli, 2026, to appear

  7. [7]

    Structure learning in graphical modeling.Annual Review of Statistics and Its Application, 4:365–393, 2017

    Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling.Annual Review of Statistics and Its Application, 4:365–393, 2017

  8. [8]

    Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

  9. [9]

    Reconstructing evolutionary trees from dna and protein sequences: paralinear distances.Proceedings of the National Academy of Sciences, 91(4):1455–1459, 1994

    James A Lake. Reconstructing evolutionary trees from dna and protein sequences: paralinear distances.Proceedings of the National Academy of Sciences, 91(4):1455–1459, 1994

  10. [10]

    Clarendon Press, 1996

    Steffen L Lauritzen.Graphical models, volume 17. Clarendon Press, 1996

  11. [11]

    Learning partial corre- lation graphs and graphical models by covariance queries.Journal of Machine Learning Research, 22(203):1–41, 2021

    G ´abor Lugosi, Jakub Truszkowski, Vasiliki Velona, and Piotr Zwiernik. Learning partial corre- lation graphs and graphical models by covariance queries.Journal of Machine Learning Research, 22(203):1–41, 2021

  12. [12]

    Combinatorial inference for graphical models.The Annals of Statistics, 47(2):795–827, 2019

    Matey Neykov, Junwei Lu, and Han Liu. Combinatorial inference for graphical models.The Annals of Statistics, 47(2):795–827, 2019

  13. [13]

    Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988

    Judea Pearl.Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 1558604790

  14. [14]

    Oxford University Press on Demand, 2003

    Charles Semple and Mike Steel.Phylogenetics, volume 24. Oxford University Press on Demand, 2003

  15. [15]

    Hierarchical latent class models for cluster analysis.Journal of Machine Learning Research, 5(6):697–723, 2004

    Nevin L Zhang. Hierarchical latent class models for cluster analysis.Journal of Machine Learning Research, 5(6):697–723, 2004

  16. [16]

    Latent tree models

    Piotr Zwiernik. Latent tree models. InHandbook of graphical models, pages 265–288. CRC Press, 2018. 19