pith. sign in

arxiv: 1907.10176 · v1 · pith:PFJATAUAnew · submitted 2019-07-23 · 🧮 math.ST · stat.ME· stat.TH

Graph inference with clustering and false discovery rate control

Pith reviewed 2026-05-24 16:42 UTC · model grok-4.3

classification 🧮 math.ST stat.MEstat.TH
keywords stochastic block modelfalse discovery ratetrue discovery ratevariational expectation-maximizationnode clusteringgraph inferencenetwork reconstruction
0
0 comments X

The pith

In a noisy stochastic block model, reliable node clustering enables FDR control for edge inference together with near-optimal true discovery rate as graph size grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a noisy version of the stochastic block model in which edges are observed with noise but communities determine connection probabilities. Parameters and node clusters are estimated via a variational expectation-maximization algorithm, after which edges are selected by a procedure that controls the false discovery rate while targeting the highest possible true discovery rate. The central theoretical result states that, whenever the clustering step succeeds, the FDR is controlled and the TDR is optimal up to remainder terms that vanish with growing graph size. Numerical experiments indicate that the method beats standard FDR procedures that ignore community structure and that the guarantees remain largely intact even when the data depart from the assumed model.

Core claim

Provided that the VEM algorithm provides reliable parameter estimates and clustering, the procedure controls the FDR while satisfying an optimal TDR property, up to remainder terms that become small when the size of the graph grows.

What carries the argument

The noisy stochastic block model together with the two-stage procedure that first runs variational EM for clustering and then applies an FDR-controlling threshold to edge p-values derived from the fitted model.

If this is right

  • The method controls the false discovery rate at the nominal level while attaining the highest achievable true discovery rate, up to vanishing remainders.
  • It outperforms classical FDR-controlling procedures that treat edges as independent and ignore the block structure.
  • The FDR and TDR properties remain essentially valid in simulations even when the data-generating process differs from the assumed noisy stochastic block model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage logic could be applied to other consistent community-detection algorithms provided they produce accurate cluster labels.
  • The approach may improve edge recovery in any domain where nodes naturally fall into groups that govern connection probabilities, such as protein-interaction or social-contact networks.
  • Because the remainder terms are shown to vanish only asymptotically, finite-sample behavior on moderately sized graphs would still need separate verification.

Load-bearing premise

The variational expectation-maximization algorithm must deliver reliable estimates of the model parameters and the node clusters.

What would settle it

A sequence of graphs of increasing size on which the variational EM step recovers the clusters accurately yet the realized proportion of false edges among declared edges stays above the nominal FDR level by an amount that does not vanish.

read the original abstract

In this paper, a noisy version of the stochastic block model (NSBM) is introduced and we investigate the three following statistical inferences in this model: estimation of the model parameters, clustering of the nodes and identification of the underlying graph. While the two first inferences are done by using a variational expectation-maximization (VEM) algorithm, the graph inference is done by controlling the false discovery rate (FDR), that is, the average proportion of errors among the edges declared significant, and by maximizing the true discovery rate (TDR), that is, the average proportion of edges declared significant among the true edges. Provided that the VEM algorithm provides reliable parameter estimates and clustering, we theoretically show that our procedure does control the FDR while satisfying an optimal TDR property, up to remainder terms that become small when the size of the graph grows. Numerical experiments show that our method outperforms the classical FDR controlling methods that ignore the underlying SBM topology. In addition, these simulations demonstrate that the FDR/TDR properties of our method are robust to model mis-specification, that is, are essentially maintained outside our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the noisy stochastic block model (NSBM) and develops procedures for parameter estimation and node clustering via a variational EM (VEM) algorithm, followed by an FDR-controlling method for inferring the underlying graph edges that also targets high TDR. It states a theoretical result establishing FDR control at a nominal level together with an asymptotically optimal TDR property, conditional on VEM consistency, with remainder terms vanishing as graph size grows; numerical experiments are used to illustrate outperformance relative to topology-ignoring FDR procedures and robustness under misspecification.

Significance. Conditional on the VEM consistency hypothesis, the work supplies a concrete mechanism for leveraging block structure to improve edge detection power while retaining FDR guarantees. The explicit separation of the clustering step from the FDR step, together with the reported robustness checks, constitutes a useful contribution to structured multiple-testing problems on networks.

major comments (2)
  1. [§4] §4 (FDR/TDR theorem): the claimed control is stated to hold up to remainder terms that vanish with n, yet the manuscript supplies neither an explicit order for these remainders nor the precise VEM consistency rates (e.g., in total variation or parameter error) that would make the vanishing statement verifiable; this is load-bearing for the asymptotic claim.
  2. [Simulation section] Simulation section (Tables 2–4): reported FDR/TDR values are given as point estimates without standard errors or replication counts, so it is impossible to assess whether the observed superiority over BH and the robustness claims are statistically distinguishable from noise.
minor comments (2)
  1. [Model section] Notation for the NSBM noise parameter is introduced without an immediate comparison table to the classical SBM; adding one line of contrast would prevent reader confusion.
  2. [Abstract] The abstract asserts optimality of the TDR “up to remainder terms”; a parenthetical remark on the rate (e.g., o(1)) would improve readability without lengthening the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the detailed comments. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (FDR/TDR theorem): the claimed control is stated to hold up to remainder terms that vanish with n, yet the manuscript supplies neither an explicit order for these remainders nor the precise VEM consistency rates (e.g., in total variation or parameter error) that would make the vanishing statement verifiable; this is load-bearing for the asymptotic claim.

    Authors: We agree that the asymptotic claim would be strengthened by greater precision on the remainder terms. The result in Section 4 is conditional on VEM consistency and shows that the remainders vanish as n → ∞ without deriving explicit rates. In the revised manuscript we will add a remark that invokes standard consistency rates available in the VEM-SBM literature (typically o_p(1) in total variation or parameter error under mild separation conditions on the block probabilities), which are already sufficient for the o(1) vanishing of the remainders. This addition will make the statement verifiable without altering the theorem statement itself. revision: yes

  2. Referee: [Simulation section] Simulation section (Tables 2–4): reported FDR/TDR values are given as point estimates without standard errors or replication counts, so it is impossible to assess whether the observed superiority over BH and the robustness claims are statistically distinguishable from noise.

    Authors: We accept this criticism. The Monte Carlo experiments underlying Tables 2–4 were performed with 100 independent replications, but neither the replication count nor standard errors were reported. In the revision we will state the number of replications explicitly and augment the tables with standard errors for the FDR and TDR entries, allowing readers to gauge the statistical significance of the observed differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's FDR/TDR guarantee is explicitly conditional on the external assumption that the VEM algorithm supplies reliable parameter estimates and clustering; this prerequisite is stated outright rather than derived from the same fitted quantities. Under that hypothesis the control and near-optimality (with vanishing remainders) are asserted. No equation reduces a claimed prediction to a fitted input by construction, no load-bearing self-citation chain appears, and the derivation remains independent of the target result once the VEM hypothesis is granted. The result is therefore self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract introduces NSBM as the modeling framework and conditions all guarantees on VEM reliability; no explicit free parameters or invented entities beyond the model itself are quantified.

free parameters (1)
  • number of blocks
    Model parameter whose value must be chosen or estimated; appears in any SBM-based procedure.
axioms (1)
  • domain assumption VEM algorithm yields reliable estimates and clustering when applied to NSBM data
    Explicit precondition stated for the FDR/TDR theorem.
invented entities (1)
  • noisy stochastic block model (NSBM) no independent evidence
    purpose: Generative model for graphs with latent communities plus observation noise
    New model introduced in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5728 in / 1271 out tokens · 21229 ms · 2026-05-24T16:42:03.849120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    S., Matias, C., and Rhodes, J

    Allman, E. S., Matias, C., and Rhodes, J. A. (2011). Parameter identifiability in a class of random graph mixture models. Journal of Statistical Planning and Inference , 141(5):1719 –

  2. [2]

    Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485–516. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B , 5...

  3. [3]

    Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):719–725. Brault, V., Keribin, C., and Mariadassou, M. (2017). Consistency and Asymptotic Normality of Latent Blocks Model Estimators. working paper or pr...

  4. [4]

    Wu, C. F. J. (1983). On the convergence properties of the em algorithm. Ann. Statist., 11(1):95–

  5. [5]

    Main lemmas for Section 5 Lemma 9.1

    Supplementary material 9.1. Main lemmas for Section 5 Lemma 9.1. Let Assumption 5.1 be true and consider anyθ∈ Θ with the corresponding quantities t1,q,ℓ(θ),t 2,q,ℓ(θ), q,ℓ ∈{ 1,...,Q }2, t1(θ) = minq,ℓt1,q,ℓ(θ) and t2(θ) = maxq,ℓt2,q,ℓ(θ). Then the function t↦→Qθ(θ,t ) is increasing on [t1(θ),t 2(θ)], continuous on (t1(θ), 1], satisfies Qθ(θ,t ) = 0 for t...

  6. [6]

    Now, since Fθ(θ,t′)> 0, this entails f(t′)≥f(t). Also, if f(t′) = f(t), the inequalities above are all equalities and we have Eθ   ∑ (i,j)∈A ℓi,j1{t<ℓ i,j≤t′}   =t Eθ   ∑ (i,j)∈A 1{t<ℓ i,j≤t′}   and thus Eθ   ∑ (i,j)∈A (ℓi,j−t)1{t<ℓ i,j≤t′}   = 0 which gives (ℓi,j−t)1{t<ℓ i,j≤t′} = 0 Pθ-a.s. for all (i,j ), which is impossible by Assumption 5....

  7. [7]

    (56) Moreover, for t∈Tθ0(K) with|t−Tθ0(α)|≤ ε, we have |F1,θ0(θ0,t )−F1,θ0(θ0,Tθ0(α))|≤W T,q1(ε)

    such that for all ε≤ e, letting η(ε) = 8κ−1(Wθ0,q(ε) + 3Q2ε), we have minK≤ α−η(ε) and α +η(ε)≤ maxK and for any θ,θ′∈ Θ with∥θ−θ0∥∞≤ε and∥θ′−θ0∥∞≤ε, we have sup t∈Tθ0(K) |F0,θ′(θ,t )−F0,θ0(θ0,t )|∨| F1,θ′(θ,t )−F1,θ0(θ0,t )|∨| Fθ′(θ,t )−Fθ0(θ0,t )| ≤ 2Wθ0,q(ε) + 6Q2ε; (54) sup t∈Tθ0(K) ⏐⏐Qθ′(θ,t )−Qθ0(θ0,t ) ⏐⏐≤η(ε); (55) Tθ0(minK)≤Tθ0(α−η(ε))≤Tθ(α)≤Tθ0(...

  8. [8]

    Similarly, the latter bound is also valid for |F1,θ′(θ,t )−F1,θ0(θ0,t )|

    ⏐⏐⏐⏐ ≤ sup q,ℓ |q1(t,q,ℓ ;θ′,θ )−q1(t,q,ℓ ;θ0,θ 0)| + ∑ q,ℓ |πq(θ′)πℓ(θ′)wq,ℓ(θ′)−πq(θ0)πℓ(θ0)wq,ℓ(θ0)| ≤W θ0,q(ε) + 3Q2ε. Similarly, the latter bound is also valid for |F1,θ′(θ,t )−F1,θ0(θ0,t )|. Since Fθ′(θ,t ) =F0,θ′(θ,t ) + F1,θ′(θ,t ), this proves (54). The proof of (57) follows similarly. Let us now establish (55). First, by (54), and since Fθ0(θ0, ...

  9. [9]

    2014/07/30 file: RRV2019_arxiv.tex date: July 25, 2019 /Graph inference with FDR control 37 t1,q,ℓ(θ0) = 0 if and only if σq,ℓ≥σ0 and t2,q,ℓ(θ0) = 1 if and only if σq,ℓ≤σ0

    The regularity assumption holds with, for all q,ℓ, imsart-generic ver. 2014/07/30 file: RRV2019_arxiv.tex date: July 25, 2019 /Graph inference with FDR control 37 t1,q,ℓ(θ0) = 0 if and only if σq,ℓ≥σ0 and t2,q,ℓ(θ0) = 1 if and only if σq,ℓ≤σ0. An illustration is given in Figure

  10. [10]

    For this, we should however avoid the non regular behavior of t↦→Qθ0(θ0,t ) occurring at the boundary point{t1,q,ℓ(θ0)}q,ℓ∪{t2,q,ℓ(θ0)}q,ℓ. This is possible by considering a compact set K containingα and such thatK⊂ (α∗,α∗) where α∗ =α∗(θ0) is given by α∗(θ0) =Qθ0(θ0,t′ 2(θ0))∈ (α∗(θ),π0] (59) t′ 2(θ0) = min{t1,q,ℓ(θ0),t 2,q,ℓ(θ0), for q,ℓ s.t. t1,q,ℓ(θ0)...

  11. [11]

    2014/07/30 file: RRV2019_arxiv.tex date: July 25, 2019 /Graph inference with FDR control 39 9.3.3

    imsart-generic ver. 2014/07/30 file: RRV2019_arxiv.tex date: July 25, 2019 /Graph inference with FDR control 39 9.3.3. Study of the q-value functional By Lemma 9.10, and expression (65), we get that, for all q,ℓ∈{ 1,...,Q }, the function (t,θ,θ′)∈ [0, 1]× Θ2↦→qδ(t,q,ℓ ;θ′,θ ) is continuous on [0, 1]× Θ2. By the explicit expressions of the previous section...

  12. [12]

    is differentiable in t =Tθ0(α) when Tθ0(α) /∈{t1,t 2}. 9.3.4. Studying α∗ Let θ = (π,w,σ 0,µ,σ )∈ Θ. Recall α∗(θ) = lim t→t1(θ)+ {Qθ(θ,t )} = lim t→t1(θ)+    1 1 + ∑ q,ℓπqπℓwq,ℓq1(t,q,ℓ;θ,θ)∑ q,ℓπqπℓ(1−wq,ℓ)q0(t,q,ℓ;θ,θ)   . We prove in this section the following result. Lemma 9.9. For all θ = (π,w,σ 0,µ,σ )∈ Θ, we have α∗(θ) = 0 if and only if there...

  13. [13]

    imsart-generic ver

    Hence, we have for t close enough to 0, min q,ℓ:σq,ℓ≥σ0 { wq,ℓ 1−wq,ℓ q1(t,q,ℓ ;θ,θ ) q0(t,q,ℓ ;θ,θ ) } ≤ ∑ q,ℓπqπℓwq,ℓq1(t,q,ℓ ;θ,θ )∑ q,ℓπqπℓ(1−wq,ℓ)q0(t,q,ℓ ;θ,θ ) ≤ max q,ℓ:σq,ℓ≥σ0 { wq,ℓ 1−wq,ℓ q1(t,q,ℓ ;θ,θ ) q0(t,q,ℓ ;θ,θ ) } . imsart-generic ver. 2014/07/30 file: RRV2019_arxiv.tex date: July 25, 2019 /Graph inference with FDR control 41 We know by...

  14. [14]

    qδ(t,q,ℓ ;θ,θ ) = Φδ ( b− (b2− 4ac)1/2 + 2|a| ) + 1− Φδ ( b + (b2− 4ac)1/2 + 2|a| ) = 1− Φ ( −b + 2|a|µ(δ) + (b2− 4ac)1/2 + 2|a|σ(δ) ) + 1− Φ ( b− 2|a|µ(δ) + (b2− 4ac)1/2 + 2|a|σ(δ) ) = 1− Φ ( −|(b− 2|a|µ(δ))| + (b2− 4ac)1/2 + 2|a|σ(δ) ) + 1− Φ ( |(b− 2|a|µ(δ))| + (b2− 4ac)1/2 + 2|a|σ(δ) ) Now use for all z >0, for x→∞ , 1− Φ(−z +x) 1− Φ(z +x) ∼ φ(−z +x) ...

  15. [15]

    Summing up, we obtain (68) forε≤e(θ0,α,Q ) = min(q,ℓ)/∈C{e1(q,ℓ,θ 0,α )∧e2(q,ℓ,θ 0,α )}∧ min(q,ℓ)∈Ce3(q,ℓ,θ 0,α )

    Sinceθ∈ Θ↦→b(q,ℓ,θ ) is continuous, forε smaller than some positive numbere3(q,ℓ,θ 0,α ), we have|b(q,ℓ,θ )−b(q,ℓ,θ 0)|≤| b(q,ℓ,θ 0)|/2. Summing up, we obtain (68) forε≤e(θ0,α,Q ) = min(q,ℓ)/∈C{e1(q,ℓ,θ 0,α )∧e2(q,ℓ,θ 0,α )}∧ min(q,ℓ)∈Ce3(q,ℓ,θ 0,α ). We have for all δ∈{ 0, 1}, θ,θ′∈ Θ with ∥θ−θ0∥∞≤ ε and∥θ′−θ0∥∞≤ ε, for all q,ℓ ∈ {1,...,Q }, for all t∈Tθ...

  16. [16]

    imsart-generic ver

    The continuity follows. imsart-generic ver. 2014/07/30 file: RRV2019_arxiv.tex date: July 25, 2019