Statistical inference for subgraph counts and clustering coefficient using network sampling in a sparse Stochastic Block Model framework

Anirban Mandal; Arindam Chatterjee

arxiv: 2407.19191 · v2 · submitted 2024-07-27 · 🧮 math.ST · stat.ME· stat.TH

Statistical inference for subgraph counts and clustering coefficient using network sampling in a sparse Stochastic Block Model framework

Anirban Mandal , Arindam Chatterjee This is my paper

Pith reviewed 2026-05-23 23:18 UTC · model grok-4.3

classification 🧮 math.ST stat.MEstat.TH

keywords stochastic block modelnetwork samplingsubgraph countsclustering coefficientlimit lawssparse networksasymptotic inferenceBernoulli sampling

0 comments

The pith

Estimates of subgraph counts and clustering coefficients follow Gaussian or Poisson limits under sparse SBM node sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes limit laws for estimates of subgraph counts and the clustering coefficient that arise when nodes are sampled from a large network generated by a sparse Stochastic Block Model. It introduces a sparse Bernoulli sampling scheme in which each node's selection probability decreases to zero with growing population size and analyzes both induced and ego-centric network formation. Quantitative bounds on the rate of normal approximation are derived in a joint model-design asymptotic framework, showing dependence on model sparsity, sampling sparsity, edge density, and the minimum vertex cover size of the target subgraph. The work proves that the conditions for a Gaussian limit are also necessary and identifies sharp transitions to Poisson limits for strictly balanced subgraphs once sparsity crosses explicit thresholds.

Core claim

In the joint asymptotic regime where both the Stochastic Block Model sparsity and the node sampling probability decrease with population size, the paper establishes that subgraph count estimators admit a Gaussian limit when certain conditions on sparsity, edge density, and subgraph structure hold, with the rate of convergence bounded explicitly. These conditions are necessary as well as sufficient. For strictly balanced subgraphs, the limit switches sharply to a Poisson law as sparsity crosses a threshold. Full characterizations of the possible limits are given for induced sampling and partial ones for ego-centric sampling, and the same limits apply to the estimated clustering coefficient.

What carries the argument

Joint model-design asymptotic framework for sparse Stochastic Block Model combined with sparse Bernoulli node sampling, which fixes the form of the limiting distribution for the estimators.

If this is right

If model sparsity remains below the identified threshold, inference quality stays unaffected even as population size increases.
The ego-centric approach maintains accurate inference at higher sparsity levels than the induced approach.
Clustering coefficient estimates inherit the same Gaussian and Poisson limiting behaviors.
Approximation accuracy improves with higher edge density and larger minimum vertex cover size of the target subgraph.
Violating the sparsity conditions necessarily produces non-Gaussian limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sampling probabilities could be chosen just below the transition thresholds to preserve Gaussian-based inference in practice.
Real-data analyses might first classify the observed sparsity regime to select between normal or Poisson confidence intervals.
Analogous transitions could appear in other sparse network models that share the same subgraph balance properties.
The framework might extend to time-varying networks by treating time as an extra asymptotic dimension.

Load-bearing premise

The population network must be generated exactly according to a sparse Stochastic Block Model and nodes must be sampled by the sparse Bernoulli scheme whose selection probability tends to zero.

What would settle it

Simulate networks from a non-SBM generative process or hold the sampling probability fixed as population size grows, then examine whether the empirical distribution of the estimated subgraph counts deviates from the predicted Gaussian or Poisson form in the stated sparsity regimes.

read the original abstract

This article develops limit laws for network sampling based estimates of subgraph counts and clustering coefficient of a large population network, and uses them for predictive inference. A model based approach is used, where the population network is assumed to be generated from a sparse Stochastic Block Model (SBM). To quantify the effects of node sampling under resource constraints, a sparse Bernoulli node sampling scheme is introduced, where the node selection probability decays to zero as the population size increases. Both induced and ego-centric network formation approaches are explored. Quantitative bounds on the speed of normal approximation for estimated subgraph counts are obtained in a joint model and design based asymptotic framework. These bounds show that inference accuracy depends on model sparsity, sampling sparsity, and features like edge density and minimum vertex cover size of the target subgraph. We find that the ego-centric approach can handle higher sparsity levels in both the model and sampling scheme, compared to the induced approach. We also show that if model sparsity remains below a threshold, inference quality is unaffected; beyond it, the quality degrades rapidly. The sufficient conditions for obtaining a Gaussian limit law also turn out to be necessary. For strictly balanced target subgraphs, we obtain sharp transitions from Gaussian to Poisson based limit laws, as sparsity levels increase. A complete description of limit laws for estimated subgraph counts is given for the induced case, with a near-complete one for the ego-centric case. These results also yield Gaussian and Poisson limit laws for the estimated clustering coefficient. Simulations support the theory across sparsity levels, and the proposed methodology is applied to a real data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives limit laws for subgraph counts and clustering coefficients under sparse SBM plus sparse Bernoulli node sampling, with necessity results and Gaussian-to-Poisson transitions.

read the letter

The core contribution here is a set of limit theorems for estimated subgraph counts when both the underlying network follows a sparse stochastic block model and nodes are sampled with inclusion probabilities that decay to zero. They introduce a sparse Bernoulli sampling scheme and work in a joint model-design asymptotic regime, which lets them track how sparsity in the model and in the sampling interact. The quantitative bounds on normal approximation speed, the necessity of the Gaussian conditions, and the sharp transitions to Poisson limits for strictly balanced subgraphs are the parts that look new relative to earlier work on denser regimes. They also treat both induced and ego-centric sampling and note that the ego-centric version tolerates higher sparsity before the limits break down. Simulations are reported to back the theory, and they apply the methods to a real dataset. The framework is internally consistent on its own terms and gives a fairly complete picture for the induced case. The main modeling assumption is that the population network is generated exactly from the sparse SBM; that is a strong but standard choice in this literature, and the results are conditional on it holding. The ego-centric limits are described as near-complete rather than fully spelled out. Practical performance will still depend on how well the SBM matches the data at hand, which the real-data section would need to address in detail. This is the kind of targeted theoretical work that belongs in a specialized network statistics journal. A serious editor should send it to referees; the claims are precise enough and the technical program looks coherent enough to warrant review even if revisions are needed on the proofs or the data example.

Referee Report

0 major / 2 minor

Summary. The paper develops limit laws for network sampling based estimates of subgraph counts and clustering coefficient under a sparse Stochastic Block Model for the population network combined with a sparse Bernoulli node sampling scheme (selection probability decaying to zero). It derives quantitative bounds on the speed of normal approximation in a joint model-design asymptotic framework, shows that sufficient conditions for Gaussian limits are also necessary, identifies sharp Gaussian-to-Poisson transitions for strictly balanced target subgraphs as sparsity increases, and gives a complete description of limit laws for the induced sampling case (near-complete for ego-centric). The ego-centric approach tolerates higher sparsity; simulations support the theory and a real-data application is included.

Significance. If the derivations hold, the work provides a rigorous joint asymptotic theory for subgraph inference under sparsity constraints that are realistic for large networks, including explicit dependence on model sparsity, sampling sparsity, edge density, and subgraph features such as minimum vertex cover size. The necessity results, sharp transition thresholds, and complete limit-law catalog (Gaussian/Poisson) are notable strengths; the simulation validation across regimes and the distinction between induced and ego-centric sampling add practical value. Machine-checked proofs are not mentioned, but the parameter-free character of the limit descriptions (under the stated SBM and sampling assumptions) is a positive feature.

minor comments (2)

[Abstract] Abstract: the statement that 'the sufficient conditions for obtaining a Gaussian limit law also turn out to be necessary' is strong; a brief indication of the proof strategy (e.g., moment-method or Stein-equation approach) would help readers assess the scope immediately.
The manuscript introduces an 'invented entity' (sparse Bernoulli node sampling) whose precise definition and relation to standard sampling designs should be highlighted early to avoid confusion with existing Bernoulli or induced sampling schemes.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their detailed summary of the manuscript and for recognizing the contributions regarding limit laws, necessity results, sharp Gaussian-to-Poisson transitions, and the comparison between induced and ego-centric sampling. The recommendation is 'uncertain,' yet the report contains no specific major comments or points requiring clarification. We remain available to address any additional questions the referee may wish to raise.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated assumptions

full rationale

The paper derives limit laws, normal approximation bounds, and Gaussian-to-Poisson transitions for subgraph counts and clustering coefficients directly from the sparse SBM population model combined with the introduced sparse Bernoulli node sampling scheme. These are joint model-and-design asymptotics whose sufficient conditions are shown necessary; the results do not reduce by the paper's own equations to any fitted parameter renamed as a prediction, nor rely on self-citation chains or ansatzes smuggled from prior work. The abstract and strongest claims present an internally consistent theoretical program resting on the modeling assumptions without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The claims rest on the SBM generative assumption and the newly introduced sampling scheme; no explicit free parameters are fitted in the abstract description, and no invented entities beyond the sampling scheme are postulated.

axioms (2)

domain assumption The population network is generated from a sparse Stochastic Block Model.
Explicitly stated as the model for the population network in the abstract.
domain assumption Node selection follows a sparse Bernoulli scheme whose probability decays to zero with population size.
Introduced to quantify effects of sampling under resource constraints.

invented entities (1)

Sparse Bernoulli node sampling scheme no independent evidence
purpose: Model node selection probability that decays to zero as population size increases to capture resource constraints.
Newly introduced in the paper; no independent evidence outside the model is provided.

pith-pipeline@v0.9.0 · 5818 in / 1490 out tokens · 26598 ms · 2026-05-23T23:18:19.576577+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Quantitative bounds on the speed of normal approximation for estimated subgraph counts... sharp transitions from Gaussian to Poisson based limit laws as sparsity levels increase
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mildly sparse SBM framework, where edge probabilities decay to zero at rate N^−β

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.