Statistical inference for subgraph counts and clustering coefficient using network sampling in a sparse Stochastic Block Model framework
Pith reviewed 2026-05-23 23:18 UTC · model grok-4.3
The pith
Estimates of subgraph counts and clustering coefficients follow Gaussian or Poisson limits under sparse SBM node sampling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the joint asymptotic regime where both the Stochastic Block Model sparsity and the node sampling probability decrease with population size, the paper establishes that subgraph count estimators admit a Gaussian limit when certain conditions on sparsity, edge density, and subgraph structure hold, with the rate of convergence bounded explicitly. These conditions are necessary as well as sufficient. For strictly balanced subgraphs, the limit switches sharply to a Poisson law as sparsity crosses a threshold. Full characterizations of the possible limits are given for induced sampling and partial ones for ego-centric sampling, and the same limits apply to the estimated clustering coefficient.
What carries the argument
Joint model-design asymptotic framework for sparse Stochastic Block Model combined with sparse Bernoulli node sampling, which fixes the form of the limiting distribution for the estimators.
If this is right
- If model sparsity remains below the identified threshold, inference quality stays unaffected even as population size increases.
- The ego-centric approach maintains accurate inference at higher sparsity levels than the induced approach.
- Clustering coefficient estimates inherit the same Gaussian and Poisson limiting behaviors.
- Approximation accuracy improves with higher edge density and larger minimum vertex cover size of the target subgraph.
- Violating the sparsity conditions necessarily produces non-Gaussian limits.
Where Pith is reading between the lines
- Sampling probabilities could be chosen just below the transition thresholds to preserve Gaussian-based inference in practice.
- Real-data analyses might first classify the observed sparsity regime to select between normal or Poisson confidence intervals.
- Analogous transitions could appear in other sparse network models that share the same subgraph balance properties.
- The framework might extend to time-varying networks by treating time as an extra asymptotic dimension.
Load-bearing premise
The population network must be generated exactly according to a sparse Stochastic Block Model and nodes must be sampled by the sparse Bernoulli scheme whose selection probability tends to zero.
What would settle it
Simulate networks from a non-SBM generative process or hold the sampling probability fixed as population size grows, then examine whether the empirical distribution of the estimated subgraph counts deviates from the predicted Gaussian or Poisson form in the stated sparsity regimes.
read the original abstract
This article develops limit laws for network sampling based estimates of subgraph counts and clustering coefficient of a large population network, and uses them for predictive inference. A model based approach is used, where the population network is assumed to be generated from a sparse Stochastic Block Model (SBM). To quantify the effects of node sampling under resource constraints, a sparse Bernoulli node sampling scheme is introduced, where the node selection probability decays to zero as the population size increases. Both induced and ego-centric network formation approaches are explored. Quantitative bounds on the speed of normal approximation for estimated subgraph counts are obtained in a joint model and design based asymptotic framework. These bounds show that inference accuracy depends on model sparsity, sampling sparsity, and features like edge density and minimum vertex cover size of the target subgraph. We find that the ego-centric approach can handle higher sparsity levels in both the model and sampling scheme, compared to the induced approach. We also show that if model sparsity remains below a threshold, inference quality is unaffected; beyond it, the quality degrades rapidly. The sufficient conditions for obtaining a Gaussian limit law also turn out to be necessary. For strictly balanced target subgraphs, we obtain sharp transitions from Gaussian to Poisson based limit laws, as sparsity levels increase. A complete description of limit laws for estimated subgraph counts is given for the induced case, with a near-complete one for the ego-centric case. These results also yield Gaussian and Poisson limit laws for the estimated clustering coefficient. Simulations support the theory across sparsity levels, and the proposed methodology is applied to a real data set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops limit laws for network sampling based estimates of subgraph counts and clustering coefficient under a sparse Stochastic Block Model for the population network combined with a sparse Bernoulli node sampling scheme (selection probability decaying to zero). It derives quantitative bounds on the speed of normal approximation in a joint model-design asymptotic framework, shows that sufficient conditions for Gaussian limits are also necessary, identifies sharp Gaussian-to-Poisson transitions for strictly balanced target subgraphs as sparsity increases, and gives a complete description of limit laws for the induced sampling case (near-complete for ego-centric). The ego-centric approach tolerates higher sparsity; simulations support the theory and a real-data application is included.
Significance. If the derivations hold, the work provides a rigorous joint asymptotic theory for subgraph inference under sparsity constraints that are realistic for large networks, including explicit dependence on model sparsity, sampling sparsity, edge density, and subgraph features such as minimum vertex cover size. The necessity results, sharp transition thresholds, and complete limit-law catalog (Gaussian/Poisson) are notable strengths; the simulation validation across regimes and the distinction between induced and ego-centric sampling add practical value. Machine-checked proofs are not mentioned, but the parameter-free character of the limit descriptions (under the stated SBM and sampling assumptions) is a positive feature.
minor comments (2)
- [Abstract] Abstract: the statement that 'the sufficient conditions for obtaining a Gaussian limit law also turn out to be necessary' is strong; a brief indication of the proof strategy (e.g., moment-method or Stein-equation approach) would help readers assess the scope immediately.
- The manuscript introduces an 'invented entity' (sparse Bernoulli node sampling) whose precise definition and relation to standard sampling designs should be highlighted early to avoid confusion with existing Bernoulli or induced sampling schemes.
Simulated Author's Rebuttal
We thank the referee for their detailed summary of the manuscript and for recognizing the contributions regarding limit laws, necessity results, sharp Gaussian-to-Poisson transitions, and the comparison between induced and ego-centric sampling. The recommendation is 'uncertain,' yet the report contains no specific major comments or points requiring clarification. We remain available to address any additional questions the referee may wish to raise.
Circularity Check
No significant circularity; derivation self-contained under stated assumptions
full rationale
The paper derives limit laws, normal approximation bounds, and Gaussian-to-Poisson transitions for subgraph counts and clustering coefficients directly from the sparse SBM population model combined with the introduced sparse Bernoulli node sampling scheme. These are joint model-and-design asymptotics whose sufficient conditions are shown necessary; the results do not reduce by the paper's own equations to any fitted parameter renamed as a prediction, nor rely on self-citation chains or ansatzes smuggled from prior work. The abstract and strongest claims present an internally consistent theoretical program resting on the modeling assumptions without circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The population network is generated from a sparse Stochastic Block Model.
- domain assumption Node selection follows a sparse Bernoulli scheme whose probability decays to zero with population size.
invented entities (1)
-
Sparse Bernoulli node sampling scheme
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Quantitative bounds on the speed of normal approximation for estimated subgraph counts... sharp transitions from Gaussian to Poisson based limit laws as sparsity levels increase
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mildly sparse SBM framework, where edge probabilities decay to zero at rate N^−β
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.