Recognition: unknown
An Efficient Streaming Algorithm for Approximating Graphlet Distributions
Pith reviewed 2026-05-07 15:17 UTC · model grok-4.3
The pith
A streaming algorithm approximates k-graphlet distributions in O(1/c) passes using Õ(n^{1+c}) memory for any fixed c>0.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We give a streaming algorithm that, for any fixed c>0, makes O(1/c) passes using Õ(n^{1+c}) memory to sample k-graphlets. We use this sampler to approximate the distribution of k-graphlets. The construction is optimal up to an Õ(n^c) factor in memory usage by a known lower bound.
What carries the argument
The multi-pass streaming sampling procedure that produces unbiased k-graphlet samples while achieving a tunable tradeoff between number of passes and memory.
If this is right
- Graphlet distribution approximation becomes feasible for graphs that exceed main memory capacity.
- The pass count can be driven to any small constant while memory remains sub-quadratic in n.
- External-memory analysis of large networks incurs only a constant number of sequential scans.
Where Pith is reading between the lines
- The same sampling trade-off might extend to counting or sampling other fixed-size induced subgraphs in streams.
- On graphs of moderate density the reduction in passes could yield large practical speedups compared with the logarithmic-pass baseline.
- Adjusting the exponent 1+c to depend explicitly on k would be a natural next parameter to optimize.
Load-bearing premise
The sampling procedure produces unbiased estimates sufficient for distribution approximation and the prior lower bound applies directly to this procedure.
What would settle it
A family of graphs on which the algorithm either requires more than O(1/c) passes, uses more than Õ(n^{1+c}) memory, or yields distribution estimates whose total variation distance to the true distribution exceeds the claimed bound.
Figures
read the original abstract
In recent years, the problem of computing the frequencies of the induced $k$-vertex subgraphs of a graph, or \emph{$k$-graphlets}, has become central. One approach for this problem is to sample $k$-graphlets randomly. Classic algorithms for $k$-graphlet sampling require loading the entire graph into main memory, making them impractical for massive graphs. To bypass this limitation, Bourreau et al. (NeurIPS 2024) introduced a \emph{streaming} algorithm that through nontrivial techniques makes only $O(\log n)$ passes using $O(n \log n)$ memory. In this work we break their $O(\log n)$-pass bound by giving an algorithm that, for any fixed $c>0$, makes $O(1/c)$ passes using $\tilde O(n^{1+c})$ memory. As a consequence of their lower bound, our algorithm is optimal up to a factor of $\tilde{O}(n^c)$ in the memory usage. We use this sampling algorithm to obtain an efficient method of approximating $k$-graphlet distributions. Experiments on real-world and synthetic graphs show that our algorithm is always at least as good as the one of Bourreau et al., and outperforms it by orders of magnitude on mildly dense graphs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a new streaming algorithm for sampling k-graphlets that, for any fixed c>0, uses O(1/c) passes and Õ(n^{1+c}) memory. This improves on Bourreau et al.'s O(log n)-pass, O(n log n)-memory algorithm. The new sampler is used to approximate k-graphlet distributions, with a claim of near-optimality (up to Õ(n^c) in memory) via the prior lower bound, and experiments on real/synthetic graphs showing it matches or outperforms the baseline, especially on mildly dense graphs.
Significance. If the sampling procedure is unbiased and the lower bound applies in the constant-pass regime, the result gives a practical constant-pass tradeoff for graphlet sampling on massive graphs and tightens the pass-memory frontier for this problem. The experimental comparison adds evidence of utility, but the primary value is the theoretical improvement in passes.
major comments (2)
- [Abstract] Abstract: The optimality claim ('our algorithm is optimal up to a factor of Õ(n^c) in the memory usage') rests on Bourreau et al.'s lower bound carrying over unchanged to the O(1/c)-pass regime. The abstract provides no argument showing that the lower-bound construction (originally stated for O(log n) passes) rules out better constant-pass algorithms or that the new multi-pass sampling procedure inherits the same hardness.
- [Abstract] Abstract: The claim that the algorithm yields an 'efficient method of approximating k-graphlet distributions' requires that the produced samples give unbiased (or sufficiently low-bias) estimates of the induced k-graphlet frequencies. No proof sketch is supplied showing that the new sampling distribution matches the uniform distribution over k-graphlets needed for unbiased frequency estimation.
minor comments (1)
- [Abstract] Abstract: The range of k for which the algorithm and bounds hold is not stated; this should be clarified in the introduction or theorem statements.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for recognizing the practical and theoretical value of our constant-pass streaming algorithm. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The optimality claim ('our algorithm is optimal up to a factor of Õ(n^c) in the memory usage') rests on Bourreau et al.'s lower bound carrying over unchanged to the O(1/c)-pass regime. The abstract provides no argument showing that the lower-bound construction (originally stated for O(log n) passes) rules out better constant-pass algorithms or that the new multi-pass sampling procedure inherits the same hardness.
Authors: We agree that the abstract would be strengthened by an explicit justification. The lower bound in Bourreau et al. is obtained via a communication-complexity reduction on a family of hard instances whose communication cost cannot be reduced below Ω(n) bits even with any constant number of passes; their O(log n) upper bound arises only from their particular recursive sampling technique. Our O(1/c)-pass algorithm matches this bound up to the Õ(n^c) factor for any fixed c>0. We will add a concise sentence to the abstract and a short paragraph in the introduction (or a new remark in Section 2) that cites the relevant lemma from Bourreau et al. and explains why the hardness persists in the constant-pass regime. revision: yes
-
Referee: [Abstract] Abstract: The claim that the algorithm yields an 'efficient method of approximating k-graphlet distributions' requires that the produced samples give unbiased (or sufficiently low-bias) estimates of the induced k-graphlet frequencies. No proof sketch is supplied showing that the new sampling distribution matches the uniform distribution over k-graphlets needed for unbiased frequency estimation.
Authors: The referee correctly notes that the abstract omits a proof sketch. In the full manuscript, Section 3 presents the multi-pass sampler and proves (via induction on the number of passes and careful accounting of the reservoir sampling probabilities) that each k-graphlet is sampled with probability (1±ε)/|G_k| where G_k is the set of all k-graphlets; Section 4 then shows that the resulting Monte-Carlo estimator for the graphlet frequency vector is unbiased up to the (1+ε) multiplicative error and gives variance bounds that yield the stated approximation guarantees. We will insert a one-sentence outline of this argument into the abstract and add a short proof sketch (or a pointer to the relevant theorem) in the introduction. revision: yes
Circularity Check
No circularity; new algorithm and optimality rest on external lower bound
full rationale
The paper introduces a new multi-pass streaming sampler for k-graphlet distributions that improves the pass complexity of Bourreau et al. (NeurIPS 2024) from O(log n) to O(1/c). Optimality up to Õ(n^c) memory is explicitly derived from the lower bound in that external prior work rather than from any self-referential construction, fitted parameter, or self-citation chain. No equations reduce a claimed prediction to its own inputs by definition, no ansatz is smuggled via self-citation, and the sampling procedure is presented as a novel construction whose unbiasedness is asserted independently of the lower-bound citation. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The input graph is presented as a stream of edges in the standard one-way or multi-pass streaming model.
Reference graph
Works this paper leans on
-
[1]
Mixing time bounds for graphlet random walks
Matteo Agostini, Marco Bressan, and Shahrzad Haddadan. Mixing time bounds for graphlet random walks. Information Processing Letters , 152:105851, 2019
2019
-
[2]
Ahmed, Nick G
Nesreen K. Ahmed, Nick G. Duffield, Jennifer Neville, and R amana Rao Kompella. Graph sample and hold: a framework for big-graph analytics. In KDD, pages 1446–1455. ACM, 2014
2014
-
[3]
D ensest subgraph in streaming and MapReduce
Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii. D ensest subgraph in streaming and MapReduce. Proceedings of the VLDB Endowment , 5(5):454–465, January 2012
2012
-
[4]
Sivakumar
Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reduction s in streaming algorithms, with an application to counting triangles in graphs. In SODA, pages 623–632. ACM/SIAM, 2002
2002
-
[5]
Efficient semi-streaming algorithms for local triangle counting in massive graphs
Luca Becchetti, Paolo Boldi, Carlos Castillo, and Arist ides Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. I n Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 16–24, Las Vegas Nevada USA, August 2008. ACM
2008
-
[6]
Bera, Lior Gishboliner, Yevgeny Levanzov, C
Suman K. Bera, Lior Gishboliner, Yevgeny Levanzov, C. Se shadhri, and Asaf Shapira. Counting subgraphs in degenerate graphs. J. ACM , 69(3):23:1–23:21, 2022
2022
-
[7]
Bhuiyan, Mahmudur Rahman, Mahmuda Rahman, a nd Mohammad Al Hasan
Mansurul A. Bhuiyan, Mahmudur Rahman, Mahmuda Rahman, a nd Mohammad Al Hasan. Guise: Uniform sampling of graphlets for large graph analys is. In Proc. of IEEE ICDM 2012 , pages 91–100, 2012. 26
2012
-
[8]
Hubert Chan, Qipeng Kuang, and Mauro Sozio
Yann Bourreau, Marco Bressan, T.-H. Hubert Chan, Qipeng Kuang, and Mauro Sozio. Efficient streaming algorithms for graphlet sampling. In NeurIPS, 2024
2024
-
[9]
Efficient and near-optimal algorithms for sampling connected subgraphs
Marco Bressan. Efficient and near-optimal algorithms for sampling connected subgraphs. In Proc. of ACM STOC , page 1132–1143, 2021
2021
-
[10]
Efficient and near-optimal algorithms fo r sampling connected subgraphs, 2021
Marco Bressan. Efficient and near-optimal algorithms fo r sampling connected subgraphs, 2021
2021
-
[11]
Faster algorithms for counting subgrap hs in sparse graphs
Marco Bressan. Faster algorithms for counting subgrap hs in sparse graphs. Algorithmica, 2021
2021
-
[12]
Efficient and near-optimal algorithms fo r sampling small connected subgraphs
Marco Bressan. Efficient and near-optimal algorithms fo r sampling small connected subgraphs. ACM Trans. Algorithms , 19(3), jun 2023
2023
-
[13]
Counting graphlets: Space vs time
Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefa no Leucci, and Alessandro Panconesi. Counting graphlets: Space vs time. In Proc. of ACM WSDM , pages 557–566, 2017
2017
-
[14]
Motif counting beyond five nodes
Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefa no Leucci, and Alessandro Panconesi. Motif counting beyond five nodes. ACM Trans. Knowl. Discov. Data , 12(4), April 2018
2018
-
[15]
Motivo: Fast motif counting via succinct color coding and adaptive sampling
Marco Bressan, Stefano Leucci, and Alessandro Pancone si. Motivo: Fast motif counting via succinct color coding and adaptive sampling. Proc. VLDB Endow. , 12(11):1651–1663, July 2019
2019
-
[16]
Faster motif counting via succinct color coding and adaptive sampling
Marco Bressan, Stefano Leucci, and Alessandro Pancone si. Faster motif counting via succinct color coding and adaptive sampling. ACM Trans. Knowl. Discov. Data , 15(6), May 2021
2021
-
[17]
Krishna Gummadi
Meeyoung Cha, Hamed Haddadi, Fabr ´ ıcio Benevenuto, and P. Krishna Gummadi. Measuring user influence in twitter: The million follower fallacy. In ICWSM. The AAAI Press, 2010
2010
-
[18]
Kanj, and Ge Xia
Jianer Chen, Xiuzhen Huang, Iyad A. Kanj, and Ge Xia. Str ong computational lower bounds via parameterized complexity. Journal of Computer and System Sciences , 72(8):1346 – 1367, 2006
2006
-
[19]
Xiaowei Chen, Yongkun Li, Pinghui Wang, and John C. S. Lu i. A general framework for estimating graphlet statistics via random walk. Proc. VLDB Endow., 10(3):253–264, November 2016
2016
-
[20]
Arboricity and su bgraph listing algorithms
Norishige Chiba and Takao Nishizeki. Arboricity and su bgraph listing algorithms. SIAM J. Comput., 14(1):210–223, 1985
1985
-
[21]
On graph problems in a semi-streaming model
Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Sid dharth Suri, and Jian Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science , 348(2-3):207–216, December 2005
2005
-
[22]
Improved parallel algorithms for density-based network clustering
Mohsen Ghaffari, Silvio Lattanzi, and Slobodan Mitrovic . Improved parallel algorithms for density-based network clustering. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 2201–2210. PMLR, 2019
2019
-
[23]
An efficient framework for clustered federated learning
Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramch andran. An efficient framework for clustered federated learning. IEEE Trans. Inf. Theory , 68(12):8076–8091, 2022
2022
-
[24]
Waddling random walk: Fast a nd accurate mining of motif statistics in large graphs
Guyue Han and Harish Sethu. Waddling random walk: Fast a nd accurate mining of motif statistics in large graphs. In Proc. of IEEE ICDM , pages 181–190, 2016. 27
2016
-
[25]
A generalizati on of sampling without replacement from a finite universe
Daniel G Horvitz and Donovan J Thompson. A generalizati on of sampling without replacement from a finite universe. Journal of the American statistical Association , 47(260):663–685, 1952
1952
-
[26]
The parameterised complex ity of counting connected subgraphs and graph motifs
Mark Jerrum and Kitty Meeks. The parameterised complex ity of counting connected subgraphs and graph motifs. J. Comput. Syst. Sci. , 81(4):702–716, 2015
2015
-
[27]
Seshadhri, and Ali Pinar
Madhav Jha, C. Seshadhri, and Ali Pinar. A space-efficien t streaming algorithm for estimating transitivity and triangle counts using the birthday parado x. ACM Trans. Knowl. Discov. Data , 9(3):15:1–15:21, 2015
2015
-
[28]
Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun
Daniel M. Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun. Counting arbitrary subgraphs in data streams. In ICALP (2) , volume 7392 of Lecture Notes in Computer Science , pages 598–609. Springer, 2012
2012
-
[29]
KONECT: the koblenz network collection
J´ erˆ ome Kunegis. KONECT: the koblenz network collection. In WWW (Companion Volume) , pages 1343–1350. International World Wide Web Conferences Steering Committee / ACM, 2013
2013
-
[30]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moo n. What is twitter, a social network or a news media? In WWW, pages 591–600. ACM, 2010
2010
-
[31]
Yongsub Lim, Minsoo Jung, and U. Kang. Memory-Efficient a nd Accurate Sampling for Counting Local Triangles in Graph Streams: From Simple to Mu ltigraphs. ACM Transactions on Knowledge Discovery from Data , 12(1):1–28, February 2018
2018
-
[32]
Improved mixing ti me for k-subgraph sampling
Ryuta Matsuno and Aristides Gionis. Improved mixing ti me for k-subgraph sampling. In Proc. of SIAM SDM , pages 568–576, 2020
2020
-
[33]
Matula and Leland L
David W. Matula and Leland L. Beck. Smallest-last order ing and clustering and graph coloring algorithms. J. ACM , 30(3):417–427, July 1983
1983
-
[34]
R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklo vskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594):824–827, 2002
2002
-
[35]
Estimating graphlet statistics via lifting
Kirill Paramonov, Dmitry Shemetov, and James Sharpnac k. Estimating graphlet statistics via lifting. In Proc. of ACM KDD , page 587–595, 2019
2019
-
[36]
Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu
A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. Counting and sampling triangles from a graph stream. Proceedings of the VLDB Endowment , 6(14):1870– 1881, September 2013
2013
-
[37]
Motif- matching based subgraph-level attentional convolutional network for graph classification
Hao Peng, Jianxin Li, Qiran Gong, Yuanxin Ning, Senzhan g Wang, and Lifang He. Motif- matching based subgraph-level attentional convolutional network for graph classification. Proc. of AAAI , 34(04):5387–5394, Apr. 2020
2020
-
[38]
Finding network motifs using mcmc sampling
Tanay Kumar Saha and Mohammad Al Hasan. Finding network motifs using mcmc sampling. In Proc. of CompleNet , pages 13–24, 2015
2015
-
[39]
C ¸ ataly¨ urek
Ahmet Erdem Sar ´ ıy¨ uce, Bu˘ gra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu, and ¨Umit V. C ¸ ataly¨ urek. Streaming algorithms for k-core decomposit ion. Proceedings of the VLDB Endowment, 6(6):433–444, April 2013. 28
2013
-
[40]
Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri , Kurt Mehlhorn, and Karsten M. Borgwardt. Efficient graphlet kernels for large graph compar ison. In AISTATS, volume 5 of JMLR Proceedings, pages 488–495. JMLR.org, 2009
2009
-
[41]
Triest: Counting local and global triangles in fully dynamic streams with fixe d memory size
Lorenzo De Stefani, Alessandro Epasto, Matteo Riondat o, and Eli Upfal. Triest: Counting local and global triangles in fully dynamic streams with fixe d memory size. ACM TKDD , 11(4):43, 2017
2017
-
[42]
Tiere d sampling: An efficient method for approximate counting sparse motifs in massive graph stream s
Lorenzo De Stefani, Erisa Terolli, and Eli Upfal. Tiere d sampling: An efficient method for approximate counting sparse motifs in massive graph stream s. In IEEE BigData , pages 776–
-
[43]
IEEE Computer Society, 2017
2017
-
[44]
Kun Tu, Jian Li, Don Towsley, Dave Braines, and Liam D. Tu rner. Gl2vec: Learning feature representation using graphlets for directed networks. In Proc. of IEEE/ACM ASONAM , page 216–221, 2019
2019
-
[45]
Pinghui Wang, John C. S. Lui, Bruno Ribeiro, Don Towsley , Junzhou Zhao, and Xiaohong Guan. Efficiently estimating motif statistics of large netwo rks. ACM TKDD , 9(2):8:1–8:27, 2014. 29 A Additional Figures Figures 8 and 9 show the figures omitted from the experiments i n Section 5. 0 0.004) 0.004 0.00 ) 0.00 0.01) 0.01 0.04) 0.04 0.0 ) ...
2014
-
[46]
ounter ApproxDD-ES
Ó ) Ñ 0.Ó Ò 1) Ñ 1Ò 4) Ñ 4Ò Ó ) Ñ Ó Ò 10) Ñ 10Ò 100) Ñ 100 Ò 1000) Ô 0.0 0.1 0.2 0.3 0.4 0.5 0. Õ 0.Ó Ö re × uen Ø y Ù Ú 0.01 Ù Ú 0.05 Ù Ú 0.1 Ù Ú 0.2 Ù Ú 0.3 Ù Ú 0.4 Ù Ú 0.5 ApproxDD (g) ER-5 Figure 8: Empirical distribution of ǫv under different c, using ApproxDD-ES. 30 Û Ü 22 0 15 30 45 Ý 0 Û 5 Num of Passes ( eft Prepro essing ight Sampling) 0.000 0.00...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.