Counting Butterflies over Streaming Bipartite Graphs with Duplicate Edges
Pith reviewed 2026-05-23 07:19 UTC · model grok-4.3
The pith
DEABC uses bucket-based priority sampling to estimate butterfly counts in streaming bipartite graphs with duplicate edges while using less memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DEABC introduces bucket-based priority sampling that handles duplicate edges by organizing them into buckets, allowing accurate butterfly estimation with reduced memory compared to priority queue methods, while maintaining unbiased estimates and bounded variance as proven rigorously.
What carries the argument
Bucket-based priority sampling, which buckets edges to manage duplicates and samples priorities to estimate butterfly counts without storing all data.
If this is right
- Real-time applications like fraud detection can process larger streams with duplicate edges.
- The approach achieves higher throughput and lower memory than FABLE.
- Provides theoretical guarantees that previous methods lacked in this setting.
Where Pith is reading between the lines
- This sampling technique could extend to counting other motifs in streaming graphs with duplicates.
- May apply to non-bipartite graphs if adapted for general subgraphs.
Load-bearing premise
That the bucket-based priority sampling produces unbiased estimates and the stated variance bounds even when duplicate edges appear in arbitrary patterns and frequencies.
What would settle it
Apply DEABC to a synthetic stream with a known exact butterfly count and controlled duplicate rates, then check if the estimates fall within the proven variance bounds on average.
Figures
read the original abstract
Bipartite graphs are commonly used to model relationships between two distinct entities in real-world applications, such as user-product interactions, user-movie ratings and collaborations between authors and publications. A butterfly (a 2x2 bi-clique) is a critical substructure in bipartite graphs, playing a significant role in tasks like community detection, fraud detection, and link prediction. As more real-world data is presented in a streaming format, efficiently counting butterflies in streaming bipartite graphs has become increasingly important. However, most existing algorithms typically assume that duplicate edges are absent, which is hard to hold in real-world graph streams, as a result, they tend to sample edges that appear multiple times, leading to inaccurate results. The only algorithm designed to handle duplicate edges is FABLE, but it suffers from significant limitations, including high variance, substantial time complexity, and memory inefficiency due to its reliance on a priority queue. To overcome these limitations, we introduce DEABC (Duplicate-Edge-Aware Butterfly Counting), an innovative method that uses bucket-based priority sampling to accurately estimate the number of butterflies, accounting for duplicate edges. Compared to existing methods, DEABC significantly reduces memory usage by storing only the essential sampled edge data while maintaining high accuracy. We provide rigorous proofs of the unbiasedness and variance bounds for DEABC, ensuring they achieve high accuracy. We compare DEABC with state-of-the-art algorithms on real-world streaming bipartite graphs. The results show that our DEABC outperforms existing methods in memory efficiency and accuracy, while also achieving significantly higher throughput.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DEABC, a bucket-based priority sampling algorithm for estimating the number of butterflies (2x2 bicliques) in streaming bipartite graphs that may contain duplicate edges. It claims to overcome limitations of prior work (especially FABLE) by storing only essential sampled edge data, achieving lower memory usage while maintaining high accuracy, and supplies proofs of unbiasedness together with variance bounds; experimental results on real-world streams are reported to show superior memory efficiency, accuracy, and throughput.
Significance. If the unbiasedness and variance claims are substantiated, the result would be a practical advance for streaming graph analytics in applications such as fraud detection and community detection, where duplicate edges are common. The explicit handling of duplicates via bucket-based sampling and the provision of formal proofs (if complete) constitute a clear strength over heuristic approaches; the reported experimental superiority on real streams further supports utility.
major comments (2)
- [Section 3] Sampling procedure (Section 3, around the definition of bucket-based priority sampling): the central unbiasedness claim for arbitrary duplicate-edge streams requires an explicit derivation of the inclusion probability that accounts for every occurrence of a repeated edge; the manuscript asserts rigorous proofs but does not display the adjusted sampling probability or the expectation calculation (e.g., for a two-edge duplicate stream), which is load-bearing for the claim that DEABC overcomes FABLE's variance issues.
- [Section 4] Variance analysis (Section 4, variance-bound theorem): the stated variance bounds are presented without the intermediate steps linking the bucket priority mechanism to the final variance expression; without these steps it is impossible to confirm that the bounds remain valid once multiplicity is incorporated, undermining the accuracy guarantees.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a short statement of the precise time and space complexity of DEABC relative to FABLE.
- [Experiments] Figure captions and experimental tables should include error bars or standard deviations for the accuracy metrics to allow direct comparison of variance claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight areas where the presentation of our proofs can be strengthened. We address each major comment below and agree to expand the derivations in the revised manuscript.
read point-by-point responses
-
Referee: [Section 3] Sampling procedure (Section 3, around the definition of bucket-based priority sampling): the central unbiasedness claim for arbitrary duplicate-edge streams requires an explicit derivation of the inclusion probability that accounts for every occurrence of a repeated edge; the manuscript asserts rigorous proofs but does not display the adjusted sampling probability or the expectation calculation (e.g., for a two-edge duplicate stream), which is load-bearing for the claim that DEABC overcomes FABLE's variance issues.
Authors: We agree that the main text does not display the full step-by-step derivation of the inclusion probability under arbitrary multiplicities. In the revision we will add an explicit calculation of the per-edge inclusion probability that sums over all occurrences of a repeated edge, together with the expectation computation for a two-edge duplicate stream. This will directly link the bucket-based priority sampling rule to the unbiased estimator and clarify the variance reduction relative to FABLE. revision: yes
-
Referee: [Section 4] Variance analysis (Section 4, variance-bound theorem): the stated variance bounds are presented without the intermediate steps linking the bucket priority mechanism to the final variance expression; without these steps it is impossible to confirm that the bounds remain valid once multiplicity is incorporated, undermining the accuracy guarantees.
Authors: We acknowledge that the intermediate algebraic steps connecting the bucket-priority sampling probabilities to the final variance bound are omitted from the current text. The revision will insert these steps, showing how the multiplicity-adjusted inclusion probabilities propagate through the variance expression and confirming that the stated bounds continue to hold. revision: yes
Circularity Check
No circularity; derivation relies on standard sampling proofs without self-referential reduction
full rationale
The provided abstract and context present DEABC as using bucket-based priority sampling with claimed rigorous proofs of unbiasedness and variance bounds for duplicate-edge streams. No equations, fitted parameters, or self-citations are exhibited that reduce any estimator or prediction to its inputs by construction. The approach is described as extending existing streaming methods to handle duplicates, with no evidence of self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is therefore self-contained against external sampling theory benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bucket-based priority sampling yields unbiased butterfly count estimates in the presence of duplicate edges
Forward citations
Cited by 1 Pith paper
-
Counting Small Balanced (p,q)-bicliques in Signed Bipartite Graphs
BBVP, a vertex-based pruning algorithm, counts balanced (p,q)-bicliques in signed bipartite graphs with an average 636x speedup over the adapted SBCList++ baseline for p=q=3 on large real-world data.
Reference graph
Works this paper leans on
-
[1]
Rectangle counting in large bipartite graphs,
J. Wang, A. W.-C. Fu, and J. Cheng, “Rectangle counting in large bipartite graphs,” in 2014 IEEE International Congress on Big Data . IEEE, 2014, pp. 17–24
work page 2014
-
[2]
Measuring and modeling bipartite graphs with community structure,
S. Aksoy, T. G. Kolda, and A. Pinar, “Measuring and modeling bipartite graphs with community structure,” J. Complex Networks , vol. 5, no. 4, pp. 581–603, 2017. [Online]. Available: https: //doi.org/10.1093/comnet/cnx001
-
[3]
Cycles and clustering in bipartite networks,
P. G. Lind, M. C. Gonzalez, and H. J. Herrmann, “Cycles and clustering in bipartite networks,” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics , vol. 72, no. 5, p. 056127, 2005
work page 2005
-
[4]
Small worlds among interlocking direc- tors: Network structure and distance in bipartite graphs,
G. Robins and M. Alexander, “Small worlds among interlocking direc- tors: Network structure and distance in bipartite graphs,” Computational & Mathematical Organization Theory , vol. 10, pp. 69–94, 2004
work page 2004
-
[5]
Clustering coefficient and community structure of bipartite networks,
P. Zhang, J. Wang, X. Li, M. Li, Z. Di, and Y . Fan, “Clustering coefficient and community structure of bipartite networks,” Physica A: Statistical Mechanics and its Applications , vol. 387, no. 27, pp. 6869– 6875, 2008
work page 2008
-
[6]
F. Zhang, S. Qi, Q. Liu, M. Mao, and A. Zeng, “Alleviating the data sparsity problem of recommender systems by clustering nodes in bipartite networks,” Expert Systems with Applications , vol. 149, p. 113346, 2020
work page 2020
-
[7]
Bipartite graph partitioning and data clustering,
H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D. Simon, “Bipartite graph partitioning and data clustering,” in Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, Atlanta, Georgia, USA, November 5-10, 2001 . ACM, 2001, pp. 25–32
work page 2001
-
[8]
Optimal bipartite network clustering,
Z. Zhou and A. A. Amini, “Optimal bipartite network clustering,” Journal of Machine Learning Research, vol. 21, no. 40, pp. 1–68, 2020
work page 2020
-
[9]
Efficient bitruss decomposition for large-scale bipartite graphs,
K. Wang, X. Lin, L. Qin, W. Zhang, and Y . Zhang, “Efficient bitruss decomposition for large-scale bipartite graphs,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE) . IEEE, 2020, pp. 661–672
work page 2020
-
[10]
Community detection in bipartite network: a modified coarsening approach,
A. Valejo, V . Ferreira, M. C. de Oliveira, and A. de Andrade Lopes, “Community detection in bipartite network: a modified coarsening approach,” in Information Management and Big Data: 4th Annual International Symposium, SIMBig 2017, Lima, Peru, September 4-6, 2017, Revised Selected Papers 4 . Springer, 2018, pp. 123–136
work page 2017
-
[11]
Searching personalized k- wing in bipartite graphs,
A. Abidi, L. Chen, R. Zhou, and C. Liu, “Searching personalized k- wing in bipartite graphs,” in 2024 IEEE 40th International Conference on Data Engineering (ICDE) . IEEE, 2024, pp. 5727–5728
work page 2024
-
[12]
Butterfly- core community search over labeled graphs,
Z. Dong, X. Huang, G. Yuan, H. Zhu, and H. Xiong, “Butterfly- core community search over labeled graphs,” arXiv preprint arXiv:2105.08628, 2021
-
[13]
Distributed approaches to butterfly analysis on large dynamic bipartite graphs,
T. Weng, X. Zhou, K. Li, K.-L. Tan, and K. Li, “Distributed approaches to butterfly analysis on large dynamic bipartite graphs,” IEEE Transac- tions on Parallel and Distributed Systems , vol. 34, no. 2, pp. 431–445, 2022
work page 2022
-
[14]
Butterfly counting and bitruss de- composition on uncertain bipartite graphs,
A. Zhou, Y . Wang, and L. Chen, “Butterfly counting and bitruss de- composition on uncertain bipartite graphs,” The VLDB Journal, vol. 32, no. 5, pp. 1013–1036, 2023
work page 2023
-
[15]
Butterfly counting in bipartite networks,
S.-V . Sanei-Mehri, A. E. Sariyuce, and S. Tirthapura, “Butterfly counting in bipartite networks,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, pp. 2150–2159
work page 2018
-
[16]
Efficient load-balanced butterfly counting on gpu,
Q. Xu, F. Zhang, Z. Yao, L. Lu, X. Du, D. Deng, and B. He, “Efficient load-balanced butterfly counting on gpu,” Proceedings of the VLDB Endowment, vol. 15, no. 11, pp. 2450–2462, 2022
work page 2022
-
[17]
Y . Fang, K. Wang, X. Lin, and W. Zhang, “Cohesive subgraph search over big heterogeneous information networks: Applications, challenges, and solutions,” in Proceedings of the 2021 International Conference on Management of Data , 2021, pp. 2829–2838
work page 2021
-
[18]
I/o-efficient butterfly counting at scale,
Z. Wang, L. Lai, Y . Liu, B. Shui, C. Tian, and S. Zhong, “I/o-efficient butterfly counting at scale,” Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–27, 2023
work page 2023
-
[19]
N. Geetha and G. Dheepa, “A hybrid deep learning and modified butterfly optimization based feature selection for transaction credit card fraud detection,” Journal of Positive School Psychology , vol. 6, no. 7, pp. 5328–5345, 2022
work page 2022
-
[20]
P. G. Samanvita, A. Balivada, S. K. Satapathy, B. M. R. Kumar, and M. Bhargavi, “Fraud detection: Comparison of traditional methods, hybrid methods, monarch butterfly optimization, and temporal convo- lutional networks,” in 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC) . IEEE, 2024, pp. 110– 116
work page 2024
-
[21]
Vertex priority based butterfly counting for large-scale bipartite networks
K. Wang, X. Lin, L. Qin, W. Zhang, and Y . Zhang, “Vertex priority based butterfly counting for large-scale bipartite networks.” PVLDB, 2019
work page 2019
-
[22]
Parallel algorithms for butterfly computations,
J. Shi and J. Shun, “Parallel algorithms for butterfly computations,” in Massive Graph Analytics. Chapman and Hall/CRC, 2022, pp. 287–330
work page 2022
-
[23]
Families of butterfly counting algorithms for bipartite graphs,
J. A. Acosta, T. M. Low, and D. N. Parikh, “Families of butterfly counting algorithms for bipartite graphs,” in 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . IEEE, 2022, pp. 304–313
work page 2022
-
[24]
Parallelization of butterfly counting on hierarchical memory,
Z. Wang, L. Lai, Y . Liu, B. Shui, C. Tian, and S. Zhong, “Parallelization of butterfly counting on hierarchical memory,” The VLDB Journal , pp. 1–32, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
work page 2024
-
[25]
Y . Xia, F. Zhang, Q. Xu, M. Zhang, Z. Yao, L. Lu, X. Du, D. Deng, B. He, and S. Ma, “Gpu-based butterfly counting,” The VLDB Journal , vol. 33, no. 5, pp. 1543–1567, 2024
work page 2024
-
[26]
Toward accurate butterfly counting with edge privacy preserving in bipartite networks,
M. Wang, H. Jiang, P. Peng, Y . Li, and W. Huang, “Toward accurate butterfly counting with edge privacy preserving in bipartite networks,” in IEEE INFOCOM 2024-IEEE Conference on Computer Communications. IEEE, 2024, pp. 2289–2298
work page 2024
-
[27]
Butterfly counting over bipartite graphs with local differential privacy,
Y . He, K. Wang, W. Zhang, X. Lin, W. Ni, and Y . Zhang, “Butterfly counting over bipartite graphs with local differential privacy,” in 2024 IEEE 40th International Conference on Data Engineering (ICDE) . IEEE, 2024, pp. 2351–2364
work page 2024
-
[28]
Fleet: Butterfly estimation from a bipartite graph stream,
S.-V . Sanei-Mehri, Y . Zhang, A. E. Sariy ¨uce, and S. Tirthapura, “Fleet: Butterfly estimation from a bipartite graph stream,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1201–1210
work page 2019
-
[29]
Approximately counting butterflies in large bipartite graph streams,
R. Li, P. Wang, P. Jia, X. Zhang, J. Zhao, J. Tao, Y . Yuan, and X. Guan, “Approximately counting butterflies in large bipartite graph streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5621–5635, 2021
work page 2021
-
[30]
sgrapp: Butterfly approximation in streaming graphs,
A. Sheshbolouki and M. T. ¨Ozsu, “sgrapp: Butterfly approximation in streaming graphs,” ACM Trans. Knowl. Discov. Data, vol. 16, no. 4, pp. 76:1–76:43, 2022. [Online]. Available: https://doi.org/10.1145/3495011
-
[31]
Counting butterflies in fully dynamic bipartite graph streams,
S. Papadias, Z. Kaoudi, V . Pandey, J.-A. Quian ´e-Ruiz, and V . Markl, “Counting butterflies in fully dynamic bipartite graph streams,” in 2024 IEEE 40th International Conference on Data Engineering (ICDE) . IEEE, 2024, pp. 2917–2930
work page 2024
-
[32]
Scalable approximate butterfly and bi-triangle counting for large bipartite networks,
F. Zhang, D. Chen, S. Wang, Y . Yang, and J. Gan, “Scalable approximate butterfly and bi-triangle counting for large bipartite networks,” Proc. ACM Manag. Data , vol. 1, no. 4, pp. 259:1–259:26,
-
[33]
Available: https://doi.org/10.1145/3626753
[Online]. Available: https://doi.org/10.1145/3626753
-
[34]
Sampling algorithms for butterfly counting on temporal bipartite graphs,
J. Pu, Y . Wang, Y . Li, and X. Zhou, “Sampling algorithms for butterfly counting on temporal bipartite graphs,” arXiv preprint arXiv:2310.11886, 2023
-
[35]
Scalable distributed stream join processing,
Q. Lin, B. C. Ooi, Z. Wang, and C. Yu, “Scalable distributed stream join processing,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data , 2015, pp. 811–825
work page 2015
-
[36]
P. Wang, Y . Qi, Y . Sun, X. Zhang, J. Tao, and X. Guan, “Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage,” Proceedings of the VLDB Endowment , vol. 11, no. 2, pp. 162–175, 2017
work page 2017
-
[37]
Furl: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams,
M. Jung, Y . Lim, S. Lee, and U. Kang, “Furl: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams,” Data Mining and Knowledge Discovery , vol. 33, pp. 1225–1253, 2019
work page 2019
-
[38]
An adaptive query execution system for data integration,
Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld, “An adaptive query execution system for data integration,” ACM SIGMOD Record, vol. 28, no. 2, pp. 299–310, 1999
work page 1999
-
[39]
Adaptive duplicate detection using learnable string similarity measures,
M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 39–48
work page 2003
-
[40]
What’s going on? learning communication rules in edge networks,
S. Kandula, R. Chandra, and D. Katabi, “What’s going on? learning communication rules in edge networks,” in Proceedings of the ACM SIGCOMM 2008 conference on Data communication , 2008, pp. 87–98
work page 2008
-
[41]
The impact of duplicate orders on demand estimation and capacity investment,
M. Armony and E. L. Plambeck, “The impact of duplicate orders on demand estimation and capacity investment,” Management science, vol. 51, no. 10, pp. 1505–1518, 2005
work page 2005
-
[42]
Exploring duplicate orders in a single-manufacturer multi-distributor supply chain,
S. V . Betancur, G. U. Casta ˜no, and P. Gonc ¸alves, “Exploring duplicate orders in a single-manufacturer multi-distributor supply chain,” in Pro- ceedings of System Dynamics Conference , 2014
work page 2014
-
[43]
Fable: Approximate butterfly counting in bipartite graph stream with duplicate edges,
G. Sun, Y . Zhao, and Y . Li, “Fable: Approximate butterfly counting in bipartite graph stream with duplicate edges,” in Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 2158–2167
work page 2024
-
[44]
Counting distinct elements in a data stream,
Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan, “Counting distinct elements in a data stream,” in Randomization and Ap- proximation Techniques, 6th International Workshop, RANDOM 2002, Cambridge, MA, USA, September 13-15, 2002, Proceedings . Springer, 2002, pp. 1–10
work page 2002
-
[45]
On synopses for distinct-value estimation under multiset operations,
K. Beyer, P. J. Haas, B. Reinwald, Y . Sismanis, and R. Gemulla, “On synopses for distinct-value estimation under multiset operations,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data , 2007, pp. 199–210
work page 2007
-
[46]
Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm,
P. Flajolet, ´E. Fusy, O. Gandouet, and F. Meunier, “Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm,” Discrete mathematics & theoretical computer science , no. Proceedings, 2007
work page 2007
-
[47]
Probabilistic counting algorithms for data base applications,
P. Flajolet and G. N. Martin, “Probabilistic counting algorithms for data base applications,” Journal of computer and system sciences , vol. 31, no. 2, pp. 182–209, 1985
work page 1985
-
[48]
Streamed approximate counting of distinct elements: Beating optimal batch methods,
D. Ting, “Streamed approximate counting of distinct elements: Beating optimal batch methods,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , 2014, pp. 442–451
work page 2014
-
[49]
Efficient temporal butterfly counting and enumeration on temporal bipartite graphs,
X. Cai, X. Ke, K. Wang, L. Chen, T. Zhang, Q. Liu, and Y . Gao, “Efficient temporal butterfly counting and enumeration on temporal bipartite graphs,” Proceedings of the VLDB Endowment , vol. 17, no. 4, pp. 657–670, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.