Recognition: unknown
Generating Synthetic Citation Networks with Communities
Pith reviewed 2026-05-07 14:13 UTC · model grok-4.3
The pith
The Citation Seeder generates realistic synthetic citation networks using up to four orders of magnitude fewer parameters than leading alternatives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Citation Seeder algorithm iteratively constructs directed graphs by adding nodes and directing edges according to a Price-Pareto citation process. When evaluated on seven real citation networks using twenty-six metrics, it produces results competitive with the best-performing baselines while using up to four orders of magnitude fewer parameters. The paper further establishes that reversing edges in static generators breaks cycles and induces realistic flow, and that exogenous mesoscopic similarities between generated and real networks matter more for realism than endogenous ones; high-parameter models fail because they memorize planted community statistics rather than capturing general,,
What carries the argument
The Citation Seeder algorithm, an iterative generator that adds nodes and edges according to the Price-Pareto preferential attachment model with a small number of interpretable parameters and linear runtime.
If this is right
- Reversing edge directions in static community generators produces more citation-like acyclic structures and improves performance of the degree-corrected stochastic block model.
- Exogenous mesoscopic similarities are more important than endogenous ones when judging how realistic a generated network is.
- Models with many free parameters overfit by memorizing the statistics of planted communities and therefore fail to produce realistic overall network structure.
- The Citation Seeder supplies an interpretable framework that can both explain observed citation growth and forecast future network evolution.
Where Pith is reading between the lines
- The small number of interpretable parameters could be estimated from early citation data to forecast how a new paper or patent will accumulate citations over time.
- The same iterative construction might be adapted to generate synthetic patent or software-dependency networks that also exhibit near-acyclic flow.
- Improved low-parameter benchmarks could make community-detection methods more robust when tested on networks whose community structure evolves rather than remaining fixed.
- The distinction between endogenous and exogenous similarities suggests that future generators should prioritize matching global growth statistics over exact reproduction of any single planted partition.
Load-bearing premise
The chosen twenty-six metrics on seven real networks are enough to judge whether a synthetic citation network is realistic, and that the ground-truth communities are fixed external features rather than arising from the network's own growth rules.
What would settle it
Fit the Citation Seeder parameters to the first half of a real citation network's history, generate forward trajectories, and check whether the predicted citation counts and community evolution match the actual second half of the same network.
Figures
read the original abstract
Generating realistic synthetic citation, patent, or component dependency networks is essential for benchmarking community detection, graph visualisation, and network data mining algorithms. We present the first systematic comparison of generators of directed graphs that are nearly acyclic and have a ground-truth community structure. We evaluate 12 methods across 7 real citation networks and 26 metrics. We propose the practice of reversing directions of edges in static generators to break cycles and induce a citation-like flow, which significantly improves the performance of a degree-corrected Stochastic Block Model. Our novel methodological approach to evaluating community detection benchmarks distinguishes between endogenous and exogenous mesoscopic similarities, with the latter proving more important. This distinction reveals that high-parameter models suffer from overfitting by memorising planted community statistics which lead to their failing to produce realistic networks. Finally, we introduce the Citation Seeder (CS) algorithm, an iterative generator grounded in the Price-Pareto model of citation networks, with interpretable parameters and O(N+E) runtime. CS achieves competitive results against the best-performing baselines while using up to four orders of magnitude fewer parameters and providing a clean framework for explaining and predicting a network's future growth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first systematic comparison of 12 generators for directed, nearly acyclic graphs with ground-truth community structure. It evaluates them across 7 real citation networks using 26 metrics, proposes reversing edge directions in static generators (improving degree-corrected SBM performance), distinguishes endogenous from exogenous mesoscopic similarities (finding the latter more important), critiques high-parameter models for overfitting planted communities, and introduces the Citation Seeder (CS) algorithm based on the Price-Pareto model. CS is claimed to achieve competitive results with up to four orders of magnitude fewer parameters, O(N+E) runtime, and a framework for explaining and predicting network growth.
Significance. If the results hold, the work supplies a parsimonious, interpretable generative model for citation networks that avoids the overfitting issues identified in high-parameter baselines. The endogenous/exogenous distinction offers a useful methodological lens for designing community-detection benchmarks, and the low-parameter count plus linear runtime would make CS practical for large-scale synthetic data generation in network science.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation description: The central claim that CS supplies a framework for predicting a network's future growth is not load-bearingly supported by the described evaluation. The 26 metrics are applied to static snapshots of 7 networks; temporal validation (e.g., citation-age distributions, attachment-rate evolution across time slices, or hold-out prediction of future edges) is required to test whether the iterative Price-Pareto process reproduces observed growth dynamics rather than merely matching static statistics.
- [Abstract] Abstract: The assertion that exogenous mesoscopic similarities are more important than endogenous ones, and that this explains why high-parameter models fail, needs explicit quantitative linkage to the 26 metrics. Which specific metrics demonstrate the overfitting of planted community statistics, and how does CS avoid this while still producing realistic networks?
minor comments (1)
- The abstract refers to '26 metrics' without enumerating or categorizing them (structural, community, temporal, etc.). A concise table or appendix listing the metrics and their groupings would improve reproducibility and allow readers to assess coverage of growth-related properties.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments highlight important aspects of our evaluation and claims that we address point by point below. We propose targeted revisions to strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: The central claim that CS supplies a framework for predicting a network's future growth is not load-bearingly supported by the described evaluation. The 26 metrics are applied to static snapshots of 7 networks; temporal validation (e.g., citation-age distributions, attachment-rate evolution across time slices, or hold-out prediction of future edges) is required to test whether the iterative Price-Pareto process reproduces observed growth dynamics rather than merely matching static statistics.
Authors: We agree that the predictive capability of the CS framework would benefit from explicit temporal validation to fully substantiate the claim in the abstract. The Price-Pareto model is iterative by design, with parameters that directly control attachment rates and growth, enabling forward simulation in principle. However, the current experiments evaluate static match to real networks. We will add a new subsection with hold-out experiments: training CS parameters on early time slices of the citation networks and evaluating prediction of later edges via metrics such as precision@K for future citations and evolution of in-degree distributions. This will be reported alongside the existing 26 metrics. revision: yes
-
Referee: [Abstract] Abstract: The assertion that exogenous mesoscopic similarities are more important than endogenous ones, and that this explains why high-parameter models fail, needs explicit quantitative linkage to the 26 metrics. Which specific metrics demonstrate the overfitting of planted community statistics, and how does CS avoid this while still producing realistic networks?
Authors: The distinction between endogenous and exogenous mesoscopic structure is operationalized by comparing models that directly optimize or plant community statistics (e.g., degree-corrected SBM variants) against those that generate communities as an emergent outcome of the growth process (CS and certain baselines). Overfitting is evidenced by high-parameter models achieving strong scores on community-aware metrics such as modularity, conductance, and normalized mutual information with the planted labels, yet performing poorly on global structural metrics including degree distribution, clustering coefficient, and acyclicity measures. CS avoids this by using only a handful of interpretable parameters that govern preferential attachment and community seeding without directly fitting mesoscopic statistics. We will insert a new paragraph and reference table in the evaluation section that explicitly maps subsets of the 26 metrics to the endogenous/exogenous distinction and the overfitting diagnosis. revision: partial
Circularity Check
No significant circularity: CS derivation grounded in external Price-Pareto model with independent benchmarks
full rationale
The paper grounds the Citation Seeder (CS) in the established Price-Pareto citation model and evaluates generated networks against 7 independent real citation networks using 26 metrics. This provides external validation rather than reducing claims to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The distinction between endogenous/exogenous structure is a methodological contribution evaluated on held-out real data, and the predictive growth framework follows directly from the iterative generative process without tautological reduction. No equations or steps in the provided text exhibit the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of CS
axioms (1)
- domain assumption Citation networks follow the Price-Pareto model
Reference graph
Works this paper leans on
-
[1]
Community detection and stochastic block models: Recent developments
Abbe, E., 2018. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research 18, 1–86
2018
-
[2]
Achieving the KS threshold in the general Stochastic Block Model with linearized acyclic belief propagation, in: Proc
Abbé, E., Sandon, C., 2016. Achieving the KS threshold in the general Stochastic Block Model with linearized acyclic belief propagation, in: Proc. Neural Information Processing Systems (NIPS’16), pp. 1334–1342
2016
-
[3]
Statistical mechanics of complex networks
Albert, R., Barabási, A.L., 2002. Statistical mechanics of complex networks. Reviews of Modern Physics 74, 47–97. doi:10.1103/RevModPhys.74.47
-
[4]
Bertoli-Barsotti, L., Gagolewski, M., Siudem, G., Żogała Siudem, B., 2024. Equivalence of inequality indices in the three-dimensional model of informetric impact. Journal of Informetrics 18, 101566. doi:10.1016/j.joi.2024.101566
-
[5]
Competition and multiscaling in evolving networks
Bianconi, G., Barabási, A.L., 2001. Competition and multiscaling in evolving networks. Europhysics Letters 54, 436. doi:10.1209/epl/i2001-00260-6
-
[6]
Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking, in: International Conference on Learning Representations (ICLR’18)
Bojchevski, A., Günnemann, S., 2018. Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking, in: International Conference on Learning Representations (ICLR’18)
2018
-
[7]
A probabilistic proof of an asymptotic formula for the number of labelled regular graphs
Bollobás, B., 1980. A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. European Journal of Combinatorics 1, 311–316. doi:10.1016/S0195-6698(80)80030-8
-
[8]
Graph generators
Bonifati, A., Holubová, I., Prat-Pèrez, A., Sakr, S., 2020. Graph generators. ACM Computing Surveys 53, 1–30. doi:10. 1145/3379445
2020
-
[9]
Brisson, L., Bothorel, C., Duminy, N., 2025. DynBenchmark: Customizable ground truths to benchmark community detection and tracking in temporal networks, in: Proc. France’s International Conference on Complex Systems (FRCCS 2025), pp. 74–85. doi:10.1007/978-3-032-00206-8_8
-
[10]
UnPACD: Unified patent and citation dataset.https://github.com/lukaszbrzozowski/unpacd
Brzozowski, L., 2026. UnPACD: Unified patent and citation dataset.https://github.com/lukaszbrzozowski/unpacd
2026
-
[11]
The Price-Pareto growth model of networks with community structure
Brzozowski, L., Gagolewski, M., Siudem, G., Żogała Siudem, B., 2026. The Price-Pareto growth model of networks with community structure doi:10.48550/arxiv.2510.13392. under review (preprint). Brzozowski, Gagolewski, Siudem:Preprint; Last updated on April 29, 2026Page 21 of 23 Generating Synthetic Citation Networks with Communities
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.13392 2026
-
[12]
Recovering asymmetric communities in the Stochastic Block Model
Caltagirone, F., Lelarge, M., Miolane, L., 2017. Recovering asymmetric communities in the Stochastic Block Model. IEEE Transactions on Network Science and Engineering 5, 237–246. doi:10.1109/tnse.2017.2758201
-
[13]
Generating directed graphs with dual attention and asymmetric encoding, in: The Fourteenth International Conference on Learning Representations
Carballo-Castro, A., Madeira, M., QIN, Y., Thanou, D., Frossard, P., 2026. Generating directed graphs with dual attention and asymmetric encoding, in: The Fourteenth International Conference on Learning Representations
2026
-
[14]
Relational topic models for document networks, in: van Dyk, D., Welling, M
Chang, J., Blei, D., 2009. Relational topic models for document networks, in: van Dyk, D., Welling, M. (Eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, pp. 81–88
2009
-
[15]
Community detection in subspace of attribute
Chen, H., Yu, Z., Yang, Q., Shao, J., 2022. Community detection in subspace of attribute. Information Sciences 602, 220–235. doi:10.1016/j.ins.2022.04.047
-
[16]
Decelle, A., Krząkała, F., Moore, C., Zdeborová, L., 2011. Asymptotic analysis of the Stochastic Block Model for modular networks and its algorithmic applications. Physical Review E 84. doi:10.1103/physreve.84.066106
-
[17]
Random graph modeling: A survey of the concepts
Drobyshevskiy, M., Turdakov, D., 2019. Random graph modeling: A survey of the concepts. ACM Computing Surveys 52, 131. doi:10.1145/3369782
-
[18]
A fast and effective heuristic for the feedback arc set problem
Eades, P., Lin, X., Smyth, W., 1993. A fast and effective heuristic for the feedback arc set problem. Information Processing Letters 47, 319–323. doi:10.1016/0020-0190(93)90079-O
-
[19]
The use of ranks to avoid the assumption of normality implicit in the analysis of variance
Friedman, M., 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32, 675–701
1937
-
[20]
Gagolewski, M., 2022. A framework for benchmarking clustering algorithms. SoftwareX 20, 101270. doi:10.1016/j.softx. 2022.101270
-
[21]
Are cluster validity measures (in)valid? Information Sciences 581, 620–636
Gagolewski, M., Bartoszuk, M., Cena, A., 2021. Are cluster validity measures (in)valid? Information Sciences 581, 620–636. doi:10.1016/j.ins.2021.10.004
-
[22]
doi: https://doi.org/10.1016/0378-8733(83)90021-7
Holland, P.W., Laskey, K.B., Leinhardt, S., 1983. Stochastic blockmodels: First steps. Social Networks 5, 109–137. doi:10.1016/0378-8733(83)90021-7
-
[23]
The aging effect in evolving scientific citation networks
Hu, F., Ma, L., Zhan, X.X., Zhou, Y., Liu, C., Zhao, H., Zhang, Z.K., 2021. The aging effect in evolving scientific citation networks. Scientometrics 126, 4297–4309. doi:10.1007/s11192-021-03929-8
-
[24]
Open Graph Benchmark: Datasets for machine learning on graphs
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J., 2020. Open Graph Benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems 33, 22118–22133
2020
-
[25]
Cluster analysis: A modern statistical review
Jaeger, A., Banks, D., 2023. Clusteranalysis: Amodernstatisticalreview. WileyInterdisciplinaryReviews: Computational Statistics 15, e1597. doi:10.1002/wics.1597
-
[26]
Random graph models for directed acyclic networks
Karrer, B., Newman, M.E.J., 2009. Random graph models for directed acyclic networks. Physical Review E 80, 046110. doi:10.1103/PhysRevE.80.046110
-
[27]
Stochastic blockmodels with a growing number of classes
Karrer, B., Newman, M.E.J., 2011. Stochastic blockmodels with a growing number of classes. Physical Review E 83, 016107. doi:10.1103/PhysRevE.83.016107
-
[28]
Benchmark graphs for testing community detection algorithms , volume =
Lancichinetti, A., Fortunato, S., Radicchi, F., 2008. Benchmark graphs for testing community detection algorithms. Physical Review E 78. doi:10.1103/physreve.78.046110
-
[29]
A review of Stochastic Block Models and extensions for graph clustering
Lee, C., Wilkinson, D., 2019. A review of Stochastic Block Models and extensions for graph clustering. Applied Network Science 4. doi:10.1007/s41109-019-0232-2
-
[30]
Bayesian testing for exogenous partition structures in Stochastic Block Models
Legramanti, S., Rigon, T., Durante, D., 2022. Bayesian testing for exogenous partition structures in Stochastic Block Models. Sankhya A 84, 108–126. doi:10.1007/s13171-020-00231-2
-
[31]
SNAPDatasets: Stanfordlargenetworkdatasetcollection.http://snap.stanford.edu/data
Leskovec, J., Krevl, A., 2014. SNAPDatasets: Stanfordlargenetworkdatasetcollection.http://snap.stanford.edu/data
2014
-
[32]
Dirichlet graph variational autoencoder, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H
Li, J., Yu, J., Li, J., Zhang, H., Zhao, K., Rong, Y., Cheng, H., Huang, J., 2020. Dirichlet graph variational autoencoder, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Advances in Neural Information Processing Systems, pp. 5274–5283
2020
-
[33]
Hygen: Generating random graphs with hyperbolic communities
Metzler, S., Miettinen, P., 2019. Hygen: Generating random graphs with hyperbolic communities. Applied Network Science 4. doi:10.1007/s41109-019-0166-8
-
[34]
The computer science and physics of community detection: Landscapes, phase transitions, and hardness
Moore, C., 2017. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. Bulletin of the EATCS 121
2017
-
[35]
Nanumyan, V., Gote, C., Schweitzer, F., 2020. Multilayer network approach to modeling authorship influence on citation dynamics in physics journals. Physical Review E 102. doi:10.1103/physreve.102.032303
-
[36]
ma-CODE: A multi-phase approach on community detection in evolving networks
Nath, K., Shanmugam, R., Varadaranjan, V., 2021. ma-CODE: A multi-phase approach on community detection in evolving networks. Information Sciences 569, 326–343. doi:10.1016/j.ins.2021.02.068
-
[37]
Newman, M., 2018. Networks. Oxford University Press. doi:10.1093/oso/9780198805090.001.0001
-
[38]
The first-mover advantage in scientific publication
Newman, M.E.J., 2009. The first-mover advantage in scientific publication. EPL (Europhysics Letters) 86, 68001–68001. doi:10.1209/0295-5075/86/68001
-
[39]
Price, D., 1965. Networks of scientific papers. Science 149, 510–515. doi:10.1126/science.149.3683.510
-
[40]
Rosvall, M., Axelsson, D., Bergstrom, C.T., 2009. The map equation. The European Physical Journal Special Topics 178, 13–23. doi:10.1140/epjst/e2010-01179-1
-
[41]
Collective classification in network data
Sen, P., Namata, G.M., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T., 2008. Collective classification in network data. AI Magazine 29, 93–106
2008
-
[42]
Three dimensions of scientific impact
Siudem, G., Żogała Siudem, B., Cena, A., Gagolewski, M., 2020. Three dimensions of scientific impact. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 117, 13896–13900. doi:10.1073/pnas.2001064117
-
[43]
Community detection in directed acyclic graphs
Speidel, L., Takaguchi, T., Masuda, N., 2015. Community detection in directed acyclic graphs. The European Physical Journal B 88. doi:10.1140/epjb/e2015-60226-y
-
[44]
Sun, J., Ajwani, D., Nicholson, P.K., Sala, A., Parthasarathy, S., 2017. Breaking cycles in noisy hierarchies, in: Proceedings Brzozowski, Gagolewski, Siudem:Preprint; Last updated on April 29, 2026Page 22 of 23 Generating Synthetic Citation Networks with Communities of the 2017 ACM on Web Science Conference, Association for Computing Machinery, New York,...
-
[45]
Arnetminer: Extraction and mining of academic social networks, in: KDD’08, pp
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z., 2008. Arnetminer: Extraction and mining of academic social networks, in: KDD’08, pp. 990–998
2008
-
[46]
Scientific Reports9(1), 5233 (2019) https://doi
Traag, V., Waltman, L., vanEck, N.J., 2019. FromLouvaintoLeiden: Guaranteeingwell-connectedcommunities. Scientific Reports 9, 5233. doi:10.1038/s41598-019-41695-z
-
[47]
Validation of cluster analysis results on validation data: A systematic framework
Ullmann, T., Hennig, C., Boulesteix, A.L., 2022. Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12, e1444. doi:10.1002/widm.1444
-
[48]
A white paper on good research practices in benchmarking: The case of cluster analysis
van Mechelen, I., Boulesteix, A.L., Dangl, R., et al., 2023. A white paper on good research practices in benchmarking: The case of cluster analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 13, e1511. doi:10. 1002/widm.1511
2023
-
[49]
Quantifying long-term scientific impact
Wang, D., Song, C., Barabási, A.L., 2013. Quantifying long-term scientific impact. Science 342, 127–132
2013
-
[50]
Detectability of macroscopic structures in directed asymmetric Stochastic Block Model
Wiliński, M., Mazzarisi, P., Tantari, D., Lillo, F., 2019. Detectability of macroscopic structures in directed asymmetric Stochastic Block Model. Physical Review E 99. doi:10.1103/physreve.99.042310
-
[51]
Community detection based on modularity and k-plexes
Zhu, J., Chen, B., Zeng, Y., 2020. Community detection based on modularity and k-plexes. Information Sciences 513, 127–142. doi:10.1016/j.ins.2019.10.076. Brzozowski, Gagolewski, Siudem:Preprint; Last updated on April 29, 2026Page 23 of 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.