Deforking the World of Code: A Project-Provenance Map that Recovers Cross-Forge Fork Families that Platform Graphs Cannot See
Pith reviewed 2026-06-30 02:01 UTC · model grok-4.3
The pith
Shared-commit clustering yields a deforking map that recovers cross-forge fork families platform graphs miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We release a curated deforking map for the World of Code (WoC) version V2604: p2PFull, which collapses every raw repository p into the deforked project P to which it belongs, built from the global shared-commit relation (51.79M shared-commit groups) via a hub-node star encoding and parallel Louvain clustering, plus capped variants (cap250/cap500) that bound mega-cluster size. The naive shared-history union over-merges: the project graph welds unrelated software into giant clusters (largest uncapped cluster 861,948 repositories, bridged by shared-commit groups as large as 267,200), for the same structural reason author-identity graphs do. A cheap size cap removes the boilerplate-hub bridges;
What carries the argument
The p2PFull deforking map, which collapses repositories into projects via shared-commit groups encoded as hub-node stars and clustered with parallel Louvain, plus size-capped variants.
If this is right
- Popularity and spread metrics computed on raw repositories are inflated by orders of magnitude and must be recomputed on the deforked project level.
- The map recovers fork families that cross forges (5.41 percent) and families whose root lies outside GitHub (1.51 percent).
- An exclusion list of 134.1 million children and a detached-fork inventory of 455,550 hard-detached edges become available for reuse.
- The same shared-commit relation that produces author-identity mega-clusters produces project mega-clusters, but size capping suffices here.
Where Pith is reading between the lines
- The same clustering pipeline could be rerun on future WoC snapshots to track how fork families evolve over time.
- Integration with author-identity resolution would allow joint analysis of both project and contributor provenance.
- The structural-bridge diagnostic left unused here could be applied in later work to test whether residual clusters split further.
Load-bearing premise
That the clusters remaining after size capping represent genuine vendored history rather than unrelated projects merged by boilerplate.
What would settle it
A sample audit of residual clusters after the cap250 or cap500 cut to determine whether they contain unrelated projects or only vendored copies.
read the original abstract
Forks share git history, so a commit surfaces in many repositories and any spread- or popularity-based measure over raw repositories is inflated by orders of magnitude. We release a curated deforking map for the World of Code (WoC) version V2604: p2PFull, which collapses every raw repository p into the deforked project P to which it belongs, built from the global shared-commit relation (51.79M shared-commit groups) via a hub-node star encoding and parallel Louvain clustering, plus capped variants (cap250/cap500) that bound mega-cluster size. The naive shared-history union over-merges: the project graph welds unrelated software into giant clusters (largest uncapped cluster 861,948 repositories, bridged by shared-commit groups as large as 267,200), for the same structural reason author-identity graphs do. A cheap size cap removes the boilerplate-hub bridges; a structural-bridge diagnostic, the cut that dissolved the analogous author mega-cluster, run here but deliberately not applied, shows the post-cap residual is genuine vendored history, robust to the cut, so we leave it intact. We validate the map against GitHub's declared fork graph reconstructed from GHArchive ForkEvents, finding 99.01% edge agreement conditional on both repositories being in WoC. Disagreements fall into two classes: a completeness byproduct (edges GitHub asserts but WoC has not ingested) and the central contribution, WoC-only fork families that GitHub's platform graph cannot represent, including 5.41% multi-forge families and 1.51% whose fork root is not on GitHub. We additionally release a refreshed fork-exclusion list (134.1M children, 3.4x the GHTorrent-era 39.5M) and a detached-fork inventory (455,550 hard-detached edges; 240,441 genuine independent origins). All artifacts are a self-contained, independently hosted replication package keyed to the WoC V2604 collection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to release a curated deforking map (p2PFull and capped variants) for WoC V2604 that collapses raw repositories into projects via the global shared-commit relation (51.79M groups), hub-node star encoding, and parallel Louvain clustering; it reports 99.01% edge agreement with GitHub's declared fork graph (conditional on both repos in WoC), identifies WoC-only cross-forge families, and releases an updated fork-exclusion list plus detached-fork inventory, all as a self-contained replication package.
Significance. If the map is sound, the release would improve accuracy of popularity, spread, and other metrics in software engineering research by correcting fork-induced inflation; the provision of independently hosted, keyed replication artifacts and the scale of the fork-exclusion list (134.1M children) constitute a concrete contribution to the field.
major comments (2)
- Abstract: the 99.01% agreement validates only GitHub-declared edges already present in WoC and provides no independent test of the central claim (5.41% multi-forge families and 1.51% non-GitHub fork roots); without external grounding for these WoC-only families the support for the map's added value remains incomplete.
- Abstract: the claim that post-cap residual clusters are 'genuine vendored history' and 'robust to the cut' rests on a structural-bridge diagnostic that was run but deliberately not applied; no quantitative outcome of that diagnostic (e.g., number or size of clusters that would be split) is reported, leaving the load-bearing assumption about residual over-merging untested.
minor comments (1)
- Abstract: the largest uncapped cluster (861,948 repositories) is bridged by shared-commit groups 'as large as 267,200'; clarify whether the bridge size refers to the number of repositories or commits and how these figures relate to the hub-node encoding.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments on validation and reporting. We respond to each major comment below.
read point-by-point responses
-
Referee: [—] Abstract: the 99.01% agreement validates only GitHub-declared edges already present in WoC and provides no independent test of the central claim (5.41% multi-forge families and 1.51% non-GitHub fork roots); without external grounding for these WoC-only families the support for the map's added value remains incomplete.
Authors: The 99.01% agreement is computed exclusively on GitHub-declared edges present in WoC, as stated. This provides strong validation of the method wherever external ground truth exists. The central claims (5.41% multi-forge families and 1.51% non-GitHub roots) concern families invisible to GitHub's platform graph by definition; no GitHub-derived test can validate them. We report a full breakdown of disagreements, separating completeness artifacts from these WoC-only families. While an independent external ground truth for the WoC-only cases would be desirable, constructing it requires a separate multi-platform study far beyond the scope of this data-release paper. The high fidelity on all observable cases supplies the necessary empirical support for the procedure's application to the unobservable cases. revision: no
-
Referee: [—] Abstract: the claim that post-cap residual clusters are 'genuine vendored history' and 'robust to the cut' rests on a structural-bridge diagnostic that was run but deliberately not applied; no quantitative outcome of that diagnostic (e.g., number or size of clusters that would be split) is reported, leaving the load-bearing assumption about residual over-merging untested.
Authors: The referee correctly notes that quantitative results from the structural-bridge diagnostic are not reported. The diagnostic (identical to the cut that resolved author mega-clusters) was executed on the capped clusters; it would have split only a modest number of clusters whose resulting components aligned with known vendoring patterns rather than erroneous merges. We will add the specific quantitative outcomes (number of clusters affected, pre- and post-cut sizes) to the revised manuscript so that the decision to retain the residuals is fully documented and testable. revision: yes
Circularity Check
Deforking map validity for residual clusters justified via self-cited structural-bridge diagnostic
specific steps
-
self citation load bearing
[Abstract]
"a structural-bridge diagnostic, the cut that dissolved the analogous author mega-cluster, run here but deliberately not applied, shows the post-cap residual is genuine vendored history, robust to the cut, so we leave it intact."
The assertion that residual clusters after size capping are genuine vendored history (the key assumption enabling the p2PFull map to be treated as accurate for new cross-forge families) is justified solely by the structural-bridge diagnostic from the author's prior work on author mega-clusters. No independent evidence is supplied for this application, making the central claim about the map's correctness dependent on that self-referential justification.
full rationale
The paper constructs the deforking map from shared-commit data via standard clustering and size-capping, with external GitHub validation at 99.01% agreement on overlapping edges. The load-bearing assumption that post-cap residuals represent genuine vendored history (rather than over-merged projects) is justified only by reference to the structural-bridge diagnostic from the author's prior author-identity work, which was run but not applied. This creates moderate self-citation dependence for the central claim about cross-forge families, but the derivation itself is data-driven and partially externally benchmarked, so the result does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- cluster size cap =
250/500
axioms (1)
- domain assumption Repositories sharing commits belong to the same project
Reference graph
Works this paper leans on
-
[1]
Georgios Gousios. 2013. The GHTorrent Dataset and Tool Suite. InProceedings of the 10th Working Conference on Mining Software Repositories (MSR). 233–236. doi:10.1109/MSR.2013.6624034
-
[2]
Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub’s Data from a Firehose. InProceedings of the 9th IEEE Working Conference on Mining Software Repositories (MSR). 12–21. doi:10.1109/MSR.2012.6224294
-
[3]
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. Ger- man, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). 92–101. doi:10.1145/2597073.2597074
-
[4]
Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus
-
[5]
InProceedings of the 16th International Conference on Mining Software Repositories (MSR)
World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. InProceedings of the 16th International Conference on Mining Software Repositories (MSR). 143–154. doi:10.1109/MSR.2019.00031
-
[6]
Yuxing Ma, Tapajit Dey, Chris Bogart, Sadika Amreen, Marat Valiev, Adam Tutko, David Kennard, Russell Zaretzki, and Audris Mockus. 2021. World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS Data.Empirical Software Engineering26, 2 (2021), 22. doi:10.1007/s10664-020-09905-9
-
[7]
Audris Mockus. 2026. Scaling Author Identity Disambiguation to the World of Code: A Methodology. Companion methodology log; full experimental record (Exps. 1–18)
2026
-
[8]
Audris Mockus, Diomidis Spinellis, Zoe Kotti, and Gabriel John Dusing. 2020. A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 513–517. doi:10.1145/3379597. 3387499
-
[9]
Diomidis Spinellis. 2017. A Repository of Unix History and Evolution.Empirical Software Engineering22, 3 (2017), 1372–1404. doi:10.1007/s10664-016-9445-5
-
[10]
Diomidis Spinellis, Zoe Kotti, and Audris Mockus. 2020. A Dataset for GitHub Repository Deduplication. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 523–527. doi:10.1145/3379597.3387496 Conference’17, July 2017, Washington, DC, USA Audris Mockus Table 6: Released artifacts (WoC V2604). All files are ;-separated,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.