pith. sign in

arxiv: 2606.29550 · v1 · pith:RGETLFTZnew · submitted 2026-06-28 · 💻 cs.SE

Deforking the World of Code: A Project-Provenance Map that Recovers Cross-Forge Fork Families that Platform Graphs Cannot See

Pith reviewed 2026-06-30 02:01 UTC · model grok-4.3

classification 💻 cs.SE
keywords deforkingshared commitsgit repositoriesproject provenancefork familiescross-forgeworld of code
0
0 comments X

The pith

Shared-commit clustering yields a deforking map that recovers cross-forge fork families platform graphs miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs and releases a map that groups raw git repositories into deforked projects by treating shared commits as evidence of fork relations. This corrects the inflation that occurs when popularity or spread metrics count each fork separately, since forks inherit the full commit history of their origins. The construction starts from 51.79 million shared-commit groups, encodes them as hub-node stars, and applies parallel Louvain clustering, with optional size caps at 250 or 500 repositories to prevent giant over-merged clusters. Validation against GitHub's declared fork graph shows 99.01 percent edge agreement for repositories present in both collections, while the remaining differences consist of families that span multiple forges or originate outside GitHub. The authors also supply an updated fork-exclusion list of 134.1 million children and an inventory of 455,550 detached forks.

Core claim

We release a curated deforking map for the World of Code (WoC) version V2604: p2PFull, which collapses every raw repository p into the deforked project P to which it belongs, built from the global shared-commit relation (51.79M shared-commit groups) via a hub-node star encoding and parallel Louvain clustering, plus capped variants (cap250/cap500) that bound mega-cluster size. The naive shared-history union over-merges: the project graph welds unrelated software into giant clusters (largest uncapped cluster 861,948 repositories, bridged by shared-commit groups as large as 267,200), for the same structural reason author-identity graphs do. A cheap size cap removes the boilerplate-hub bridges;

What carries the argument

The p2PFull deforking map, which collapses repositories into projects via shared-commit groups encoded as hub-node stars and clustered with parallel Louvain, plus size-capped variants.

If this is right

  • Popularity and spread metrics computed on raw repositories are inflated by orders of magnitude and must be recomputed on the deforked project level.
  • The map recovers fork families that cross forges (5.41 percent) and families whose root lies outside GitHub (1.51 percent).
  • An exclusion list of 134.1 million children and a detached-fork inventory of 455,550 hard-detached edges become available for reuse.
  • The same shared-commit relation that produces author-identity mega-clusters produces project mega-clusters, but size capping suffices here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering pipeline could be rerun on future WoC snapshots to track how fork families evolve over time.
  • Integration with author-identity resolution would allow joint analysis of both project and contributor provenance.
  • The structural-bridge diagnostic left unused here could be applied in later work to test whether residual clusters split further.

Load-bearing premise

That the clusters remaining after size capping represent genuine vendored history rather than unrelated projects merged by boilerplate.

What would settle it

A sample audit of residual clusters after the cap250 or cap500 cut to determine whether they contain unrelated projects or only vendored copies.

read the original abstract

Forks share git history, so a commit surfaces in many repositories and any spread- or popularity-based measure over raw repositories is inflated by orders of magnitude. We release a curated deforking map for the World of Code (WoC) version V2604: p2PFull, which collapses every raw repository p into the deforked project P to which it belongs, built from the global shared-commit relation (51.79M shared-commit groups) via a hub-node star encoding and parallel Louvain clustering, plus capped variants (cap250/cap500) that bound mega-cluster size. The naive shared-history union over-merges: the project graph welds unrelated software into giant clusters (largest uncapped cluster 861,948 repositories, bridged by shared-commit groups as large as 267,200), for the same structural reason author-identity graphs do. A cheap size cap removes the boilerplate-hub bridges; a structural-bridge diagnostic, the cut that dissolved the analogous author mega-cluster, run here but deliberately not applied, shows the post-cap residual is genuine vendored history, robust to the cut, so we leave it intact. We validate the map against GitHub's declared fork graph reconstructed from GHArchive ForkEvents, finding 99.01% edge agreement conditional on both repositories being in WoC. Disagreements fall into two classes: a completeness byproduct (edges GitHub asserts but WoC has not ingested) and the central contribution, WoC-only fork families that GitHub's platform graph cannot represent, including 5.41% multi-forge families and 1.51% whose fork root is not on GitHub. We additionally release a refreshed fork-exclusion list (134.1M children, 3.4x the GHTorrent-era 39.5M) and a detached-fork inventory (455,550 hard-detached edges; 240,441 genuine independent origins). All artifacts are a self-contained, independently hosted replication package keyed to the WoC V2604 collection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to release a curated deforking map (p2PFull and capped variants) for WoC V2604 that collapses raw repositories into projects via the global shared-commit relation (51.79M groups), hub-node star encoding, and parallel Louvain clustering; it reports 99.01% edge agreement with GitHub's declared fork graph (conditional on both repos in WoC), identifies WoC-only cross-forge families, and releases an updated fork-exclusion list plus detached-fork inventory, all as a self-contained replication package.

Significance. If the map is sound, the release would improve accuracy of popularity, spread, and other metrics in software engineering research by correcting fork-induced inflation; the provision of independently hosted, keyed replication artifacts and the scale of the fork-exclusion list (134.1M children) constitute a concrete contribution to the field.

major comments (2)
  1. Abstract: the 99.01% agreement validates only GitHub-declared edges already present in WoC and provides no independent test of the central claim (5.41% multi-forge families and 1.51% non-GitHub fork roots); without external grounding for these WoC-only families the support for the map's added value remains incomplete.
  2. Abstract: the claim that post-cap residual clusters are 'genuine vendored history' and 'robust to the cut' rests on a structural-bridge diagnostic that was run but deliberately not applied; no quantitative outcome of that diagnostic (e.g., number or size of clusters that would be split) is reported, leaving the load-bearing assumption about residual over-merging untested.
minor comments (1)
  1. Abstract: the largest uncapped cluster (861,948 repositories) is bridged by shared-commit groups 'as large as 267,200'; clarify whether the bridge size refers to the number of repositories or commits and how these figures relate to the hub-node encoding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments on validation and reporting. We respond to each major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: the 99.01% agreement validates only GitHub-declared edges already present in WoC and provides no independent test of the central claim (5.41% multi-forge families and 1.51% non-GitHub fork roots); without external grounding for these WoC-only families the support for the map's added value remains incomplete.

    Authors: The 99.01% agreement is computed exclusively on GitHub-declared edges present in WoC, as stated. This provides strong validation of the method wherever external ground truth exists. The central claims (5.41% multi-forge families and 1.51% non-GitHub roots) concern families invisible to GitHub's platform graph by definition; no GitHub-derived test can validate them. We report a full breakdown of disagreements, separating completeness artifacts from these WoC-only families. While an independent external ground truth for the WoC-only cases would be desirable, constructing it requires a separate multi-platform study far beyond the scope of this data-release paper. The high fidelity on all observable cases supplies the necessary empirical support for the procedure's application to the unobservable cases. revision: no

  2. Referee: [—] Abstract: the claim that post-cap residual clusters are 'genuine vendored history' and 'robust to the cut' rests on a structural-bridge diagnostic that was run but deliberately not applied; no quantitative outcome of that diagnostic (e.g., number or size of clusters that would be split) is reported, leaving the load-bearing assumption about residual over-merging untested.

    Authors: The referee correctly notes that quantitative results from the structural-bridge diagnostic are not reported. The diagnostic (identical to the cut that resolved author mega-clusters) was executed on the capped clusters; it would have split only a modest number of clusters whose resulting components aligned with known vendoring patterns rather than erroneous merges. We will add the specific quantitative outcomes (number of clusters affected, pre- and post-cut sizes) to the revised manuscript so that the decision to retain the residuals is fully documented and testable. revision: yes

Circularity Check

1 steps flagged

Deforking map validity for residual clusters justified via self-cited structural-bridge diagnostic

specific steps
  1. self citation load bearing [Abstract]
    "a structural-bridge diagnostic, the cut that dissolved the analogous author mega-cluster, run here but deliberately not applied, shows the post-cap residual is genuine vendored history, robust to the cut, so we leave it intact."

    The assertion that residual clusters after size capping are genuine vendored history (the key assumption enabling the p2PFull map to be treated as accurate for new cross-forge families) is justified solely by the structural-bridge diagnostic from the author's prior work on author mega-clusters. No independent evidence is supplied for this application, making the central claim about the map's correctness dependent on that self-referential justification.

full rationale

The paper constructs the deforking map from shared-commit data via standard clustering and size-capping, with external GitHub validation at 99.01% agreement on overlapping edges. The load-bearing assumption that post-cap residuals represent genuine vendored history (rather than over-merged projects) is justified only by reference to the structural-bridge diagnostic from the author's prior author-identity work, which was run but not applied. This creates moderate self-citation dependence for the central claim about cross-forge families, but the derivation itself is data-driven and partially externally benchmarked, so the result does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; method relies on domain assumption that shared commits define projects and introduces size-cap parameters to control over-merging.

free parameters (1)
  • cluster size cap = 250/500
    Bounds mega-cluster size to remove boilerplate-hub bridges
axioms (1)
  • domain assumption Repositories sharing commits belong to the same project
    Core assumption used to build shared-commit groups and the project map

pith-pipeline@v0.9.1-grok · 5912 in / 1252 out tokens · 55825 ms · 2026-06-30T02:01:40.302420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 8 canonical work pages

  1. [1]

    Georgios Gousios. 2013. The GHTorrent Dataset and Tool Suite. InProceedings of the 10th Working Conference on Mining Software Repositories (MSR). 233–236. doi:10.1109/MSR.2013.6624034

  2. [2]

    Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub’s Data from a Firehose. InProceedings of the 9th IEEE Working Conference on Mining Software Repositories (MSR). 12–21. doi:10.1109/MSR.2012.6224294

  3. [3]

    Ger- man, and Daniela Damian

    Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. Ger- man, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). 92–101. doi:10.1145/2597073.2597074

  4. [4]

    Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus

  5. [5]

    InProceedings of the 16th International Conference on Mining Software Repositories (MSR)

    World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. InProceedings of the 16th International Conference on Mining Software Repositories (MSR). 143–154. doi:10.1109/MSR.2019.00031

  6. [6]

    Yuxing Ma, Tapajit Dey, Chris Bogart, Sadika Amreen, Marat Valiev, Adam Tutko, David Kennard, Russell Zaretzki, and Audris Mockus. 2021. World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS Data.Empirical Software Engineering26, 2 (2021), 22. doi:10.1007/s10664-020-09905-9

  7. [7]

    Audris Mockus. 2026. Scaling Author Identity Disambiguation to the World of Code: A Methodology. Companion methodology log; full experimental record (Exps. 1–18)

  8. [8]

    Audris Mockus, Diomidis Spinellis, Zoe Kotti, and Gabriel John Dusing. 2020. A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 513–517. doi:10.1145/3379597. 3387499

  9. [9]

    Diomidis Spinellis. 2017. A Repository of Unix History and Evolution.Empirical Software Engineering22, 3 (2017), 1372–1404. doi:10.1007/s10664-016-9445-5

  10. [10]

    Diomidis Spinellis, Zoe Kotti, and Audris Mockus. 2020. A Dataset for GitHub Repository Deduplication. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 523–527. doi:10.1145/3379597.3387496 Conference’17, July 2017, Washington, DC, USA Audris Mockus Table 6: Released artifacts (WoC V2604). All files are ;-separated,...