The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot

Ashish Agarwal; Fangchen Song; Wen Wen

arxiv: 2410.02091 · v3 · pith:FQGW2U65new · submitted 2024-10-02 · 💻 cs.SE · cs.AI· cs.HC· econ.GN· q-fin.EC

The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot

Fangchen Song , Ashish Agarwal , Wen Wen This is my paper

Pith reviewed 2026-05-23 19:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HCecon.GNq-fin.EC

keywords generative AIGitHub Copilotopen-source softwaredeveloper productivitycoordination timecode contributionsOSS collaboration

0 comments

The pith

GitHub Copilot increases open-source code contributions by 5.9% while raising coordination time by 8%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how GitHub Copilot, a generative AI pair programmer, affects collaborative open-source software development. It reports that Copilot use raises project-level code contributions by 5.9 percent through a 3.4 percent rise in developer participation and a 2.1 percent gain in individual productivity. At the same time, it increases coordination time by 8 percent due to more code discussions. The net result remains positive for timely code merges. Impacts differ between core and peripheral developers, with the latter seeing smaller contribution gains and larger coordination costs.

Core claim

Using GitHub's proprietary Copilot usage data combined with public OSS project data, the study shows that Copilot use increases project-level code contributions by 5.9%. This gain is driven by a 3.4% rise in developer coding participation and a 2.1% increase in individual productivity. However, Copilot use also leads to an increase in coordination time by 8% due to more code discussions. This reveals an important tradeoff: While AI expands who can contribute and how much they contribute, it slows coordination in collective development efforts. Despite this tension, the combined effect of these two competing forces remains positive, indicating a net gain in overall project-level timely merge.

What carries the argument

GitHub Copilot usage data linked to OSS project metrics, isolating effects on contributions, participation, productivity, and coordination time.

Load-bearing premise

Differences in Copilot usage across projects and developers can be used to causally identify effects on contributions and coordination without significant confounding from project characteristics or developer self-selection.

What would settle it

A randomized controlled trial that assigns Copilot access to some developers or projects and measures resulting changes in code contributions and discussion volume would falsify the causal claims if no effects are observed.

Figures

Figures reproduced from arXiv: 2410.02091 by Ashish Agarwal, Fangchen Song, Wen Wen.

read the original abstract

Generative artificial intelligence (AI) facilitates content production and enhances ideation capabilities, which can significantly influence developer productivity and participation in software development. To explore its impact on collaborative open-source software (OSS) development, we investigate the role of GitHub Copilot, a generative AI pair programmer, in OSS development where multiple distributed developers voluntarily collaborate. Using GitHub's proprietary Copilot usage data, combined with public OSS project data obtained from GitHub, we find that Copilot use increases project-level code contributions by 5.9%. This gain is driven by a 3.4% rise in developer coding participation and a 2.1% increase in individual productivity. However, Copilot use also leads to an increase in coordination time by 8% due to more code discussions. This reveals an important tradeoff: While AI expands who can contribute and how much they contribute, it slows coordination in collective development efforts. Despite this tension, the combined effect of these two competing forces remains positive, indicating a net gain in overall project-level timely merge of code contributions from using AI pair programmers. Interestingly, we also find the effects differ across developer roles. Peripheral developers show relatively smaller increases in project-level code contributions and experience larger increases in coordination time than core developers. In summary, our study underscores the dual role of AI pair programmers in affecting project-level code contributions and coordination time in OSS development. Our findings on the differential effects between core and peripheral developers also provide important implications for the structure of OSS communities in the long run.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper uses real Copilot usage data to quantify a 5.9% contribution increase and 8% coordination rise in OSS projects, but the causal claims hinge on untested assumptions about adoption being exogenous.

read the letter

The paper's core contribution is bringing proprietary GitHub Copilot usage logs together with public project data to estimate effects on contributions, participation, productivity, and coordination time. The 5.9% project-level contribution bump, split between participation and individual output, plus the 8% coordination increase and the net positive on timely merges, are specific numbers that earlier Copilot studies did not report for collaborative OSS settings. The split between core and peripheral developers is also new ground. That combination of data access and dual focus on gains and costs is the part worth paying attention to.

Referee Report

3 major / 2 minor

Summary. The paper uses proprietary GitHub Copilot usage data merged with public OSS project data to estimate the effects of Copilot on collaborative development. It claims causal increases of 5.9% in project-level code contributions (via 3.4% higher developer participation and 2.1% higher individual productivity), an 8% rise in coordination time from more discussions, a net positive effect on timely merges, and heterogeneous effects with smaller contribution gains and larger coordination costs for peripheral versus core developers.

Significance. If the causal identification is credible, the results would provide rare large-scale evidence on the productivity-coordination tradeoff induced by generative AI tools in voluntary collaborative settings, with implications for OSS community structure. The linkage of proprietary usage logs to public GitHub metrics is a clear strength that enables project-level analysis not feasible with public data alone.

major comments (3)

[Abstract] Abstract and Methods (inferred from absence of detail): the 5.9%, 3.4%, 2.1%, and 8% effects are presented as causal impacts of Copilot, yet no identification strategy, fixed effects, matching procedure, or instrument is described to address endogeneity from project characteristics, developer self-selection, or reverse causality. This assumption is load-bearing for all headline claims.
[Results] Results section (inferred): the claim that the combined effect of higher contributions and higher coordination time remains net positive for timely merges is stated without showing the explicit aggregation or weighting used to reach that conclusion, leaving the net-gain interpretation unsupported by reported quantities.
[Heterogeneity] Heterogeneity analysis: the differential effects for peripheral versus core developers are reported, but without robustness checks that interact the treatment with time-varying project activity or developer tenure, the role-based differences cannot be distinguished from selection.

minor comments (2)

[Abstract] Abstract: the time window and sample construction (number of projects, developers, and Copilot adoption dates) are not stated, making it difficult to assess external validity.
[Abstract] Notation: the distinction between 'project-level code contributions' and 'individual productivity' is used without an explicit equation or variable definition in the summary, which could be clarified with a short measurement appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and commit to revisions that improve clarity and robustness.

read point-by-point responses

Referee: [Abstract] Abstract and Methods (inferred from absence of detail): the 5.9%, 3.4%, 2.1%, and 8% effects are presented as causal impacts of Copilot, yet no identification strategy, fixed effects, matching procedure, or instrument is described to address endogeneity from project characteristics, developer self-selection, or reverse causality. This assumption is load-bearing for all headline claims.

Authors: The full manuscript's Methods section specifies a staggered difference-in-differences design that exploits variation in Copilot rollout timing across projects, augmented by project fixed effects, time fixed effects, and controls for developer characteristics to address endogeneity, self-selection, and reverse causality. We will revise the abstract to concisely reference this identification strategy and expand the methods description for greater prominence. revision: yes
Referee: [Results] Results section (inferred): the claim that the combined effect of higher contributions and higher coordination time remains net positive for timely merges is stated without showing the explicit aggregation or weighting used to reach that conclusion, leaving the net-gain interpretation unsupported by reported quantities.

Authors: We agree the aggregation procedure requires explicit documentation. We will revise the Results section to include the precise weighting or combination method used to derive the net positive effect on timely merges from the contribution and coordination estimates. revision: yes
Referee: [Heterogeneity] Heterogeneity analysis: the differential effects for peripheral versus core developers are reported, but without robustness checks that interact the treatment with time-varying project activity or developer tenure, the role-based differences cannot be distinguished from selection.

Authors: We will add robustness checks that interact the treatment indicator with time-varying project activity metrics and developer tenure to better isolate role-based heterogeneity from selection effects. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical estimates from data analysis with no self-referential derivations

full rationale

The paper reports econometric estimates of Copilot's effects on OSS contributions and coordination using proprietary usage data merged with public GitHub project data. No mathematical derivations, first-principles predictions, or fitted parameters are presented as outputs that reduce to the inputs by construction. The identification strategy relies on variation in Copilot adoption, but this is an external assumption about exogeneity rather than a self-definitional or self-citation chain that forces the reported percentages. The analysis is self-contained as standard empirical work without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on econometric assumptions about causality and data quality that are standard but unverified in the abstract; free parameters are the fitted effect sizes from regressions.

free parameters (1)

Copilot effect coefficients
Percentage changes (5.9%, 3.4%, 2.1%, 8%) are estimated via statistical models fitted to project and developer data.

axioms (2)

domain assumption Copilot usage data accurately reflects actual tool adoption without measurement error
Required for interpreting usage as the treatment variable in the analysis.
domain assumption Project and developer fixed effects or controls sufficiently address confounding factors
Typical assumption in panel data studies for identifying causal effects.

pith-pipeline@v0.9.0 · 5823 in / 1427 out tokens · 64768 ms · 2026-05-23T19:43:45.390227+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI Policy, Disclosure, and Human in the Loop: How Are Contribution Guidelines Adapting to GenAI?
cs.SE 2026-05 unverdicted novelty 6.0

An empirical analysis of 1,000 GitHub repositories finds 118 AI policies where 78% allow GenAI contributions, 51% require disclosure, and 74% mandate human oversight.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Engineering Students' Usage and Perceptions of GitHub Copilot in Open-Source Projects
cs.SE 2026-04 unverdicted novelty 5.0

Students primarily used Copilot chat and code generation features during open-source contributions, with usage patterns varying significantly by gender, programming skill, and AI experience.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity

rity with project structure and context, which underscores the importance of integrating human intelligence with the automation and augmentation capabilities of AI pair programmers. In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity. Over time, this imbalance could lead to more conce...

work page arXiv 2010
[2]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Universal access in the information society 14:81-95. Mattarelli E, Bertolotti F, Prencipe A, Gupta A (2022) The effect of role-based product representations on individual and team coordination practices: A field study of a globally distributed new product development team. Organization Science 33(4):1423-1451. Mawdsley JK, Meyer-Doyle P, Chatain O (2022)...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

2023, Liu et al

and conclude that there exists no pre-treatment trend (Hartman and Hidalgo 2018, Pan and Qiu 2022, Egami and Yamauchi 2023, Wang et al. 2023, Liu et al. 2024). This indicates that a sufficient set of confounders have been controlled to address the endogeneity concerns and that GSCM provides a good control group. Figure B.1: Pre-trend test (TOST) of logged...

work page 2018
[4]

= 𝛽)+ H𝛽$,/ I𝐶𝑜𝑝𝑖𝑙𝑜𝑡!×𝑃𝑜𝑠𝑡

in contribution-related measures, such as the number of merged PRs and the number of timely merged PRs. This drop likely reflects an initial adjustment period as developers adapt to the AI pair programmer or integrate it into their workflow. Importantly, the equivalence holds in the pre-treatment window, supporting the credibility of the synthetic control...

work page 2021
[5]

Complexity

This modeling approach provides a multinomial distribution over the inferred topics for each merged PR, enabling us to assess both the number of distinct topics present and the evenness of their distribution. A critical hyperparameter in LDA modeling is the number of topics to extract. Several methods have been proposed to determine the optimal number, in...

work page 2010
[6]

Using each of these two variables as the dependent variable, the results are shown in Tables L.2

the ratio of coordination time for merging code contributed by high-familiarity developers to the average coordination time, labeled as Ratio of merge time. Using each of these two variables as the dependent variable, the results are shown in Tables L.2. Consistent with our arguments, Copilot led to an increased proportion of code contributions from high-...

work page 2012

[1] [1]

In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity

rity with project structure and context, which underscores the importance of integrating human intelligence with the automation and augmentation capabilities of AI pair programmers. In contrast, peripheral developers face relatively longer coordination time that reduces their overall project productivity. Over time, this imbalance could lead to more conce...

work page arXiv 2010

[2] [2]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Universal access in the information society 14:81-95. Mattarelli E, Bertolotti F, Prencipe A, Gupta A (2022) The effect of role-based product representations on individual and team coordination practices: A field study of a globally distributed new product development team. Organization Science 33(4):1423-1451. Mawdsley JK, Meyer-Doyle P, Chatain O (2022)...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

2023, Liu et al

and conclude that there exists no pre-treatment trend (Hartman and Hidalgo 2018, Pan and Qiu 2022, Egami and Yamauchi 2023, Wang et al. 2023, Liu et al. 2024). This indicates that a sufficient set of confounders have been controlled to address the endogeneity concerns and that GSCM provides a good control group. Figure B.1: Pre-trend test (TOST) of logged...

work page 2018

[4] [4]

= 𝛽)+ H𝛽$,/ I𝐶𝑜𝑝𝑖𝑙𝑜𝑡!×𝑃𝑜𝑠𝑡

in contribution-related measures, such as the number of merged PRs and the number of timely merged PRs. This drop likely reflects an initial adjustment period as developers adapt to the AI pair programmer or integrate it into their workflow. Importantly, the equivalence holds in the pre-treatment window, supporting the credibility of the synthetic control...

work page 2021

[5] [5]

Complexity

This modeling approach provides a multinomial distribution over the inferred topics for each merged PR, enabling us to assess both the number of distinct topics present and the evenness of their distribution. A critical hyperparameter in LDA modeling is the number of topics to extract. Several methods have been proposed to determine the optimal number, in...

work page 2010

[6] [6]

Using each of these two variables as the dependent variable, the results are shown in Tables L.2

the ratio of coordination time for merging code contributed by high-familiarity developers to the average coordination time, labeled as Ratio of merge time. Using each of these two variables as the dependent variable, the results are shown in Tables L.2. Consistent with our arguments, Copilot led to an increased proportion of code contributions from high-...

work page 2012