pith. sign in

arxiv: 2509.20353 · v2 · submitted 2025-09-24 · 💻 cs.SE

Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study

Pith reviewed 2026-05-18 14:10 UTC · model grok-4.3

classification 💻 cs.SE
keywords GitHub Copilotdeveloper productivitycommit activitymixed-methods studylongitudinal analysisAI coding assistantsperceived productivityagile development
0
0 comments X

The pith

GitHub Copilot adoption produced no statistically significant rise in commit-based activity metrics among developers who used the tool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study tracks real developer activity through commits in a large public-sector organization over two years. It compares 25 users who adopted GitHub Copilot against 14 who did not, looking at activity both before and after the tool became available. Copilot users turned out to be more active than non-users even before they started using the tool. After adoption, their commit rates did not increase in a way that reached statistical significance, though small upward shifts appeared. At the same time, the same developers described feeling more productive in surveys and interviews, pointing to a mismatch between what the numbers show and what people experience.

Core claim

In this longitudinal mixed-methods case study, individuals who adopted GitHub Copilot showed no statistically significant changes in commit-based activity metrics after the tool's introduction. Minor increases were observed, yet these did not alter the pre-existing pattern in which Copilot users had been consistently more active than non-users. The analysis of 26,317 commits across 703 repositories, combined with survey responses and 13 interviews, therefore reveals a discrepancy between changes in objective activity measures and the subjective sense of productivity gains.

What carries the argument

Pre- and post-adoption comparison of commit-based activity metrics (such as commit counts and related indicators) across Git repositories, set against self-reported productivity from surveys and interviews.

If this is right

  • Objective commit metrics may miss productivity effects that developers themselves notice when using generative AI coding tools.
  • Developers who voluntarily adopt such tools already differ in baseline activity levels from those who do not.
  • Perceived productivity gains can occur without corresponding rises in the volume of commits produced.
  • In large organizations, introducing AI assistants does not automatically produce measurable increases in repository activity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations evaluating these tools may need to supplement commit counts with measures such as task completion time or code review effort.
  • Flat activity metrics could indicate that the tool reduces the work required per commit rather than raising the number of commits.
  • The observed discrepancy invites direct tests of whether other productivity indicators, like defect rates or feature velocity, move in line with user reports.

Load-bearing premise

Commit counts and similar activity metrics serve as a valid proxy for changes in developer productivity caused by the tool, and pre-existing differences between user groups can be adequately controlled.

What would settle it

A re-analysis of the commit data that finds statistically significant activity increases for Copilot users after tighter controls for individual experience or role, or new interviews in which users report no perceived productivity improvement.

Figures

Figures reproduced from arXiv: 2509.20353 by Astri Barbala, Elias Goldmann Brandtz{\ae}g, Nils Brede Moe, Viggo Tellefsen Wivestad, Viktoria Stray.

Figure 1
Figure 1. Figure 1: Self-reported roles of the 39 employees whose GitHub activity was analyzed. were discovered by a mix of statistical and manual inspection: Duplicates: Some commits were present across multiple repos and were removed. Outliers: Some commits contained an extreme amount of code insertions or deletions. Manual inspection revealed that this typically was due to non-human-generated code (e.g., someone adding or … view at source ↗
Figure 2
Figure 2. Figure 2: Time series showing the average weekly commit activity among GitHub users and non-users. The top [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average weekly commit contributions for non-users and Copilot users for the periods before and after [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlations between change in commits and perceived productivity. Y-axis represents the 5-point [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

This study investigates the real-world impact of the generative AI (GenAI) tool GitHub Copilot on developer activity and perceived productivity. We conducted a mixed-methods case study in NAV IT, a large public sector agile organization. We analyzed 26,317 unique non-merge commits from 703 of NAV IT's GitHub repositories over a two-year period, focusing on commit-based activity metrics from 25 Copilot users and 14 non-users. The analysis was complemented by survey responses on their roles and perceived productivity, as well as 13 interviews. Our analysis of activity metrics revealed that individuals who used Copilot were consistently more active than non-users, even prior to Copilot's introduction. We did not find any statistically significant changes in commit-based activity for Copilot users after they adopted the tool, although minor increases were observed. This suggests a discrepancy between changes in commit-based metrics and the subjective experience of productivity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper presents a longitudinal mixed-methods case study investigating the impact of GitHub Copilot on developer activity and perceived productivity within NAV IT, a large public sector agile organization. It analyzes 26,317 unique non-merge commits from 703 GitHub repositories over two years, comparing commit-based activity metrics between 25 Copilot users and 14 non-users, supplemented by survey responses on roles and perceived productivity as well as 13 interviews. The central findings are that Copilot users showed consistently higher activity levels than non-users even prior to adoption, with no statistically significant changes in commit-based metrics after adoption (though minor increases were observed), suggesting a discrepancy between objective activity metrics and subjective productivity perceptions.

Significance. If the results hold after addressing methodological details, the study offers a valuable real-world contribution to empirical software engineering by providing longitudinal evidence from an enterprise setting on the effects of generative AI coding tools. It explicitly credits the mixed-methods design, large commit dataset, and attention to pre-adoption differences as strengths, while highlighting the limitations of commit counts as productivity proxies and the value of combining them with subjective data.

major comments (2)
  1. [Results] Results section (around the pre/post adoption analysis): the claim of no statistically significant changes relies on within-user comparisons, but the manuscript does not provide full details on how pre-existing activity differences between groups were controlled (e.g., via propensity matching, regression covariates, or difference-in-differences); this is load-bearing for the central no-effect finding given the noted baseline disparities.
  2. [Methods] Methods section (data collection and sample description): exact exclusion rules for commits or users, and any sample matching procedures between the 25 Copilot users and 14 non-users, are not fully specified; without these, the robustness of the activity metric comparisons and generalizability cannot be fully assessed.
minor comments (2)
  1. [Abstract] The abstract mentions survey responses but does not report the number of respondents or response rate; adding this would improve clarity on the subjective data component.
  2. [Results] Figure or table presenting the activity metrics over time could benefit from explicit labeling of the adoption date and confidence intervals for the statistical tests to aid interpretation of the minor increases observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our methodology and analysis. We address each major comment below and have revised the manuscript accordingly to improve transparency and robustness.

read point-by-point responses
  1. Referee: [Results] Results section (around the pre/post adoption analysis): the claim of no statistically significant changes relies on within-user comparisons, but the manuscript does not provide full details on how pre-existing activity differences between groups were controlled (e.g., via propensity matching, regression covariates, or difference-in-differences); this is load-bearing for the central no-effect finding given the noted baseline disparities.

    Authors: We appreciate this observation. Our primary analysis for the no-effect claim used within-subject pre/post comparisons on the 25 Copilot users (e.g., comparing each individual's commit activity in the 6 months before vs. after adoption using non-parametric tests). The non-user group was included descriptively to document baseline differences between adopters and non-adopters, not as a matched control for causal inference. No propensity score matching, difference-in-differences, or regression covariates were applied, as the study is observational with a small sample and focuses on individual change trajectories rather than between-group effects. We have added a new paragraph in the Results section explicitly describing the statistical procedures (including test statistics and p-values), confirming the absence of additional controls, and discussing why a more complex design was not feasible given the data constraints. This addresses the load-bearing concern by making the analytical choices transparent. revision: yes

  2. Referee: [Methods] Methods section (data collection and sample description): exact exclusion rules for commits or users, and any sample matching procedures between the 25 Copilot users and 14 non-users, are not fully specified; without these, the robustness of the activity metric comparisons and generalizability cannot be fully assessed.

    Authors: We agree that greater specificity is required. The revised Methods section now details: (a) exclusion criteria for the 26,317 commits (removal of merge commits, bot-generated commits, and those with zero changed lines); (b) user inclusion based on Copilot license activation records cross-referenced with survey responses and GitHub activity, resulting in the final 25 users and 14 non-users; and (c) explicit statement that no formal sample matching (e.g., propensity or exact matching) was performed—the groups reflect natural variation in tool adoption within the organization. We have also expanded the Limitations section to discuss implications for generalizability. These changes allow readers to evaluate the comparisons directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational study with independent data analysis

full rationale

This is a longitudinal mixed-methods case study relying on direct analysis of 26,317 commits, survey responses, and interviews from a real organization. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the central claim of no statistically significant change in commit activity post-Copilot adoption. The pre/post comparison and acknowledgment of baseline differences are grounded in the collected data itself rather than reducing to any prior fitted quantity or self-referential definition. The result is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that commit-based activity metrics are appropriate proxies for productivity impact and that the user and non-user groups are comparable after accounting for baseline differences.

axioms (1)
  • domain assumption Commit counts and related activity metrics are valid proxies for developer productivity changes
    The study interprets absence of significant change in these metrics as evidence against productivity gains from Copilot.

pith-pipeline@v0.9.0 · 5714 in / 1180 out tokens · 39550 ms · 2026-05-18T14:10:20.151053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A survey of generative AI adoption and perceived productivity among scientists who program

    cs.SE 2025-12 unverdicted novelty 6.0

    Survey of 868 scientific programmers shows generative AI adoption is highest among the inexperienced, who prefer conversational tools, and perceived productivity correlates most with volume of accepted generated code ...

  2. The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study

    cs.SE 2026-05 unverdicted novelty 5.0

    Longitudinal surveys show AI coding assistants reduce time on code writing but increase supervisory verification tasks, with stable productivity perceptions yet rising reports of worsened developer experience.

  3. Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

    cs.SE 2026-04 unverdicted novelty 5.0

    Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...

  4. Engineering Students' Usage and Perceptions of GitHub Copilot in Open-Source Projects

    cs.SE 2026-04 unverdicted novelty 5.0

    Students primarily used Copilot chat and code generation features during open-source contributions, with usage patterns varying significantly by gender, programming skill, and AI experience.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 4 Pith papers · 2 internal anchors

  1. [1]

    Al-Ahmad, A., Kahtan, H., Tahat, L., & Tahat, T. (2024). Enhancing software engineering with AI: Key insights from ChatGPT.2024 International Conference on Decision Aid Sciences and Applications (DASA), 1–5. Barbala, A., Ulfsnes, R., Wivestad, V ., & Stray, V . (2025). Generative AI in the workplace: Affective affordances and employee flourishing.Proceedi...

  2. [2]

    Scacchi, W. (1995). Understanding software productivity. In Software engineering and knowledge engineering: Trends for the next decade(pp. 273–316). World Scientific. Simkute, A., Tankelevitch, L., Kewenig, V ., Scott, A. E., Sellen, A., & Rintel, S. (2025). Ironies of generative AI: Understanding and mitigating productivity loss in human-AI interaction.I...