Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study
Pith reviewed 2026-05-18 14:10 UTC · model grok-4.3
The pith
GitHub Copilot adoption produced no statistically significant rise in commit-based activity metrics among developers who used the tool.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In this longitudinal mixed-methods case study, individuals who adopted GitHub Copilot showed no statistically significant changes in commit-based activity metrics after the tool's introduction. Minor increases were observed, yet these did not alter the pre-existing pattern in which Copilot users had been consistently more active than non-users. The analysis of 26,317 commits across 703 repositories, combined with survey responses and 13 interviews, therefore reveals a discrepancy between changes in objective activity measures and the subjective sense of productivity gains.
What carries the argument
Pre- and post-adoption comparison of commit-based activity metrics (such as commit counts and related indicators) across Git repositories, set against self-reported productivity from surveys and interviews.
If this is right
- Objective commit metrics may miss productivity effects that developers themselves notice when using generative AI coding tools.
- Developers who voluntarily adopt such tools already differ in baseline activity levels from those who do not.
- Perceived productivity gains can occur without corresponding rises in the volume of commits produced.
- In large organizations, introducing AI assistants does not automatically produce measurable increases in repository activity.
Where Pith is reading between the lines
- Organizations evaluating these tools may need to supplement commit counts with measures such as task completion time or code review effort.
- Flat activity metrics could indicate that the tool reduces the work required per commit rather than raising the number of commits.
- The observed discrepancy invites direct tests of whether other productivity indicators, like defect rates or feature velocity, move in line with user reports.
Load-bearing premise
Commit counts and similar activity metrics serve as a valid proxy for changes in developer productivity caused by the tool, and pre-existing differences between user groups can be adequately controlled.
What would settle it
A re-analysis of the commit data that finds statistically significant activity increases for Copilot users after tighter controls for individual experience or role, or new interviews in which users report no perceived productivity improvement.
Figures
read the original abstract
This study investigates the real-world impact of the generative AI (GenAI) tool GitHub Copilot on developer activity and perceived productivity. We conducted a mixed-methods case study in NAV IT, a large public sector agile organization. We analyzed 26,317 unique non-merge commits from 703 of NAV IT's GitHub repositories over a two-year period, focusing on commit-based activity metrics from 25 Copilot users and 14 non-users. The analysis was complemented by survey responses on their roles and perceived productivity, as well as 13 interviews. Our analysis of activity metrics revealed that individuals who used Copilot were consistently more active than non-users, even prior to Copilot's introduction. We did not find any statistically significant changes in commit-based activity for Copilot users after they adopted the tool, although minor increases were observed. This suggests a discrepancy between changes in commit-based metrics and the subjective experience of productivity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper presents a longitudinal mixed-methods case study investigating the impact of GitHub Copilot on developer activity and perceived productivity within NAV IT, a large public sector agile organization. It analyzes 26,317 unique non-merge commits from 703 GitHub repositories over two years, comparing commit-based activity metrics between 25 Copilot users and 14 non-users, supplemented by survey responses on roles and perceived productivity as well as 13 interviews. The central findings are that Copilot users showed consistently higher activity levels than non-users even prior to adoption, with no statistically significant changes in commit-based metrics after adoption (though minor increases were observed), suggesting a discrepancy between objective activity metrics and subjective productivity perceptions.
Significance. If the results hold after addressing methodological details, the study offers a valuable real-world contribution to empirical software engineering by providing longitudinal evidence from an enterprise setting on the effects of generative AI coding tools. It explicitly credits the mixed-methods design, large commit dataset, and attention to pre-adoption differences as strengths, while highlighting the limitations of commit counts as productivity proxies and the value of combining them with subjective data.
major comments (2)
- [Results] Results section (around the pre/post adoption analysis): the claim of no statistically significant changes relies on within-user comparisons, but the manuscript does not provide full details on how pre-existing activity differences between groups were controlled (e.g., via propensity matching, regression covariates, or difference-in-differences); this is load-bearing for the central no-effect finding given the noted baseline disparities.
- [Methods] Methods section (data collection and sample description): exact exclusion rules for commits or users, and any sample matching procedures between the 25 Copilot users and 14 non-users, are not fully specified; without these, the robustness of the activity metric comparisons and generalizability cannot be fully assessed.
minor comments (2)
- [Abstract] The abstract mentions survey responses but does not report the number of respondents or response rate; adding this would improve clarity on the subjective data component.
- [Results] Figure or table presenting the activity metrics over time could benefit from explicit labeling of the adoption date and confidence intervals for the statistical tests to aid interpretation of the minor increases observed.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify key aspects of our methodology and analysis. We address each major comment below and have revised the manuscript accordingly to improve transparency and robustness.
read point-by-point responses
-
Referee: [Results] Results section (around the pre/post adoption analysis): the claim of no statistically significant changes relies on within-user comparisons, but the manuscript does not provide full details on how pre-existing activity differences between groups were controlled (e.g., via propensity matching, regression covariates, or difference-in-differences); this is load-bearing for the central no-effect finding given the noted baseline disparities.
Authors: We appreciate this observation. Our primary analysis for the no-effect claim used within-subject pre/post comparisons on the 25 Copilot users (e.g., comparing each individual's commit activity in the 6 months before vs. after adoption using non-parametric tests). The non-user group was included descriptively to document baseline differences between adopters and non-adopters, not as a matched control for causal inference. No propensity score matching, difference-in-differences, or regression covariates were applied, as the study is observational with a small sample and focuses on individual change trajectories rather than between-group effects. We have added a new paragraph in the Results section explicitly describing the statistical procedures (including test statistics and p-values), confirming the absence of additional controls, and discussing why a more complex design was not feasible given the data constraints. This addresses the load-bearing concern by making the analytical choices transparent. revision: yes
-
Referee: [Methods] Methods section (data collection and sample description): exact exclusion rules for commits or users, and any sample matching procedures between the 25 Copilot users and 14 non-users, are not fully specified; without these, the robustness of the activity metric comparisons and generalizability cannot be fully assessed.
Authors: We agree that greater specificity is required. The revised Methods section now details: (a) exclusion criteria for the 26,317 commits (removal of merge commits, bot-generated commits, and those with zero changed lines); (b) user inclusion based on Copilot license activation records cross-referenced with survey responses and GitHub activity, resulting in the final 25 users and 14 non-users; and (c) explicit statement that no formal sample matching (e.g., propensity or exact matching) was performed—the groups reflect natural variation in tool adoption within the organization. We have also expanded the Limitations section to discuss implications for generalizability. These changes allow readers to evaluate the comparisons directly. revision: yes
Circularity Check
No circularity: purely empirical observational study with independent data analysis
full rationale
This is a longitudinal mixed-methods case study relying on direct analysis of 26,317 commits, survey responses, and interviews from a real organization. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the central claim of no statistically significant change in commit activity post-Copilot adoption. The pre/post comparison and acknowledgment of baseline differences are grounded in the collected data itself rather than reducing to any prior fitted quantity or self-referential definition. The result is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Commit counts and related activity metrics are valid proxies for developer productivity changes
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We did not find any statistically significant changes in commit-based activity for Copilot users after they adopted the tool, although minor increases were observed.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our analysis of activity metrics revealed that individuals who used Copilot were consistently more active than non-users, even prior to Copilot's introduction.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
A survey of generative AI adoption and perceived productivity among scientists who program
Survey of 868 scientific programmers shows generative AI adoption is highest among the inexperienced, who prefer conversational tools, and perceived productivity correlates most with volume of accepted generated code ...
-
The Impact of AI Coding Assistants on Software Engineering: A Longitudinal Study
Longitudinal surveys show AI coding assistants reduce time on code writing but increase supervisory verification tasks, with stable productivity perceptions yet rising reports of worsened developer experience.
-
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
-
Engineering Students' Usage and Perceptions of GitHub Copilot in Open-Source Projects
Students primarily used Copilot chat and code generation features during open-source contributions, with usage patterns varying significantly by gender, programming skill, and AI experience.
Reference graph
Works this paper leans on
-
[1]
Al-Ahmad, A., Kahtan, H., Tahat, L., & Tahat, T. (2024). Enhancing software engineering with AI: Key insights from ChatGPT.2024 International Conference on Decision Aid Sciences and Applications (DASA), 1–5. Barbala, A., Ulfsnes, R., Wivestad, V ., & Stray, V . (2025). Generative AI in the workplace: Affective affordances and employee flourishing.Proceedi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Scacchi, W. (1995). Understanding software productivity. In Software engineering and knowledge engineering: Trends for the next decade(pp. 273–316). World Scientific. Simkute, A., Tankelevitch, L., Kewenig, V ., Scott, A. E., Sellen, A., & Rintel, S. (2025). Ironies of generative AI: Understanding and mitigating productivity loss in human-AI interaction.I...
work page internal anchor Pith review Pith/arXiv arXiv 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.