Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship

Haoran Yu; Lifei Liu; Pin Qian; Su Wang; Xiaochong Jiang; Yihang Chen

arxiv: 2606.22711 · v1 · pith:BFSDZE36new · submitted 2026-06-21 · 💻 cs.SE · cs.AI

Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship

Haoran Yu , Xiaochong Jiang , Lifei Liu , Su Wang , Pin Qian , Yihang Chen This is my paper

Pith reviewed 2026-06-26 09:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI coding agentspull requestsco-authorshipSimpson's paradoxconfoundersmerge ratesselection biasrepository fixed effects

0 comments

The pith

No AI coding agent retains a clear co-authorship effect on PR merge rates once repository selection and PR structure are controlled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes 33,596 pull requests from five AI coding agents and shows that an initial pooled finding of lower merge rates with human co-authorship is a Simpson's paradox caused by differences in which agents use co-authorship. Stratifying by agent reverses the pattern for some agents, but further controls for which repository the PR comes from, the number of commits, and whether the PR has multiple commits remove any remaining effect. A sympathetic reader cares because the result shows that observational associations between co-authorship and merge success are largely artefacts of how agents are chosen and how PRs are structured rather than evidence of a causal benefit. The work therefore cautions against reporting unstratified statistics when evaluating AI tools in software development.

Core claim

Pooled across agents, PRs with a Co-Authored-By trailer merge less often than autonomous ones, yet this reverses when stratified by agent identity because Codex dominates the data with high merge rates but rare co-authorship. Within-repository fixed effects eliminate Devin's initial gap, a commit-count control halves Copilot's remaining gap, and restricting to multi-commit PRs reduces Copilot's within-repo effect to a non-significant level. No agent shows a clear co-authorship effect after both repository selection and PR structure are controlled.

What carries the argument

The cascade of sequential controls beginning with agent stratification to resolve Simpson's paradox, then within-repository fixed effects, commit-count stratification, and restriction to multi-commit PRs.

If this is right

Agent-pooled statistics without stratification produce misleading conclusions about co-authorship.
Cross-sectional associations between co-authorship and merge rates are largely selection and PR-structure artefacts.
Researchers evaluating AI coding tools must apply layered controls to isolate any genuine causal signal.
Claims of human-AI collaboration benefits based on observational PR data require caution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layered confounding pattern could appear in other observational studies that compare AI-assisted versus unaided code changes.
Replicating the controls on metrics other than merge rate, such as review time or defect rates, would test whether the pattern is outcome-specific.
If unmeasured factors like repository popularity or agent deployment timing correlate with both co-authorship and outcomes, they could still bias results even after the reported controls.

Load-bearing premise

That the sequence of within-repo fixed effects, commit-count stratification, and multi-commit PR restriction fully captures selection biases so that no important unmeasured confounders remain that could restore a co-authorship effect.

What would settle it

A new dataset in which at least one agent shows a statistically significant positive co-authorship effect on merge probability after applying within-repository fixed effects, commit-count stratification, and restriction to multi-commit PRs.

read the original abstract

Pooled across five AI coding agents, pull requests (PRs) with a human Co-Authored-By trailer merge less often than purely-autonomous ones (53.8% vs. 79.8%) -- yet this aggregate finding is a textbook Simpson's Paradox. Stratifying 33,596 PRs from the AIDev dataset by agent identity reverses the conclusion: Copilot and Devin show large positive within-agent gaps (+41.2 and +33.5 pp, both p<0.001), while Cursor, Claude Code, and Codex show small effects whose cross-sectional 95% CIs span zero. The paradox is driven entirely by agent composition: Codex, which dominates 64.9% of the dataset, achieves high merge rates while rarely using co-authorship. But Simpson's Paradox is only the first layer of a cascade of confounders: within-repo controls eliminate Devin's gap (+33.5 to +1.6 pp, p=0.73); a commit-count control further halves Copilot's within-repo gap (+36.2 to +24.4 pp); restricted to multi-commit PRs, the Copilot within-repo effect dissolves to +4.8 pp (p=0.59). No agent retains a clear co-authorship effect once both repository selection and PR structure are controlled. Our findings caution against reporting agent-pooled statistics without stratification and demonstrate that cross-sectional co-authorship associations are largely selection and PR-structure artefacts rather than evidence of a causal benefit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that co-authorship effects for these AI agents on PR merge rates disappear after repo fixed effects and PR structure controls, but the observational setup leaves open whether unmeasured factors could restore some association.

read the letter

The main thing to take from this is that the apparent co-authorship advantage for agents like Copilot and Devin goes away once you stratify by repository and then add controls for commit count and multi-commit PRs.

The paper does a clear job laying out the cascade with numbers at each layer. It starts with the pooled Simpson's paradox (53.8% vs 79.8% merge rates), shows the reversal when splitting by agent, then reports how within-repo fixed effects drop Devin's gap to +1.6 pp, commit-count stratification cuts Copilot's within-repo gap, and the multi-commit restriction brings the remaining effect to +4.8 pp with p=0.59. The step-by-step reporting with p-values and CIs makes the argument easy to trace on the 33k PR AIDev dataset.

The soft spot is that these sequential controls may not capture everything. Repository and PR-structure adjustments are reasonable, but the paper does not report balance checks on other observables or robustness to alternatives like code complexity or within-repo timing. Residual confounding could still exist, which means the conclusion that associations are "artefacts" rests on the chosen controls being adequate.

This is relevant for software engineering researchers who evaluate AI coding agents on PR data. It gives a concrete example of why pooled stats can mislead and why stratification matters in this domain. A reader working on agent benchmarks would find the caution useful.

I would send it for peer review. The empirical demonstration is transparent enough to deserve referee scrutiny on the methods and remaining confounders.

Referee Report

1 major / 0 minor

Summary. The paper analyzes 33,596 PRs from the AIDev dataset across five AI coding agents. It reports a pooled Simpson's paradox in which co-authored PRs merge at lower rates (53.8% vs. 79.8%) than autonomous ones, but agent-stratified results reverse this for Copilot (+41.2 pp) and Devin (+33.5 pp). Within-repo fixed effects eliminate Devin's gap (+33.5 to +1.6 pp), commit-count stratification halves Copilot's within-repo gap (+36.2 to +24.4 pp), and restriction to multi-commit PRs reduces it further to +4.8 pp (p=0.59). The central claim is that no agent retains a statistically clear co-authorship effect once repository selection and PR structure are controlled, implying that cross-sectional associations are selection artifacts rather than evidence of causal benefit.

Significance. If the sequential controls are adequate, the work provides a clear methodological caution for observational studies of AI agents in software engineering by showing how pooled statistics can mislead and how stratification by agent, repository, and PR features dissolves apparent effects. The explicit reporting of numerical reversals, p-values, and 95% CIs after each control layer is a strength that makes the cascade of adjustments transparent and reproducible in principle. This contributes to better practice in empirical SE research on tool effectiveness.

major comments (1)

[Results (description of cascade controls)] The central claim that controls for repository selection and PR structure fully eliminate co-authorship effects rests on the sequential adjustments (within-repo fixed effects, commit-count stratification, multi-commit restriction) capturing all relevant biases. However, the manuscript does not report balance checks on observables such as code complexity or within-repo timing of agent use, nor robustness to additional controls; if these restore non-zero gaps for any agent, the conclusion that associations are 'largely selection and PR-structure artefacts' would not hold. A concrete test is to add such covariates and re-estimate the within-agent gaps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the methodological contribution of demonstrating how sequential stratification dissolves apparent effects. We respond to the major comment below.

read point-by-point responses

Referee: [Results (description of cascade controls)] The central claim that controls for repository selection and PR structure fully eliminate co-authorship effects rests on the sequential adjustments (within-repo fixed effects, commit-count stratification, multi-commit restriction) capturing all relevant biases. However, the manuscript does not report balance checks on observables such as code complexity or within-repo timing of agent use, nor robustness to additional controls; if these restore non-zero gaps for any agent, the conclusion that associations are 'largely selection and PR-structure artefacts' would not hold. A concrete test is to add such covariates and re-estimate the within-agent gaps.

Authors: We agree that explicit balance checks on additional observables would be informative. However, the AIDev dataset does not contain measures of code complexity (such as cyclomatic complexity or changed lines of code beyond commit counts), so such checks cannot be performed with the available data. Commit count is already used as a proxy for PR structure and is correlated with complexity. Within-repo fixed effects control for time-invariant repository characteristics, though they do not model dynamic timing of agent adoption. The observed attenuation to statistical insignificance after the reported controls supports our conclusion that the associations are largely artifacts. We will revise the manuscript to add an explicit limitations discussion acknowledging the absence of these additional covariates and the rationale for the chosen stratification layers. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational stratification with no derivations or fitted predictions

full rationale

The paper reports empirical merge-rate differences from direct stratification of the AIDev dataset (33,596 PRs) under successive controls (agent identity, within-repo fixed effects, commit-count bins, multi-commit restriction). No equations, models, or predictions appear; the central claim is that effects disappear under these controls, presented as a descriptive finding rather than a derived result. No self-citations, ansatzes, or renamings of known results are invoked as load-bearing steps. The analysis is self-contained against the reported data and does not reduce any quantity to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard observational-data assumptions rather than new parameters or entities; no free parameters are fitted beyond the implicit choice of control variables.

axioms (1)

standard math Standard assumptions for difference-in-proportions tests and confidence intervals (independence within strata, correct specification of within-repo controls)
Invoked implicitly by the reported p-values and 95% CIs after each stratification step.

pith-pipeline@v0.9.1-grok · 5822 in / 1339 out tokens · 29291 ms · 2026-06-26T09:32:40.718507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021
[2]

Zhiyuan Cheng, Longying Lai, Yue Liu, and Yu Sun. 2026. Toward Sustainable On-Device Intelligence: A Survey on Energy-Efficient RAG Systems with Small Language Models.A vailable at SSRN 6698538(2026)

2026
[3]

Desmarais, and Zhen Ming Jack Jiang

Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming Jack Jiang. 2023. GitHub Copilot AI pair programmer: Asset or liability?Journal of Systems and Software203 (2023), 111734. doi:10.1016/j.jss.2023.111734 Beyond Simpson’s Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship AgenticS...

work page doi:10.1016/j.jss.2023.111734 2023
[4]

Georgios Gousios, Margaret-Anne Storey, and Alberto Bacchelli. 2016. Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspec- tive. InProceedings of the 38th International Conference on Software Engineering (ICSE). ACM, 285–296. doi:10.1145/2884781.2884826

work page doi:10.1145/2884781.2884826 2016
[5]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv:2507.15003 [cs.SE]

Pith/arXiv arXiv 2025
[6]

Judea Pearl. 2014. Comment: Understanding Simpson’s Paradox.The American Statistician68, 1 (2014), 8–13. doi:10.1080/00031305.2014.876829

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.2014.876829 2014
[7]

Edward H. Simpson. 1951. The Interpretation of Interaction in Contingency Tables.Journal of the Royal Statistical Society, Series B13, 2 (1951), 238–241. doi:10.1111/j.2517-6161.1951.tb00088.x

work page doi:10.1111/j.2517-6161.1951.tb00088.x 1951
[8]

Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa. 2020. Effects of Adopting Code Review Bots on Pull Requests to OSS Projects. InProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 1–11. doi:10.1109/icsme46990.2020. 00011

work page doi:10.1109/icsme46990.2020 2020
[9]

Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa. 2022. Quality Gatekeepers: Investigating the Effects of Code Review Bots on Pull Request Activities.Empirical Software Engineering27, 5 (2022). doi:10.1007/s10664-022-10130-9

work page doi:10.1007/s10664-022-10130-9 2022
[10]

Mairieli Wessel, Igor Wiese, Igor Steinmacher, and Marco Aurelio Gerosa. 2021. Don’t Disturb Me: Challenges of Interacting with Software Bots on Open Source Software Projects.Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–21. doi:10.1145/3476042

work page doi:10.1145/3476042 2021
[11]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agent- less: Demystifying LLM-Based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024)

Pith/arXiv arXiv 2024
[12]

Yue Yu, Huaimin Wang, Gang Yin, and Tao Wang. 2016. Reviewer Recommen- dation for Pull-Requests in GitHub: What Can We Learn from Code Review and Bug Assignment?Information and Software Technology74 (2016), 204–218. doi:10.1016/j.infsof.2016.01.004

work page doi:10.1016/j.infsof.2016.01.004 2016

[1] [1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021

[2] [2]

Zhiyuan Cheng, Longying Lai, Yue Liu, and Yu Sun. 2026. Toward Sustainable On-Device Intelligence: A Survey on Energy-Efficient RAG Systems with Small Language Models.A vailable at SSRN 6698538(2026)

2026

[3] [3]

Desmarais, and Zhen Ming Jack Jiang

Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming Jack Jiang. 2023. GitHub Copilot AI pair programmer: Asset or liability?Journal of Systems and Software203 (2023), 111734. doi:10.1016/j.jss.2023.111734 Beyond Simpson’s Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship AgenticS...

work page doi:10.1016/j.jss.2023.111734 2023

[4] [4]

Georgios Gousios, Margaret-Anne Storey, and Alberto Bacchelli. 2016. Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspec- tive. InProceedings of the 38th International Conference on Software Engineering (ICSE). ACM, 285–296. doi:10.1145/2884781.2884826

work page doi:10.1145/2884781.2884826 2016

[5] [5]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv:2507.15003 [cs.SE]

Pith/arXiv arXiv 2025

[6] [6]

Judea Pearl. 2014. Comment: Understanding Simpson’s Paradox.The American Statistician68, 1 (2014), 8–13. doi:10.1080/00031305.2014.876829

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.2014.876829 2014

[7] [7]

Edward H. Simpson. 1951. The Interpretation of Interaction in Contingency Tables.Journal of the Royal Statistical Society, Series B13, 2 (1951), 238–241. doi:10.1111/j.2517-6161.1951.tb00088.x

work page doi:10.1111/j.2517-6161.1951.tb00088.x 1951

[8] [8]

Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa. 2020. Effects of Adopting Code Review Bots on Pull Requests to OSS Projects. InProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 1–11. doi:10.1109/icsme46990.2020. 00011

work page doi:10.1109/icsme46990.2020 2020

[9] [9]

Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco A. Gerosa. 2022. Quality Gatekeepers: Investigating the Effects of Code Review Bots on Pull Request Activities.Empirical Software Engineering27, 5 (2022). doi:10.1007/s10664-022-10130-9

work page doi:10.1007/s10664-022-10130-9 2022

[10] [10]

Mairieli Wessel, Igor Wiese, Igor Steinmacher, and Marco Aurelio Gerosa. 2021. Don’t Disturb Me: Challenges of Interacting with Software Bots on Open Source Software Projects.Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–21. doi:10.1145/3476042

work page doi:10.1145/3476042 2021

[11] [11]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agent- less: Demystifying LLM-Based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024)

Pith/arXiv arXiv 2024

[12] [12]

Yue Yu, Huaimin Wang, Gang Yin, and Tao Wang. 2016. Reviewer Recommen- dation for Pull-Requests in GitHub: What Can We Learn from Code Review and Bug Assignment?Information and Software Technology74 (2016), 204–218. doi:10.1016/j.infsof.2016.01.004

work page doi:10.1016/j.infsof.2016.01.004 2016