pith. sign in

arxiv: 2512.19644 · v3 · submitted 2025-12-22 · 💻 cs.SE · cs.HC

A survey of generative AI adoption and perceived productivity among scientists who program

Pith reviewed 2026-05-16 20:17 UTC · model grok-4.3

classification 💻 cs.SE cs.HC
keywords generative AIscientific programmingperceived productivitycode acceptancesurveydevelopment practicesprogrammer experienceChatGPT
0
0 comments X

The pith

The volume of AI-generated code scientists accept at once is the strongest predictor of their perceived productivity gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A survey of 868 scientists who program shows that generative AI adoption is highest among students and less experienced users, who favor general conversational tools like ChatGPT over specialized developer options. Perceived productivity rises with the number of lines of generated code typically accepted in one interaction, and this association is stronger among those with limited programming experience or infrequent use of practices such as testing, code review, and version control. The patterns suggest users gauge tool value mainly by output volume rather than by validation or integration quality. These results matter because programming underpins modern scientific work, so how researchers adopt and evaluate AI assistance can shape both the speed and reliability of research outputs.

Core claim

Through a survey of 868 scientific programmers, adoption of generative AI for coding is highest among students and less experienced programmers, with strong preference for conversational interfaces over developer-specific tools. Both inexperience and limited use of formal development practices are associated with greater perceived productivity, though these factors interact. The strongest predictor of perceived productivity is the number of lines of generated code typically accepted at once, indicating that scientific programmers may assess tool value primarily through generation volume rather than through subsequent validation or integration efforts.

What carries the argument

The association between perceived productivity and the typical number of lines of AI-generated code accepted per interaction, with interactions between programmer experience and use of development practices.

If this is right

  • Inexperienced programmers report the largest perceived gains, suggesting AI tools may lower entry barriers into scientific coding.
  • The interaction between experience and practices implies that adopting testing or version control can moderate perceived productivity differences.
  • Emphasis on code generation volume over validation may increase later debugging costs in research projects.
  • Field variation in adoption indicates that recommendations for AI tools should account for domain-specific workflows.
  • Preference for general-purpose interfaces over specialized ones highlights usability as a primary driver of tool choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If high acceptance volumes correlate with lower code quality, scientific projects risk accumulating technical debt from unverified AI contributions.
  • Tool interfaces could be redesigned to encourage review of smaller segments, potentially aligning perceived and actual productivity more closely.
  • Educational programs for scientific computing might add explicit training on validating AI-generated code to offset the observed associations with inexperience.
  • Longitudinal tracking of research outputs would test whether volume-based acceptance leads to faster discovery or hidden delays in verification.

Load-bearing premise

Self-reported perceived productivity accurately reflects actual productivity gains and is not driven by unmeasured confounders such as field-specific norms or individual motivation.

What would settle it

A controlled study that objectively tracks code correctness, debugging time, and overall project completion rates for AI-assisted versus non-assisted scientific tasks, then compares those measures directly to participants' self-reported productivity scores.

read the original abstract

Programming is essential to modern scientific research, yet most scientists report inadequate training for the software development their work demands. Generative AI tools capable of code generation may support scientific programmers, but user studies indicate risks of over-reliance, particularly among inexperienced users. We surveyed 868 scientists who program, examining adoption patterns, tool preferences, and factors associated with perceived productivity. Adoption is highest among students and less experienced programmers, with variation across fields. Scientific programmers overwhelmingly prefer general-purpose conversational interfaces like ChatGPT over developer-specific tools. Both inexperience and limited use of development practices (like testing, code review, and version control) are associated with greater perceived productivity -- but these factors interact, suggesting formal practices may partially compensate for inexperience. The strongest predictor of perceived productivity is the number of lines of generated code typically accepted at once. These findings suggest scientific programmers using generative AI may gauge productivity by code generation rather than validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports findings from an online survey of 868 scientists who engage in programming as part of their research. It describes adoption rates of generative AI code generation tools, noting higher adoption among students and less experienced programmers, and a strong preference for general-purpose conversational interfaces such as ChatGPT. The analysis identifies associations between perceived productivity gains and factors including inexperience, limited adherence to software development practices (e.g., testing, code review, version control), and the typical number of lines of AI-generated code accepted per interaction. The strongest predictor is reported as the volume of code accepted at once, with interactions suggesting that formal practices may mitigate risks associated with inexperience. The authors conclude that scientific programmers may be assessing productivity primarily through code generation volume rather than through validation or integration processes.

Significance. If the reported associations are robust, the survey provides timely empirical data on how generative AI is being integrated into scientific workflows, highlighting potential disparities in adoption and perceived benefits across experience levels. This could inform training programs and tool design for scientific computing. However, the reliance on unvalidated self-reported measures means the significance is primarily descriptive of perceptions rather than causal impacts on actual productivity.

major comments (3)
  1. Abstract and Results: The identification of the number of lines of generated code accepted at once as the strongest predictor of perceived productivity rests on regression models using a single unvalidated self-reported Likert item as the outcome; no correlation with external proxies (commit velocity, bug rates, or project completion times) is reported, which is load-bearing for the central claim that this metric reflects genuine productivity differences rather than reporting bias.
  2. Methods: The cross-sectional survey design reports associations between inexperience, limited development practices, and higher perceived productivity without documented controls for confounders such as motivation, field norms, or self-selection into the sample; this omission directly affects the interpretability of the interaction terms highlighted in the abstract.
  3. Results: The claim that formal development practices partially compensate for inexperience is derived from interaction coefficients on subjective responses, yet the manuscript provides no sensitivity analyses or robustness checks against overestimation by inexperienced respondents, undermining the load-bearing interpretation offered in the discussion.
minor comments (2)
  1. Abstract: Include the exact sample size (868) and any response rate information to give readers immediate context on survey scale and potential non-response bias.
  2. Tables/Figures: Ensure all regression output tables report exact coefficients, standard errors, confidence intervals, and p-values for the key predictors and interaction terms rather than summary statements alone.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments on our survey of generative AI adoption among scientific programmers. The feedback underscores key limitations of self-reported, cross-sectional data, which we address point by point below. We propose targeted revisions to enhance transparency while noting constraints imposed by the original survey design.

read point-by-point responses
  1. Referee: Abstract and Results: The identification of the number of lines of generated code accepted at once as the strongest predictor of perceived productivity rests on regression models using a single unvalidated self-reported Likert item as the outcome; no correlation with external proxies (commit velocity, bug rates, or project completion times) is reported, which is load-bearing for the central claim that this metric reflects genuine productivity differences rather than reporting bias.

    Authors: We agree that the productivity outcome relies on a single unvalidated self-reported Likert item and that no external proxies were collected. The survey focused on perceptions and did not include objective measures such as commit velocity or bug rates. We will revise the abstract, results, and discussion to explicitly frame all findings as relating to perceived productivity, discuss potential reporting biases, and add a dedicated limitations subsection on this issue. revision: partial

  2. Referee: Methods: The cross-sectional survey design reports associations between inexperience, limited development practices, and higher perceived productivity without documented controls for confounders such as motivation, field norms, or self-selection into the sample; this omission directly affects the interpretability of the interaction terms highlighted in the abstract.

    Authors: The cross-sectional design inherently limits causal claims and full confounder control. We will expand the methods section to detail the demographic and field controls that were included in the regressions and revise the discussion to explicitly note the absence of controls for motivation or self-selection as a limitation on interpreting the interaction terms. revision: partial

  3. Referee: Results: The claim that formal development practices partially compensate for inexperience is derived from interaction coefficients on subjective responses, yet the manuscript provides no sensitivity analyses or robustness checks against overestimation by inexperienced respondents, undermining the load-bearing interpretation offered in the discussion.

    Authors: We will add sensitivity analyses and robustness checks to the results section, including stratification by experience level and alternative model specifications to evaluate potential overestimation. These additions will be reported to support the interpretation of the interaction effects. revision: yes

standing simulated objections not resolved
  • We cannot add correlations with external productivity proxies (e.g., commit velocity or bug rates) because the survey did not collect such objective data.

Circularity Check

0 steps flagged

No circularity: observational survey reports direct associations without derivations or self-referential reductions

full rationale

The paper is a cross-sectional survey of 868 respondents analyzing adoption patterns and associations with a single self-reported perceived productivity item via regression. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the derivation chain. All reported predictors (e.g., lines of code accepted, inexperience, development practices) and their interactions are computed directly from the survey responses; the analysis does not reduce any claimed result to its own inputs by construction. This is the standard non-circular outcome for descriptive survey work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on the assumption that self-reported survey responses from 868 participants validly capture behaviors and perceptions, with no free parameters or invented entities introduced.

axioms (1)
  • domain assumption Self-reported survey responses accurately reflect actual tool usage, experience levels, and perceived productivity
    Standard assumption in survey research but unverified by objective measures in the reported findings.

pith-pipeline@v0.9.0 · 5457 in / 1209 out tokens · 19422 ms · 2026-05-16T20:17:15.092529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    It’s impossible to conduct research without software, say 7 out of 10 UK researchers (2014)

    Hettrick, S. It’s impossible to conduct research without software, say 7 out of 10 UK researchers (2014)

  2. [2]

    Software Carpentry: Lessons learned (2016)

    Wilson, G. Software Carpentry: Lessons learned (2016)

  3. [3]

    C., Weber, N., Ram, K., Gesing, S

    Carver, J. C., Weber, N., Ram, K., Gesing, S. & Katz, D. S. A survey of the state of the practice for research software in the United States.PeerJ Computer Science8(2022)

  4. [4]

    & Katz, D

    Nangia, U. & Katz, D. S. Surveying the US National Postdoctoral Association regarding software use and training in research (2017)

  5. [5]

    URL https://dl.acm.org/doi/abs/10.1145/3520312.3534864

    Ziegler, A.et al.Productivity assessment of neural code completion (2022). URL https://dl.acm.org/doi/abs/10.1145/3520312.3534864

  6. [6]

    URL https://arxiv.org/abs/2509.19708v1

    Kumar, A.et al.Intuition to Evidence: Measuring AI’s True Impact on Developer Productivity (2025). URL https://arxiv.org/abs/2509.19708v1

  7. [7]

    Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S

    Prather, J.et al.The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers.Proceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1469–486 (2024). URL https://dl.acm. org/doi/10.1145/3632620.3671116

  8. [8]

    Moradi Dakhel, A.et al.GitHub Copilot AI pair programmer: Asset or Liability? Journal of Systems and Software203, 111734 (2023)

  9. [9]

    & Horvitz, E

    Mozannar, H., Fourney, A., Bansal, G. & Horvitz, E. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming.Conference on Human Factors in Computing Systems - Proceedings(2024)

  10. [10]

    & Glassman, E

    Vaithilingam, P., Zhang, T. & Glassman, E. L. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models.Conference on Human Factors in Computing Systems - Proceedings (2022)

  11. [11]

    Barke, S., James, M. B. & Polikarpova, N. Grounded Copilot: How Programmers Interact with Code-Generating Models.Proceedings of the ACM on Programming Languages7(2023)

  12. [12]

    How Scientists Use Large Language Models to Program.Conference on Human Factors in Computing Systems - Proceedings16 (2025)

    O’Brien, G. How Scientists Use Large Language Models to Program.Conference on Human Factors in Computing Systems - Proceedings16 (2025). URL https: //dl.acm.org/doi/10.1145/3706598.3713668

  13. [13]

    Threats to scientific software from over-reliance on AI code assistants

    O’Brien, G. Threats to scientific software from over-reliance on AI code assistants. Nature Computational Science 2025 5:95, 701–703 (2025). URL https://www. nature.com/articles/s43588-025-00845-2. 36

  14. [14]

    PLOS Computational Biology 13(6), e1005510 (Jun 2017)

    Wilson, G.et al.Good enough practices in scientific computing (2017). URL https://doi.org/10.1371/journal.pcbi.1005510

  15. [15]

    & Sankaranarayana, R

    Nguyen-Hoan, L., Flint, S. & Sankaranarayana, R. A Survey of Scientific Software Development. Tech. Rep. (2010). URL http://apollo.anu.edu.au

  16. [16]

    URL https://dl.acm.org/doi/10.1145/2063348

    Prabhu, P.et al.A survey of the practice of computational science.State of the Practice Reports, SC’11(2011). URL https://dl.acm.org/doi/10.1145/2063348. 2063374

  17. [17]

    URL https: //ascopubs.org/doi/10.1200/JCO.2016.69.0875

    Retraction: Inferring the effects of cancer treatment: Divergent results from Early Breast Cancer Trialists’ Collaborative Group meta-analyses of randomized trials and observational data from SEER registries (Journal of Clinical Oncology (34) 803-809 (2016)).Journal of Clinical Oncology34, 3358–3359 (2016). URL https: //ascopubs.org/doi/10.1200/JCO.2016.69.0875

  18. [18]

    & Latham, K

    Karraker, A. & Latham, K. Authors’ explanation of the retraction.Journal of Health and Social Behavior56, 417–419 (2015)

  19. [19]

    Mandhane, P. J. Notice of Retraction: Hahn LM, et al. Post–COVID-19 Condition in Children. JAMA Pediatrics. 2023;177(11):1226-1228.JAMA Pediatrics(2024). URL https://jamanetwork.com/journals/jamapediatrics/fullarticle/2822489

  20. [20]

    & Mayer, S

    Weber, T., Brandmaier, M., Schmidt, A. & Mayer, S. Significant Productivity Gains through Programming with Large Language Models.Proceedings of the ACM on Human-Computer Interaction8(2024)

  21. [21]

    & Rein, D

    Becker, J., Rush, N., Barnes, B. & Rein, D. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

  22. [22]

    Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study

    Stray, V., Brandtzæg, E. G., Wivestad, V. T., Barbala, A. & Moe, N. B. Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study (2025). URL https://arxiv.org/abs/2509.20353v1

  23. [23]

    & Vasilescu, B

    He, H., Miller, C., Agarwal, S., K¨ astner, C. & Vasilescu, B. Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development (2025). URL http://arxiv.org/abs/2511.04427

  24. [24]

    & Blincoe, K

    Fawzy, A., Tahir, A. & Blincoe, K. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook - a Grey Literature Review1(2025). URL https://doi.org/10.1145/nnnnnnn.nnnnnnn

  25. [25]

    Nguyen, S.et al.How Beginning Programmers and Code LLMs (Mis) read Each Other (2024)

  26. [26]

    It’s Weird That it Knows What I Want

    Prather, J.et al.“It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers.ACM Transactions on 37 Computer-Human Interaction31(2023). URL https://dl.acm.org/doi/10.1145/ 3617367

  27. [27]

    Panko, R. R. Two Experiments in Reducing Overconfidence in Spreadsheet De- velopment.Journal of Organizational and End User Computing19, 1–23 (2008). URL https://arxiv.org/abs/0804.0941v1

  28. [28]

    J.et al.The State of the Art in End-User Software Engineering

    Ko, A. J.et al.The State of the Art in End-User Software Engineering

  29. [29]

    K.et al.Data Carpentry: Workshops to Increase Data Literacy for Researchers.International Journal of Digital Curation10(2015)

    Teal, T. K.et al.Data Carpentry: Workshops to Increase Data Literacy for Researchers.International Journal of Digital Curation10(2015)

  30. [30]

    U., Kanewala, U

    Eisty, N. U., Kanewala, U. & Carver, J. C. Testing Research Software: An In- Depth Survey of Practices, Methods, and Tools (2025). URL http://arxiv.org/ abs/2501.17739

  31. [31]

    & Bieman, J

    Kanewala, U. & Bieman, J. M. Testing scientific software: A systematic literature review.Information and Software Technology56, 1219–1232 (2014)

  32. [32]

    & Grunske, L

    Vogel, T., Druskat, S., Scheidgen, M., Draxl, C. & Grunske, L. Challenges for verifying and validating scientific software in computational materials sci- ence.Proceedings - 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science, SE4Science 201925–32 (2019)

  33. [33]

    Ariful Islam Malik, M.et al.Peer Code Review in Research Software Develop- ment: The Research Software Engineer Perspective (2025)

  34. [34]

    Not everyone can use Git

    Jay, C. “Not everyone can use Git”: Research Software Engineers’ recommenda- tions for scientist-centred software support (and what researchers really think of them). Tech. Rep. URL http://man.ac.uk/04Y6Bo]

  35. [35]

    Jesse, K., Ahmed, T., Devanbu, P. T. & Morgan, E. Large Language Models and Simple, Stupid Bugs (2023)

  36. [36]

    Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma

    Wang, Z.et al.Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models (2025). URL https://arxiv.org/abs/ 2406.08731v2

  37. [37]

    & Perkel, J

    Van Noorden, R. & Perkel, J. M. AI and science: what 1,600 researchers think (2023)

  38. [38]

    Caziot and B

    Arroyo-Machado, W.et al.Generative AI and academic scientists in US universi- ties: Perception, experience, and adoption intentions.PLOS ONE20, e0330416 (2025). URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone. 0330416

  39. [39]

    URL https://papers.ssrn.com/abstract= 38 5259847

    Chugunova, M.et al.Who Uses AI in Research, and for What? Large-scale Survey Evidence from Germany (2025). URL https://papers.ssrn.com/abstract= 38 5259847

  40. [40]

    & Gerosa, M

    Treude, C. & Gerosa, M. A. How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering (2025). URL https://arxiv. org/abs/2501.08774v2

  41. [41]

    inPolychoric and Polyserial Correlations(eds Kotz, S

    Drasgow, F. inPolychoric and Polyserial Correlations(eds Kotz, S. & Johnson, N.)The Encyclopedia of Statistics, Vol. 7 68–74 (Wiley, 1986)

  42. [42]

    URL https://spawn-queue.acm.org/doi/10.1145/3454122.3454124

    Forsgren, N.et al.The SPACE of Developer Productivity.Queue19(2021). URL https://spawn-queue.acm.org/doi/10.1145/3454122.3454124

  43. [43]

    & Wyatt, J

    Goddard, K., Roudsari, A. & Wyatt, J. C. Automation bias: A systematic review of frequency, effect mediators, and mitigators.Journal of the American Medical Informatics Association19, 121–127 (2012). URL https://dx.doi.org/10.1136/ amiajnl-2011-000089

  44. [44]

    Who Has Plots?

    Paine, D. & Lee, C. P. “Who Has Plots?”: Contextualizing Scientific Software, Practice, and Visualizations1, 85 (2017). URL https://doi.org/10.1145/3134720

  45. [45]

    Alpernas, K., Feldman, Y. M. & Peleg, H. The wonderful wizard of LoC: Paying attention to the man behind the curtain of lines-of-code metrics.Onward! 2020 - Proceedings of the 2020 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Co-located with SPLASH 2020146–156 (2020). URL https://dl.acm.org/...

  46. [46]

    & Wills, C

    Solla, M., Patel, A. & Wills, C. New metric for measuring programmer productiv- ity.ISCI 2011 - 2011 IEEE Symposium on Computers and Informatics177–182 (2011)

  47. [47]

    Self-Admitted GenAI Usage in Open-Source Software

    Xiao, T.et al.Self-Admitted GenAI Usage in Open-Source Software (2025). URL https://arxiv.org/abs/2507.10422v2

  48. [48]

    Here the GPT made a choice, and every choice can be biased

    Prabhudesai, S.et al.“Here the GPT made a choice, and every choice can be biased”: How Students Critically Engage with LLMs through End-User Auditing Activity ACM Reference Format (2025). URL https://doi.org/10.1145/3706598. 3713714

  49. [49]

    Lee, H. P. H.et al.The Impact of Generative AI on Critical Thinking: Self- Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers (2025)

  50. [50]

    R: A Language and Environment for Statistical Computing (2025)

    R Core Team. R: A Language and Environment for Statistical Computing (2025). URL https://www.R-project.org/. 39

  51. [51]

    stargazer: Well-Formatted Regression and Summary Statistics Tables (2022)

    Hlavac, M. stargazer: Well-Formatted Regression and Summary Statistics Tables (2022). URL https://CRAN.R-project.org/package=stargazer

  52. [52]

    & Simko, V

    Wei, T. & Simko, V. corrplot: Visualization of a Correlation Matrix (2024). URL https://github.com/taiyun/corrplot. 40