pith. sign in

arxiv: 2607.02337 · v1 · pith:B2RVE3X3new · submitted 2026-07-02 · 💻 cs.SE

Developers' Experience with Generative AI Beyond Productivity Assessment -- Insights from an Empirical Mixed-Methods Field Study

Pith reviewed 2026-07-03 08:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords generative AIcoding assistantsdeveloper experiencemixed-methods studytask efficiencyperceived workloadinteraction types
0
0 comments X

The pith

Developers experience reduced efficiency gains when combining in-code suggestions and chat prompts in a single task with generative AI tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates professional developers' experiences with generative AI coding assistants in their actual work environments using a mixed-methods approach that combines controlled sessions with natural work periods. It establishes that while each interaction type—in-code suggestions or chat-based prompting—boosts efficiency and lowers workload on its own, using both together in one task reduces those advantages. Developers report satisfaction with the tools especially for structured and repetitive tasks, and the study itself increased their mindful use of AI. The findings highlight how task characteristics should determine the choice of interaction mode to optimize outcomes.

Core claim

In a mixed-methods field study, developers showed satisfaction with generative AI for monotonous, repetitive, and structured tasks and perceived efficiency and productivity gains. In-code suggestions and chat-based prompting each improved task efficiency and reduced perceived workload independently, but combining them within a single task diminished the benefits. A rule-of-thumb is proposed for selecting an interaction type based on task characteristics. During development-heavy tasks, perceived cognitive load arises from AI interaction while perceived productivity depends on AI output quality. Participation in the study positively influenced developers' awareness and intentional use of GenA

What carries the argument

The differentiated effects of in-code suggestions versus chat-based prompting interaction types, with diminishing benefits observed specifically from their combination within one task.

If this is right

  • Task characteristics should guide selection of a single interaction type to maximize efficiency and workload reduction.
  • In development-heavy tasks, perceived cognitive load stems from the act of interacting with AI rather than task demands alone.
  • Perceived productivity in development-heavy tasks hinges on the quality of AI-generated output.
  • Real-world mixed-methods designs combining controlled and natural periods capture experiences more accurately than purely lab-based studies.
  • Involvement in such studies can increase developers' awareness and lead to more intentional tool usage afterward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tool interfaces could be redesigned to minimize easy switching between modes mid-task, reducing the risk of unintended combinations.
  • The rule-of-thumb for mode selection might be tested for transferability across different programming domains or team structures.
  • Reflective participation in studies could serve as a lightweight intervention to improve how teams adopt and use AI coding tools.

Load-bearing premise

The study design assumes that combining controlled sessions with natural work periods provides an accurate representation of developers' genuine experiences and preferences without the study itself significantly altering behavior or introducing bias in self-reports.

What would settle it

A follow-up experiment that measures objective task completion time and error rates for developers randomly assigned to use only one interaction type versus both types on identical tasks, with participants unaware they are being compared on interaction mode.

read the original abstract

With the growing adoption of AI-powered coding assistants, organizations and developers are increasingly seeking to optimize their interaction with these tools. Prior research has largely focused on output quality and productivity gains, with limited attention paid to developers' well-being and interaction experiences. This paper presents a developer-centered empirical mixed-methods study to investigate how professional developers engage with Generative AI (GenAI) in their natural work environment. Controlled data collection sessions are combined with natural work periods. Results show that developers are generally satisfied with GenAI, particularly for monotonous, repetitive, and structured tasks, and report perceived efficiency and productivity gains. Copilot interaction type preferences differ by task type and complexity: While both in-code suggestions and chat-based prompting independently improve task efficiency and reduce perceived workload, combining these interaction types within a single task diminishes benefits. We propose a rule-of-thumb for selecting an interaction type based on task characteristics. During development-heavy tasks, results indicate that perceived cognitive load arises from AI interaction, while perceived productivity depends on AI output quality. Participation in this study positively influenced developers' awareness and intentional use of GenAI tools. These findings demonstrate the value of real-world, mixed-methods study designs to understand GenAI tools and developers' experiences with them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports results from a mixed-methods empirical field study of professional developers using generative AI coding assistants. Controlled sessions are combined with natural work periods to examine interaction experiences, satisfaction, efficiency, and workload. Key claims include general satisfaction especially for repetitive tasks, independent efficiency gains from in-code suggestions and chat-based prompting, but diminished benefits when both are combined in one task; a rule-of-thumb for selecting interaction type by task characteristics is proposed. Participation itself increased awareness and intentional tool use.

Significance. If the central claims on interaction-type effects hold, the study adds value by shifting focus from productivity metrics alone to well-being, preferences, and real-world usage patterns. The mixed-methods design in natural settings is a methodological strength for ecological validity in software engineering research on GenAI tools.

major comments (2)
  1. [Abstract] Abstract: the claim that combining in-code suggestions and chat-based prompting 'diminishes benefits' is load-bearing for the rule-of-thumb proposal yet rests entirely on self-reported efficiency and workload data; no objective task-time or error-rate measures are referenced, leaving the finding vulnerable to the study-induced awareness effect explicitly noted in the abstract.
  2. [Methods] Methods/Results (study design): the weakest assumption—that the mixed controlled-plus-natural protocol accurately captures genuine experiences without systematic alteration of behavior—is not addressed despite the abstract stating that participation increased awareness and intentional use; this directly risks demand characteristics or Hawthorne effects confounding the single-mode vs. combined-mode comparisons.
minor comments (2)
  1. [Abstract] Abstract and methods sections should report sample size, response rates, and details of the statistical analysis used to support efficiency and workload claims.
  2. [Discussion] The rule-of-thumb is presented without explicit criteria or examples tied to the collected data; a table or figure illustrating the mapping from task characteristics to recommended interaction type would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each of the major comments below, proposing revisions to strengthen the paper where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that combining in-code suggestions and chat-based prompting 'diminishes benefits' is load-bearing for the rule-of-thumb proposal yet rests entirely on self-reported efficiency and workload data; no objective task-time or error-rate measures are referenced, leaving the finding vulnerable to the study-induced awareness effect explicitly noted in the abstract.

    Authors: We acknowledge that our key finding regarding the diminished benefits of combining interaction modes relies on self-reported efficiency and workload data, as the study emphasizes developers' subjective experiences rather than objective performance metrics. No task completion times or error rates were collected, which limits the ability to corroborate the self-reports. We will revise the abstract to qualify this claim and add a discussion of the potential study-induced awareness effect as a limitation, thereby tempering the rule-of-thumb proposal accordingly. revision: partial

  2. Referee: [Methods] Methods/Results (study design): the weakest assumption—that the mixed controlled-plus-natural protocol accurately captures genuine experiences without systematic alteration of behavior—is not addressed despite the abstract stating that participation increased awareness and intentional use; this directly risks demand characteristics or Hawthorne effects confounding the single-mode vs. combined-mode comparisons.

    Authors: The referee correctly identifies a potential issue with our study design. Although we report that participation increased awareness, we did not explicitly address the risk of demand characteristics or Hawthorne effects in the methods or limitations sections. This could indeed confound comparisons between interaction modes. We will add a new subsection in the limitations discussing these threats to validity and how the mixed controlled-natural protocol was intended to balance ecological validity with some control, while noting that complete elimination of such effects is challenging in field studies. revision: yes

Circularity Check

0 steps flagged

Empirical study with no mathematical derivations or self-referential predictions

full rationale

This is a mixed-methods empirical field study whose claims rest on data collected from controlled sessions and natural work periods (interviews, observations, self-reports). No equations, fitted parameters, predictions derived from models, or uniqueness theorems appear in the paper. All load-bearing steps are direct inferences from the gathered evidence rather than reductions to inputs by construction, self-citation chains, or ansatzes. The design is self-contained against external benchmarks in the sense that findings are presented as observations from the study sample, not as derived theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical study, the claims rest on the validity of participant self-reports and the generalizability of the sample. No free parameters or new entities are introduced.

axioms (1)
  • domain assumption Standard assumptions in mixed-methods research regarding the triangulation of quantitative and qualitative data.
    Invoked implicitly in the study design described in the abstract.

pith-pipeline@v0.9.1-grok · 5764 in / 1214 out tokens · 32044 ms · 2026-07-03T08:37:58.158500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Gal Bakal, Ali Dasdan, Yaniv Katz, Michael Kaufman, and Guy Levin. 2025. Experience with GitHub Copilot for Developer Productivity at Zoominfo. ArXiv (January 2025). https://doi.org/10.48550/arXiv.2501.13282

  2. [2]

    Grounded copilot: How programmers interact with code-generating models,

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code -Generating Models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (April 2023), 85–111. https://doi.org/10.1145/3586030

  3. [3]

    Alexander Barker. 2022. JNativeHook: Global Keyboard and Mouse Listener for Java. Retrieved July 29, 2025 from https://github.com/kwhat/jnativehook

  4. [4]

    Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking Flight with Copilot. Commun. ACM 66, 6 (May 2023), 56–62. https://doi.org/10.1145/3589996

  5. [5]

    Charlotte Brandebusemeyer, Tobias Schimmer, and Bert Arnrich. 2025. Supplemental Material for Developers’ Experience with Generative AI - First Insights from an Empirical Mixed-Methods Field Study. Zenodo. https://doi.org/10.5281/zenodo.17818081

  6. [6]

    Charlotte Brandebusemeyer, Tobias Schimmer, and Bert Arnrich. 2025. Wearables to Measure Developer Experience at Work. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE -SEIP), April 27, 2025. IEE E, 23–33. https://doi.org/10.1109/ICSE-SEIP66354.2025.00008

  7. [7]

    Charlotte Brandebusemeyer, Tobias Schimmer, and Bert Arnrich. 2025. Developers’ Experience with Generative AI - First Insights from an Empirical Mixed-Methods Field Study. ArXiv (December 2025). Retrieved April 18, 2026 from https://doi.org/10.48550/arXiv.2512.19926

  8. [8]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  9. [9]

    Mariana Coutinho, Lorena Marques, Anderson Santos, Marcio Dahia, Cesar França, and Ronnie de Souza Santos. 2024. The Role of Generative AI in Software Development Productivity: A Pilot Case Study. In Proceedings of the 1st ACM Int ernational Conference on AI -Powered Software (AIware 2024), July 10, 2024. ACM, New York, NY, USA, 131–138. https://doi.org/1...

  10. [10]

    Mihaly Czikszentmihalyi. 1990. Flow: The psychology of optimal experience . Harper & Row, New York

  11. [11]

    Nicole Forsgren, Eirini Kalliamvakou, Abi Noda, Michaela Greiler, Brian Houck, and Margaret Anne Storey. 2023. DevEx in Action: A study of its tangible impacts. Queue 21, 6 (December 2023), 47–77. https://doi.org/10.1145/3639443

  12. [12]

    Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: There’s more to it than you think. Queue 19, 1 (February 2021), 20–48. https://doi.org/10.1145/3454122.3454124

  13. [13]

    Povilas Godliauskas and Darja Šmite. 2025. The well -being of software engineers: a systematic literature review and a theory. Empir. Softw. Eng. 30, 1 (January 2025), 35. https://doi.org/10.1007/s10664 -024-10543-8

  14. [14]

    Lewis R Goldberg. 1992. The development of markers for the Big -Five factor structure. Psychol. Assess. 4, 1 (1992), 26–42

  15. [15]

    Lewis R Goldberg. 2024. International Personality Item Pool: A Scientific Collaboratory for the Development of Advanced Measu res of Personality Traits and Other Individual Differences. Retrieved Jul y 29, 2025 from https://ipip.ori.org/new_ipip -50-item-scale.htm

  16. [16]

    Michaela Greiler , Margaret -Anne Storey, and Abi Noda. 2023. An Actionable Framework for Understanding and Improving Developer Experience. IEEE Transactions on Software Engineering 49, 4 (April 2023), 1411–1425. https://doi.org/10.1109/TSE.2022.3175660

  17. [17]

    Sandra G. Hart. 2006 . Nasa-Task Load Index (NASA -TLX); 20 Years Later. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 50, 9 (October 2006), 904–908. https://doi.org/10.1177/154193120605000909

  18. [18]

    Headfulness - Luke Horton. 2024. Calming 4 -7-8 Breathing (5 Minutes). Retrieved July 29, 2025 from https://www.youtube.com/watch?v=DAp3aiC57ZQ&t=17s

  19. [19]

    Haritha Khandabattu. 2025. The 2025 Hype Cycle for Artificial Intelligence Goes Beyond GenAI. Gartner. Retrieved September 13, 2025 from https://www.gartner.com/en/articles/hype-cycle-for-artificial-intelligence

  20. [20]

    Mansi Khemka and Brian Houck. 2024. Toward Effective AI Support for Developers. Commun. ACM 67, 11 (November 2024), 42 –49. https://doi.org/10.1145/3690928

  21. [21]

    Shuang Li, Yuntao Cheng, Jinfu Chen, Jifeng Xuan, Sen He , and Weiyi Shang. 2024. Assessing the Performance of AI -Generated Code: A Case Study on GitHub Copilot. In 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE) , October 28, 2024. IEEE, 216–

  22. [22]

    https://doi.org/10.1109/ISSRE62328.2024.00030 39

  23. [23]

    Philipp Mayring and Thomas Fenzl. 2019. Qualitative Inhaltsanalyse. In Handbuch Methoden der empirischen Sozialforschung . Springer Fachmedien Wiesbaden, Wiesbaden, 633–648. https://doi.org/10.1007/978-3-658-21308-4_42

  24. [24]

    Meyer, Earl T

    Andre N. Meyer, Earl T. Barr, Christian Bird, and Thomas Zimmermann. 2021. Today Was a Good Day: The Daily Life of Software Developers. IEEE Transactions on Software Engineering 47, 5 (May 2021), 863–880. https://doi.org/10.1109/TSE.2019.2904957

  25. [25]

    Microsoft and LinkedIn. 2024. 2024 Work Trend Index Annual Report - AI at Work is Here. Now Comes the Hard Part. Retrieved July 29, 2025 from https://www.microsoft.com/en-us/worklab/work-trend-index/ai-at-work-is-here-now-comes-the-hard-part

  26. [26]

    Giovanni B. Moneta. 2021. On the conceptualizat ion and measurement of flow. In Advances in Flow Research . Springer Science, New York, 31 –

  27. [27]

    https://doi.org/10.1007/978-3-030-53468-4_2

  28. [28]

    Desmarais, and Zhen Ming (Jack) Jiang

    Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh , Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. GitHub Copilot AI pair programmer: Asset or Liability? Journal of Systems and Software 203, (September 2023), 111734. https://doi.org/10.1016/j.jss.2023.111734

  29. [29]

    Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading Between the Lines: Modeling User Behavior and Costs in AI - Assisted Programming. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24) , May 11, 2024. ACM, Honolulu, HI, USA, 1–16. https://doi.org/10.1145/3613904.3641936

  30. [30]

    Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s code suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR ’22) , May 23, 2022. ACM, Pittsburgh, P A, USA, 1 –5. https://doi.org/10.1145/3524842.3528470

  31. [31]

    Ojelanki Ngwenyama, Nada Kanita, and Frantz Rowe. 2025. Can Generative AI Contribute to Both Productivity Gains and Human Flourishing, and in Fine Satisfaction at Work? Research on GitHub Copilot Use in Software Development. In Hawaii International Conference on System Sciences 2025 (HICSS 2025), 2025. . https://doi.org/10.24251/HICSS.2025.719

  32. [32]

    Abi Noda, Margaret Anne Storey, Nicole Forsgren, and Michaela Greiler. 2023. DevEx: What Actually Drives Producti vity. Queue 21, 2 (April 2023), 35–53. https://doi.org/10.1145/3595878

  33. [33]

    OBS Project. 2025. OBS - Open Broadcaster Software. Retrieved July 29, 2025 from https://obsproject.com/

  34. [34]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. ArXiv (February 2023). https://doi.org/10.48550/arXiv.2302.06590

  35. [35]

    Paige S Rutner, Bill C Hardgrave, and D Harrison Mcknight. 2008. Emotional Dissonance and the Information Technology Professi onal. MIS Quarterly 32, 3 (September 2008), 635–652. https://doi.org/10.2307/25148859

  36. [36]

    Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, and Ben Zorn

    Advait Sarkar, Andrew D. Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, and Ben Zorn. 2022. What is it like to program with artificial intelligence? ArXiv (2022), 1–27. https://doi.org/10.48550/arXiv.2208.06213

  37. [37]

    Stanford University Human-Centered Artificial Intelligence. 2025. Artificial Intelligence Index Report 2025

  38. [38]

    Margaret Anne Storey, Thomas Zimmermann, Christian Bird, Jacek Czerwonka, Brendan Murphy, and Eirini Kalliamvakou. 2021. Towa rds a Theory of Software Developer Job Satisfaction and Perceived Productivity. IEEE Transactions on Software Engineering 47, 10 (October 2021), 2125–2142. https://doi.org/10.1109/TSE.2019.2944354

  39. [39]

    Margaret-Anne Storey, T Zimmermann, C Bird, J Czerwonka, B Murphy, and E Kalliamvakou. 2019. Supplemental material for towards a theory of software developer job satisfaction and perceived p roductivity. Zenodo. Retrieved from https://zenodo.org/records/3451354#.XYUr - OdKjOQ

  40. [40]

    Viktoria Stray, Nils Brede Moe, Nivethika Ganeshan, and Simon Kobbenes. 2025. Generative AI and Developer Workflows: How GitH ub Copilot and ChatGPT Influence Solo and Pair Programming. In Proceedings of the 58th Hawaii International Conference on System Sciences (HICSS ’25) , January 07, 2025. 7381–7390. https://doi.org/10.24251/HICSS.2025.883

  41. [41]

    Laura Tacho. 2024. Introducing Core 4: The best way to measure and improve your pro duct velocity. Retrieved July 29, 2025 from https://www.lennysnewsletter.com/p/introducing -core-4-the-best-way-to

  42. [42]

    Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMillan, and Toby Jia -Jun Li. 2024. Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking. In 2024 IEEE Symposium on Visual Languages and Human -Centric Computing (VL/HCC), September 02, 2024. IEEE, 40–46. https://doi.org/10.1109/VL/HCC60511.2024.00015

  43. [43]

    Glassman

    Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts (CHI ’22 Extended Abstracts), April 27, 2022. ACM, New York, NY, USA, 1–7. https://doi.org/10.1145/3491101.3519665

  44. [44]

    Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the quality of GitHub copilot’s code generation. In Proceedings of the 18th International Conference on Predictive Model s and Data Analytics in Software Engineering (PROMISE ’22) , November 07, 2022. ACM, Singapore, Singapore, 62–71. https://doi.org/10.1145/3558489.3559072

  45. [45]

    Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating the Code Quality of AI -Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv preprint (April 2023)

  46. [46]

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval -X. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (K DD ’23) , August ...

  47. [47]

    Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian

    Albert Ziegler, Eirini Kalliamvakou, X. Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. In MAPS 2022: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine 40 Programming, June 13, 2022. 21–29. https://doi.org/10.1145/3520312.3534864