Recognition: no theorem link
To Copilot and Beyond: 22 AI Systems Developers Want Built
Pith reviewed 2026-05-10 17:47 UTC · model grok-4.3
The pith
Developers want AI to absorb assembly tasks around coding while keeping the core craft under their own control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself. That boundary tracks where they locate professional identity, suggesting that the value of AI tooling may lie as much in where and how precisely it stops as in what it does. The survey data reveal a right-shift burden: developers need quality signals moved earlier in the workflow to match accelerating code generation, while demanding authority scoping, provenance tracking, uncertainty signaling, and least-privilege access in every system.
What carries the argument
Bounded delegation: the explicit pattern in which developers delegate surrounding assembly tasks to AI but retain authority over the core coding craft and professional judgment.
If this is right
- AI systems must embed quality signals earlier in the workflow to keep pace with faster code generation.
- Every desired system requires explicit authority scoping, provenance tracking, uncertainty signaling, and least-privilege access.
- The boundary of acceptable delegation is set by developers' sense of professional identity rather than by technical feasibility.
- Tool value depends as much on precise stopping points as on the tasks the AI performs.
Where Pith is reading between the lines
- The same bounded-delegation logic may appear in other knowledge-work domains where identity is tied to judgment rather than output volume.
- Interfaces for these 22 systems will need persistent visual or textual markers that make the delegation boundary immediately legible to the user.
- As base models improve, the assembly-versus-craft distinction itself may shift, requiring periodic re-mapping of acceptable delegation limits.
Load-bearing premise
Self-reported desires from 860 Microsoft developers, processed through the described thematic analysis, accurately capture generalizable needs whose constraints will remain stable as AI capabilities advance.
What would settle it
A survey of developers outside Microsoft, or a repeat survey after a major jump in AI code-generation ability, that shows widespread willingness to let AI perform core creative coding tasks would falsify the bounded-delegation claim.
Figures
read the original abstract
Developers spend roughly one-tenth of their workday writing code, yet most AI tooling targets that fraction. This paper asks what should be built for the rest. We surveyed 860 Microsoft developers to understand where they want AI support, and where they want it to stay out. Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories. For each, we describe the problem it solves, what makes it hard to build, and the constraints developers place on its behavior. Our findings point to a growing right-shift burden in AI-assisted development: developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation, while enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout. This tension reveals a pattern we call "bounded delegation": developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself. That boundary tracks where they locate professional identity, suggesting that the value of AI tooling may lie as much in where and how precisely it stops as in what it does.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a survey of 860 Microsoft developers and applies human-in-the-loop, multi-model council-based thematic analysis to identify 22 desired AI systems across five task categories. It describes the problems each system would solve, implementation challenges, and developer-imposed constraints (authority scoping, provenance, uncertainty signaling, least-privilege access), then interprets the results as evidence of a 'bounded delegation' pattern in which developers want AI to handle surrounding assembly work but not core professional craft.
Significance. If the taxonomy and bounded-delegation pattern hold beyond the sampled population, the work would be significant for AI-for-SE research by shifting attention from code-generation tools to broader workflow support, earlier quality-signal embedding, and explicit boundary mechanisms. The empirical grounding in developer self-reports and the explicit enumeration of constraints provide concrete design guidance that could influence both research prototypes and commercial tooling.
major comments (3)
- [Abstract] Abstract and implied Methods section: the description of the survey and thematic analysis states the sample size (860) and the use of a human-in-the-loop multi-model council but supplies no information on question design, response rate, inter-rater reliability, or any validation steps against external populations; without these details the 22-system taxonomy and the bounded-delegation interpretation rest on unexamined methodological choices.
- [Abstract] Abstract and Results: the central claim that the 22 systems and 'bounded delegation' pattern reflect general developer needs is undercut by the exclusive Microsoft sample; Microsoft-specific tooling, workflows, and culture may systematically shape the reported desires, so the taxonomy and the stability of the listed constraints (authority scoping, provenance, uncertainty signaling) cannot be assumed to transfer without additional evidence or explicit limitation statements.
- [Discussion] Discussion of bounded delegation: the interpretation that developers locate professional identity at the boundary between assembly work and craft relies on cross-sectional self-reports; the manuscript provides no longitudinal data or robustness checks to support the claim that these boundaries will remain stable as AI capabilities advance.
minor comments (1)
- [Abstract] Abstract: the five task categories are referenced but not enumerated; listing them would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our work. We address each major point below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and implied Methods section: the description of the survey and thematic analysis states the sample size (860) and the use of a human-in-the-loop multi-model council but supplies no information on question design, response rate, inter-rater reliability, or any validation steps against external populations; without these details the 22-system taxonomy and the bounded-delegation interpretation rest on unexamined methodological choices.
Authors: We agree that greater methodological transparency is needed. The full manuscript contains a Methods section, but we will expand it in revision to detail the survey question design (including exact prompts and branching logic), the achieved response rate, the implementation of the multi-model council (including how disagreements were resolved), any quantitative inter-rater reliability metrics, and steps taken to validate themes against external developer populations or prior literature. These additions will allow readers to evaluate the taxonomy and interpretation more rigorously. revision: yes
-
Referee: [Abstract] Abstract and Results: the central claim that the 22 systems and 'bounded delegation' pattern reflect general developer needs is undercut by the exclusive Microsoft sample; Microsoft-specific tooling, workflows, and culture may systematically shape the reported desires, so the taxonomy and the stability of the listed constraints (authority scoping, provenance, uncertainty signaling) cannot be assumed to transfer without additional evidence or explicit limitation statements.
Authors: We accept this as a valid limitation. Although Microsoft employs developers across many product areas and geographies, the sample is organizationally bounded. In the revised manuscript we will add an explicit Limitations subsection and strengthen the Discussion to state that the taxonomy and constraint patterns are scoped to this population, note potential influences from internal tooling and culture, and refrain from claiming broad generalizability. We will retain the bounded-delegation framing as an observation within the sampled context rather than a universal claim. revision: partial
-
Referee: [Discussion] Discussion of bounded delegation: the interpretation that developers locate professional identity at the boundary between assembly work and craft relies on cross-sectional self-reports; the manuscript provides no longitudinal data or robustness checks to support the claim that these boundaries will remain stable as AI capabilities advance.
Authors: The bounded-delegation pattern is presented as an interpretive synthesis of the cross-sectional self-reports we collected, not as a longitudinal prediction. We will revise the Discussion to make this distinction explicit, acknowledge the absence of longitudinal or robustness data, and frame the finding as a snapshot of current developer preferences and identity boundaries. We will also add a forward-looking paragraph suggesting that future studies could track whether these boundaries shift with AI progress, while preserving the value of the present evidence for immediate design implications. revision: partial
Circularity Check
No circularity: empirical survey and thematic analysis with no derivations or self-referential reductions
full rationale
The paper reports results from a survey of 860 Microsoft developers followed by human-in-the-loop multi-model thematic analysis to surface 22 desired AI systems and the 'bounded delegation' pattern. No equations, fitted parameters, or derivation chains exist that could reduce outputs to inputs by construction. All claims rest directly on the collected self-reported responses and the coding process applied to them; the named pattern is an interpretive label for observed themes rather than a renamed or fitted input. No load-bearing self-citations or uniqueness theorems are invoked. The work is therefore self-contained as a descriptive empirical study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Self-reported preferences collected via survey accurately reflect developers' true desires for AI system behavior and boundaries.
- domain assumption The human-in-the-loop multi-model council thematic analysis produces unbiased and complete categorization of responses into the 22 systems.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Supplemental Package. https://cabird.github.io/22-systems-devs-want/
-
[2]
Sadia Afroz, Zixuan Feng, Katie Kimura, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma. 2025. Developer Productivity with GenAI.arXiv preprint arXiv:2510.24265(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Blake A Allan, Cassondra Batz-Barbarich, Haley M Sterling, and Louis Tay. 2019. Outcomes of meaningful work: A meta-analysis.Journal of management studies 56, 3 (2019), 500–528
2019
-
[4]
2022.The labor market impacts of technological change: From un- bridled enthusiasm to qualified optimism to vast uncertainty
David Autor. 2022.The labor market impacts of technological change: From un- bridled enthusiasm to qualified optimism to vast uncertainty. Technical Report. National Bureau of Economic Research
2022
-
[5]
Catherine Bailey, Ruth Yeoman, Adrian Madden, Marc Thompson, and Gary Kerridge. 2019. A review of the empirical literature on meaningful work: Progress and research agenda.Human Resource Development Review18, 1 (2019), 83–113
2019
-
[6]
Sebastian Baltes, Marc Cheong, and Christoph Treude. 2026. " An Endless Stream of AI Slop": The Growing Burden of AI-Assisted Software Development.arXiv preprint arXiv:2603.27249(2026)
-
[7]
Leonardo Banh, Florian Holldack, and Gero Strobel. 2025. Copiloting the future: How generative AI transforms Software Engineering.Information and Software Technology183 (2025), 107751
2025
-
[8]
Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2022. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue20, 6 (2022), 35–57
2022
-
[9]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Virginia Braun and Victoria Clark. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101
2006
-
[11]
Virginia Braun and Victoria Clarke. 2022. Conceptual and design thinking for thematic analysis.Qualitative Psychology9, 1 (2022), 3
2022
-
[12]
Erik Brynjolfsson. 2022. The turing trap: The promise & peril of human-like artificial intelligence.Daedalus151, 2 (2022), 272–287
2022
-
[13]
Jenna Butler, Jina Suh, Sankeerti Haniyur, and Constance Hadley. 2025. Dear Diary: A randomized controlled trial of Generative AI coding tools in the work- place. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 319–329
2025
-
[14]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
It would work for me too
Ruijia Cheng, Ruotong Wang, Thomas Zimmermann, and Denae Ford. 2023. “It would work for me too”: How Online Communities Shape Software Developers’ Trust in AI-Powered Code Generation Tools.ACM Transactions on Interactive Intelligent Systems(2023)
2023
- [16]
-
[17]
Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, and Anita Sarma. 2025. What Guides Our Choices? Modeling Developers’ Trust and Behavioral Intentions Towards GenAI. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1691–1703
2025
-
[18]
Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, and Anita Sarma. 2025. What Needs Attention? Prioritizing Drivers of Developers’ Trust and Adoption of Generative AI.arXiv preprint arXiv:2505.17418(2025)
-
[19]
2016.Qualitative inquiry and research design: Choosing among five approaches
John W Creswell and Cheryl N Poth. 2016.Qualitative inquiry and research design: Choosing among five approaches. Sage publications
2016
- [20]
-
[21]
Bent Flyvbjerg. 2006. Five misunderstandings about case-study research.Quali- tative inquiry12, 2 (2006), 219–245
2006
-
[22]
Natasa Gisev, J Simon Bell, and Timothy F Chen. 2013. Interrater agreement and interrater reliability: key concepts, approaches, and applications.Research in Social and Administrative Pharmacy9, 3 (2013), 330–338
2013
-
[23]
2014.Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters
Kilem L Gwet. 2014.Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC
2014
-
[24]
J Richard Hackman and Greg R Oldham. 1976. Motivation through the design of work: Test of a theory.Organizational behavior and human performance16, 2 (1976), 250–279
1976
-
[25]
Brittany Johnson, Christian Bird, Denae Ford, Ebtesam Al Haque, Nicole Forsgren, and Thomas Zimmermann. [n. d.]. Facilitating Trust in AI-assisted Software Tools.ACM Transactions on Software Engineering and Methodology([n. d.])
-
[26]
Brittany Johnson, Christian Bird, Denae Ford, Nicole Forsgren, and Thomas Zimmermann. 2023. Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 409–419
2023
-
[27]
Eirini Kalliamvakou. 2024. A developer’s second brain: Reducing complexity through partnership with AI
2024
-
[28]
Mansi Khemka and Brian Houck. 2024. Toward Effective AI Support for Devel- opers: A survey of desires and concerns.Commun. ACM67, 11 (2024), 42–49
2024
- [29]
-
[30]
2004.The social science encyclopedia
Adam Kuper. 2004.The social science encyclopedia. Routledge
2004
-
[31]
Stefano Lambiase, Gemma Catolino, Fabio Palomba, Filomena Ferrucci, and Daniel Russo. 2025. Exploring Individual Factors in the Adoption of LLMs for Specific Software Engineering Tasks.arXiv preprint arXiv:2504.02553(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
1991.Emotion and adaptation
Richard S Lazarus. 1991.Emotion and adaptation. Oxford University Press
1991
-
[33]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al
-
[34]
Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Yue Liu, Ratnadira Widyasari, Yanjie Zhao, Ivana Clairine Irsan, and David Lo
-
[36]
Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild.arXiv preprint arXiv:2603.28592(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
André N Meyer, Earl T Barr, Christian Bird, and Thomas Zimmermann. 2019. Today was a good day: The daily life of software developers.IEEE Transactions on Software Engineering47, 5 (2019), 863–880
2019
-
[38]
Maybe We Need Some More Examples:
Courtney Miller, Rudrajit Choudhuri, Mara Ulloa, Sankeerti Haniyur, Robert DeLine, Margaret-Anne Storey, Emerson Murphy-Hill, Christian Bird, and Jenna L Butler. 2025. " Maybe We Need Some More Examples:" Individual and Team Drivers of Developer GenAI Tool Use.arXiv preprint arXiv:2507.21280(2025)
-
[39]
workslop
Kate Niederhoffer, Gabriella Rosen Kellerman, Angela Lee, Alex Liebscher, Kristina Rapuano, and Jeffrey T Hancock. 2025. AI-generated “workslop” is destroying productivity.Harvard Business Review(2025)
2025
-
[40]
bad days
Ike Obi, Jenna Butler, Sankeerti Haniyur, Brian Hassan, Margaret-Anne Storey, and Brendan Murphy. 2025. Identifying factors contributing to “bad days” for software developers: A mixed-methods study. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 1–11
2025
-
[41]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of GitHub copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768
2022
-
[42]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2025. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM68, 2 (2025), 96–105
2025
-
[43]
Guilherme Vaz Pereira, Victoria Jackson, Rafael Prikladnicki, André van der Hoek, Luciane Fortes, Carolina Araújo, André Coelho, Ligia Chelli, and Diego Ramos
-
[44]
In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
Exploring GenAI in Software Development: Insights from a Case Study in a Large Brazilian Company. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 330–341
-
[45]
Teade Punter, Marcus Ciolkowski, Bernd Freimut, and Isabel John. 2003. Con- ducting on-line surveys in software engineering. In2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings.IEEE, 80–88
2003
-
[46]
Ira J Roseman and Craig A Smith. 2001. Appraisal theory.Appraisal processes in emotion: Theory, methods, research(2001), 3–19
2001
-
[47]
Daniel Russo. 2024. Navigating the complexity of generative AI adoption in software engineering.ACM Transactions on Software Engineering and Methodology (2024)
2024
-
[48]
Hope Schroeder, Marianne Aubin Le Quéré, Casey Randazzo, David Mimno, and Sarita Schoenebeck. 2025. Large language models in qualitative research: uses, tensions, and intentions. InProceedings of the 2025 chi conference on human factors in computing systems. 1–17
2025
-
[49]
Ben Shneiderman. 2020. Human-centered artificial intelligence: Reliable, safe & trustworthy.International Journal of Human–Computer Interaction36, 6 (2020), 495–504
2020
-
[50]
Margaret-Anne Storey, Thomas Zimmermann, Christian Bird, Jacek Czerwonka, Brendan Murphy, and Eirini Kalliamvakou. 2019. Towards a theory of software Oregon State University & Microsoft Research, USA, 2026 Choudhuri et al. developer job satisfaction and perceived productivity.IEEE Transactions on Software Engineering47, 10 (2019), 2125–2142
2019
-
[51]
Robert H Tai, Lillian R Bentley, Xin Xia, Jason M Sitt, Sarah C Fankhauser, Ana M Chicas-Mosier, and Barnas G Monteith. 2024. An examination of the use of large language models to aid analysis of textual data.International Journal of Qualitative Methods23 (2024), 16094069241231168
2024
-
[52]
Eric Lansdown Trist and Kenneth W Bamforth. 1951. Some social and psycho- logical consequences of the longwall method of coal-getting: An examination of the psychological situation and defences of a work group in relation to the social structure and technological content of the work system.Human relations4, 1 (1951), 3–38
1951
- [53]
-
[54]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
2022
-
[55]
Shuai Wu, Xue Li, Yanna Feng, Yufang Li, and Zhijun Wang. 2026. Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus.arXiv preprint arXiv:2604.02923(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [56]
-
[57]
Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2024. Measuring GitHub Copilot’s Impact on Productivity.Commun. ACM(2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.