Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents
Pith reviewed 2026-06-28 04:56 UTC · model grok-4.3
The pith
Developers oversee software agents with four forms of work that include proactive steps before agents act.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through interviews with 17 experienced developers, the paper finds at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. Oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. The study documents situated challenges such as difficulty reviewing agent-generated code and heuristics developers adopt, such as using test results as guarantees for code correctness.
What carries the argument
Four forms of emergent oversight work (a priori control, co-planning, real-time monitoring, post hoc review) that developers perform when using software agents.
Load-bearing premise
The patterns identified from interviews with 17 developers represent general practices that hold for developers using software agents beyond this sample.
What would settle it
A larger study of developers that finds oversight consists only of post hoc review with no instances of a priori control or co-planning would show the four forms are not general.
Figures
read the original abstract
Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirical anchors for the theoretical discourse on agent oversight. Drawing on interviews with 17 experienced developers, we conduct an exploratory inquiry examining what forms of emergent oversight work developers perform, when, and how. We also document the oversight challenges developers face and the strategies they have started using to address them. We found at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. We show that oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. We describe situated oversight challenges (e.g., difficulty reviewing agent-generated code) and outline heuristics developers adopt to address such challenges (e.g., using test results as guarantees for code correctness). We conclude with high-level takeaways, future research directions, implications for the human-centered design of software agents and for software engineering practice, and limitations of our research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an exploratory qualitative interview study with 17 experienced developers using software agents. It identifies four forms of emergent oversight work (a priori control, co-planning, real-time monitoring, and post hoc review), argues that oversight is preventative and proactive in addition to reactive, and documents situated challenges (e.g., reviewing agent-generated code) along with developer heuristics (e.g., using test results as guarantees). The work positions these findings as empirical anchors for theoretical discussions on human-agent collaboration in software engineering.
Significance. If the patterns hold beyond the sample, the study supplies concrete empirical observations that can ground normative frameworks on agent oversight, shift emphasis from purely retrospective review to proactive strategies, and inform both the design of agent interfaces and software engineering practices around human-agent teams.
major comments (2)
- [Methods] Methods section (study design and data collection): The manuscript states it draws on interviews with 17 experienced developers but provides no details on sampling strategy, recruitment criteria, participant diversity (company size, agent types, experience levels), interview protocol, or how themes were derived and validated. This directly undermines support for the central claim that the four forms represent general patterns of oversight work that extend beyond the sample and existing research portrayals.
- [Findings and Discussion] Findings and Discussion: The generalization that oversight 'is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive' and the identification of 'at least four forms' as emergent patterns rest on the 17 interviews being sufficient for transferable insights, yet no evidence of theoretical saturation, negative case analysis, or limitations on generalizability is reported. This is load-bearing for the paper's contrast with prior work.
minor comments (2)
- [Abstract] Abstract: The phrase 'we found at least four forms' should be clarified in the body to indicate whether these are presented as exhaustive categories or illustrative examples from the data.
- [Conclusion] The manuscript should include a dedicated limitations subsection that explicitly addresses the small sample size and exploratory nature when discussing implications for design and practice.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to improve methodological transparency and clarify the scope of our claims.
read point-by-point responses
-
Referee: [Methods] Methods section (study design and data collection): The manuscript states it draws on interviews with 17 experienced developers but provides no details on sampling strategy, recruitment criteria, participant diversity (company size, agent types, experience levels), interview protocol, or how themes were derived and validated. This directly undermines support for the central claim that the four forms represent general patterns of oversight work that extend beyond the sample and existing research portrayals.
Authors: We agree that the methods section in the current manuscript lacks the requested details on sampling strategy, recruitment criteria, participant diversity, interview protocol, and the process of deriving and validating themes. This omission weakens the presentation of our exploratory findings. We will revise the methods section to incorporate these elements from our study records, which will provide better support for the observed patterns. revision: yes
-
Referee: [Findings and Discussion] Findings and Discussion: The generalization that oversight 'is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive' and the identification of 'at least four forms' as emergent patterns rest on the 17 interviews being sufficient for transferable insights, yet no evidence of theoretical saturation, negative case analysis, or limitations on generalizability is reported. This is load-bearing for the paper's contrast with prior work.
Authors: Our study is framed as exploratory rather than a formal grounded theory investigation, so we did not perform or report theoretical saturation or negative case analysis. The four forms are presented as emergent patterns observed in the data ('at least four'), and the contrast with prior work highlights proactive elements seen in our interviews. We will add an expanded limitations section that explicitly addresses sample size, the absence of saturation checks, and boundaries on generalizability and transferability. This will appropriately scope our claims while preserving the empirical observations from the data. revision: partial
Circularity Check
No circularity: empirical qualitative study reports new observations
full rationale
The paper is an exploratory qualitative study drawing on 17 developer interviews to identify forms of oversight work. It makes no mathematical derivations, parameter fits, or load-bearing self-citations that reduce claims to prior inputs by construction. All listed patterns (self-definitional, fitted predictions, uniqueness theorems, ansatz smuggling) are absent; the central claims rest on fresh interview data rather than re-deriving from the paper's own inputs or citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Qualitative interview data from a convenience sample of experienced developers can be interpreted to identify general forms of oversight work.
Reference graph
Works this paper leans on
-
[1]
Artificial Intelligence Act
E.U. Artificial Intelligence Act. 2024. Article 14: Human Oversight | EU Artificial Intelligence Act. https://artificialintelligenceact.eu/article/14/ [Online; accessed 2025-08-01]. Human Oversight of Agentic Systems in Practice 17
2024
-
[2]
Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=b0jYs6JOZu
2025
-
[3]
Ali Al-Kaswan, Maliheh Izadi, and Arie van Deursen. 2024. Traces of Memorisation in Large Language Models for Code. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 78, 12 pages. doi:10.1145/3597503.3639133
-
[4]
Anthropic. 2025. Anthropic Economic Index report: Uneven geographic and enterprise AI adoption — anthropic.com. https://www.anthropic.com/ research/anthropic-economic-index-september-2025-report. [Online; accessed 12-01-2026]
2025
-
[5]
Anthropic. 2025. Disrupting the first reported AI-orchestrated cyber espionage campaign. https://assets.anthropic.com/m/ec212e6566a0d47/ original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf [Online; accessed 2026-01-09]
2025
-
[6]
Gagan Bansal, Wenyue Hua, Zezhou Huang, Adam Fourney, Amanda Swearngin, Will Epperson, Tyler Payne, Jake M. Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M. Rothschild, Aleksandrs Slivkins, Daniel G. Goldstein, Hussein Mozannar, Nicole Immorlica, Maya Murad, Matthew Vogel, Subbarao Kambhampati, Eric Ho...
arXiv 2025
-
[7]
Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld
-
[8]
arXiv:2412.10380 [cs.HC] https://arxiv.org/abs/2412.10380
Challenges in Human-Agent Communication. arXiv:2412.10380 [cs.HC] https://arxiv.org/abs/2412.10380
-
[9]
Joseph R Biden. 2023. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence.Presidential Actions (2023)
2023
-
[10]
Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 2188–2200. doi:10.1109/ICSE55347.2025.00157
-
[11]
Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. arXiv:2506.18824 [cs.SE] https://arxiv.org/abs/2506.18824
arXiv 2025
-
[12]
Michelle Brachman, Siya Kunde, Sarah Miller, Ana Fucs, Samantha Dempsey, Jamie Jabbour, and Werner Geyer. 2025. Building Appropriate Mental Models: What Users Know and Want to Know about an Agentic AI Chatbot. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA...
-
[13]
Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287
work page internal anchor Pith review doi:10.1145/3449287 2021
-
[14]
Stefan Buijsman and Herman Veluwenkamp. 2023. Spotting When Algorithms Are Wrong.Minds and Machines33, 4 (01 Dec 2023), 541–562. doi:10.1007/s11023-022-09591-0
-
[15]
Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Wenjing Hu, Yuchen Mao, et al. 2024. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems37 (2024), 107703–107744
2024
-
[16]
Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. 2024. Black-Box Access i...
-
[17]
Jonker, Jeroen van den Hoven, Deborah Forster, and Reginald L
Luciano Cavalcante Siebert, Maria Luce Lupetti, Evgeni Aizenberg, Niek Beckers, Arkady Zgonnikov, Herman Veluwenkamp, David Abbink, Elisa Giaccardi, Geert-Jan Houben, Catholijn M. Jonker, Jeroen van den Hoven, Deborah Forster, and Reginald L. Lagendijk. 2023. Meaningful human control: actionable properties for AI system development.AI and Ethics3, 1 (01 F...
-
[18]
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657 [cs.AI]
Pith/arXiv arXiv 2025
-
[19]
Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. 2024. Visibility into AI Agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computi...
-
[20]
Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Shalaleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, and Tegan Maharaj. 202...
-
[21]
Chacha Chen, Shi Feng, Amit Sharma, and Chenhao Tan. 2023. Machine Explanations and Human Understanding. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1. doi:10.1145/3593013.3593970
-
[22]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott 18 Dhanorkar, Passi & Vorvoreanu Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,...
Pith/arXiv arXiv 2021
-
[23]
Valerie Chen, Ameet Talwalkar, Robert Brennan, and Graham Neubig. 2025. Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows. arXiv:2507.08149 [cs.SE] https://arxiv.org/abs/2507.08149
arXiv 2025
-
[24]
Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv:2503.12374 [cs.SE] https://arxiv.org/abs/2503.12374
Pith/arXiv arXiv 2025
-
[25]
Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google.CoRRabs/2502.01821 (2025). arXiv:2502.01821 doi:10.48550/ARXIV.2502.01821
-
[26]
2025.Human-in/on-the-Loop Design for Human Controllability
Ria Cheruvu. 2025.Human-in/on-the-Loop Design for Human Controllability. Springer Nature Singapore, Singapore, 1–47. doi:10.1007/978-981-97- 8440-0_75-1
-
[27]
Adele E. Clarke. 2016. Anticipation Work: Abduction, Simplification, Hope. InBoundary Objects and Beyond: Working with Leigh Star. The MIT Press. doi:10.7551/mitpress/10113.003.0007
-
[28]
Eric Corbett and Remi Denton. 2023. Interrogating the T in FAccT. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1624–1634. doi:10.1145/3593013.3594104
-
[29]
Sam Cox. 2025. Code Execution Through Deception: Gemini AI CLI Hijack. https://tracebit.com/blog/code-exec-deception-gemini-ai-cli-hijack [Online; accessed 2026-01-09]
2025
-
[30]
Lorrie Faith Cranor. 2008. A framework for reasoning about the human in the loop. InProceedings of the 1st Conference on Usability, Psychology, and Security(San Francisco, California)(UPSEC’08). USENIX Association, USA, Article 1, 15 pages
2008
-
[31]
Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. 2025. The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks. arXiv:2502.08235 [cs.AI] htt...
arXiv 2025
-
[32]
Nigel Daly. 2025. Managerial AI Skill Stacking: A New Professional skillset for the AI-Driven Workplace.A vailable at SSRN 5277762(2025), 11 pages. doi:10.2139/ssrn.5277762
-
[33]
Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. 2024. MARG: Multi-Agent Review Generation for Scientific Papers. arXiv:2401.04259 [cs.CL] https://arxiv.org/abs/2401.04259
arXiv 2024
-
[34]
Joseph Donia. 2022. Normative Logics of Algorithmic Accountability. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 598. doi:10.1145/3531146.3533123
-
[35]
Ekwa Duala-Ekoko and Martin P. Robillard. 2012. Asking and answering questions about unfamiliar APIs: An exploratory study. In2012 34th International Conference on Software Engineering (ICSE). 266–276. doi:10.1109/ICSE.2012.6227187
-
[36]
Vera Liao, Samir Passi, Mark O
Upol Ehsan, Q. Vera Liao, Samir Passi, Mark O. Riedl, and III Daumé, Hal. 2024. Seamful XAI: Operationalizing Seamful Design in Explainable AI. Proc. ACM Hum.-Comput. Interact.8, CSCW1, Article 119 (April 2024), 29 pages. doi:10.1145/3637396
-
[37]
Upol Ehsan, Samir Passi, Koustuv Saha, Todd McNutt, Mark O Riedl, and Sara Alcorn. 2026. From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms to Foster Dignified Human-AI Interaction. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Artic...
-
[38]
Hasan Ferit Eniser, Hanliang Zhang, Cristina David, Meng Wang, Maria Christakis, Brandon Paulsen, Joey Dodds, and Daniel Kroening. 2024. Towards Translating Real-World Code with LLMs: A Study of Translating to Rust.CoRR(2024)
2024
-
[39]
Lena Enqvist. 2023. ‘Human oversight’ in the EU artificial intelligence act: what, when and by whom?Law, Innovation and Technology15, 2 (2023), 508–535. doi:10.1080/17579961.2023.2245683
-
[40]
Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 156, 15 pages. doi:10....
-
[41]
Deng, Zachary C
Michael Feffer, Anusha Sinha, Wesley H. Deng, Zachary C. Lipton, and Hoda Heidari. 2025.Red-Teaming for Generative AI: Silver Bullet or Security Theater?AAAI Press, 421–437
2025
-
[42]
Joel Frenette. 2023. Ensuring human oversight in high-performance AI systems: A framework for control and accountability.World Journal of Advanced Research and Reviews20, 2 (2023), 1507–1516
2023
-
[43]
Harold Garfinkel and Harvey Sacks. 1987. On formal structures of practical actions. InEthnomethodological studies of work, Harold Garkinkel (Ed.). Routledge, 165–198. https://doi.org/10.4324/9780203996867
-
[44]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...
Pith/arXiv arXiv 2025
-
[45]
Voss, Marine Carpuat, and Hal Daumé III
Navita Goyal, Eleftheria Briakou, Amanda Liu, Connor Baumler, Claire Bonial, Jeffrey Micher, Clare R. Voss, Marine Carpuat, and Hal Daumé III. 2023. What Else Do I Need to Know? The Effect of Background Information on Users’ Reliance on QA Systems. arXiv:2305.14331 [cs.CL] https://arxiv.org/abs/2305.14331
arXiv 2023
-
[46]
Ben Green. 2022. The flaws of policies requiring human oversight of government algorithms.Computer Law & Security Review45 (2022), 105681. doi:10.1016/j.clsr.2022.105681
-
[47]
Ben Green and Yiling Chen. 2019. The Principles and Limits of Algorithm-in-the-Loop Decision Making.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 50 (Nov. 2019), 24 pages. doi:10.1145/3359152
-
[48]
Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 106190–106236...
-
[49]
Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schonherr, and Mario Fritz. 2024. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models . In2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE Computer Society, Los Alamitos, CA, USA, 684–709. doi:10.1109/SaTML59370...
-
[50]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=sD93GOzH3i5
2021
-
[51]
Thomas Henzinger, Mahyar Karimi, Konstantin Kueffner, and Kaushik Mallik. 2023. Runtime Monitoring of Dynamic Fairness Properties. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 604–614. doi:10.1145/3593013.3594028
-
[52]
2009.The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong
Erik Hollnagel. 2009.The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong. CRC press
2009
-
[53]
Andreas Holzinger, Kurt Zatloukal, and Heimo Müller. 2025. Is human oversight to AI systems still possible?New Biotechnology85 (2025), 59–62. doi:10.1016/j.nbt.2024.12.003
-
[54]
Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv:2511.04824 [[cs.SE](http://cs.se/)] https://arxiv.org/abs/2511.04824
arXiv 2025
-
[55]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79
2024
-
[56]
Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, and Chien-Sheng Wu. 2025. CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. arXiv:2411.02305 [cs.CL] https://arxiv.org/abs/2411.02305
arXiv 2025
-
[57]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155
-
[58]
Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, and Brian Hempel. 2025. Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012 [cs.SE] https://arxiv.org/abs/2512.14012
arXiv 2025
-
[59]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 1643...
-
[60]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id= VTF8yNQM66
2024
-
[61]
Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining Interpretability and Explainability using Sensemaking Theory. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 702–714. doi:10.1145/...
-
[62]
Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. "I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Asso...
-
[63]
Rob Kling. 1980. Social Analyses of Computing: Theoretical Perspectives in Recent Empirical Research.ACM Comput. Surv.12, 1 (March 1980), 61–110. doi:10.1145/356802.356806
-
[64]
Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. 2024. Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI.Computers in Human Behavior160 (2024), 108352. doi:10.1016/j.chb.2024.108352
-
[65]
Riikka KOULU. 2020. Human control over automation: EU policy and AI ethics.European journal of legal studies12 (2020), 9–46
2020
-
[66]
Riikka Koulu. 2020. Proceduralizing control and discretion: Human oversight in artificial intelligence policy.Maastricht Journal of European and Comparative Law27, 6 (2020), 720–735. arXiv:https://doi.org/10.1177/1023263X20978649 doi:10.1177/1023263X20978649
-
[67]
Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. 2020. Specification gaming: the flip side of AI ingenuity. https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ 20 Dhanorkar, Passi & Vorvoreanu [Online; accessed 2025-09-01]
2020
-
[68]
Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Gustavo Soares, and Emerson Murphy-Hill. 2025. Why AI Agents Still Need You: Findings from Developer-Agent Collaborations in the Wild. arXiv:2506.12347 [cs.SE] https://arxiv.org/abs/2506.12347
arXiv 2025
-
[69]
Kyriakos Kyriakou and Jahna Otterbacher. 2023. In humans, we trust.Discover Artificial Intelligence3, 1 (12 Dec 2023), 44. doi:10.1007/s44163-023- 00092-2
-
[70]
Vivian Lai, Chacha Chen, Alison Smith-Renner, Q. Vera Liao, and Chenhao Tan. 2023. Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York,...
-
[71]
Markus Langer, Kevin Baum, and Nadine Schlicker. 2024. Effective Human Oversight of AI-Based Systems: A Signal Detection Perspective on the Detection of Inaccurate and Unfair Outputs.Minds and Machines35, 1 (05 Nov 2024), 1. doi:10.1007/s11023-024-09701-0
-
[72]
LaToza, Gina Venolia, and Robert DeLine
Thomas D. LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental models: a study of developer work habits. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 492–501. doi:10.1145/1134285.1134355
-
[73]
Min Hun Lee and Chong Jun Chew. 2023. Understanding the Effect of Counterfactual Explanations on Trust and Reliance on AI for Human-AI Collaborative Clinical Decision Making.Proc. ACM Hum.-Comput. Interact.7, CSCW2, Article 369 (Oct. 2023), 22 pages. doi:10.1145/3610218
-
[75]
Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. arXiv:2506.12286 [[cs.AI](http://cs.ai/)] https://arxiv.org/abs/2506.12286
arXiv 2025
-
[76]
Gabriel Lima, Nina Grgić-Hlača, Jin Keun Jeong, and Meeyoung Cha. 2022. The Conflict Between Explainable and Accountable Decision-Making Algorithms. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2103–2113. doi:10.1145/3531...
-
[77]
Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patanamon Thongtanunam. 2025. Automatic programming: Large language models and beyond.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–33
2025
-
[78]
Ramesh Manuvinakurike, Emanuel Moss, Elizabeth Anne Watkins, Saurav Sahay, Giuseppe Raffa, and Lama Nachman. 2025. Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines. arXiv:2505.00875 [cs.AI] https://arxiv.org/abs/2505.00875
arXiv 2025
-
[79]
Marvin. [n. d.]. Marvin: The AI-Native Customer Feedback Repository — heymarvin.com. https://heymarvin.com/. [Online; accessed 12-01-2026]
2026
-
[80]
Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. arXiv:2404.11584 [cs.AI] https://arxiv.org/abs/2404.11584
Pith/arXiv arXiv 2024
-
[81]
Cecily Mauran. 2025. https://mashable.com/article/google-gemini-deletes-users-code
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.