Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents

Mihaela Vorvoreanu; Samir Passi; Shipi Dhanorkar

arxiv: 2606.05391 · v1 · pith:ARSF75MHnew · submitted 2026-06-03 · 💻 cs.SE · cs.AI

Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents

Shipi Dhanorkar , Samir Passi , Mihaela Vorvoreanu This is my paper

Pith reviewed 2026-06-28 04:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords human oversightsoftware agentsagentic systemsdeveloper practicesempirical studyhuman-AI collaborationsoftware engineeringoversight challenges

0 comments

The pith

Developers oversee software agents with four forms of work that include proactive steps before agents act.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how developers actually oversee autonomous software agents that promise productivity gains yet produce novel errors. Interviews with 17 experienced developers reveal four emergent oversight practices: a priori control, co-planning, real-time monitoring, and post hoc review. This establishes that oversight includes preventative and proactive elements in addition to the reactive and retrospective modes emphasized in earlier conceptual work. The findings matter because they supply empirical detail on real collaboration practices, which can guide the design of agents that better support human involvement throughout the workflow.

Core claim

Through interviews with 17 experienced developers, the paper finds at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. Oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. The study documents situated challenges such as difficulty reviewing agent-generated code and heuristics developers adopt, such as using test results as guarantees for code correctness.

What carries the argument

Four forms of emergent oversight work (a priori control, co-planning, real-time monitoring, post hoc review) that developers perform when using software agents.

Load-bearing premise

The patterns identified from interviews with 17 developers represent general practices that hold for developers using software agents beyond this sample.

What would settle it

A larger study of developers that finds oversight consists only of post hoc review with no instances of a priori control or co-planning would show the four forms are not general.

Figures

Figures reproduced from arXiv: 2606.05391 by Mihaela Vorvoreanu, Samir Passi, Shipi Dhanorkar.

read the original abstract

Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirical anchors for the theoretical discourse on agent oversight. Drawing on interviews with 17 experienced developers, we conduct an exploratory inquiry examining what forms of emergent oversight work developers perform, when, and how. We also document the oversight challenges developers face and the strategies they have started using to address them. We found at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. We show that oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. We describe situated oversight challenges (e.g., difficulty reviewing agent-generated code) and outline heuristics developers adopt to address such challenges (e.g., using test results as guarantees for code correctness). We conclude with high-level takeaways, future research directions, implications for the human-centered design of software agents and for software engineering practice, and limitations of our research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives initial interview-based categories for how developers oversee agents but the small sample and thin methods reporting limit how far the patterns can be taken.

read the letter

The core takeaway is that interviews with 17 developers turned up four forms of oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. The authors use this to argue that oversight is often proactive and preventative rather than only reactive, which differs from the conceptual literature they cite.

What stands out as new is the set of concrete categories and the heuristics developers actually use, such as treating test results as a stand-in for code correctness. The paper does a reasonable job of describing situated challenges like reviewing agent-generated code and tying them back to practice. It supplies the kind of early empirical detail that the abstract says is missing from normative frameworks.

The soft spots are clear. The abstract gives almost no information on sampling, interview protocol, or how themes were checked, and 17 interviews is a narrow base for claiming patterns that apply to developers more broadly. The stress-test concern about generalizability lands because the leap from these cases to statements about existing research and oversight work in general rests on unshown details. This is typical for exploratory qualitative work but still leaves the central claims under-supported.

The paper is aimed at researchers studying human-AI collaboration in software engineering who need practice descriptions to build on. It could be useful for people designing oversight tools or running follow-up studies.

It should go to peer review. The topic is timely and the observations are grounded enough to merit referee input on methods and scope, even if revisions will be needed.

Referee Report

2 major / 2 minor

Summary. The paper reports an exploratory qualitative interview study with 17 experienced developers using software agents. It identifies four forms of emergent oversight work (a priori control, co-planning, real-time monitoring, and post hoc review), argues that oversight is preventative and proactive in addition to reactive, and documents situated challenges (e.g., reviewing agent-generated code) along with developer heuristics (e.g., using test results as guarantees). The work positions these findings as empirical anchors for theoretical discussions on human-agent collaboration in software engineering.

Significance. If the patterns hold beyond the sample, the study supplies concrete empirical observations that can ground normative frameworks on agent oversight, shift emphasis from purely retrospective review to proactive strategies, and inform both the design of agent interfaces and software engineering practices around human-agent teams.

major comments (2)

[Methods] Methods section (study design and data collection): The manuscript states it draws on interviews with 17 experienced developers but provides no details on sampling strategy, recruitment criteria, participant diversity (company size, agent types, experience levels), interview protocol, or how themes were derived and validated. This directly undermines support for the central claim that the four forms represent general patterns of oversight work that extend beyond the sample and existing research portrayals.
[Findings and Discussion] Findings and Discussion: The generalization that oversight 'is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive' and the identification of 'at least four forms' as emergent patterns rest on the 17 interviews being sufficient for transferable insights, yet no evidence of theoretical saturation, negative case analysis, or limitations on generalizability is reported. This is load-bearing for the paper's contrast with prior work.

minor comments (2)

[Abstract] Abstract: The phrase 'we found at least four forms' should be clarified in the body to indicate whether these are presented as exhaustive categories or illustrative examples from the data.
[Conclusion] The manuscript should include a dedicated limitations subsection that explicitly addresses the small sample size and exploratory nature when discussing implications for design and practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to improve methodological transparency and clarify the scope of our claims.

read point-by-point responses

Referee: [Methods] Methods section (study design and data collection): The manuscript states it draws on interviews with 17 experienced developers but provides no details on sampling strategy, recruitment criteria, participant diversity (company size, agent types, experience levels), interview protocol, or how themes were derived and validated. This directly undermines support for the central claim that the four forms represent general patterns of oversight work that extend beyond the sample and existing research portrayals.

Authors: We agree that the methods section in the current manuscript lacks the requested details on sampling strategy, recruitment criteria, participant diversity, interview protocol, and the process of deriving and validating themes. This omission weakens the presentation of our exploratory findings. We will revise the methods section to incorporate these elements from our study records, which will provide better support for the observed patterns. revision: yes
Referee: [Findings and Discussion] Findings and Discussion: The generalization that oversight 'is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive' and the identification of 'at least four forms' as emergent patterns rest on the 17 interviews being sufficient for transferable insights, yet no evidence of theoretical saturation, negative case analysis, or limitations on generalizability is reported. This is load-bearing for the paper's contrast with prior work.

Authors: Our study is framed as exploratory rather than a formal grounded theory investigation, so we did not perform or report theoretical saturation or negative case analysis. The four forms are presented as emergent patterns observed in the data ('at least four'), and the contrast with prior work highlights proactive elements seen in our interviews. We will add an expanded limitations section that explicitly addresses sample size, the absence of saturation checks, and boundaries on generalizability and transferability. This will appropriately scope our claims while preserving the empirical observations from the data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical qualitative study reports new observations

full rationale

The paper is an exploratory qualitative study drawing on 17 developer interviews to identify forms of oversight work. It makes no mathematical derivations, parameter fits, or load-bearing self-citations that reduce claims to prior inputs by construction. All listed patterns (self-definitional, fitted predictions, uniqueness theorems, ansatz smuggling) are absent; the central claims rest on fresh interview data rather than re-deriving from the paper's own inputs or citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that semi-structured interviews with 17 developers can reliably surface general oversight patterns; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Qualitative interview data from a convenience sample of experienced developers can be interpreted to identify general forms of oversight work.
Invoked implicitly when generalizing from the 17 interviews to broader claims about developer practices.

pith-pipeline@v0.9.1-grok · 5761 in / 1132 out tokens · 20657 ms · 2026-06-28T04:56:20.789995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

159 extracted references · 85 canonical work pages · 2 internal anchors

[1]

Artificial Intelligence Act

E.U. Artificial Intelligence Act. 2024. Article 14: Human Oversight | EU Artificial Intelligence Act. https://artificialintelligenceact.eu/article/14/ [Online; accessed 2025-08-01]. Human Oversight of Agentic Systems in Practice 17

2024
[2]

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=b0jYs6JOZu

2025
[3]

Ali Al-Kaswan, Maliheh Izadi, and Arie van Deursen. 2024. Traces of Memorisation in Large Language Models for Code. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 78, 12 pages. doi:10.1145/3597503.3639133

work page doi:10.1145/3597503.3639133 2024
[4]

Anthropic. 2025. Anthropic Economic Index report: Uneven geographic and enterprise AI adoption — anthropic.com. https://www.anthropic.com/ research/anthropic-economic-index-september-2025-report. [Online; accessed 12-01-2026]

2025
[5]

Anthropic. 2025. Disrupting the first reported AI-orchestrated cyber espionage campaign. https://assets.anthropic.com/m/ec212e6566a0d47/ original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf [Online; accessed 2026-01-09]

2025
[6]

Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M

Gagan Bansal, Wenyue Hua, Zezhou Huang, Adam Fourney, Amanda Swearngin, Will Epperson, Tyler Payne, Jake M. Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M. Rothschild, Aleksandrs Slivkins, Daniel G. Goldstein, Hussein Mozannar, Nicole Immorlica, Maya Murad, Matthew Vogel, Subbarao Kambhampati, Eric Ho...

arXiv 2025
[7]

Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld
[8]

arXiv:2412.10380 [cs.HC] https://arxiv.org/abs/2412.10380

Challenges in Human-Agent Communication. arXiv:2412.10380 [cs.HC] https://arxiv.org/abs/2412.10380

arXiv
[9]

Joseph R Biden. 2023. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence.Presidential Actions (2023)

2023
[10]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 2188–2200. doi:10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025
[11]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. arXiv:2506.18824 [cs.SE] https://arxiv.org/abs/2506.18824

arXiv 2025
[12]

Michelle Brachman, Siya Kunde, Sarah Miller, Ana Fucs, Samantha Dempsey, Jamie Jabbour, and Werner Geyer. 2025. Building Appropriate Mental Models: What Users Know and Want to Know about an Agentic AI Chatbot. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3708359.3712071 2025
[13]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021
[14]

Stefan Buijsman and Herman Veluwenkamp. 2023. Spotting When Algorithms Are Wrong.Minds and Machines33, 4 (01 Dec 2023), 541–562. doi:10.1007/s11023-022-09591-0

work page doi:10.1007/s11023-022-09591-0 2023
[15]

Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Wenjing Hu, Yuchen Mao, et al. 2024. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems37 (2024), 107703–107744

2024
[16]

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. 2024. Black-Box Access i...

work page doi:10.1145/3630106.3659037 2024
[17]

Jonker, Jeroen van den Hoven, Deborah Forster, and Reginald L

Luciano Cavalcante Siebert, Maria Luce Lupetti, Evgeni Aizenberg, Niek Beckers, Arkady Zgonnikov, Herman Veluwenkamp, David Abbink, Elisa Giaccardi, Geert-Jan Houben, Catholijn M. Jonker, Jeroen van den Hoven, Deborah Forster, and Reginald L. Lagendijk. 2023. Meaningful human control: actionable properties for AI system development.AI and Ethics3, 1 (01 F...

work page doi:10.1007/s43681-022-00167-3 2023
[18]

Pan, Shuyi Yang, Lakshya A

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657 [cs.AI]

Pith/arXiv arXiv 2025
[19]

Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. 2024. Visibility into AI Agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computi...

work page doi:10.1145/3630106.3658948 2024
[20]

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Shalaleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, and Tegan Maharaj. 202...

work page doi:10.1145/3593013.3594033 2023
[21]

Chacha Chen, Shi Feng, Amit Sharma, and Chenhao Tan. 2023. Machine Explanations and Human Understanding. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1. doi:10.1145/3593013.3593970

work page doi:10.1145/3593013.3593970 2023
[22]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott 18 Dhanorkar, Passi & Vorvoreanu Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,...

Pith/arXiv arXiv 2021
[23]

Valerie Chen, Ameet Talwalkar, Robert Brennan, and Graham Neubig. 2025. Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows. arXiv:2507.08149 [cs.SE] https://arxiv.org/abs/2507.08149

arXiv 2025
[24]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv:2503.12374 [cs.SE] https://arxiv.org/abs/2503.12374

Pith/arXiv arXiv 2025
[25]

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google.CoRRabs/2502.01821 (2025). arXiv:2502.01821 doi:10.48550/ARXIV.2502.01821

work page doi:10.48550/arxiv.2502.01821 2025
[26]

2025.Human-in/on-the-Loop Design for Human Controllability

Ria Cheruvu. 2025.Human-in/on-the-Loop Design for Human Controllability. Springer Nature Singapore, Singapore, 1–47. doi:10.1007/978-981-97- 8440-0_75-1

work page doi:10.1007/978-981-97- 2025
[27]

Adele E. Clarke. 2016. Anticipation Work: Abduction, Simplification, Hope. InBoundary Objects and Beyond: Working with Leigh Star. The MIT Press. doi:10.7551/mitpress/10113.003.0007

work page doi:10.7551/mitpress/10113.003.0007 2016
[28]

Eric Corbett and Remi Denton. 2023. Interrogating the T in FAccT. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1624–1634. doi:10.1145/3593013.3594104

work page doi:10.1145/3593013.3594104 2023
[29]

Sam Cox. 2025. Code Execution Through Deception: Gemini AI CLI Hijack. https://tracebit.com/blog/code-exec-deception-gemini-ai-cli-hijack [Online; accessed 2026-01-09]

2025
[30]

Lorrie Faith Cranor. 2008. A framework for reasoning about the human in the loop. InProceedings of the 1st Conference on Usability, Psychology, and Security(San Francisco, California)(UPSEC’08). USENIX Association, USA, Article 1, 15 pages

2008
[31]

Gonzalez

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. 2025. The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks. arXiv:2502.08235 [cs.AI] htt...

arXiv 2025
[32]

Nigel Daly. 2025. Managerial AI Skill Stacking: A New Professional skillset for the AI-Driven Workplace.A vailable at SSRN 5277762(2025), 11 pages. doi:10.2139/ssrn.5277762

work page doi:10.2139/ssrn.5277762 2025
[33]

Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. 2024. MARG: Multi-Agent Review Generation for Scientific Papers. arXiv:2401.04259 [cs.CL] https://arxiv.org/abs/2401.04259

arXiv 2024
[34]

Joseph Donia. 2022. Normative Logics of Algorithmic Accountability. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 598. doi:10.1145/3531146.3533123

work page doi:10.1145/3531146.3533123 2022
[35]

Robillard

Ekwa Duala-Ekoko and Martin P. Robillard. 2012. Asking and answering questions about unfamiliar APIs: An exploratory study. In2012 34th International Conference on Software Engineering (ICSE). 266–276. doi:10.1109/ICSE.2012.6227187

work page doi:10.1109/icse.2012.6227187 2012
[36]

Vera Liao, Samir Passi, Mark O

Upol Ehsan, Q. Vera Liao, Samir Passi, Mark O. Riedl, and III Daumé, Hal. 2024. Seamful XAI: Operationalizing Seamful Design in Explainable AI. Proc. ACM Hum.-Comput. Interact.8, CSCW1, Article 119 (April 2024), 29 pages. doi:10.1145/3637396

work page doi:10.1145/3637396 2024
[37]

Upol Ehsan, Samir Passi, Koustuv Saha, Todd McNutt, Mark O Riedl, and Sara Alcorn. 2026. From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms to Foster Dignified Human-AI Interaction. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Artic...

work page doi:10.1145/3772318.3791081 2026
[38]

Hasan Ferit Eniser, Hanliang Zhang, Cristina David, Meng Wang, Maria Christakis, Brandon Paulsen, Joey Dodds, and Daniel Kroening. 2024. Towards Translating Real-World Code with LLMs: A Study of Translating to Rust.CoRR(2024)

2024
[39]

Lena Enqvist. 2023. ‘Human oversight’ in the EU artificial intelligence act: what, when and by whom?Law, Innovation and Technology15, 2 (2023), 508–535. doi:10.1080/17579961.2023.2245683

work page doi:10.1080/17579961.2023.2245683 2023
[40]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 156, 15 pages. doi:10....

work page doi:10.1145/3706598.3713581 2025
[41]

Deng, Zachary C

Michael Feffer, Anusha Sinha, Wesley H. Deng, Zachary C. Lipton, and Hoda Heidari. 2025.Red-Teaming for Generative AI: Silver Bullet or Security Theater?AAAI Press, 421–437

2025
[42]

Joel Frenette. 2023. Ensuring human oversight in high-performance AI systems: A framework for control and accountability.World Journal of Advanced Research and Reviews20, 2 (2023), 1507–1516

2023
[43]

Harold Garfinkel and Harvey Sacks. 1987. On formal structures of practical actions. InEthnomethodological studies of work, Harold Garkinkel (Ed.). Routledge, 165–198. https://doi.org/10.4324/9780203996867

work page doi:10.4324/9780203996867 1987
[44]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

Pith/arXiv arXiv 2025
[45]

Voss, Marine Carpuat, and Hal Daumé III

Navita Goyal, Eleftheria Briakou, Amanda Liu, Connor Baumler, Claire Bonial, Jeffrey Micher, Clare R. Voss, Marine Carpuat, and Hal Daumé III. 2023. What Else Do I Need to Know? The Effect of Background Information on Users’ Reliance on QA Systems. arXiv:2305.14331 [cs.CL] https://arxiv.org/abs/2305.14331

arXiv 2023
[46]

Ben Green. 2022. The flaws of policies requiring human oversight of government algorithms.Computer Law & Security Review45 (2022), 105681. doi:10.1016/j.clsr.2022.105681

work page doi:10.1016/j.clsr.2022.105681 2022
[47]

Ben Green and Yiling Chen. 2019. The Principles and Limits of Algorithm-in-the-Loop Decision Making.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 50 (Nov. 2019), 24 pages. doi:10.1145/3359152

work page doi:10.1145/3359152 2019
[48]

Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 106190–106236...

work page doi:10.52202/079017-3369 2024
[49]

Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schonherr, and Mario Fritz. 2024. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models . In2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE Computer Society, Los Alamitos, CA, USA, 684–709. doi:10.1109/SaTML59370...

work page doi:10.1109/satml59370.2024.00040 2024
[50]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=sD93GOzH3i5

2021
[51]

Thomas Henzinger, Mahyar Karimi, Konstantin Kueffner, and Kaushik Mallik. 2023. Runtime Monitoring of Dynamic Fairness Properties. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 604–614. doi:10.1145/3593013.3594028

work page doi:10.1145/3593013.3594028 2023
[52]

2009.The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong

Erik Hollnagel. 2009.The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong. CRC press

2009
[53]

Andreas Holzinger, Kurt Zatloukal, and Heimo Müller. 2025. Is human oversight to AI systems still possible?New Biotechnology85 (2025), 59–62. doi:10.1016/j.nbt.2024.12.003

work page doi:10.1016/j.nbt.2024.12.003 2025
[54]

Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv:2511.04824 [[cs.SE](http://cs.se/)] https://arxiv.org/abs/2511.04824

arXiv 2025
[55]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024
[56]

Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, and Chien-Sheng Wu. 2025. CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. arXiv:2411.02305 [cs.CL] https://arxiv.org/abs/2411.02305

arXiv 2025
[57]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155

work page doi:10.1145/3703155 2025
[58]

Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, and Brian Hempel. 2025. Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012 [cs.SE] https://arxiv.org/abs/2512.14012

arXiv 2025
[59]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 1643...

work page doi:10.18653/v1/d18-1192 2018
[60]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id= VTF8yNQM66

2024
[61]

Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining Interpretability and Explainability using Sensemaking Theory. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 702–714. doi:10.1145/...

work page doi:10.1145/3531146.3533135 2022
[62]

i’m not sure, but

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. "I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Asso...

work page doi:10.1145/3630106.3658941 2024
[63]

Rob Kling. 1980. Social Analyses of Computing: Theoretical Perspectives in Recent Empirical Research.ACM Comput. Surv.12, 1 (March 1980), 61–110. doi:10.1145/356802.356806

work page doi:10.1145/356802.356806 1980
[64]

Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. 2024. Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI.Computers in Human Behavior160 (2024), 108352. doi:10.1016/j.chb.2024.108352

work page doi:10.1016/j.chb.2024.108352 2024
[65]

Riikka KOULU. 2020. Human control over automation: EU policy and AI ethics.European journal of legal studies12 (2020), 9–46

2020
[66]

Riikka Koulu. 2020. Proceduralizing control and discretion: Human oversight in artificial intelligence policy.Maastricht Journal of European and Comparative Law27, 6 (2020), 720–735. arXiv:https://doi.org/10.1177/1023263X20978649 doi:10.1177/1023263X20978649

work page doi:10.1177/1023263x20978649 2020
[67]

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. 2020. Specification gaming: the flip side of AI ingenuity. https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ 20 Dhanorkar, Passi & Vorvoreanu [Online; accessed 2025-09-01]

2020
[68]

Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Gustavo Soares, and Emerson Murphy-Hill. 2025. Why AI Agents Still Need You: Findings from Developer-Agent Collaborations in the Wild. arXiv:2506.12347 [cs.SE] https://arxiv.org/abs/2506.12347

arXiv 2025
[69]

Kyriakos Kyriakou and Jahna Otterbacher. 2023. In humans, we trust.Discover Artificial Intelligence3, 1 (12 Dec 2023), 44. doi:10.1007/s44163-023- 00092-2

work page doi:10.1007/s44163-023- 2023
[70]

Vera Liao, and Chenhao Tan

Vivian Lai, Chacha Chen, Alison Smith-Renner, Q. Vera Liao, and Chenhao Tan. 2023. Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York,...

work page doi:10.1145/3593013.3594087 2023
[71]

Markus Langer, Kevin Baum, and Nadine Schlicker. 2024. Effective Human Oversight of AI-Based Systems: A Signal Detection Perspective on the Detection of Inaccurate and Unfair Outputs.Minds and Machines35, 1 (05 Nov 2024), 1. doi:10.1007/s11023-024-09701-0

work page doi:10.1007/s11023-024-09701-0 2024
[72]

LaToza, Gina Venolia, and Robert DeLine

Thomas D. LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental models: a study of developer work habits. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 492–501. doi:10.1145/1134285.1134355

work page doi:10.1145/1134285.1134355 2006
[73]

Min Hun Lee and Chong Jun Chew. 2023. Understanding the Effect of Counterfactual Explanations on Trust and Reliance on AI for Human-AI Collaborative Clinical Decision Making.Proc. ACM Hum.-Comput. Interact.7, CSCW2, Article 369 (Oct. 2023), 22 pages. doi:10.1145/3610218

work page doi:10.1145/3610218 2023
[75]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. arXiv:2506.12286 [[cs.AI](http://cs.ai/)] https://arxiv.org/abs/2506.12286

arXiv 2025
[76]

Gabriel Lima, Nina Grgić-Hlača, Jin Keun Jeong, and Meeyoung Cha. 2022. The Conflict Between Explainable and Accountable Decision-Making Algorithms. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2103–2113. doi:10.1145/3531...

work page doi:10.1145/3531146.3534628 2022
[77]

Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patanamon Thongtanunam. 2025. Automatic programming: Large language models and beyond.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–33

2025
[78]

Ramesh Manuvinakurike, Emanuel Moss, Elizabeth Anne Watkins, Saurav Sahay, Giuseppe Raffa, and Lama Nachman. 2025. Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines. arXiv:2505.00875 [cs.AI] https://arxiv.org/abs/2505.00875

arXiv 2025
[79]

Marvin. [n. d.]. Marvin: The AI-Native Customer Feedback Repository — heymarvin.com. https://heymarvin.com/. [Online; accessed 12-01-2026]

2026
[80]

Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. arXiv:2404.11584 [cs.AI] https://arxiv.org/abs/2404.11584

Pith/arXiv arXiv 2024
[81]

Cecily Mauran. 2025. https://mashable.com/article/google-gemini-deletes-users-code

2025

Showing first 80 references.

[1] [1]

Artificial Intelligence Act

E.U. Artificial Intelligence Act. 2024. Article 14: Human Oversight | EU Artificial Intelligence Act. https://artificialintelligenceact.eu/article/14/ [Online; accessed 2025-08-01]. Human Oversight of Agentic Systems in Practice 17

2024

[2] [2]

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=b0jYs6JOZu

2025

[3] [3]

Ali Al-Kaswan, Maliheh Izadi, and Arie van Deursen. 2024. Traces of Memorisation in Large Language Models for Code. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 78, 12 pages. doi:10.1145/3597503.3639133

work page doi:10.1145/3597503.3639133 2024

[4] [4]

Anthropic. 2025. Anthropic Economic Index report: Uneven geographic and enterprise AI adoption — anthropic.com. https://www.anthropic.com/ research/anthropic-economic-index-september-2025-report. [Online; accessed 12-01-2026]

2025

[5] [5]

Anthropic. 2025. Disrupting the first reported AI-orchestrated cyber espionage campaign. https://assets.anthropic.com/m/ec212e6566a0d47/ original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf [Online; accessed 2026-01-09]

2025

[6] [6]

Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M

Gagan Bansal, Wenyue Hua, Zezhou Huang, Adam Fourney, Amanda Swearngin, Will Epperson, Tyler Payne, Jake M. Hofman, Brendan Lucier, Chinmay Singh, Markus Mobius, Akshay Nambi, Archana Yadav, Kevin Gao, David M. Rothschild, Aleksandrs Slivkins, Daniel G. Goldstein, Hussein Mozannar, Nicole Immorlica, Maya Murad, Matthew Vogel, Subbarao Kambhampati, Eric Ho...

arXiv 2025

[7] [7]

Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld

[8] [8]

arXiv:2412.10380 [cs.HC] https://arxiv.org/abs/2412.10380

Challenges in Human-Agent Communication. arXiv:2412.10380 [cs.HC] https://arxiv.org/abs/2412.10380

arXiv

[9] [9]

Joseph R Biden. 2023. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence.Presidential Actions (2023)

2023

[10] [10]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 2188–2200. doi:10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025

[11] [11]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. arXiv:2506.18824 [cs.SE] https://arxiv.org/abs/2506.18824

arXiv 2025

[12] [12]

Michelle Brachman, Siya Kunde, Sarah Miller, Ana Fucs, Samantha Dempsey, Jamie Jabbour, and Werner Geyer. 2025. Building Appropriate Mental Models: What Users Know and Want to Know about an Agentic AI Chatbot. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3708359.3712071 2025

[13] [13]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proc. ACM Hum.-Comput. Interact.5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021

[14] [14]

Stefan Buijsman and Herman Veluwenkamp. 2023. Spotting When Algorithms Are Wrong.Minds and Machines33, 4 (01 Dec 2023), 541–562. doi:10.1007/s11023-022-09591-0

work page doi:10.1007/s11023-022-09591-0 2023

[15] [15]

Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Wenjing Hu, Yuchen Mao, et al. 2024. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?Advances in Neural Information Processing Systems37 (2024), 107703–107744

2024

[16] [16]

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. 2024. Black-Box Access i...

work page doi:10.1145/3630106.3659037 2024

[17] [17]

Jonker, Jeroen van den Hoven, Deborah Forster, and Reginald L

Luciano Cavalcante Siebert, Maria Luce Lupetti, Evgeni Aizenberg, Niek Beckers, Arkady Zgonnikov, Herman Veluwenkamp, David Abbink, Elisa Giaccardi, Geert-Jan Houben, Catholijn M. Jonker, Jeroen van den Hoven, Deborah Forster, and Reginald L. Lagendijk. 2023. Meaningful human control: actionable properties for AI system development.AI and Ethics3, 1 (01 F...

work page doi:10.1007/s43681-022-00167-3 2023

[18] [18]

Pan, Shuyi Yang, Lakshya A

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657 [cs.AI]

Pith/arXiv arXiv 2025

[19] [19]

Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, and Markus Anderljung. 2024. Visibility into AI Agents. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computi...

work page doi:10.1145/3630106.3658948 2024

[20] [20]

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Shalaleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, and Tegan Maharaj. 202...

work page doi:10.1145/3593013.3594033 2023

[21] [21]

Chacha Chen, Shi Feng, Amit Sharma, and Chenhao Tan. 2023. Machine Explanations and Human Understanding. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1. doi:10.1145/3593013.3593970

work page doi:10.1145/3593013.3593970 2023

[22] [22]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott 18 Dhanorkar, Passi & Vorvoreanu Gray, Nick Ryder, Mikhail Pavlov, Alethea Power,...

Pith/arXiv arXiv 2021

[23] [23]

Valerie Chen, Ameet Talwalkar, Robert Brennan, and Graham Neubig. 2025. Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows. arXiv:2507.08149 [cs.SE] https://arxiv.org/abs/2507.08149

arXiv 2025

[24] [24]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv:2503.12374 [cs.SE] https://arxiv.org/abs/2503.12374

Pith/arXiv arXiv 2025

[25] [25]

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google.CoRRabs/2502.01821 (2025). arXiv:2502.01821 doi:10.48550/ARXIV.2502.01821

work page doi:10.48550/arxiv.2502.01821 2025

[26] [26]

2025.Human-in/on-the-Loop Design for Human Controllability

Ria Cheruvu. 2025.Human-in/on-the-Loop Design for Human Controllability. Springer Nature Singapore, Singapore, 1–47. doi:10.1007/978-981-97- 8440-0_75-1

work page doi:10.1007/978-981-97- 2025

[27] [27]

Adele E. Clarke. 2016. Anticipation Work: Abduction, Simplification, Hope. InBoundary Objects and Beyond: Working with Leigh Star. The MIT Press. doi:10.7551/mitpress/10113.003.0007

work page doi:10.7551/mitpress/10113.003.0007 2016

[28] [28]

Eric Corbett and Remi Denton. 2023. Interrogating the T in FAccT. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1624–1634. doi:10.1145/3593013.3594104

work page doi:10.1145/3593013.3594104 2023

[29] [29]

Sam Cox. 2025. Code Execution Through Deception: Gemini AI CLI Hijack. https://tracebit.com/blog/code-exec-deception-gemini-ai-cli-hijack [Online; accessed 2026-01-09]

2025

[30] [30]

Lorrie Faith Cranor. 2008. A framework for reasoning about the human in the loop. InProceedings of the 1st Conference on Usability, Psychology, and Security(San Francisco, California)(UPSEC’08). USENIX Association, USA, Article 1, 15 pages

2008

[31] [31]

Gonzalez

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. 2025. The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks. arXiv:2502.08235 [cs.AI] htt...

arXiv 2025

[32] [32]

Nigel Daly. 2025. Managerial AI Skill Stacking: A New Professional skillset for the AI-Driven Workplace.A vailable at SSRN 5277762(2025), 11 pages. doi:10.2139/ssrn.5277762

work page doi:10.2139/ssrn.5277762 2025

[33] [33]

Mike D’Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. 2024. MARG: Multi-Agent Review Generation for Scientific Papers. arXiv:2401.04259 [cs.CL] https://arxiv.org/abs/2401.04259

arXiv 2024

[34] [34]

Joseph Donia. 2022. Normative Logics of Algorithmic Accountability. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 598. doi:10.1145/3531146.3533123

work page doi:10.1145/3531146.3533123 2022

[35] [35]

Robillard

Ekwa Duala-Ekoko and Martin P. Robillard. 2012. Asking and answering questions about unfamiliar APIs: An exploratory study. In2012 34th International Conference on Software Engineering (ICSE). 266–276. doi:10.1109/ICSE.2012.6227187

work page doi:10.1109/icse.2012.6227187 2012

[36] [36]

Vera Liao, Samir Passi, Mark O

Upol Ehsan, Q. Vera Liao, Samir Passi, Mark O. Riedl, and III Daumé, Hal. 2024. Seamful XAI: Operationalizing Seamful Design in Explainable AI. Proc. ACM Hum.-Comput. Interact.8, CSCW1, Article 119 (April 2024), 29 pages. doi:10.1145/3637396

work page doi:10.1145/3637396 2024

[37] [37]

Upol Ehsan, Samir Passi, Koustuv Saha, Todd McNutt, Mark O Riedl, and Sara Alcorn. 2026. From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms to Foster Dignified Human-AI Interaction. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). Association for Computing Machinery, New York, NY, USA, Artic...

work page doi:10.1145/3772318.3791081 2026

[38] [38]

Hasan Ferit Eniser, Hanliang Zhang, Cristina David, Meng Wang, Maria Christakis, Brandon Paulsen, Joey Dodds, and Daniel Kroening. 2024. Towards Translating Real-World Code with LLMs: A Study of Translating to Rust.CoRR(2024)

2024

[39] [39]

Lena Enqvist. 2023. ‘Human oversight’ in the EU artificial intelligence act: what, when and by whom?Law, Innovation and Technology15, 2 (2023), 508–535. doi:10.1080/17579961.2023.2245683

work page doi:10.1080/17579961.2023.2245683 2023

[40] [40]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 156, 15 pages. doi:10....

work page doi:10.1145/3706598.3713581 2025

[41] [41]

Deng, Zachary C

Michael Feffer, Anusha Sinha, Wesley H. Deng, Zachary C. Lipton, and Hoda Heidari. 2025.Red-Teaming for Generative AI: Silver Bullet or Security Theater?AAAI Press, 421–437

2025

[42] [42]

Joel Frenette. 2023. Ensuring human oversight in high-performance AI systems: A framework for control and accountability.World Journal of Advanced Research and Reviews20, 2 (2023), 1507–1516

2023

[43] [43]

Harold Garfinkel and Harvey Sacks. 1987. On formal structures of practical actions. InEthnomethodological studies of work, Harold Garkinkel (Ed.). Routledge, 165–198. https://doi.org/10.4324/9780203996867

work page doi:10.4324/9780203996867 1987

[44] [44]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

Pith/arXiv arXiv 2025

[45] [45]

Voss, Marine Carpuat, and Hal Daumé III

Navita Goyal, Eleftheria Briakou, Amanda Liu, Connor Baumler, Claire Bonial, Jeffrey Micher, Clare R. Voss, Marine Carpuat, and Hal Daumé III. 2023. What Else Do I Need to Know? The Effect of Background Information on Users’ Reliance on QA Systems. arXiv:2305.14331 [cs.CL] https://arxiv.org/abs/2305.14331

arXiv 2023

[46] [46]

Ben Green. 2022. The flaws of policies requiring human oversight of government algorithms.Computer Law & Security Review45 (2022), 105681. doi:10.1016/j.clsr.2022.105681

work page doi:10.1016/j.clsr.2022.105681 2022

[47] [47]

Ben Green and Yiling Chen. 2019. The Principles and Limits of Algorithm-in-the-Loop Decision Making.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 50 (Nov. 2019), 24 pages. doi:10.1145/3359152

work page doi:10.1145/3359152 2019

[48] [48]

Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 106190–106236...

work page doi:10.52202/079017-3369 2024

[49] [49]

Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schonherr, and Mario Fritz. 2024. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models . In2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE Computer Society, Los Alamitos, CA, USA, 684–709. doi:10.1109/SaTML59370...

work page doi:10.1109/satml59370.2024.00040 2024

[50] [50]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=sD93GOzH3i5

2021

[51] [51]

Thomas Henzinger, Mahyar Karimi, Konstantin Kueffner, and Kaushik Mallik. 2023. Runtime Monitoring of Dynamic Fairness Properties. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York, NY, USA, 604–614. doi:10.1145/3593013.3594028

work page doi:10.1145/3593013.3594028 2023

[52] [52]

2009.The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong

Erik Hollnagel. 2009.The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong. CRC press

2009

[53] [53]

Andreas Holzinger, Kurt Zatloukal, and Heimo Müller. 2025. Is human oversight to AI systems still possible?New Biotechnology85 (2025), 59–62. doi:10.1016/j.nbt.2024.12.003

work page doi:10.1016/j.nbt.2024.12.003 2025

[54] [54]

Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv:2511.04824 [[cs.SE](http://cs.se/)] https://arxiv.org/abs/2511.04824

arXiv 2025

[55] [55]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024

[56] [56]

Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan, Yixin Mao, Huan Wang, Silvio Savarese, Caiming Xiong, Philippe Laban, and Chien-Sheng Wu. 2025. CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments. arXiv:2411.02305 [cs.CL] https://arxiv.org/abs/2411.02305

arXiv 2025

[57] [57]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.43, 2, Article 42 (Jan. 2025), 55 pages. doi:10.1145/3703155

work page doi:10.1145/3703155 2025

[58] [58]

Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, and Brian Hempel. 2025. Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012 [cs.SE] https://arxiv.org/abs/2512.14012

arXiv 2025

[59] [59]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Programmatic Context. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 1643...

work page doi:10.18653/v1/d18-1192 2018

[60] [60]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id= VTF8yNQM66

2024

[61] [61]

Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining Interpretability and Explainability using Sensemaking Theory. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 702–714. doi:10.1145/...

work page doi:10.1145/3531146.3533135 2022

[62] [62]

i’m not sure, but

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. 2024. "I’m Not Sure, But... ": Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Asso...

work page doi:10.1145/3630106.3658941 2024

[63] [63]

Rob Kling. 1980. Social Analyses of Computing: Theoretical Perspectives in Recent Empirical Research.ACM Comput. Surv.12, 1 (March 1980), 61–110. doi:10.1145/356802.356806

work page doi:10.1145/356802.356806 1980

[64] [64]

Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. 2024. Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI.Computers in Human Behavior160 (2024), 108352. doi:10.1016/j.chb.2024.108352

work page doi:10.1016/j.chb.2024.108352 2024

[65] [65]

Riikka KOULU. 2020. Human control over automation: EU policy and AI ethics.European journal of legal studies12 (2020), 9–46

2020

[66] [66]

Riikka Koulu. 2020. Proceduralizing control and discretion: Human oversight in artificial intelligence policy.Maastricht Journal of European and Comparative Law27, 6 (2020), 720–735. arXiv:https://doi.org/10.1177/1023263X20978649 doi:10.1177/1023263X20978649

work page doi:10.1177/1023263x20978649 2020

[67] [67]

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, and Shane Legg. 2020. Specification gaming: the flip side of AI ingenuity. https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/ 20 Dhanorkar, Passi & Vorvoreanu [Online; accessed 2025-09-01]

2020

[68] [68]

Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Gustavo Soares, and Emerson Murphy-Hill. 2025. Why AI Agents Still Need You: Findings from Developer-Agent Collaborations in the Wild. arXiv:2506.12347 [cs.SE] https://arxiv.org/abs/2506.12347

arXiv 2025

[69] [69]

Kyriakos Kyriakou and Jahna Otterbacher. 2023. In humans, we trust.Discover Artificial Intelligence3, 1 (12 Dec 2023), 44. doi:10.1007/s44163-023- 00092-2

work page doi:10.1007/s44163-023- 2023

[70] [70]

Vera Liao, and Chenhao Tan

Vivian Lai, Chacha Chen, Alison Smith-Renner, Q. Vera Liao, and Chenhao Tan. 2023. Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, USA)(FAccT ’23). Association for Computing Machinery, New York,...

work page doi:10.1145/3593013.3594087 2023

[71] [71]

Markus Langer, Kevin Baum, and Nadine Schlicker. 2024. Effective Human Oversight of AI-Based Systems: A Signal Detection Perspective on the Detection of Inaccurate and Unfair Outputs.Minds and Machines35, 1 (05 Nov 2024), 1. doi:10.1007/s11023-024-09701-0

work page doi:10.1007/s11023-024-09701-0 2024

[72] [72]

LaToza, Gina Venolia, and Robert DeLine

Thomas D. LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental models: a study of developer work habits. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 492–501. doi:10.1145/1134285.1134355

work page doi:10.1145/1134285.1134355 2006

[73] [73]

Min Hun Lee and Chong Jun Chew. 2023. Understanding the Effect of Counterfactual Explanations on Trust and Reliance on AI for Human-AI Collaborative Clinical Decision Making.Proc. ACM Hum.-Comput. Interact.7, CSCW2, Article 369 (Oct. 2023), 22 pages. doi:10.1145/3610218

work page doi:10.1145/3610218 2023

[74] [75]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. arXiv:2506.12286 [[cs.AI](http://cs.ai/)] https://arxiv.org/abs/2506.12286

arXiv 2025

[75] [76]

Gabriel Lima, Nina Grgić-Hlača, Jin Keun Jeong, and Meeyoung Cha. 2022. The Conflict Between Explainable and Accountable Decision-Making Algorithms. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2103–2113. doi:10.1145/3531...

work page doi:10.1145/3531146.3534628 2022

[76] [77]

Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patanamon Thongtanunam. 2025. Automatic programming: Large language models and beyond.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–33

2025

[77] [78]

Ramesh Manuvinakurike, Emanuel Moss, Elizabeth Anne Watkins, Saurav Sahay, Giuseppe Raffa, and Lama Nachman. 2025. Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines. arXiv:2505.00875 [cs.AI] https://arxiv.org/abs/2505.00875

arXiv 2025

[78] [79]

Marvin. [n. d.]. Marvin: The AI-Native Customer Feedback Repository — heymarvin.com. https://heymarvin.com/. [Online; accessed 12-01-2026]

2026

[79] [80]

Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. 2024. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. arXiv:2404.11584 [cs.AI] https://arxiv.org/abs/2404.11584

Pith/arXiv arXiv 2024

[80] [81]

Cecily Mauran. 2025. https://mashable.com/article/google-gemini-deletes-users-code

2025