pith. sign in

arxiv: 2605.18747 · v1 · pith:ENJXQKPKnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Code as Agent Harness

Pith reviewed 2026-05-20 10:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords code as agent harnessLLM agentsagentic AI systemsmulti-agent coordinationexecutable verificationharness mechanismsstateful agentsAI agent infrastructure
0
0 comments X

The pith

Code serves as the harness that turns large language models into executable, verifiable, and stateful AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey frames code as the operational substrate for agentic AI systems rather than only a target output. It organizes the literature into three connected layers: the harness interface that links agents to reasoning, action, and environment modeling; the mechanisms of planning, memory, tool use, and feedback-driven control; and the scaling of these elements to multi-agent coordination through shared code artifacts. A sympathetic reader would care because the framing supplies a concrete roadmap for building agents that execute actions, verify their own work, and maintain consistent state over long horizons. The paper reviews methods across coding assistants, GUI automation, embodied agents, scientific discovery, and enterprise workflows while listing open challenges in evaluation, verification, and safety oversight.

Core claim

By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems. Code now supports agent reasoning, acting, environment modeling, and execution-based verification. The survey examines the harness interface, mechanisms for long-horizon execution, and scaling to multi-agent settings where shared code artifacts enable coordination and review.

What carries the argument

Code as agent harness: the unified view that centers code as the basis for agent infrastructure, organized across the three layers of interface, mechanisms, and multi-agent scaling.

If this is right

  • Applications in GUI/OS automation and embodied agents gain reliability through code-based execution and feedback control.
  • Multi-agent systems achieve consistent shared state and verification via shared code artifacts.
  • Evaluation of agents must move beyond final task success to include verification under incomplete feedback.
  • Harness improvements can be made regression-free while supporting human oversight for safety-critical actions.
  • The same harness structure extends to scientific discovery, personalization, DevOps, and enterprise workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This structure implies that agent benchmarks should incorporate metrics for code executability and state consistency over time.
  • Designers could test whether code-centric harnesses reduce error accumulation in long-horizon tasks compared with purely language-based approaches.
  • The three-layer model suggests straightforward extensions to multimodal environments where code still manages execution and verification.

Load-bearing premise

Organizing the literature on code-based agent systems into the three specific layers of harness interface, mechanisms, and multi-agent scaling captures the essential structure without significant omissions or the need for additional dimensions.

What would settle it

A review that identifies a substantial set of code-enabled agent methods or applications that cannot be placed into any of the three layers would falsify the completeness of the proposed organization.

read the original abstract

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper surveys recent work on LLMs and agentic systems, framing code not merely as generated output but as an operational 'agent harness' substrate for reasoning, acting, environment modeling, execution-based verification, and state management. It organizes the literature into three layers—harness interface (reasoning/action/environment), harness mechanisms (planning/memory/tool use/feedback/optimization), and multi-agent scaling (coordination/review/verification via shared code)—while summarizing applications across coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows, and listing open challenges such as evaluation beyond task success, verification under incomplete feedback, and safety oversight.

Significance. If the three-layer taxonomy holds as a coherent organizing principle, the survey supplies a useful roadmap that connects LLM code capabilities with agent infrastructure, highlighting executable and verifiable systems. The paper draws on external prior work across domains rather than self-referential results, and its explicit listing of applications and challenges provides a concrete synthesis that could help researchers identify gaps in stateful, multi-agent code harnesses.

major comments (1)
  1. [Introduction / survey organization] Introduction and survey organization: The central claim that the three layers deliver a 'unified view' and 'unified roadmap' rests on the premise that this structure comprehensively captures code-based agent systems. However, the manuscript provides no explicit justification, comparative mapping, or discussion of why alternative dimensions (e.g., safety constraints, evaluation protocols, or regression testing) are subsumed within the layers rather than treated as orthogonal; the challenges section lists several of these topics separately without showing integration.
minor comments (2)
  1. [Abstract] Abstract and early sections: The phrase 'code as agent harness' is introduced as a new framing but would benefit from a concise contrast with related terms such as 'agent frameworks' or 'tool-augmented agents' to clarify novelty for readers.
  2. [Applications sections] Applications summary: When enumerating domains (coding assistants, embodied agents, etc.), a short table or bullet list with one representative citation per domain would improve scannability and allow readers to trace the claimed coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our survey and the constructive feedback recommending minor revision. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Introduction / survey organization] Introduction and survey organization: The central claim that the three layers deliver a 'unified view' and 'unified roadmap' rests on the premise that this structure comprehensively captures code-based agent systems. However, the manuscript provides no explicit justification, comparative mapping, or discussion of why alternative dimensions (e.g., safety constraints, evaluation protocols, or regression testing) are subsumed within the layers rather than treated as orthogonal; the challenges section lists several of these topics separately without showing integration.

    Authors: We thank the referee for this observation. Our three-layer taxonomy is motivated by the distinct functional roles code plays as an agent harness: the interface layer captures how code connects agents to reasoning, action, and environment modeling; the mechanisms layer addresses the operational components (planning, memory, tool use, feedback, and optimization) that enable reliable long-horizon execution; and the multi-agent scaling layer examines how shared code artifacts support coordination, review, and verification. This decomposition provides a natural progression from foundational capabilities to complex systems. Dimensions such as safety constraints, evaluation protocols, and regression testing are treated as cross-cutting concerns that appear within the layers (e.g., verification and feedback mechanisms in layer 2, human oversight and consistent state in layer 3) and are synthesized in the challenges section. We acknowledge, however, that the introduction does not explicitly justify this choice, provide a comparative mapping to alternative organizations, or demonstrate integration of the challenges back into the layers. In the revised manuscript we will add a short subsection in the introduction that (1) states the rationale for the taxonomy, (2) briefly contrasts it with orthogonal alternatives, and (3) clarifies how the listed challenges connect to and are addressed across the three layers. This addition will strengthen the claims of a unified view and roadmap. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without self-referential derivations

full rationale

This paper is a literature survey that introduces a three-layer organizational perspective (harness interface, mechanisms, multi-agent scaling) to structure existing work on code-based agent systems. No equations, predictions, or derivations are present that could reduce to fitted inputs or self-definitions by construction. The framing is explicitly presented as a viewpoint for summarizing representative methods and applications drawn from prior external research, with open challenges listed separately. The central claim of a unified roadmap rests on this organizational synthesis rather than any load-bearing self-citation chain or ansatz that loops back to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The survey rests on the domain assumption that code is transitioning from a target output to an operational substrate for agents. It introduces the conceptual entity of an 'agent harness' to structure the discussion. No free parameters appear because the work is not an empirical fit or derivation.

axioms (1)
  • domain assumption Code can serve as an effective operational substrate for agent reasoning, acting, environment modeling, and execution-based verification in LLM-based systems.
    This premise underpins the entire 'code as agent harness' perspective and is stated in the abstract as the basis for the unified view.
invented entities (1)
  • Agent harness no independent evidence
    purpose: To provide a unified conceptual basis that centers code as the infrastructure for agent reasoning, action, and verification across single- and multi-agent settings.
    This is a new framing term introduced by the authors to organize the survey; the abstract supplies no independent falsifiable evidence or external validation for the entity itself.

pith-pipeline@v0.9.0 · 5959 in / 1589 out tokens · 71352 ms · 2026-05-20T10:51:36.481212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · 41 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022

  4. [4]

    Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

  5. [5]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  6. [6]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

  7. [7]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

  8. [8]

    Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

    Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

  9. [9]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  10. [10]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

  11. [11]

    Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

    John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

  12. [12]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

  13. [13]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. 67 Code as Agent Harness

  14. [15]

    Effective harnesses for long-running agents

    Justin Young. Effective harnesses for long-running agents. Anthropic Engineer- ing Blog, November 2025. URL https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents. Accessed: 2026-05-11

  15. [16]

    Harness engineering: Leveraging codex in an agent-first world.https://openai

    Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world.https://openai. com/index/harness-engineering/, 2026. OpenAI Engineering Blog, February 11, 2026. Ac- cessed: 2026-05-10

  16. [17]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

  17. [18]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  18. [19]

    Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023

    Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023

  19. [20]

    Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025

    Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025

  20. [21]

    Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025

    Boaz Lavon, Shahar Katz, and Lior Wolf. Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025

  21. [22]

    Computer Environments Elicit General Agentic Intelligence in LLMs

    Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, and Furu Wei. Computer environments elicit general agentic intelligence in llms.arXiv preprint arXiv:2601.16206, 2026

  22. [23]

    Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025

    Dekun Dai, MingWei Liu, Anji Li, Jialun Cao, Yanlin Wang, Chong Wang, Xin Peng, and Zibin Zheng. Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025

  23. [24]

    Harness engineering: Leveraging codex in an agent-first world

    Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world. OpenAI Engineering Blog, February 2026. URLhttps://openai.com/index/harness-engineering/. Accessed: 2026-05-11

  24. [25]

    The anatomy of an agent harness

    Vivek Trivedy. The anatomy of an agent harness. https://www.langchain.com/blog/ the-anatomy-of-an-agent-harness, 2026. LangChain blog. Accessed: 2026-05-10

  25. [26]

    Claude code

    Anthropic. Claude code. https://www.anthropic.com/product/claude-code. Accessed: 2026-05-09

  26. [27]

    Introducing Codex

    OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. OpenAI announcement. 68 Code as Agent Harness

  27. [28]

    Improving deep agents with harness engineering

    Vivek Trivedy. Improving deep agents with harness engineering. https://www.langchain. com/blog/improving-deep-agents-with-harness-engineering, 2026. LangChain blog. Accessed: 2026-05-10

  28. [29]

    Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023

    Xi Ye, Qiaochu Chen, Isil Dillig, and Greg Durrett. Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023

  29. [30]

    Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024

    Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, and Pengcheng Yin. Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024

  30. [31]

    Codeprm: Execution feedback-enhanced process reward model for code generation

    Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8169–8182, 2025

  31. [32]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv. org/abs/2305.16291, 2(11), 2023

  32. [33]

    Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024

    Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, et al. Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024

  33. [34]

    Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models

    Siyang Zhang, Bin Li, Jingtao Qi, Xueying Wang, Fu Li, Jianan Wang, En Zhu, and Jinjing Sun. Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8814–8822, 2025

  34. [35]

    Ui-voyager: A self-evolving gui agent learning via failed experience

    Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience. arXiv preprint arXiv:2603.24533, 2026

  35. [36]

    Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

  36. [37]

    Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

    Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

  37. [38]

    Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026

    Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026

  38. [39]

    Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026

    Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026

  39. [40]

    Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024

    Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024. 69 Code as Agent Harness

  40. [41]

    A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

  41. [42]

    Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D C, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, Balasubramanyan Ashok, and Shashank Shet. Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024

  42. [43]

    Codetree: Agent-guided tree search for code generation with large language models

    Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent-guided tree search for code generation with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3711–3726, 2025

  43. [44]

    Mapcoder: Multi-agent code generationforcompetitiveproblemsolving

    Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generationforcompetitiveproblemsolving. InProceedingsofthe62ndAnnualMeetingoftheAssociation for Computational Linguistics (Volume 1: Long Papers), pages 4912–4944, 2024

  44. [45]

    Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025

    NishantGaurav, AditAkarsh, TejasRavishankar, andManojBajaj. Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025

  45. [46]

    Autocoderover: Autonomous program improvement

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024

  46. [47]

    Repocoder: Repository-level code completion through iterative retrieval and generation

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023

  47. [48]

    Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

    Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

  48. [49]

    Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

    Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

  49. [50]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- coder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023

  50. [51]

    Adacoder: Adaptive prompt compression for programmatic visual question answering

    Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024

  51. [52]

    AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024

    A.Nunez, N.T.Islam, S.K.Jha, andP.Najafirad. AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024

  52. [53]

    Agent harness engineering: A survey, 2026

    Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey, 2026. URLhttps://openreview.net/pdf?id=eONq7FdiHa. 70 Code as Age...

  53. [54]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  54. [55]

    Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, andJürgenSchmidhuber. Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework. InThe Twelfth International Conference on Learning Representations, ICLR 2024, ...

  55. [56]

    Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via ChatGPT.ACM Transactions on Software Engineering and Methodology, 33(7):1–38, 2024

  56. [57]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  57. [58]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

  58. [59]

    Mind2web: Towards a generalist agent for the web, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URLhttps://arxiv.org/abs/2306. 06070

  59. [60]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

  60. [61]

    ChemCrow: Augmenting large-language models with chemistry tools

    Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URLhttps://arxiv. org/abs/2304.05376

  61. [62]

    Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

  62. [63]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/ 2408.06292

  63. [64]

    Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

  64. [65]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  65. [66]

    Show your work: Scratchpads for intermediate computation with language models

    MaxwellNye, AndersJohanAndreassen, GuyGur-Ari, HenrykMichalewski, JacobAustin, DavidBieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021. 71 Code as Agent Harness

  66. [67]

    Reasoning like program executors

    Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang Lou, and Weizhu Chen. Reasoning like program executors. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 761–779, 2022

  67. [68]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

  68. [69]

    When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024

    Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024

  69. [70]

    Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments

    Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udom- charoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, and Sarana Nutanong. Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments. InFindings of the Association for Computational Linguistics: A...

  70. [71]

    Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025

    Hong Su. Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025

  71. [72]

    Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025

    Cedegao E Zhang, Cédric Colas, Gabriel Poesia, Joshua B Tenenbaum, and Jacob Andreas. Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025

  72. [73]

    CodeIO: Condensing reasoning patterns via code input-output prediction

    Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. CodeIO: Condensing reasoning patterns via code input-output prediction. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ...

  73. [74]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

  74. [75]

    Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025

    Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, and Jun Wang. Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025

  75. [76]

    Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025

    Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, and Tong Zhang. Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025

  76. [77]

    Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025

    Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025

  77. [78]

    Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025

    Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan. Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025. 72 Code as Agent Harness

  78. [79]

    Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation

    Yongchao Chen, Yilun Hao, Yang Zhang, and Chuchu Fan. Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 19248–19254. IEEE, 2025

  79. [80]

    ISBN 979-8-89176-189-6

    Cuong Le Chi, Chau Truong Vinh Hoang, Phan Nhat Huy, Dung D. Le, Tien N Nguyen, and Nghi D. Q. Bui. VisualCoder: Guiding large language models in code execution with fine-grained multimodal chain-of-thought reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6643–6660,...

  80. [81]

    The lean 4 theorem prover and programming language

    Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

Showing first 80 references.