Code as Agent Harness

Bingxuan Li; Cheng Qian; Chenyuan Yang; Dongqi Fu; Dorothy Sun; Dylan Zhang; Gaotang Li; Hanghang Tong; Hong Li; Hong Yan

arxiv: 2605.18747 · v1 · pith:ENJXQKPKnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Code as Agent Harness

Xuying Ning , Katherine Tieu , Dongqi Fu , Tianxin Wei , Zihao Li , Yuanchen Bei , Jiaru Zou , Mengting Ai

show 34 more authors

Zhining Liu Ting-Wei Li Lingjie Chen Yanjun Zhao Ke Yang Bingxuan Li Cheng Qian Gaotang Li Xiao Lin Zhichen Zeng Ruizhong Qiu Sirui Chen Yifan Sun Xiyuan Yang Ruida Wang Rui Pan Chenyuan Yang Dylan Zhang Liri Fang Zikun Cui Yang Cao Pan Chen Dorothy Sun Ren Chen Mahesh Srinivasan Nipun Mathur Yinglong Xia Hong Li Hong Yan Pan Lu Lingming Zhang Tong Zhang Hanghang Tong Jingrui He

This is my paper

Pith reviewed 2026-05-20 10:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords code as agent harnessLLM agentsagentic AI systemsmulti-agent coordinationexecutable verificationharness mechanismsstateful agentsAI agent infrastructure

0 comments

The pith

Code serves as the harness that turns large language models into executable, verifiable, and stateful AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey frames code as the operational substrate for agentic AI systems rather than only a target output. It organizes the literature into three connected layers: the harness interface that links agents to reasoning, action, and environment modeling; the mechanisms of planning, memory, tool use, and feedback-driven control; and the scaling of these elements to multi-agent coordination through shared code artifacts. A sympathetic reader would care because the framing supplies a concrete roadmap for building agents that execute actions, verify their own work, and maintain consistent state over long horizons. The paper reviews methods across coding assistants, GUI automation, embodied agents, scientific discovery, and enterprise workflows while listing open challenges in evaluation, verification, and safety oversight.

Core claim

By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems. Code now supports agent reasoning, acting, environment modeling, and execution-based verification. The survey examines the harness interface, mechanisms for long-horizon execution, and scaling to multi-agent settings where shared code artifacts enable coordination and review.

What carries the argument

Code as agent harness: the unified view that centers code as the basis for agent infrastructure, organized across the three layers of interface, mechanisms, and multi-agent scaling.

If this is right

Applications in GUI/OS automation and embodied agents gain reliability through code-based execution and feedback control.
Multi-agent systems achieve consistent shared state and verification via shared code artifacts.
Evaluation of agents must move beyond final task success to include verification under incomplete feedback.
Harness improvements can be made regression-free while supporting human oversight for safety-critical actions.
The same harness structure extends to scientific discovery, personalization, DevOps, and enterprise workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This structure implies that agent benchmarks should incorporate metrics for code executability and state consistency over time.
Designers could test whether code-centric harnesses reduce error accumulation in long-horizon tasks compared with purely language-based approaches.
The three-layer model suggests straightforward extensions to multimodal environments where code still manages execution and verification.

Load-bearing premise

Organizing the literature on code-based agent systems into the three specific layers of harness interface, mechanisms, and multi-agent scaling captures the essential structure without significant omissions or the need for additional dimensions.

What would settle it

A review that identifies a substantial set of code-enabled agent methods or applications that cannot be placed into any of the three layers would falsify the completeness of the proposed organization.

read the original abstract

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LLM agent work around code as the central harness in three layers, but the taxonomy's completeness is asserted more than demonstrated.

read the letter

The paper is a survey that reframes agentic AI by treating code as the operational harness rather than just an output. It breaks the space into three layers: the interface that links agents to reasoning, action, and environment modeling; the mechanisms for planning, memory, tool use, feedback, and optimization; and multi-agent scaling through shared code for coordination and verification. It then maps these to applications in coding assistants, GUI automation, embodied agents, scientific discovery, and enterprise workflows, while listing challenges like evaluation beyond task success and handling incomplete feedback.

Referee Report

1 major / 2 minor

Summary. The paper surveys recent work on LLMs and agentic systems, framing code not merely as generated output but as an operational 'agent harness' substrate for reasoning, acting, environment modeling, execution-based verification, and state management. It organizes the literature into three layers—harness interface (reasoning/action/environment), harness mechanisms (planning/memory/tool use/feedback/optimization), and multi-agent scaling (coordination/review/verification via shared code)—while summarizing applications across coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows, and listing open challenges such as evaluation beyond task success, verification under incomplete feedback, and safety oversight.

Significance. If the three-layer taxonomy holds as a coherent organizing principle, the survey supplies a useful roadmap that connects LLM code capabilities with agent infrastructure, highlighting executable and verifiable systems. The paper draws on external prior work across domains rather than self-referential results, and its explicit listing of applications and challenges provides a concrete synthesis that could help researchers identify gaps in stateful, multi-agent code harnesses.

major comments (1)

[Introduction / survey organization] Introduction and survey organization: The central claim that the three layers deliver a 'unified view' and 'unified roadmap' rests on the premise that this structure comprehensively captures code-based agent systems. However, the manuscript provides no explicit justification, comparative mapping, or discussion of why alternative dimensions (e.g., safety constraints, evaluation protocols, or regression testing) are subsumed within the layers rather than treated as orthogonal; the challenges section lists several of these topics separately without showing integration.

minor comments (2)

[Abstract] Abstract and early sections: The phrase 'code as agent harness' is introduced as a new framing but would benefit from a concise contrast with related terms such as 'agent frameworks' or 'tool-augmented agents' to clarify novelty for readers.
[Applications sections] Applications summary: When enumerating domains (coding assistants, embodied agents, etc.), a short table or bullet list with one representative citation per domain would improve scannability and allow readers to trace the claimed coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our survey and the constructive feedback recommending minor revision. We address the major comment point by point below.

read point-by-point responses

Referee: [Introduction / survey organization] Introduction and survey organization: The central claim that the three layers deliver a 'unified view' and 'unified roadmap' rests on the premise that this structure comprehensively captures code-based agent systems. However, the manuscript provides no explicit justification, comparative mapping, or discussion of why alternative dimensions (e.g., safety constraints, evaluation protocols, or regression testing) are subsumed within the layers rather than treated as orthogonal; the challenges section lists several of these topics separately without showing integration.

Authors: We thank the referee for this observation. Our three-layer taxonomy is motivated by the distinct functional roles code plays as an agent harness: the interface layer captures how code connects agents to reasoning, action, and environment modeling; the mechanisms layer addresses the operational components (planning, memory, tool use, feedback, and optimization) that enable reliable long-horizon execution; and the multi-agent scaling layer examines how shared code artifacts support coordination, review, and verification. This decomposition provides a natural progression from foundational capabilities to complex systems. Dimensions such as safety constraints, evaluation protocols, and regression testing are treated as cross-cutting concerns that appear within the layers (e.g., verification and feedback mechanisms in layer 2, human oversight and consistent state in layer 3) and are synthesized in the challenges section. We acknowledge, however, that the introduction does not explicitly justify this choice, provide a comparative mapping to alternative organizations, or demonstrate integration of the challenges back into the layers. In the revised manuscript we will add a short subsection in the introduction that (1) states the rationale for the taxonomy, (2) briefly contrasts it with orthogonal alternatives, and (3) clarifies how the listed challenges connect to and are addressed across the three layers. This addition will strengthen the claims of a unified view and roadmap. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without self-referential derivations

full rationale

This paper is a literature survey that introduces a three-layer organizational perspective (harness interface, mechanisms, multi-agent scaling) to structure existing work on code-based agent systems. No equations, predictions, or derivations are present that could reduce to fitted inputs or self-definitions by construction. The framing is explicitly presented as a viewpoint for summarizing representative methods and applications drawn from prior external research, with open challenges listed separately. The central claim of a unified roadmap rests on this organizational synthesis rather than any load-bearing self-citation chain or ansatz that loops back to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The survey rests on the domain assumption that code is transitioning from a target output to an operational substrate for agents. It introduces the conceptual entity of an 'agent harness' to structure the discussion. No free parameters appear because the work is not an empirical fit or derivation.

axioms (1)

domain assumption Code can serve as an effective operational substrate for agent reasoning, acting, environment modeling, and execution-based verification in LLM-based systems.
This premise underpins the entire 'code as agent harness' perspective and is stated in the abstract as the basis for the unified view.

invented entities (1)

Agent harness no independent evidence
purpose: To provide a unified conceptual basis that centers code as the infrastructure for agent reasoning, action, and verification across single- and multi-agent settings.
This is a new framing term introduced by the authors to organize the survey; the abstract supplies no independent falsifiable evidence or external validation for the entity itself.

pith-pipeline@v0.9.0 · 5959 in / 1589 out tokens · 71352 ms · 2026-05-20T10:51:36.481212+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · 41 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

work page 2022
[5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

work page 2023
[8]

Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

work page arXiv 2023
[9]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023
[11]

Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

work page 2023
[12]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. 67 Code as Agent Harness

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Effective harnesses for long-running agents

Justin Young. Effective harnesses for long-running agents. Anthropic Engineer- ing Blog, November 2025. URL https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents. Accessed: 2026-05-11

work page 2025
[16]

Harness engineering: Leveraging codex in an agent-first world.https://openai

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world.https://openai. com/index/harness-engineering/, 2026. OpenAI Engineering Blog, February 11, 2026. Ac- cessed: 2026-05-10

work page 2026
[17]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023

Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023

work page arXiv 2023
[20]

Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025

Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025

work page 2025
[21]

Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025

Boaz Lavon, Shahar Katz, and Lior Wolf. Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025

work page arXiv 2025
[22]

Computer Environments Elicit General Agentic Intelligence in LLMs

Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, and Furu Wei. Computer environments elicit general agentic intelligence in llms.arXiv preprint arXiv:2601.16206, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025

Dekun Dai, MingWei Liu, Anji Li, Jialun Cao, Yanlin Wang, Chong Wang, Xin Peng, and Zibin Zheng. Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025

work page arXiv 2025
[24]

Harness engineering: Leveraging codex in an agent-first world

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world. OpenAI Engineering Blog, February 2026. URLhttps://openai.com/index/harness-engineering/. Accessed: 2026-05-11

work page 2026
[25]

The anatomy of an agent harness

Vivek Trivedy. The anatomy of an agent harness. https://www.langchain.com/blog/ the-anatomy-of-an-agent-harness, 2026. LangChain blog. Accessed: 2026-05-10

work page 2026
[26]

Claude code

Anthropic. Claude code. https://www.anthropic.com/product/claude-code. Accessed: 2026-05-09

work page 2026
[27]

Introducing Codex

OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. OpenAI announcement. 68 Code as Agent Harness

work page 2025
[28]

Improving deep agents with harness engineering

Vivek Trivedy. Improving deep agents with harness engineering. https://www.langchain. com/blog/improving-deep-agents-with-harness-engineering, 2026. LangChain blog. Accessed: 2026-05-10

work page 2026
[29]

Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023

Xi Ye, Qiaochu Chen, Isil Dillig, and Greg Durrett. Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023

work page 2023
[30]

Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024

Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, and Pengcheng Yin. Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024

work page arXiv 2024
[31]

Codeprm: Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8169–8182, 2025

work page 2025
[32]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv. org/abs/2305.16291, 2(11), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024

Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, et al. Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024

work page arXiv 2024
[34]

Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models

Siyang Zhang, Bin Li, Jingtao Qi, Xueying Wang, Fu Li, Jianan Wang, En Zhu, and Jinjing Sun. Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8814–8822, 2025

work page 2025
[35]

Ui-voyager: A self-evolving gui agent learning via failed experience

Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience. arXiv preprint arXiv:2603.24533, 2026

work page arXiv 2026
[36]

Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

work page 2024
[37]

Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

work page arXiv 2025
[38]

Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026

Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026

work page arXiv 2026
[39]

Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026

Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026

work page arXiv 2026
[40]

Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024. 69 Code as Agent Harness

work page 2024
[41]

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D C, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, Balasubramanyan Ashok, and Shashank Shet. Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024

work page 2024
[43]

Codetree: Agent-guided tree search for code generation with large language models

Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent-guided tree search for code generation with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3711–3726, 2025

work page 2025
[44]

Mapcoder: Multi-agent code generationforcompetitiveproblemsolving

Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generationforcompetitiveproblemsolving. InProceedingsofthe62ndAnnualMeetingoftheAssociation for Computational Linguistics (Volume 1: Long Papers), pages 4912–4944, 2024

work page 2024
[45]

Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025

NishantGaurav, AditAkarsh, TejasRavishankar, andManojBajaj. Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025

work page arXiv 2025
[46]

Autocoderover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024

work page 2024
[47]

Repocoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023

work page 2023
[48]

Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

work page arXiv 2026
[49]

Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

work page arXiv 2024
[50]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- coder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Adacoder: Adaptive prompt compression for programmatic visual question answering

Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024

work page 2024
[52]

AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024

A.Nunez, N.T.Islam, S.K.Jha, andP.Najafirad. AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024

work page arXiv 2024
[53]

Agent harness engineering: A survey, 2026

Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey, 2026. URLhttps://openreview.net/pdf?id=eONq7FdiHa. 70 Code as Age...

work page 2026
[54]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[55]

Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, andJürgenSchmidhuber. Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework. InThe Twelfth International Conference on Learning Representations, ICLR 2024, ...

work page 2024
[56]

Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via ChatGPT.ACM Transactions on Software Engineering and Methodology, 33(7):1–38, 2024

work page 2024
[57]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[58]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

work page 2025
[59]

Mind2web: Towards a generalist agent for the web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URLhttps://arxiv.org/abs/2306. 06070

work page 2023
[60]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URLhttps://arxiv. org/abs/2304.05376

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023
[63]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/ 2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

work page 2025
[65]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[66]

Show your work: Scratchpads for intermediate computation with language models

MaxwellNye, AndersJohanAndreassen, GuyGur-Ari, HenrykMichalewski, JacobAustin, DavidBieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021. 71 Code as Agent Harness

work page 2021
[67]

Reasoning like program executors

Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang Lou, and Weizhu Chen. Reasoning like program executors. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 761–779, 2022

work page 2022
[68]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

work page arXiv 2023
[69]

When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024

Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024

work page 2024
[70]

Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udom- charoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, and Sarana Nutanong. Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments. InFindings of the Association for Computational Linguistics: A...

work page 2025
[71]

Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025

Hong Su. Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025

work page arXiv 2025
[72]

Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025

Cedegao E Zhang, Cédric Colas, Gabriel Poesia, Joshua B Tenenbaum, and Jacob Andreas. Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025

work page arXiv 2025
[73]

CodeIO: Condensing reasoning patterns via code input-output prediction

Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. CodeIO: Condensing reasoning patterns via code input-output prediction. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ...

work page 2025
[74]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[75]

Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025

Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, and Jun Wang. Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025

work page arXiv 2025
[76]

Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025

Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, and Tong Zhang. Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[77]

Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025

Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025

work page arXiv 2025
[78]

Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025

Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan. Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025. 72 Code as Agent Harness

work page arXiv 2025
[79]

Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation

Yongchao Chen, Yilun Hao, Yang Zhang, and Chuchu Fan. Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 19248–19254. IEEE, 2025

work page 2025
[80]

ISBN 979-8-89176-189-6

Cuong Le Chi, Chau Truong Vinh Hoang, Phan Nhat Huy, Dung D. Le, Tien N Nguyen, and Nghi D. Q. Bui. VisualCoder: Guiding large language models in code execution with fine-grained multimodal chain-of-thought reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6643–6660,...

work page doi:10.18653/v1/ 2025
[81]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

work page 2021

Showing first 80 references.

[1] [1]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

work page 2022

[5] [5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023

work page 2023

[8] [8]

Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

work page arXiv 2023

[9] [9]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

work page 2023

[11] [11]

Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023

work page 2023

[12] [12]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. 67 Code as Agent Harness

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [15]

Effective harnesses for long-running agents

Justin Young. Effective harnesses for long-running agents. Anthropic Engineer- ing Blog, November 2025. URL https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents. Accessed: 2026-05-11

work page 2025

[15] [16]

Harness engineering: Leveraging codex in an agent-first world.https://openai

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world.https://openai. com/index/harness-engineering/, 2026. OpenAI Engineering Blog, February 11, 2026. Ac- cessed: 2026-05-10

work page 2026

[16] [17]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023

Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023

work page arXiv 2023

[19] [20]

Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025

Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025

work page 2025

[20] [21]

Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025

Boaz Lavon, Shahar Katz, and Lior Wolf. Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025

work page arXiv 2025

[21] [22]

Computer Environments Elicit General Agentic Intelligence in LLMs

Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, and Furu Wei. Computer environments elicit general agentic intelligence in llms.arXiv preprint arXiv:2601.16206, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [23]

Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025

Dekun Dai, MingWei Liu, Anji Li, Jialun Cao, Yanlin Wang, Chong Wang, Xin Peng, and Zibin Zheng. Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025

work page arXiv 2025

[23] [24]

Harness engineering: Leveraging codex in an agent-first world

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world. OpenAI Engineering Blog, February 2026. URLhttps://openai.com/index/harness-engineering/. Accessed: 2026-05-11

work page 2026

[24] [25]

The anatomy of an agent harness

Vivek Trivedy. The anatomy of an agent harness. https://www.langchain.com/blog/ the-anatomy-of-an-agent-harness, 2026. LangChain blog. Accessed: 2026-05-10

work page 2026

[25] [26]

Claude code

Anthropic. Claude code. https://www.anthropic.com/product/claude-code. Accessed: 2026-05-09

work page 2026

[26] [27]

Introducing Codex

OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. OpenAI announcement. 68 Code as Agent Harness

work page 2025

[27] [28]

Improving deep agents with harness engineering

Vivek Trivedy. Improving deep agents with harness engineering. https://www.langchain. com/blog/improving-deep-agents-with-harness-engineering, 2026. LangChain blog. Accessed: 2026-05-10

work page 2026

[28] [29]

Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023

Xi Ye, Qiaochu Chen, Isil Dillig, and Greg Durrett. Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023

work page 2023

[29] [30]

Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024

Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, and Pengcheng Yin. Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024

work page arXiv 2024

[30] [31]

Codeprm: Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8169–8182, 2025

work page 2025

[31] [32]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv. org/abs/2305.16291, 2(11), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024

Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, et al. Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024

work page arXiv 2024

[33] [34]

Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models

Siyang Zhang, Bin Li, Jingtao Qi, Xueying Wang, Fu Li, Jianan Wang, En Zhu, and Jinjing Sun. Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8814–8822, 2025

work page 2025

[34] [35]

Ui-voyager: A self-evolving gui agent learning via failed experience

Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience. arXiv preprint arXiv:2603.24533, 2026

work page arXiv 2026

[35] [36]

Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024

work page 2024

[36] [37]

Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025

work page arXiv 2025

[37] [38]

Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026

Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026

work page arXiv 2026

[38] [39]

Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026

Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026

work page arXiv 2026

[39] [40]

Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024. 69 Code as Agent Harness

work page 2024

[40] [41]

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [42]

Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D C, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, Balasubramanyan Ashok, and Shashank Shet. Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024

work page 2024

[42] [43]

Codetree: Agent-guided tree search for code generation with large language models

Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent-guided tree search for code generation with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3711–3726, 2025

work page 2025

[43] [44]

Mapcoder: Multi-agent code generationforcompetitiveproblemsolving

Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generationforcompetitiveproblemsolving. InProceedingsofthe62ndAnnualMeetingoftheAssociation for Computational Linguistics (Volume 1: Long Papers), pages 4912–4944, 2024

work page 2024

[44] [45]

Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025

NishantGaurav, AditAkarsh, TejasRavishankar, andManojBajaj. Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025

work page arXiv 2025

[45] [46]

Autocoderover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024

work page 2024

[46] [47]

Repocoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023

work page 2023

[47] [48]

Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

work page arXiv 2026

[48] [49]

Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024

work page arXiv 2024

[49] [50]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- coder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [51]

Adacoder: Adaptive prompt compression for programmatic visual question answering

Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024

work page 2024

[51] [52]

AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024

A.Nunez, N.T.Islam, S.K.Jha, andP.Najafirad. AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024

work page arXiv 2024

[52] [53]

Agent harness engineering: A survey, 2026

Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey, 2026. URLhttps://openreview.net/pdf?id=eONq7FdiHa. 70 Code as Age...

work page 2026

[53] [54]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024

[54] [55]

Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, andJürgenSchmidhuber. Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework. InThe Twelfth International Conference on Learning Representations, ICLR 2024, ...

work page 2024

[55] [56]

Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via ChatGPT.ACM Transactions on Software Engineering and Methodology, 33(7):1–38, 2024

work page 2024

[56] [57]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[57] [58]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

work page 2025

[58] [59]

Mind2web: Towards a generalist agent for the web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URLhttps://arxiv.org/abs/2306. 06070

work page 2023

[59] [60]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [61]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URLhttps://arxiv. org/abs/2304.05376

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [62]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023

[62] [63]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/ 2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [64]

Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

work page 2025

[64] [65]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[65] [66]

Show your work: Scratchpads for intermediate computation with language models

MaxwellNye, AndersJohanAndreassen, GuyGur-Ari, HenrykMichalewski, JacobAustin, DavidBieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021. 71 Code as Agent Harness

work page 2021

[66] [67]

Reasoning like program executors

Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang Lou, and Weizhu Chen. Reasoning like program executors. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 761–779, 2022

work page 2022

[67] [68]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023

work page arXiv 2023

[68] [69]

When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024

Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024

work page 2024

[69] [70]

Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udom- charoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, and Sarana Nutanong. Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments. InFindings of the Association for Computational Linguistics: A...

work page 2025

[70] [71]

Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025

Hong Su. Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025

work page arXiv 2025

[71] [72]

Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025

Cedegao E Zhang, Cédric Colas, Gabriel Poesia, Joshua B Tenenbaum, and Jacob Andreas. Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025

work page arXiv 2025

[72] [73]

CodeIO: Condensing reasoning patterns via code input-output prediction

Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. CodeIO: Condensing reasoning patterns via code input-output prediction. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ...

work page 2025

[73] [74]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024

[74] [75]

Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025

Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, and Jun Wang. Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025

work page arXiv 2025

[75] [76]

Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025

Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, and Tong Zhang. Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025

work page 2025

[76] [77]

Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025

Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025

work page arXiv 2025

[77] [78]

Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025

Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan. Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025. 72 Code as Agent Harness

work page arXiv 2025

[78] [79]

Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation

Yongchao Chen, Yilun Hao, Yang Zhang, and Chuchu Fan. Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 19248–19254. IEEE, 2025

work page 2025

[79] [80]

ISBN 979-8-89176-189-6

Cuong Le Chi, Chau Truong Vinh Hoang, Phan Nhat Huy, Dung D. Le, Tien N Nguyen, and Nghi D. Q. Bui. VisualCoder: Guiding large language models in code execution with fine-grained multimodal chain-of-thought reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6643–6660,...

work page doi:10.18653/v1/ 2025

[80] [81]

The lean 4 theorem prover and programming language

Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021

work page 2021