pith. sign in

arxiv: 2607.00038 · v1 · pith:67QU5YDWnew · submitted 2026-06-28 · 💻 cs.SE

Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting

Pith reviewed 2026-07-02 20:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords loop engineeringcoding agentsprompt engineeringverification ladderloop specificationagent harnessautonomous verificationterminal states
0
0 comments X

The pith

Loop specifications let coding agents pursue goals autonomously via triggers, goals, verification, stopping rules and memory, without retiring prompt engineering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a loop specification as a reusable five-part artifact that a human passes to an agent harness so the agent can complete a task on its own. It places this practice as the next stage after prompts, context, and harnesses, while arguing that loops and prompts remain distinct tools for different situations. An analysis of fifty real loops finds that verification steps and named terminal states already appear in most cases, but automated triggers and lasting memory are rarer. The work supplies a taxonomy built around goal types, a five-level verification ladder, and architecture, plus design principles drawn from studies of self-correction and reward hacking. It also notes practical limits such as the ongoing cost of verification and the risk that humans may surrender too much oversight.

Core claim

The central claim is that a loop specification is a bounded, reusable object consisting of a trigger, a goal, a verification step, a stopping rule, and memory that a human supplies to a harness so the agent can pursue the goal without step-by-step prompting. This object differs from both ordinary programming loops and the harness's internal perceive-act-observe cycle. Loop engineering therefore forms a distinct new layer after prompt, context, and harness use, yet the paper maintains that prompt engineering continues as a separate practice for tasks that do not fit the loop form. The fifty-loop corpus shows that verification and terminal-state naming are already common while triggering and m

What carries the argument

The loop specification, a five-component reusable artifact (trigger, goal, verification step, stopping rule, memory) handed to a harness to enable autonomous goal pursuit.

If this is right

  • A five-level verification ladder classifies how much human checking each loop requires.
  • Seventy percent of the corpus loops already verify inside the autonomous zone of that ladder.
  • Seventy-four percent of the loops explicitly name their terminal states.
  • Automated triggering and durable memory remain comparatively rare in existing loops.
  • Design principles address risks such as reward hacking and fragility of model-as-judge checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Platforms may add native interfaces for loading and running loop specifications directly.
  • Shared libraries of common loop patterns could reduce the cost of defining new ones.
  • The verification burden may restrict loops to high-value or repeated tasks.
  • The same five-part structure could be tested on agents outside coding, such as research or data workflows.

Load-bearing premise

The fifty hand-coded loops represent current practice and the five-component definition plus verification ladder capture the essential distinctions from prompts and harness mechanisms.

What would settle it

A larger survey or set of controlled tests showing that most coding-agent sessions still rely on step-by-step human prompts rather than pre-written loop specifications, or that loop-based runs do not reduce human oversight compared with prompt-based runs.

Figures

Figures reproduced from arXiv: 2607.00038 by Sandeco Macedo.

Figure 1
Figure 1. Figure 1: Loop engineering as a new layer in the progression from prompt to context to harness [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The anatomy of a loop specification. A trigger starts the agent; the agent executes by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The five-level verification ladder, strongest at the top. Levels 1 and 2 form the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dominant verification level across the fifty loops. Half verify deterministically (level [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trigger (left) and architecture (right) across the fifty loops. Manual triggering and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Coverage of the four pattern families across the fifty loops. Most loops combine [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A maturity mismatch in current loop practice. The corpus is mature on verification, [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The sandeco-loop authoring pipeline. A task is first triaged; if no feedback would change the next action it is sent back as a scheduled prompt rather than forced into a loop. Otherwise a short interview gathers the design elements, a hardening pass applies the principles and anti-patterns, and the skill emits a single specification document together with the way to actuate it. The interview For a task tha… view at source ↗
read the original abstract

In mid-2026 a slogan reorganized how practitioners talk about coding agents: stop prompting your agent, start designing the loop that prompts it. We take this claim seriously and give it a careful treatment. We call the object of the new practice the loop specification: a bounded, reusable artifact, made of a trigger, a goal, a verification step, a stopping rule and a memory, that a human hands to an agent harness (such as Claude Code or Codex) so the agent pursues a goal on its own, in place of step-by-step prompting. We distinguish this external loop specification from two things it is often confused with: an ordinary programming loop, and the internal perceive-act-observe cycle that the harness already provides as plumbing. We position loop engineering as a new layer in the progression from prompt to context to harness to loop, and we argue, against the stronger headlines, that it does not retire prompt engineering; loop and prompt are distinct tools with distinct uses. We offer four contributions: a definition and scope for the discipline; an anatomy and taxonomy of loop specifications organized around trigger, goal type, a five-level verification ladder, architecture, and named terminal states; a descriptive analysis of the Loop Library, a public corpus of fifty real loops that we code by hand; and a set of design principles and anti-patterns grounded in the scientific literature on self-correction, reward hacking and model-as-judge fragility. The corpus shows that practice has matured most where the discipline says it matters: seventy percent of loops verify in the autonomous zone of the ladder and seventy-four percent name their terminal states, while automated triggering and durable memory remain comparatively underdeveloped. We close with the limits the practice must respect, including the verification burden, comprehension debt and the risk of cognitive surrender.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that loop engineering represents a distinct new layer beyond prompt engineering and harness plumbing for coding agents. It defines a loop specification as a bounded, reusable five-component artifact (trigger, goal, verification step, stopping rule, memory) handed to an agent harness. The work contributes a taxonomy organized around these components plus a five-level verification ladder, architecture, and named terminal states; a hand-coded analysis of a public corpus of fifty real loops showing 70% autonomous verification and 74% named terminals; and design principles drawn from self-correction and model-as-judge literature. It positions the practice as complementary to, rather than replacing, prompt engineering.

Significance. If the taxonomy and distinctions hold, the paper could usefully organize an emerging practitioner practice by identifying underdeveloped areas such as automated triggering and durable memory, while supplying concrete design principles. The descriptive corpus snapshot and explicit separation of external loop specifications from internal harness cycles are potentially helpful for software engineering audiences working with agent harnesses. Strengths include the public corpus and grounding in existing literature on reward hacking; however, the small author-curated sample limits generalizability.

major comments (2)
  1. [Corpus analysis / Loop Library] Corpus construction and analysis section: The reported statistics (70% autonomous verification, 74% named terminal states) rest on the authors' hand-coding of fifty loops, yet the manuscript supplies no details on selection criteria, exclusion rules, inter-rater reliability, or annotation protocol. This directly affects the claim that the five-component definition and verification ladder capture essential, non-arbitrary distinctions rather than post-hoc categories.
  2. [Anatomy and taxonomy] Taxonomy and definition section: The five-component anatomy and five-level verification ladder are presented as capturing distinctions from prompts and harness plumbing, but no systematic comparison against existing prompt-engineering or agent-taxonomy literature is supplied to demonstrate that the categories are reproducible or load-bearing rather than author-interpreted.
minor comments (1)
  1. [Abstract / Discussion] The abstract states the corpus shows practice has 'matured most where the discipline says it matters,' but the manuscript does not clarify whether this alignment was assessed before or after defining the ladder.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise feedback. We agree that the manuscript would benefit from greater methodological transparency on the corpus and a more explicit positioning of the taxonomy against prior literature. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Corpus analysis / Loop Library] Corpus construction and analysis section: The reported statistics (70% autonomous verification, 74% named terminal states) rest on the authors' hand-coding of fifty loops, yet the manuscript supplies no details on selection criteria, exclusion rules, inter-rater reliability, or annotation protocol. This directly affects the claim that the five-component definition and verification ladder capture essential, non-arbitrary distinctions rather than post-hoc categories.

    Authors: We agree that the current manuscript omits critical details on corpus construction. The fifty loops were drawn from publicly available practitioner examples and open-source agent repositories. In revision we will add a dedicated subsection 'Corpus Construction and Annotation Protocol' that states: the collection sources and cutoff date; explicit inclusion criteria (a loop must specify at least trigger, goal, and verification to qualify); exclusion criteria (incomplete or non-reusable fragments discarded); and the single-author primary coding process with co-author spot checks. We will explicitly note the absence of formal inter-rater reliability metrics as a limitation and the risk of author bias, while stressing that the full annotated corpus is released publicly so others can replicate or extend the coding. These additions will allow readers to evaluate whether the five-component distinctions and verification ladder are load-bearing rather than post-hoc. revision: yes

  2. Referee: [Anatomy and taxonomy] Taxonomy and definition section: The five-component anatomy and five-level verification ladder are presented as capturing distinctions from prompts and harness plumbing, but no systematic comparison against existing prompt-engineering or agent-taxonomy literature is supplied to demonstrate that the categories are reproducible or load-bearing rather than author-interpreted.

    Authors: The taxonomy is derived from the operational definition of an external, reusable loop specification that sits outside both natural-language prompts and the harness's internal perceive-act cycle. We acknowledge, however, that the manuscript does not supply a systematic side-by-side mapping to prior work. In revision we will add a concise comparative subsection (or table) that situates the five components and verification ladder against representative prompt-engineering taxonomies (e.g., CoT, ToT, ReAct-style prompting) and agent-framework categorizations (e.g., memory and tool-use patterns in LangChain/AutoGen). The comparison will highlight that our contribution centers on the bounded, externally specified nature of the loop artifact and the autonomous-verification ladder, thereby clarifying both overlap and differentiation without overstating novelty. We will also tie the existing design-principles section more explicitly to this taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: definitional framework and descriptive corpus analysis

full rationale

The paper advances a conceptual framework by defining loop specifications (trigger, goal, verification step, stopping rule, memory) and a five-level verification ladder, then applies this taxonomy descriptively to a hand-coded corpus of 50 loops. No equations, fitted parameters, predictions, or derivations appear. The central claims rest on explicit definitional distinctions and observed statistics (e.g., 70% autonomous verification) rather than any reduction of outputs to inputs by construction. The corpus is presented as illustrative data selected by the authors, but the paper makes no claim that the taxonomy is derived from or validated against external benchmarks in a way that creates self-referential fitting. Self-citation is absent. This is a standard non-circular proposal of new terminology and taxonomy.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is entirely conceptual and definitional. It introduces no mathematical free parameters, no unproved background axioms, and no postulated physical or computational entities beyond the named conceptual artifact 'loop specification'.

pith-pipeline@v0.9.1-grok · 5856 in / 1126 out tokens · 27309 ms · 2026-07-02T20:37:56.433867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 55 canonical work pages · 42 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, K...

  2. [2]

    ReST meets ReAct: Self-improvement for multi-step reasoning llm agent.arXiv preprint, 2023

    Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kop- parapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix Yu, and Sanjiv Kumar. ReST meets ReAct: Self-improvement for multi-step reasoning llm agent.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2312.10003. 17

  3. [3]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2310.11511

  4. [4]

    Stay Focused: Problem Drift in Multi-Agent Debate

    Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, and Bela Gipp. Stay focused: Problem drift in multi-agent debate.arXiv preprint, 2025. Preprint, arXiv. doi:10.48550/arXiv.2502.19559

  5. [5]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690. AAA...

  6. [6]

    Demonstrat- ing specification gaming in reasoning models.arXiv preprint, 2025

    Alexander Bondarenko, Denis Volk, Dmitrii Volkov, and Jeffrey Ladish. Demonstrat- ing specification gaming in reasoning models.arXiv preprint, 2025. Preprint, arXiv. doi:10.48550/arXiv.2502.13295

  7. [7]

    AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. AgentVerse: Facilitating multi-agent col- laboration and exploring emergent behaviors.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2308.10848

  8. [8]

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Jacob Steinhardt, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Investi- gating reward-tampering in large language models.arXiv preprint, 2024. Preprint, arXiv. doi:10.48550/arX...

  9. [9]

    Chain-of-Verification Reduces Hallucination in Large Language Models

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celiky- ilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2309.11495

  10. [10]

    Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective

    Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv preprint, 2019. Preprint, arXiv. doi:10.48550/arXiv.1908.04734

  11. [11]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint, 2025. Preprint, arXiv. doi:10.48550/arXiv.2507.21046

  12. [12]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint, 2024. Preprint, arXiv. doi:10.48550/arXiv.2411.15594

  13. [13]

    Reasoning with language model is planning with world model.arXiv preprint,

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model.arXiv preprint,

  14. [14]
  15. [15]

    Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant.arXiv preprint, 2025

    Gaole He, Gianluca Demartini, and Ujwal Gadiraju. Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant.arXiv preprint, 2025. Preprint, arXiv. doi:10.48550/arXiv.2502.01390. 18

  16. [16]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta program- ming for a multi-agent collaborative framework.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2308.00352

  17. [17]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2310.01798

  18. [18]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models.arXiv preprint, 2022. Preprint, arXiv. doi:1...

  19. [19]

    Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023. doi:10.1145/3571730

  20. [20]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledg...

  21. [21]

    J. O. Kephart and D. M. Chess. The vision of autonomic computing.Computer, 36(1):41–50,

  22. [22]

    doi:10.1109/MC.2003.1160055

  23. [23]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks.arXiv preprint, 2022. Preprint, arXiv. doi:10.48550/arXiv.2210.02406

  24. [24]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large lan- guage models through multi-agent debate.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.19118

  25. [25]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. LLM+P: Empowering large language models with optimal planning proficiency. arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2304.11477

  26. [26]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long con- texts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi:10.1162/tacl_a_00638

  27. [27]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating llms as agents.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/a...

  28. [28]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegr- effe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bod- hisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback.arXiv preprint, 2023. Preprint, arXiv. doi:1...

  29. [29]

    Bowman, and Shi Feng

    Jane Pan, He He, Samuel R. Bowman, and Shi Feng. Spontaneous reward hacking in iterative self-refinement.arXiv preprint, 2024. Preprint, arXiv. doi:10.48550/arXiv.2407.04549

  30. [30]

    ART: Automatic multi-step reasoning and tool-use for large language models

    Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2303.09014

  31. [31]

    Bernstein

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22. ACM, 2023. doi:10.1145/3586183.3606763

  32. [32]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.15334

  33. [33]

    Smith, and Mike Lewis

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models.arXiv preprint,

  34. [34]
  35. [35]

    ChatDev: Communicative Agents for Software Development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2307.07924

  36. [36]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world apis.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550...

  37. [37]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2302.04761

  38. [38]

    HuggingGPT: Solving ai tasks with chatgpt and its friends in hugging face.arXiv preprint,

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in hugging face.arXiv preprint,

  39. [39]
  40. [40]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2303.11366

  41. [41]

    Cognitive Architectures for Language Agents

    Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2309.02427

  42. [42]

    AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint, 2023

    Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. AdaPlanner: Adaptive planning from feedback with language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.16653. 20

  43. [43]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.16291

  44. [44]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. doi:10.1007/s11704-024-40231-1

  45. [45]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.04091

  46. [46]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint, 2022. Preprint, arXiv. doi:10.48550/arXiv.2203.11171

  47. [47]

    Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

    Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2302.01560

  48. [48]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.arXiv preprint, 2022. Preprint, arXiv. doi:10.48550/arXiv.2201.11903

  49. [49]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen llm applications via multi-agent conversation.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2308.08155

  50. [50]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Jun- zhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2309.07864

  51. [51]

    ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

    Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. ReWOO: Decoupling reasoning from observations for efficient augmented language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.18323

  52. [52]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint, 2024. Preprint, arXiv. doi:10.48550/arXiv.2405.15793

  53. [53]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2305.10601

  54. [54]

    ReAct: Synergizing reasoning and acting in language models.arXiv preprint,

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint,

  55. [55]
  56. [56]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, et al. The landscape of agentic reinforcement learning for LLMs: A survey.arXiv preprint, 2025. Preprint, arXiv. doi:10.48550/arXiv.2509.02547. 21

  57. [57]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint, 2025. Preprint, arXiv. doi:10.48550/arXiv.2510.04618

  58. [58]

    Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. arXiv preprint, 2023. Preprint, arXiv. doi:10.48550/arXiv.2310.04406

  59. [59]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint, 2022. Preprint, arXiv. doi:10.48550/arXiv.2205.10625

  60. [60]

    Le, Ed H

    Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-Discover: Large language models self-compose reasoning structures.arXiv preprint, 2024. Preprint, arXiv. doi:10.48550/arXiv.2402.03620. 22