arxiv: 2312.13010 · v3 · submitted 2023-12-20 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang , Jie M.Zhang , Michael Luck , Qingwen Bu , Yuhao Qing , Heming Cui

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent systemscode generationlarge language modelsiterative testingtest case generationHumanEvalMBPP

0 comments

The pith

A multi-agent system divides code generation among three agents to reach higher accuracy with lower token cost than single models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentCoder as a framework with a programmer agent that writes and refines code, a test designer agent that creates test cases, and a test executor agent that runs the code and reports results back for further refinement. This division allows the programmer to improve its output using concrete execution feedback rather than internal guesses alone. Experiments across nine base models and twelve prior techniques show the approach delivers 96.3 percent pass@1 on HumanEval and 91.8 percent on MBPP when using GPT-4, versus 90.2 percent and 78.9 percent for the previous best methods, while consuming roughly 57K and 66K tokens instead of 138K and 206K. A reader would care because the results indicate a route to more reliable code from existing models without scaling their size or total compute.

Core claim

AgentCoder assigns distinct roles so the programmer agent generates and revises code based on feedback from the test executor agent, while the test designer agent supplies the test cases that the executor runs. The resulting loop produces code that passes more tests on HumanEval and MBPP than either standalone large language models or other enhancement methods, at substantially lower total token cost.

What carries the argument

The three-agent collaboration loop in which the test executor returns execution outcomes to the programmer agent for iterative code refinement.

If this is right

Accuracy on standard code generation benchmarks rises for multiple underlying language models.
Overall token usage for producing and verifying code falls compared with prior single-model and prompt-based approaches.
The method outperforms both base models and the twelve other enhancement techniques evaluated in the experiments.
Specialized agent roles allow each component to focus on one part of the task instead of requiring one model to handle writing, testing, and debugging simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same role separation might transfer to other structured generation tasks such as writing documentation or fixing bugs.
Reduced token consumption could make the technique practical for environments with strict compute limits.
Performance may depend on how well the generated tests cover edge cases that matter for real-world use.
Adding agents for additional checks like security or efficiency analysis could be a direct next step.

Load-bearing premise

Feedback produced by running the designer-generated tests is assumed to steer the programmer toward code that is correct in general rather than merely passing those particular tests.

What would settle it

Measure whether the final code still passes a fresh collection of human-written tests that were never shown to any of the agents during the generation process.

read the original abstract

The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentCoder's three-agent setup with a dedicated test designer boosts pass rates on HumanEval and MBPP while cutting token use, but the gains may partly reflect fitting to self-generated tests rather than broader robustness.

read the letter

The main thing to know about this paper is that their multi-agent framework, with separate agents for programming, test design, and execution, achieves higher pass rates on standard code generation benchmarks while using significantly fewer tokens than the compared methods. What is new is the specific three-agent division where the test designer creates cases independently, the executor runs the code against them, and feeds results back to the programmer for refinement. This closed loop is presented as an improvement over single-agent models or simple prompt techniques. The paper does well in showing results across nine different base models and reporting consistent improvements on both HumanEval and MBPP. The token overhead numbers are lower, which is a practical advantage for anyone concerned with API costs in real applications. They also compare against twelve enhancement approaches, which helps position the work. The soft spots are around the evaluation of the test cases themselves. The headline numbers rest on the assumption that feedback from tests generated by the test designer leads to code that passes the benchmark's hidden tests, not just the ones created during iteration. There are no reported metrics on test coverage, no analysis of how the generated tests compare to the official ones, and no ablation study that keeps the executor but disables the dedicated test designer. This leaves open the possibility that some gains come from the programmer overfitting to a limited set of tests rather than producing more robust code. The abstract-only view also means we can't check the exact implementation of feedback handling or if there were any post-hoc adjustments. This kind of paper is for people working on applied LLM systems for software engineering. Readers interested in multi-agent setups for code tasks or looking for ways to improve pass rates without extra model calls would find the details useful. It is not a foundational theoretical advance, but the empirical results are solid enough to warrant attention. I would recommend sending it for peer review. The architecture is clear and the performance claims are specific, so referees can verify the methods and ask for the missing ablations or coverage analysis.

Referee Report

2 major / 2 minor

Summary. The paper proposes AgentCoder, a multi-agent code generation framework with three specialized agents: a programmer agent that generates and refines code, a test designer agent that creates test cases, and a test executor agent that runs the code against those tests and provides feedback. The system performs iterative refinement until the code passes the generated tests. Experiments across nine base LLMs and two benchmarks (HumanEval, MBPP) report that AgentCoder with GPT-4 reaches 96.3% and 91.8% pass@1 while using substantially lower token overhead than prior state-of-the-art methods.

Significance. If the central performance claims hold under scrutiny, the work provides concrete evidence that explicit separation of test design, execution, and code refinement within a multi-agent loop can simultaneously raise pass rates and reduce token consumption relative to single-model or prompt-engineering baselines. The efficiency gains are particularly noteworthy given the public benchmarks used.

major comments (2)

[Experimental Evaluation] Experimental section (results on HumanEval/MBPP): the headline pass@1 figures rest on the unverified assumption that tests produced by the test designer agent provide sufficient coverage of the benchmark's hidden test distribution. No coverage statistics, no comparison of generated versus benchmark test suites, and no ablation that removes the test designer while preserving the executor feedback loop are reported; this leaves open the possibility that reported gains arise from overfitting to incomplete self-generated tests rather than genuine generalization.
[AgentCoder Framework] Methods description of the iterative loop: the paper does not specify how many iterations are performed, what stopping criterion is used when the executor reports failures, or whether the programmer agent is allowed to see the full generated test suite versus only failure messages. These details are load-bearing for reproducing the exact 96.3%/91.8% numbers and for assessing whether the process introduces new errors.

minor comments (2)

[Abstract] The abstract states results on '12 enhancement approaches' but the main text should list them explicitly with citations for reproducibility.
[Results] Token-overhead numbers (56.9K / 66.3K) should be broken down by agent in a table so readers can see the contribution of test generation versus code refinement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our work. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental section (results on HumanEval/MBPP): the headline pass@1 figures rest on the unverified assumption that tests produced by the test designer agent provide sufficient coverage of the benchmark's hidden test distribution. No coverage statistics, no comparison of generated versus benchmark test suites, and no ablation that removes the test designer while preserving the executor feedback loop are reported; this leaves open the possibility that reported gains arise from overfitting to incomplete self-generated tests rather than genuine generalization.

Authors: We acknowledge that the current manuscript lacks explicit coverage statistics, direct comparisons between generated and benchmark test suites, and an ablation isolating the test designer's contribution while retaining executor feedback. Although pass@1 on the official hidden tests of HumanEval and MBPP remains the community-standard metric and our gains hold across nine base models, we agree these additions would better rule out overfitting concerns. In the revised version we will add: (i) average coverage metrics (statement/branch) for the generated test suites, (ii) a qualitative comparison of test-suite sizes and diversity, and (iii) the requested ablation that removes the test-designer agent while preserving the executor loop. These analyses will be reported in a new subsection of the experimental evaluation. revision: yes
Referee: [AgentCoder Framework] Methods description of the iterative loop: the paper does not specify how many iterations are performed, what stopping criterion is used when the executor reports failures, or whether the programmer agent is allowed to see the full generated test suite versus only failure messages. These details are load-bearing for reproducing the exact 96.3%/91.8% numbers and for assessing whether the process introduces new errors.

Authors: We agree these implementation details are essential for reproducibility. The revised manuscript will explicitly state that the loop runs for a maximum of 10 iterations or terminates early once the executor reports that every generated test passes. The programmer agent is provided with the complete set of test cases produced by the test designer together with the executor's failure messages (including which tests failed and the corresponding error output). We will also insert a concise algorithm box and additional pseudocode in Section 3 to make the control flow unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are externally measured

full rationale

The paper presents an empirical multi-agent framework evaluated directly on the public HumanEval and MBPP benchmarks. Pass@1 rates and token overheads are obtained by executing the system against those fixed external test suites rather than being computed from any internally defined quantities, fitted parameters, or self-referential equations. No mathematical derivation chain exists that reduces a claimed result to its own inputs by construction; the central performance numbers are independent measurements. Self-citations, if present, are not load-bearing for the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The framework rests on the standard assumption that LLM agents can follow role-specific instructions and that benchmark test suites are sufficient proxies for correctness. No new physical or mathematical entities are introduced.

axioms (2)

domain assumption LLM agents can reliably interpret and execute distinct roles (programmer, test designer, executor) when given natural-language instructions.
Invoked in the description of the collaborative system.
domain assumption Pass@1 on HumanEval and MBPP is a valid proxy for real-world code correctness.
Used to claim superiority over baselines.

invented entities (3)

Programmer agent no independent evidence
purpose: Generates and refines code based on test feedback
Core component of the multi-agent framework
Test designer agent no independent evidence
purpose: Generates test cases for the produced code
Core component of the multi-agent framework
Test executor agent no independent evidence
purpose: Runs code against tests and returns feedback
Core component of the multi-agent framework

pith-pipeline@v0.9.0 · 5593 in / 1495 out tokens · 50315 ms · 2026-05-15T03:53:52.500420+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
cs.SE 2026-05 unverdicted novelty 7.0

ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
BIM Information Extraction Through LLM-based Adaptive Exploration
cs.CL 2026-05 unverdicted novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Evaluating LLM Agents on Automated Software Analysis Tasks
cs.SE 2026-04 unverdicted novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
cs.SE 2026-04 unverdicted novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
An Iterative Test-and-Repair Framework for Competitive Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
cs.NE 2026-03 unverdicted novelty 7.0

BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
Conformal Agent Error Attribution
cs.LG 2026-05 unverdicted novelty 6.0

A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.
Tail-aware N-version Machine Learning Models for Reliable API Recommendation
cs.SE 2026-04 unverdicted novelty 6.0

NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?
cs.SE 2026-04 unverdicted novelty 6.0

SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
cs.SE 2026-04 unverdicted novelty 6.0

MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
You Don't Need Public Tests to Generate Correct Code
cs.SE 2026-04 unverdicted novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
Explicit Trait Inference for Multi-Agent Coordination
cs.AI 2026-04 unverdicted novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation
cs.SE 2026-04 unverdicted novelty 6.0

ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
cs.SE 2026-04 unverdicted novelty 5.0

REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
Position: agentic AI orchestration should be Bayes-consistent
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 20 Pith papers · 13 internal anchors

[1]

Unified pre- training for program understanding and generation

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre- training for program understanding and generation. ArXiv, abs/2103.06333, 2021

work page arXiv 2021
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023

work page arXiv 2023
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Agentverse: Facilitat- ing multi-agent collaboration and exploring emergent behaviors in agents

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Cheng Qian, Chi-Min Chan, Yujia Qin, Ya-Ting Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agent- verse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. ArXiv, abs/2308.10848, 2023

work page arXiv 2023
[9]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch ¨arli, and Denny Zhou. Teaching large language models to self-debug. ArXiv, abs/2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Codescore: Evaluating code generation by learning code execution

Yihong Dong, Ji Ding, Xue Jiang, Zhuo Li, Ge Li, and Zhi Jin. Codescore: Evaluating code generation by learning code execution. ArXiv, abs/2301.09043, 2023. 10

work page arXiv 2023
[11]

Self-collaboration code generation via chatgpt

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. ArXiv, abs/2304.07590, 2023

work page arXiv 2023
[12]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022

work page 2022
[13]

CodeBERT: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguis- tics: EMNLP 2020 , pages 1536–1547, Online, November 2020. Association for Computational Linguistics

work page 2020
[14]

Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida I. Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022

work page arXiv 2022
[15]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, and M. Zhou. Graphcodebert: Pre-training code representations with data flow. ArXiv, abs/2009.08366, 2020

work page arXiv 2009
[16]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. ArXiv, abs/2305.14992, 2023

work page arXiv 2023
[17]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Codecot and beyond: Learning to program and test like a developer

Dong Huang, Qi Bu, and Heming Cui. Codecot and beyond: Learning to program and test like a developer. ArXiv, abs/2308.08784, 2023

work page arXiv 2023
[19]

Self-planning code gener- ation with large language model

Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. Self-planning code gener- ation with large language model. ArXiv, abs/2303.06689, 2023

work page arXiv 2023
[20]

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq R. Joty. Codechain: Towards modular code generation through chain of self-revisions with representa- tive sub-modules. ArXiv, abs/2310.08992, 2023

work page arXiv 2023
[21]

Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi Leblond, Tom, Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cy- prien de, Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey, Cherepanov, James Molloy, Daniel Jaymin Mankowitz, Esme Suther- lan...

work page 2022
[22]

CAMEL: Communicative agents for ”mind” exploration of large language model soci- ety

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model soci- ety. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[23]

Structured chain-of-thought prompting for code generation

Jia Li, Ge Li, Yongming Li, and Zhi Jin. Structured chain-of-thought prompting for code generation. 2023

work page 2023
[24]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo ˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Lo...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Training socially aligned language models in simulated human society

Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023

work page arXiv 2023
[26]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refine- ment with self-feedback. ArXiv, abs/2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Code- gen2: Lessons for training llms on programming and natural languages

Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. Code- gen2: Lessons for training llms on programming and natural languages. ICLR, 2023

work page 2023
[28]

Codegen: An open large language model for code with multi-turn program synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023

work page 2023
[29]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages 1–22, 2023

work page 2023
[31]

ChatDev: Communicative Agents for Software Development

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J´er´emy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. 2023

work page 2023
[34]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021

work page 2021
[35]

Intervenor: Prompt the coding ability of large language models with the interactive chain of repairing

Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, and Ge Yu. Intervenor: Prompt the coding ability of large language models with the interactive chain of repairing. ArXiv, abs/2311.09868, 2023

work page arXiv 2023
[36]

Codet5+: Open code large language models for code understanding and generation.arXiv preprint arXiv:2305.07922, 2023

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023

work page arXiv 2023
[37]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language mod- els. ArXiv, abs/2305.10601, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

CERT: Continual pre-training on sketches for library-oriented code generation

Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. CERT: Continual pre-training on sketches for library-oriented code generation. In The 2022 International Joint Conference on Artificial Intelligence, 2022

work page 2022
[41]

Proagent: Building proactive cooperative ai with large language models

Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yi Eve Sun, Chen Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, F. Yin, Yitao Liang, and Yaodong Yang. Proagent: Building proactive cooperative ai with large language models. ArXiv, abs/2308.11339, 2023

work page arXiv 2023
[42]

Self-edit: Fault-aware code editor for code generation

Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. Self-edit: Fault-aware code editor for code generation. ArXiv, abs/2305.04087, 2023

work page arXiv 2023
[43]

Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.ArXiv, abs/2303.17568, 2023

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shanshan Wang, Yufei Xue, Zi-Yuan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.ArXiv, abs/2303.17568, 2023

work page arXiv 2023
[44]

Mindstorms in natu- ral language-based societies of mind

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, R ´obert Csord ´as, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. 13 A Appendix A.1 Response Setup To ensure that the output of each agent fo...

work page arXiv 2023
[45]

**Understand and Clarify**: Make sure you understand the task

work page
[46]

**Algorithm/Method Selection**: Decide on the most efficient way

work page
[47]

**Pseudocode Creation**: Write down the steps you will follow in pseudocode

work page
[48]

AgentCoder: Programmer Prompt Figure 6: AgentCoder programmer prompt example

**Code Generation**: Translate your pseudocode into executable Python code. AgentCoder: Programmer Prompt Figure 6: AgentCoder programmer prompt example. 19 As a software developer, I'll follow the Chain-of-Thought approach outlined in the prompt to complete the `has_close_elements` function. Let's break it down into the four steps:

work page
[49]

- The function should return `True` if such a pair exists, otherwise `False`

**Understand and Clarify**: - The task is to write a function that checks if any two numbers in a given list are closer to each other than a specified threshold. - The function should return `True` if such a pair exists, otherwise `False`

work page
[50]

- We can use a nested loop where each number is compared to every number that comes after it in the list

**Algorithm/Method Selection**: - A straightforward way to solve this would be to compare each number with every other number in the list. - We can use a nested loop where each number is compared to every number that comes after it in the list. - The absolute difference between each pair will be calculated and compared to the threshold

work page
[51]

**Pseudocode Creation**: ``` function has_close_elements(numbers, threshold): loop through each number in the list (i) for each number, loop through the remaining numbers in the list (j) if the absolute difference between numbers[i] and numbers[j] is less than or equal to the threshold return True if no such pair is found, return False ```

work page
[52]

task_id"] task_id = problems.index(sample) prompt = sample[

**Code Generation**: - Now, I'll translate the pseudocode into Python code. Here's the complete function: ```python from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5...

work page
[53]

If necessary, write down what the function should do

**Understand and Clarify**: Make sure you understand the task. If necessary, write down what the function should do

work page
[54]

**Algorithm/Method Selection**: Decide on the most efficient way to compare the numbers in the list to find if any two are within the threshold

work page
[55]

This should outline how you will iterate through the list and compare the numbers

**Pseudocode Creation**: Write down the steps you will follow in pseudocode. This should outline how you will iterate through the list and compare the numbers

work page
[56]

"" # Execute the function definition exec(text) # Run doctest if __name__ ==

**Code Generation**: Translate your pseudocode into executable Python code. Remember to test your function with the provided examples and any additional cases you think are relevant. Programmer+Test Designer Figure 11: Programmer + test designer example. 23 import doctest # Function definition as a string text = """ from typing import List def has_close_e...

work page