Recognition: 3 theorem links
· Lean TheoremAgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Pith reviewed 2026-05-15 03:53 UTC · model grok-4.3
The pith
A multi-agent system divides code generation among three agents to reach higher accuracy with lower token cost than single models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentCoder assigns distinct roles so the programmer agent generates and revises code based on feedback from the test executor agent, while the test designer agent supplies the test cases that the executor runs. The resulting loop produces code that passes more tests on HumanEval and MBPP than either standalone large language models or other enhancement methods, at substantially lower total token cost.
What carries the argument
The three-agent collaboration loop in which the test executor returns execution outcomes to the programmer agent for iterative code refinement.
If this is right
- Accuracy on standard code generation benchmarks rises for multiple underlying language models.
- Overall token usage for producing and verifying code falls compared with prior single-model and prompt-based approaches.
- The method outperforms both base models and the twelve other enhancement techniques evaluated in the experiments.
- Specialized agent roles allow each component to focus on one part of the task instead of requiring one model to handle writing, testing, and debugging simultaneously.
Where Pith is reading between the lines
- The same role separation might transfer to other structured generation tasks such as writing documentation or fixing bugs.
- Reduced token consumption could make the technique practical for environments with strict compute limits.
- Performance may depend on how well the generated tests cover edge cases that matter for real-world use.
- Adding agents for additional checks like security or efficiency analysis could be a direct next step.
Load-bearing premise
Feedback produced by running the designer-generated tests is assumed to steer the programmer toward code that is correct in general rather than merely passing those particular tests.
What would settle it
Measure whether the final code still passes a fresh collection of human-written tests that were never shown to any of the agents during the generation process.
read the original abstract
The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AgentCoder, a multi-agent code generation framework with three specialized agents: a programmer agent that generates and refines code, a test designer agent that creates test cases, and a test executor agent that runs the code against those tests and provides feedback. The system performs iterative refinement until the code passes the generated tests. Experiments across nine base LLMs and two benchmarks (HumanEval, MBPP) report that AgentCoder with GPT-4 reaches 96.3% and 91.8% pass@1 while using substantially lower token overhead than prior state-of-the-art methods.
Significance. If the central performance claims hold under scrutiny, the work provides concrete evidence that explicit separation of test design, execution, and code refinement within a multi-agent loop can simultaneously raise pass rates and reduce token consumption relative to single-model or prompt-engineering baselines. The efficiency gains are particularly noteworthy given the public benchmarks used.
major comments (2)
- [Experimental Evaluation] Experimental section (results on HumanEval/MBPP): the headline pass@1 figures rest on the unverified assumption that tests produced by the test designer agent provide sufficient coverage of the benchmark's hidden test distribution. No coverage statistics, no comparison of generated versus benchmark test suites, and no ablation that removes the test designer while preserving the executor feedback loop are reported; this leaves open the possibility that reported gains arise from overfitting to incomplete self-generated tests rather than genuine generalization.
- [AgentCoder Framework] Methods description of the iterative loop: the paper does not specify how many iterations are performed, what stopping criterion is used when the executor reports failures, or whether the programmer agent is allowed to see the full generated test suite versus only failure messages. These details are load-bearing for reproducing the exact 96.3%/91.8% numbers and for assessing whether the process introduces new errors.
minor comments (2)
- [Abstract] The abstract states results on '12 enhancement approaches' but the main text should list them explicitly with citations for reproducibility.
- [Results] Token-overhead numbers (56.9K / 66.3K) should be broken down by agent in a table so readers can see the contribution of test generation versus code refinement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our work. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental section (results on HumanEval/MBPP): the headline pass@1 figures rest on the unverified assumption that tests produced by the test designer agent provide sufficient coverage of the benchmark's hidden test distribution. No coverage statistics, no comparison of generated versus benchmark test suites, and no ablation that removes the test designer while preserving the executor feedback loop are reported; this leaves open the possibility that reported gains arise from overfitting to incomplete self-generated tests rather than genuine generalization.
Authors: We acknowledge that the current manuscript lacks explicit coverage statistics, direct comparisons between generated and benchmark test suites, and an ablation isolating the test designer's contribution while retaining executor feedback. Although pass@1 on the official hidden tests of HumanEval and MBPP remains the community-standard metric and our gains hold across nine base models, we agree these additions would better rule out overfitting concerns. In the revised version we will add: (i) average coverage metrics (statement/branch) for the generated test suites, (ii) a qualitative comparison of test-suite sizes and diversity, and (iii) the requested ablation that removes the test-designer agent while preserving the executor loop. These analyses will be reported in a new subsection of the experimental evaluation. revision: yes
-
Referee: [AgentCoder Framework] Methods description of the iterative loop: the paper does not specify how many iterations are performed, what stopping criterion is used when the executor reports failures, or whether the programmer agent is allowed to see the full generated test suite versus only failure messages. These details are load-bearing for reproducing the exact 96.3%/91.8% numbers and for assessing whether the process introduces new errors.
Authors: We agree these implementation details are essential for reproducibility. The revised manuscript will explicitly state that the loop runs for a maximum of 10 iterations or terminates early once the executor reports that every generated test passes. The programmer agent is provided with the complete set of test cases produced by the test designer together with the executor's failure messages (including which tests failed and the corresponding error output). We will also insert a concise algorithm box and additional pseudocode in Section 3 to make the control flow unambiguous. revision: yes
Circularity Check
No circularity: empirical benchmark results are externally measured
full rationale
The paper presents an empirical multi-agent framework evaluated directly on the public HumanEval and MBPP benchmarks. Pass@1 rates and token overheads are obtained by executing the system against those fixed external test suites rather than being computed from any internally defined quantities, fitted parameters, or self-referential equations. No mathematical derivation chain exists that reduces a claimed result to its own inputs by construction; the central performance numbers are independent measurements. Self-citations, if present, are not load-bearing for the reported results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM agents can reliably interpret and execute distinct roles (programmer, test designer, executor) when given natural-language instructions.
- domain assumption Pass@1 on HumanEval and MBPP is a valid proxy for real-world code correctness.
invented entities (3)
-
Programmer agent
no independent evidence
-
Test designer agent
no independent evidence
-
Test executor agent
no independent evidence
Forward citations
Cited by 20 Pith papers
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
-
BIM Information Extraction Through LLM-based Adaptive Exploration
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
-
An Iterative Test-and-Repair Framework for Competitive Code Generation
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
Conformal Agent Error Attribution
A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.
-
Tail-aware N-version Machine Learning Models for Reliable API Recommendation
NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
-
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
-
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
-
You Don't Need Public Tests to Generate Correct Code
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
-
Explicit Trait Inference for Multi-Agent Coordination
ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
-
ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation
ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
-
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
-
Position: agentic AI orchestration should be Bayes-consistent
Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Unified pre- training for program understanding and generation
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre- training for program understanding and generation. ArXiv, abs/2103.06333, 2021
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
Large language models as tool makers
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Agentverse: Facilitat- ing multi-agent collaboration and exploring emergent behaviors in agents
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Cheng Qian, Chi-Min Chan, Yujia Qin, Ya-Ting Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agent- verse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. ArXiv, abs/2308.10848, 2023
-
[9]
Teaching Large Language Models to Self-Debug
Xinyun Chen, Maxwell Lin, Nathanael Sch ¨arli, and Denny Zhou. Teaching large language models to self-debug. ArXiv, abs/2304.05128, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Codescore: Evaluating code generation by learning code execution
Yihong Dong, Ji Ding, Xue Jiang, Zhuo Li, Ge Li, and Zhi Jin. Codescore: Evaluating code generation by learning code execution. ArXiv, abs/2301.09043, 2023. 10
-
[11]
Self-collaboration code generation via chatgpt
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. ArXiv, abs/2304.07590, 2023
-
[12]
Human-level play in the game of diplomacy by combining language models with strategic reasoning
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022
work page 2022
-
[13]
CodeBERT: A pre-trained model for programming and natural languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguis- tics: EMNLP 2020 , pages 1536–1547, Online, November 2020. Association for Computational Linguistics
work page 2020
-
[14]
Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida I. Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022
- [15]
-
[16]
Reasoning with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. ArXiv, abs/2305.14992, 2023
-
[17]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Codecot and beyond: Learning to program and test like a developer
Dong Huang, Qi Bu, and Heming Cui. Codecot and beyond: Learning to program and test like a developer. ArXiv, abs/2308.08784, 2023
-
[19]
Self-planning code gener- ation with large language model
Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. Self-planning code gener- ation with large language model. ArXiv, abs/2303.06689, 2023
- [20]
-
[21]
Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´emi Leblond, Tom, Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cy- prien de, Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey, Cherepanov, James Molloy, Daniel Jaymin Mankowitz, Esme Suther- lan...
work page 2022
-
[22]
CAMEL: Communicative agents for ”mind” exploration of large language model soci- ety
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model soci- ety. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[23]
Structured chain-of-thought prompting for code generation
Jia Li, Ge Li, Yongming Li, and Zhi Jin. Structured chain-of-thought prompting for code generation. 2023
work page 2023
-
[24]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo ˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Lo...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Training socially aligned language models in simulated human society
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023
-
[26]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refine- ment with self-feedback. ArXiv, abs/2303.17651, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Code- gen2: Lessons for training llms on programming and natural languages
Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. Code- gen2: Lessons for training llms on programming and natural languages. ICLR, 2023
work page 2023
-
[28]
Codegen: An open large language model for code with multi-turn program synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023
work page 2023
-
[29]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages 1–22, 2023
work page 2023
-
[31]
ChatDev: Communicative Agents for Software Development
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J´er´emy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. 2023
work page 2023
-
[34]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021
work page 2021
-
[35]
Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, and Ge Yu. Intervenor: Prompt the coding ability of large language models with the interactive chain of repairing. ArXiv, abs/2311.09868, 2023
-
[36]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023
-
[37]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language mod- els. ArXiv, abs/2305.10601, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
CERT: Continual pre-training on sketches for library-oriented code generation
Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. CERT: Continual pre-training on sketches for library-oriented code generation. In The 2022 International Joint Conference on Artificial Intelligence, 2022
work page 2022
-
[41]
Proagent: Building proactive cooperative ai with large language models
Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yi Eve Sun, Chen Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, Xiaojun Chang, Junge Zhang, F. Yin, Yitao Liang, and Yaodong Yang. Proagent: Building proactive cooperative ai with large language models. ArXiv, abs/2308.11339, 2023
-
[42]
Self-edit: Fault-aware code editor for code generation
Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. Self-edit: Fault-aware code editor for code generation. ArXiv, abs/2305.04087, 2023
-
[43]
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shanshan Wang, Yufei Xue, Zi-Yuan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.ArXiv, abs/2303.17568, 2023
-
[44]
Mindstorms in natu- ral language-based societies of mind
Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, R ´obert Csord ´as, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. 13 A Appendix A.1 Response Setup To ensure that the output of each agent fo...
-
[45]
**Understand and Clarify**: Make sure you understand the task
-
[46]
**Algorithm/Method Selection**: Decide on the most efficient way
-
[47]
**Pseudocode Creation**: Write down the steps you will follow in pseudocode
-
[48]
AgentCoder: Programmer Prompt Figure 6: AgentCoder programmer prompt example
**Code Generation**: Translate your pseudocode into executable Python code. AgentCoder: Programmer Prompt Figure 6: AgentCoder programmer prompt example. 19 As a software developer, I'll follow the Chain-of-Thought approach outlined in the prompt to complete the `has_close_elements` function. Let's break it down into the four steps:
-
[49]
- The function should return `True` if such a pair exists, otherwise `False`
**Understand and Clarify**: - The task is to write a function that checks if any two numbers in a given list are closer to each other than a specified threshold. - The function should return `True` if such a pair exists, otherwise `False`
-
[50]
**Algorithm/Method Selection**: - A straightforward way to solve this would be to compare each number with every other number in the list. - We can use a nested loop where each number is compared to every number that comes after it in the list. - The absolute difference between each pair will be calculated and compared to the threshold
-
[51]
**Pseudocode Creation**: ``` function has_close_elements(numbers, threshold): loop through each number in the list (i) for each number, loop through the remaining numbers in the list (j) if the absolute difference between numbers[i] and numbers[j] is less than or equal to the threshold return True if no such pair is found, return False ```
-
[52]
task_id"] task_id = problems.index(sample) prompt = sample[
**Code Generation**: - Now, I'll translate the pseudocode into Python code. Here's the complete function: ```python from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5...
-
[53]
If necessary, write down what the function should do
**Understand and Clarify**: Make sure you understand the task. If necessary, write down what the function should do
-
[54]
**Algorithm/Method Selection**: Decide on the most efficient way to compare the numbers in the list to find if any two are within the threshold
-
[55]
This should outline how you will iterate through the list and compare the numbers
**Pseudocode Creation**: Write down the steps you will follow in pseudocode. This should outline how you will iterate through the list and compare the numbers
-
[56]
"" # Execute the function definition exec(text) # Run doctest if __name__ ==
**Code Generation**: Translate your pseudocode into executable Python code. Remember to test your function with the provided examples and any additional cases you think are relevant. Programmer+Test Designer Figure 11: Programmer + test designer example. 23 import doctest # Function definition as a string text = """ from typing import List def has_close_e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.