Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Chenchen Liu; Chenyu Wang; Chong Zheng; Han Li; Jiaheng Liu; Jinyu Tian; Letian Zhu; Rili Feng; Shihao Li; Weihao Xie

arxiv: 2605.15301 · v1 · pith:S4UISSUInew · submitted 2026-05-14 · 💻 cs.AI

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Han Li , Jinyu Tian , Rili Feng , Yuqiao Du , Chong Zheng , Chenyu Wang , Chenchen Liu , Shihao Li

show 5 more authors

Xinping Lei Yifan Yao Weihao Xie Letian Zhu Jiaheng Liu

This is my paper

Pith reviewed 2026-05-19 16:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic evolutioncompetitive programminglarge language modelsmulti-agent systemsknowledge networksreinforcement learningcode generationagentic framework

0 comments

The pith

Solvita achieves state-of-the-art results in competitive programming by letting LLM agents continuously learn from past outcomes using updatable knowledge networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Solvita to fix the stateless limitation in current multi-agent code generation setups that discard experience after each task. It structures problem solving as a closed loop of strategy selection, synthesis, verification, and targeted attacks carried out by four agents, each tied to its own trainable graph-structured knowledge network. Outcome signals from executions are turned into reinforcement learning updates that adjust the networks, so future queries get routed and solved using patterns from earlier successes and failures. A sympathetic reader would care because the approach shows how to make AI coding systems improve over repeated use without retraining the base model weights.

Core claim

Solvita establishes a closed-loop agentic evolution system where Planner, Solver, Oracle, and Hacker agents use trainable graph-structured knowledge networks that update via reinforcement learning from pass/fail verdicts, test certification quality, and adversarial vulnerabilities to accumulate transferable reasoning experience for competitive programming tasks.

What carries the argument

The trainable graph-structured knowledge network paired with each specialized agent, which dynamically routes queries and accumulates experience by treating outcome signals as reinforcement learning updates.

If this is right

The agents achieve higher success rates on benchmarks such as CodeContests, APPS, and Codeforces by learning from past outcomes.
The framework outperforms static multi-agent pipelines by maintaining and using accumulated knowledge across problems.
Single-pass LLM baselines see their accuracy nearly doubled through the evolutionary learning process.
Future tasks benefit from dynamic routing based on historical successes and failures without requiring LLM weight updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend to other complex reasoning tasks like mathematical problem solving by similar outcome-based network updates.
Smaller LLMs might achieve competitive performance levels by relying more on the growing knowledge networks.
The approach suggests building persistent agent systems that improve with each new user query over long periods.

Load-bearing premise

That signals from program execution results can be reliably converted into effective reinforcement learning updates for the knowledge networks to enhance future performance.

What would settle it

A controlled test showing that a version of Solvita with disabled network updates performs equally well or better than the full learning version on a series of new competitive programming problems would falsify the benefit of the updates.

Figures

Figures reproduced from arXiv: 2605.15301 by Chenchen Liu, Chenyu Wang, Chong Zheng, Han Li, Jiaheng Liu, Jinyu Tian, Letian Zhu, Rili Feng, Shihao Li, Weihao Xie, Xinping Lei, Yifan Yao, Yuqiao Du.

**Figure 2.** Figure 2: The Solvita architecture and its comparison with existing agent frameworks. Solvita couples an [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The three-layer Solver knowledge network. Q nodes (top) store problem descriptions and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Seed-level strategy taxonomy of Oracle and Hacker memory, showing how each agent factorizes [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Cost and failure-profile analysis. (a) Average prompt and completion token consumption per [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Oracle/Hacker diagnostics and Codeforces evaluation across three backbones (Claude Opus 4.6, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solvita adds graph networks to four specialized agents so they can update from pass/fail and hacking signals, but the update rules and generalization are not shown clearly enough to confirm real transferable learning.

read the letter

Hey, the main thing to know is that Solvita tries to fix the stateless problem in multi-agent code frameworks by giving each of four agents—Planner, Solver, Oracle, and Hacker—its own trainable graph knowledge network. Outcome signals like pass/fail verdicts, certification quality, and adversarial vulnerabilities get turned into reinforcement learning updates on those networks, letting the system route decisions better on future problems without touching the base LLM weights. They evaluate on CodeContests, APPS, AetherCode, and live Codeforces rounds and report beating prior multi-agent pipelines while nearly doubling single-pass accuracy.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Solvita, an agentic evolution framework for improving LLMs on competitive programming tasks. It organizes problem-solving into a closed-loop system involving four agents—Planner, Solver, Oracle, and Hacker—each paired with a trainable graph-structured knowledge network. Outcome signals such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities are recast as reinforcement learning updates to these network weights, enabling dynamic routing and accumulation of transferable reasoning experience without modifying the underlying LLM parameters. The system is evaluated on CodeContests, APPS, AetherCode, and live Codeforces rounds, claiming new state-of-the-art results among code-generation agents that outperform existing multi-agent pipelines and nearly double the accuracy of single-pass baselines.

Significance. If the performance claims and the mechanism for continuous learning hold, this work could significantly advance agentic systems for code generation by addressing the stateless limitation of current multi-agent frameworks. The evaluations on multiple benchmarks including live Codeforces rounds provide falsifiable predictions that strengthen the assessment of practical utility.

major comments (2)

[§3.2] §3.2: The claim that outcome signals (pass/fail verdicts, certification quality, adversarial vulnerabilities) are recast as RL updates to produce accumulating transferable reasoning experience lacks a concrete reward definition, graph topology, or update rule (e.g., no specification of policy gradient, Q-learning, or how edge weights modulate Planner/Solver choices). This is load-bearing for the central assertion that the graph encodes reusable strategy patterns rather than per-problem heuristics.
[§5, Table 1] §5, Table 1: The SOTA claims and 'nearly doubling' accuracy improvement are stated without quantitative metrics, error analysis, ablation studies isolating the knowledge-network updates from multi-agent orchestration, or statistical significance tests, undermining verification of whether the described updates actually support the performance claims.

minor comments (2)

[Figure 1] Figure 1: The closed-loop system diagram would benefit from explicit arrows showing information flow between the four agents and the trainable graph-structured knowledge networks.
The paper does not include a limitations section discussing potential failure modes when the graph network fails to generalize across problem distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment in turn and outline the changes we will make to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2: The claim that outcome signals (pass/fail verdicts, certification quality, adversarial vulnerabilities) are recast as RL updates to produce accumulating transferable reasoning experience lacks a concrete reward definition, graph topology, or update rule (e.g., no specification of policy gradient, Q-learning, or how edge weights modulate Planner/Solver choices). This is load-bearing for the central assertion that the graph encodes reusable strategy patterns rather than per-problem heuristics.

Authors: We thank the referee for highlighting this important point. Section 3.2 of the manuscript provides an overview of how outcome signals are used to update the graph-structured knowledge networks via reinforcement learning. However, we agree that additional technical details would clarify the mechanism. In the revised manuscript, we will expand this section to include the specific reward function, which combines the pass/fail verdict with terms for certification quality and vulnerability discovery. We will also specify the graph topology as a directed graph where nodes represent abstract reasoning patterns and edges encode transition probabilities between strategies. The update rule employs a policy gradient method, with edge weights adjusted based on the outcome to favor successful paths. This design ensures that the accumulated experience is transferable across problems, as the patterns are not tied to individual instances but to general problem-solving approaches. We will include a formal description and pseudocode to make this explicit. revision: yes
Referee: [§5, Table 1] §5, Table 1: The SOTA claims and 'nearly doubling' accuracy improvement are stated without quantitative metrics, error analysis, ablation studies isolating the knowledge-network updates from multi-agent orchestration, or statistical significance tests, undermining verification of whether the described updates actually support the performance claims.

Authors: We appreciate the referee's call for stronger empirical validation. Table 1 in Section 5 presents the performance results on the benchmarks, showing Solvita outperforming baselines and achieving the claimed improvements, including the near-doubling relative to single-pass methods. To address the concerns, we will add in the revision: (1) quantitative metrics with exact percentages and absolute numbers, (2) error analysis discussing failure cases, (3) ablation studies that compare the full system against a variant without the knowledge network updates (keeping the agent orchestration fixed), and (4) statistical significance tests such as paired t-tests or bootstrap confidence intervals on the performance differences. These additions will help isolate the contribution of the RL updates to the knowledge networks. revision: yes

Circularity Check

0 steps flagged

No circularity: external outcome signals drive updates without self-referential reduction

full rationale

The paper describes Solvita as a multi-agent framework (Planner, Solver, Oracle, Hacker) paired with trainable graph-structured knowledge networks. Outcome signals including pass/fail verdicts, test certification quality, and adversarial vulnerabilities are recast as reinforcement learning updates to network weights, enabling dynamic routing and accumulation of transferable reasoning experience. No equations, reward functions, graph topologies, or update rules appear that would reduce the claimed transferable experience or SOTA performance to the inputs by construction. The central mechanism relies on observable external problem-solving results rather than fitted parameters renamed as predictions or self-citations that bear the load of uniqueness. The evaluation on CodeContests, APPS, AetherCode, and live Codeforces rounds provides independent benchmarks, confirming the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities with supporting evidence; the graph knowledge networks are introduced as core components but lack independent verification details.

invented entities (1)

graph-structured knowledge network no independent evidence
purpose: Store and update agent-specific reasoning experience from outcome signals
Introduced as trainable component paired with each agent; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5797 in / 1129 out tokens · 59150 ms · 2026-05-19T16:23:24.195639+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each agent is paired with a trainable, graph-structured knowledge network... outcome signals... recast as reinforcement learning updates to these network weights
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-layer heterogeneous directed graph G = (V_Q ∪ V_M ∪ V_S, E_QM ∪ E_MS)... ρ(s|q_new) = ∑ ... w_qm · w_ms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 11 internal anchors

[1]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

work page 2022
[2]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In35th Conference on Neural Information Processing Systems (NeurIPS 2021), Track on Datasets and Benchmarks, 2021

work page 2021
[3]

Can language models solve olympiad programming?arXiv preprint arXiv:2404.10952, 2024

Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?arXiv preprint arXiv:2404.10952, 2024

work page arXiv 2024
[4]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida I. Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Code generation with AlphaCodium : From prompt engineering to flow engineering

Tal Ridnik, Dedy Kredo, and Itamar Friedman. Code generation with AlphaCodium: From prompt engineering to flow engineering.arXiv preprint arXiv:2401.08500, 2024

work page arXiv 2024
[7]

MapCoder: Multi-agent code generation for competitive problem solving

Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[8]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[9]

Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

work page arXiv 2025
[10]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010

work page 2010
[11]

Aethercode: Evaluating llms’ ability to win in premier programming competitions, 2025

Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, et al. Aethercode: Evaluating llms’ ability to win in premier programming competitions, 2025. URLhttps://arxiv.org/abs/2508.16402

work page arXiv 2025
[12]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

The Claude 3 model family: Opus, Sonnet, Haiku, 2024

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku, 2024. URL https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. An- thropic Model Card. 11

work page 2024
[14]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- Coder: Multi-agent code generation with effective testing and self-optimisation.arXiv preprint arXiv:2312.13010, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Codeelo: Benchmarking competition-level code generation of llms with human- comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable Elo ratings.arXiv preprint arXiv:2501.01257, 2025

work page arXiv 2025
[16]

Elo.The Rating of Chessplayers: Past and Present

Arpad E. Elo.The Rating of Chessplayers: Past and Present. Arco Publishing, 1978

work page 1978
[17]

CodeChain: Towards modular code generation through chain of self-revisions with representative sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. CodeChain: Towards modular code generation through chain of self-revisions with representative sub-modules. InInternational Conference on Learning Representations, 2024

work page 2024
[18]

Goodman, and Nick Haber

Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[19]

Tenenbaum, and Chuang Gan

Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. Planning with large language models for code generation. InInternational Conference on Learning Representations, 2023

work page 2023
[20]

LEVER: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria Lin. LEVER: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, 2023

work page 2023
[21]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems, 2024

work page 2024
[22]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. InInternational Conference on Learning Representations, 2024

work page 2024
[23]

Is self-repair a silver bullet for code generation? InInternational Conference on Learning Representations, 2024

Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InInternational Conference on Learning Representations, 2024

work page 2024
[24]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

work page 2024
[25]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[27]

Experiential co-learning of software-developing agents

Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Experiential co-learning of software-developing agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[28]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[29]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C.H. Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 2022

work page 2022
[30]

STaR: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022. 12

work page 2022
[31]

Self-taught optimizer (stop): Recur- sively self-improving code generation

Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Kalai. Self-taught optimizer (stop): Recur- sively self-improving code generation. InProceedings of the Conference on Language Modeling (COLM), 2024

work page 2024
[32]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[33]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[34]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

work page 2023
[35]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

arXiv preprint arXiv:2404.14387 , year=

Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models.arXiv preprint arXiv:2404.14387, 2024

work page arXiv 2024
[37]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InAAAI Conference on Artificial Intelligence, 2024

work page 2024
[39]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

A Survey on the Memory Mechanism of Large Language Model based Agents

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents.arXiv preprint arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Heterogeneous graph transformer

Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings of The Web Conference (WWW), 2020

work page 2020
[43]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hofler. Graph of thoughts: Solving elaborate problems with large language models. InAAAI Conference on Artificial Intelligence, 2024

work page 2024
[44]

Compiler validation via equivalence modulo inputs

Vu Le, Mehrdad Afshari, and Zhendong Su. Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 216–226. ACM, 2014. doi: 10.1145/2594291.2594334

work page doi:10.1145/2594291.2594334 2014
[45]

Lahiri, and Siddhartha Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th International Conference on Software Engineering, 2023

work page 2023
[46]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2023

work page 2023
[47]

CodeT: Code generation with generated tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InInternational Conference on Learning Representations, 2023

work page 2023
[48]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation.arXiv preprint arXiv:2305.01210, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50,

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50,

work page
[50]

doi: 10.1109/TSE.2023.3334955

work page doi:10.1109/tse.2023.3334955 2023
[51]

canonical_problem

Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, and Shengyu Tao. CodeHacker: Automated test case generation for detecting vulnerabilities in competitive programming solutions.arXiv preprint arXiv:2602.20213, 2026. Appendix A Data Pipeline Configuration This appendix lists the exact configuration of every step of the filtering pipeline in Section 2.3. ...

work page arXiv 2026
[52]

First judge whether the failures indicate a localized bug or a systemic/global flaw

work page
[53]

Then identify the most likely error type

work page
[54]

mode":"patch|full_regen

Choose the better repair mode: - ‘patch‘ if the overall approach is still sound and the issue is localized. - ‘full_regen‘ if the overall approach is likely wrong or patching would keep drifting further. Rules: - Base the decision primarily on the objective evidence and current code. - Treat the auxiliary references as secondary hints only. - Do not guess...

work page
[55]

If it doesn’t, your understanding of the problem is wrong (or your brute force is buggy) and you’ll be asked to revise

Run your brute force on each public sample input and verify it produces the expected sample output. If it doesn’t, your understanding of the problem is wrong (or your brute force is buggy) and you’ll be asked to revise

work page
[56]

brute_force

Use your input generator + brute force as a reference to cross-validate the C++ solution we will write next on N random small inputs. Both scripts must: - Be syntactically valid Python. - Use ONLY: ‘sys‘, ‘math‘, ‘itertools‘, ‘collections‘, ‘bisect‘, ‘heapq‘, ‘random‘ (no other imports). - Read input from stdin (‘sys.stdin.read()‘) and write to stdout (‘p...

work page
[57]

What does the wrong output tell you about where the code diverges?

Read the failure evidence carefully. What does the wrong output tell you about where the code diverges?

work page
[58]

Track key variables at each step

Trace the failing test case through the code step by step. Track key variables at each step

work page
[59]

Identify the exact line or logic block where the value first goes wrong

work page
[60]

Phase 2 — Fix hypothesis: - State one specific fix hypothesis that addresses the root cause

Name the root cause category: overflow / off-by-one / wrong formula / missing edge case / wrong data structure / TLE / MLE / other. Phase 2 — Fix hypothesis: - State one specific fix hypothesis that addresses the root cause. - If the root cause is a global algorithmic flaw (not a local bug), this should be ‘full_regen‘ territory — say so rather than patch...

work page
[61]

The SEARCH block must match the current code EXACTLY (including whitespace, indentation)

work page
[62]

The SEARCH block must appear EXACTLY ONCE in the code

work page
[63]

You can have multiple SEARCH/REPLACE blocks to fix multiple issues

work page
[64]

Preserve proper indentation in the REPLACE block

work page
[65]

Make minimal, surgical changes - only fix what’s broken

work page
[66]

Re-check BOTH time and space complexity before proposing edits

work page
[67]

Replace unsafe data structures if the current implementation appears to allocate memory proportional to a dangerous product of input dimensions

work page
[68]

Apply the same design-first principles before rewriting

Do not preserve an existing approach just because it matches the plan if it is not implementable within the stated limits Example: «««< SEARCH for (int i = 1; i <= n; i++) { sum += arr[i]; } ======= for (int i = 0; i < n; i++) { sum += arr[i]; } »»»> REPLACE Generate the SEARCH/REPLACE edits now: 24 Solver — generate_code.regenerate You are repairing a fa...

work page
[69]

Root cause of the errors

work page
[70]

Specific fixes needed

work page
[71]

Solver — analyze_feedback.test_failure You are a competitive programming debugging expert

Corrected code snippets Be concise and actionable. Solver — analyze_feedback.test_failure You are a competitive programming debugging expert. Analyze the following failures and provide CONCRETE fixes. ## Problem Description <PROBLEM_DESC> ## Selected Approach Algorithm: <ALGORITHM> Steps: <STEPS_TEXT> ## Current Status Iteration: <ITERATION> Pass Rate: <P...

work page
[72]

Trace the code execution step-by-step with that input

Pick the SIMPLEST failure case above. Trace the code execution step-by-step with that input. Track key variables

work page
[73]

Identify WHERE and WHY the code produces wrong output

work page
[74]

Determine the root cause category: overflow, off-by-one, wrong formula, missing edge case, TLE, etc

work page
[75]

analysis

Provide SPECIFIC code-level fixes (not vague suggestions). Return ONLY valid JSON (no markdown, no explanation outside JSON): { "analysis": "<detailed step-by-step trace showing where the bug is>", "root_cause": "<one-line root cause>", "error_pattern": "<category: overflow/off-by-one/ wrong-formula/missing-edge-case/tle/other>", "suggested_fixes": [ "<sp...

work page
[76]

Analyze why the code fails these specific hack cases

work page
[77]

overflow, edge case, logic hole)

Identify the root cause (e.g. overflow, edge case, logic hole)

work page
[78]

analysis

Provide a fixed C++ solution. Return ONLY JSON: { "analysis": "<analysis of hack failures>", "suggested_fixes": ["<fix 1>", "<fix 2>"] } E.5 Oracle: certified test generation The Oracle prompt set has four sub-prompts (generator, validator, checker, solver) corresponding to the four artifacts that compose a certified test suite (Section 3.5). The generato...

work page
[79]

Your code MUST be a complete standalone program with #include, main(), cin/cout

work page
[80]

Read input from stdin, write output to stdout, matching the exact I/O format shown in the public tests

work page

Showing first 80 references.

[1] [1]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

work page 2022

[2] [2]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In35th Conference on Neural Information Processing Systems (NeurIPS 2021), Track on Datasets and Benchmarks, 2021

work page 2021

[3] [3]

Can language models solve olympiad programming?arXiv preprint arXiv:2404.10952, 2024

Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?arXiv preprint arXiv:2404.10952, 2024

work page arXiv 2024

[4] [4]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida I. Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Code generation with AlphaCodium : From prompt engineering to flow engineering

Tal Ridnik, Dedy Kredo, and Itamar Friedman. Code generation with AlphaCodium: From prompt engineering to flow engineering.arXiv preprint arXiv:2401.08500, 2024

work page arXiv 2024

[7] [7]

MapCoder: Multi-agent code generation for competitive problem solving

Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[8] [8]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[9] [9]

Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

work page arXiv 2025

[10] [10]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010

work page 2010

[11] [11]

Aethercode: Evaluating llms’ ability to win in premier programming competitions, 2025

Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, et al. Aethercode: Evaluating llms’ ability to win in premier programming competitions, 2025. URLhttps://arxiv.org/abs/2508.16402

work page arXiv 2025

[12] [12]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

The Claude 3 model family: Opus, Sonnet, Haiku, 2024

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku, 2024. URL https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. An- thropic Model Card. 11

work page 2024

[14] [14]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- Coder: Multi-agent code generation with effective testing and self-optimisation.arXiv preprint arXiv:2312.13010, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Codeelo: Benchmarking competition-level code generation of llms with human- comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable Elo ratings.arXiv preprint arXiv:2501.01257, 2025

work page arXiv 2025

[16] [16]

Elo.The Rating of Chessplayers: Past and Present

Arpad E. Elo.The Rating of Chessplayers: Past and Present. Arco Publishing, 1978

work page 1978

[17] [17]

CodeChain: Towards modular code generation through chain of self-revisions with representative sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. CodeChain: Towards modular code generation through chain of self-revisions with representative sub-modules. InInternational Conference on Learning Representations, 2024

work page 2024

[18] [18]

Goodman, and Nick Haber

Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[19] [19]

Tenenbaum, and Chuang Gan

Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. Planning with large language models for code generation. InInternational Conference on Learning Representations, 2023

work page 2023

[20] [20]

LEVER: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria Lin. LEVER: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, 2023

work page 2023

[21] [21]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems, 2024

work page 2024

[22] [22]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. InInternational Conference on Learning Representations, 2024

work page 2024

[23] [23]

Is self-repair a silver bullet for code generation? InInternational Conference on Learning Representations, 2024

Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InInternational Conference on Learning Representations, 2024

work page 2024

[24] [24]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

work page 2024

[25] [25]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[27] [27]

Experiential co-learning of software-developing agents

Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Experiential co-learning of software-developing agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[28] [28]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[29] [29]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C.H. Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 2022

work page 2022

[30] [30]

STaR: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022. 12

work page 2022

[31] [31]

Self-taught optimizer (stop): Recur- sively self-improving code generation

Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Kalai. Self-taught optimizer (stop): Recur- sively self-improving code generation. InProceedings of the Conference on Language Modeling (COLM), 2024

work page 2024

[32] [32]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[33] [33]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[34] [34]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

work page 2023

[35] [35]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

arXiv preprint arXiv:2404.14387 , year=

Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models.arXiv preprint arXiv:2404.14387, 2024

work page arXiv 2024

[37] [37]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InAAAI Conference on Artificial Intelligence, 2024

work page 2024

[39] [39]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

A Survey on the Memory Mechanism of Large Language Model based Agents

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents.arXiv preprint arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Heterogeneous graph transformer

Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings of The Web Conference (WWW), 2020

work page 2020

[43] [43]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hofler. Graph of thoughts: Solving elaborate problems with large language models. InAAAI Conference on Artificial Intelligence, 2024

work page 2024

[44] [44]

Compiler validation via equivalence modulo inputs

Vu Le, Mehrdad Afshari, and Zhendong Su. Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 216–226. ACM, 2014. doi: 10.1145/2594291.2594334

work page doi:10.1145/2594291.2594334 2014

[45] [45]

Lahiri, and Siddhartha Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th International Conference on Software Engineering, 2023

work page 2023

[46] [46]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2023

work page 2023

[47] [47]

CodeT: Code generation with generated tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InInternational Conference on Learning Representations, 2023

work page 2023

[48] [48]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation.arXiv preprint arXiv:2305.01210, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50,

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50,

work page

[50] [50]

doi: 10.1109/TSE.2023.3334955

work page doi:10.1109/tse.2023.3334955 2023

[51] [51]

canonical_problem

Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, and Shengyu Tao. CodeHacker: Automated test case generation for detecting vulnerabilities in competitive programming solutions.arXiv preprint arXiv:2602.20213, 2026. Appendix A Data Pipeline Configuration This appendix lists the exact configuration of every step of the filtering pipeline in Section 2.3. ...

work page arXiv 2026

[52] [52]

First judge whether the failures indicate a localized bug or a systemic/global flaw

work page

[53] [53]

Then identify the most likely error type

work page

[54] [54]

mode":"patch|full_regen

Choose the better repair mode: - ‘patch‘ if the overall approach is still sound and the issue is localized. - ‘full_regen‘ if the overall approach is likely wrong or patching would keep drifting further. Rules: - Base the decision primarily on the objective evidence and current code. - Treat the auxiliary references as secondary hints only. - Do not guess...

work page

[55] [55]

If it doesn’t, your understanding of the problem is wrong (or your brute force is buggy) and you’ll be asked to revise

Run your brute force on each public sample input and verify it produces the expected sample output. If it doesn’t, your understanding of the problem is wrong (or your brute force is buggy) and you’ll be asked to revise

work page

[56] [56]

brute_force

Use your input generator + brute force as a reference to cross-validate the C++ solution we will write next on N random small inputs. Both scripts must: - Be syntactically valid Python. - Use ONLY: ‘sys‘, ‘math‘, ‘itertools‘, ‘collections‘, ‘bisect‘, ‘heapq‘, ‘random‘ (no other imports). - Read input from stdin (‘sys.stdin.read()‘) and write to stdout (‘p...

work page

[57] [57]

What does the wrong output tell you about where the code diverges?

Read the failure evidence carefully. What does the wrong output tell you about where the code diverges?

work page

[58] [58]

Track key variables at each step

Trace the failing test case through the code step by step. Track key variables at each step

work page

[59] [59]

Identify the exact line or logic block where the value first goes wrong

work page

[60] [60]

Phase 2 — Fix hypothesis: - State one specific fix hypothesis that addresses the root cause

Name the root cause category: overflow / off-by-one / wrong formula / missing edge case / wrong data structure / TLE / MLE / other. Phase 2 — Fix hypothesis: - State one specific fix hypothesis that addresses the root cause. - If the root cause is a global algorithmic flaw (not a local bug), this should be ‘full_regen‘ territory — say so rather than patch...

work page

[61] [61]

The SEARCH block must match the current code EXACTLY (including whitespace, indentation)

work page

[62] [62]

The SEARCH block must appear EXACTLY ONCE in the code

work page

[63] [63]

You can have multiple SEARCH/REPLACE blocks to fix multiple issues

work page

[64] [64]

Preserve proper indentation in the REPLACE block

work page

[65] [65]

Make minimal, surgical changes - only fix what’s broken

work page

[66] [66]

Re-check BOTH time and space complexity before proposing edits

work page

[67] [67]

Replace unsafe data structures if the current implementation appears to allocate memory proportional to a dangerous product of input dimensions

work page

[68] [68]

Apply the same design-first principles before rewriting

Do not preserve an existing approach just because it matches the plan if it is not implementable within the stated limits Example: «««< SEARCH for (int i = 1; i <= n; i++) { sum += arr[i]; } ======= for (int i = 0; i < n; i++) { sum += arr[i]; } »»»> REPLACE Generate the SEARCH/REPLACE edits now: 24 Solver — generate_code.regenerate You are repairing a fa...

work page

[69] [69]

Root cause of the errors

work page

[70] [70]

Specific fixes needed

work page

[71] [71]

Solver — analyze_feedback.test_failure You are a competitive programming debugging expert

Corrected code snippets Be concise and actionable. Solver — analyze_feedback.test_failure You are a competitive programming debugging expert. Analyze the following failures and provide CONCRETE fixes. ## Problem Description <PROBLEM_DESC> ## Selected Approach Algorithm: <ALGORITHM> Steps: <STEPS_TEXT> ## Current Status Iteration: <ITERATION> Pass Rate: <P...

work page

[72] [72]

Trace the code execution step-by-step with that input

Pick the SIMPLEST failure case above. Trace the code execution step-by-step with that input. Track key variables

work page

[73] [73]

Identify WHERE and WHY the code produces wrong output

work page

[74] [74]

Determine the root cause category: overflow, off-by-one, wrong formula, missing edge case, TLE, etc

work page

[75] [75]

analysis

Provide SPECIFIC code-level fixes (not vague suggestions). Return ONLY valid JSON (no markdown, no explanation outside JSON): { "analysis": "<detailed step-by-step trace showing where the bug is>", "root_cause": "<one-line root cause>", "error_pattern": "<category: overflow/off-by-one/ wrong-formula/missing-edge-case/tle/other>", "suggested_fixes": [ "<sp...

work page

[76] [76]

Analyze why the code fails these specific hack cases

work page

[77] [77]

overflow, edge case, logic hole)

Identify the root cause (e.g. overflow, edge case, logic hole)

work page

[78] [78]

analysis

Provide a fixed C++ solution. Return ONLY JSON: { "analysis": "<analysis of hack failures>", "suggested_fixes": ["<fix 1>", "<fix 2>"] } E.5 Oracle: certified test generation The Oracle prompt set has four sub-prompts (generator, validator, checker, solver) corresponding to the four artifacts that compose a certified test suite (Section 3.5). The generato...

work page

[79] [79]

Your code MUST be a complete standalone program with #include, main(), cin/cout

work page

[80] [80]

Read input from stdin, write output to stdout, matching the exact I/O format shown in the public tests

work page