pith. sign in

arxiv: 2605.15301 · v1 · pith:S4UISSUInew · submitted 2026-05-14 · 💻 cs.AI

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Pith reviewed 2026-05-19 16:23 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic evolutioncompetitive programminglarge language modelsmulti-agent systemsknowledge networksreinforcement learningcode generationagentic framework
0
0 comments X

The pith

Solvita achieves state-of-the-art results in competitive programming by letting LLM agents continuously learn from past outcomes using updatable knowledge networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Solvita to fix the stateless limitation in current multi-agent code generation setups that discard experience after each task. It structures problem solving as a closed loop of strategy selection, synthesis, verification, and targeted attacks carried out by four agents, each tied to its own trainable graph-structured knowledge network. Outcome signals from executions are turned into reinforcement learning updates that adjust the networks, so future queries get routed and solved using patterns from earlier successes and failures. A sympathetic reader would care because the approach shows how to make AI coding systems improve over repeated use without retraining the base model weights.

Core claim

Solvita establishes a closed-loop agentic evolution system where Planner, Solver, Oracle, and Hacker agents use trainable graph-structured knowledge networks that update via reinforcement learning from pass/fail verdicts, test certification quality, and adversarial vulnerabilities to accumulate transferable reasoning experience for competitive programming tasks.

What carries the argument

The trainable graph-structured knowledge network paired with each specialized agent, which dynamically routes queries and accumulates experience by treating outcome signals as reinforcement learning updates.

If this is right

  • The agents achieve higher success rates on benchmarks such as CodeContests, APPS, and Codeforces by learning from past outcomes.
  • The framework outperforms static multi-agent pipelines by maintaining and using accumulated knowledge across problems.
  • Single-pass LLM baselines see their accuracy nearly doubled through the evolutionary learning process.
  • Future tasks benefit from dynamic routing based on historical successes and failures without requiring LLM weight updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could extend to other complex reasoning tasks like mathematical problem solving by similar outcome-based network updates.
  • Smaller LLMs might achieve competitive performance levels by relying more on the growing knowledge networks.
  • The approach suggests building persistent agent systems that improve with each new user query over long periods.

Load-bearing premise

That signals from program execution results can be reliably converted into effective reinforcement learning updates for the knowledge networks to enhance future performance.

What would settle it

A controlled test showing that a version of Solvita with disabled network updates performs equally well or better than the full learning version on a series of new competitive programming problems would falsify the benefit of the updates.

Figures

Figures reproduced from arXiv: 2605.15301 by Chenchen Liu, Chenyu Wang, Chong Zheng, Han Li, Jiaheng Liu, Jinyu Tian, Letian Zhu, Rili Feng, Shihao Li, Weihao Xie, Xinping Lei, Yifan Yao, Yuqiao Du.

Figure 1
Figure 1. Figure 1: Data pipeline overview. Raw artifacts are collected from multiple competitive programming [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Solvita architecture and its comparison with existing agent frameworks. Solvita couples an [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The three-layer Solver knowledge network. Q nodes (top) store problem descriptions and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Seed-level strategy taxonomy of Oracle and Hacker memory, showing how each agent factorizes [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cost and failure-profile analysis. (a) Average prompt and completion token consumption per [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Oracle/Hacker diagnostics and Codeforces evaluation across three backbones (Claude Opus 4.6, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Solvita, an agentic evolution framework for improving LLMs on competitive programming tasks. It organizes problem-solving into a closed-loop system involving four agents—Planner, Solver, Oracle, and Hacker—each paired with a trainable graph-structured knowledge network. Outcome signals such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities are recast as reinforcement learning updates to these network weights, enabling dynamic routing and accumulation of transferable reasoning experience without modifying the underlying LLM parameters. The system is evaluated on CodeContests, APPS, AetherCode, and live Codeforces rounds, claiming new state-of-the-art results among code-generation agents that outperform existing multi-agent pipelines and nearly double the accuracy of single-pass baselines.

Significance. If the performance claims and the mechanism for continuous learning hold, this work could significantly advance agentic systems for code generation by addressing the stateless limitation of current multi-agent frameworks. The evaluations on multiple benchmarks including live Codeforces rounds provide falsifiable predictions that strengthen the assessment of practical utility.

major comments (2)
  1. [§3.2] §3.2: The claim that outcome signals (pass/fail verdicts, certification quality, adversarial vulnerabilities) are recast as RL updates to produce accumulating transferable reasoning experience lacks a concrete reward definition, graph topology, or update rule (e.g., no specification of policy gradient, Q-learning, or how edge weights modulate Planner/Solver choices). This is load-bearing for the central assertion that the graph encodes reusable strategy patterns rather than per-problem heuristics.
  2. [§5, Table 1] §5, Table 1: The SOTA claims and 'nearly doubling' accuracy improvement are stated without quantitative metrics, error analysis, ablation studies isolating the knowledge-network updates from multi-agent orchestration, or statistical significance tests, undermining verification of whether the described updates actually support the performance claims.
minor comments (2)
  1. [Figure 1] Figure 1: The closed-loop system diagram would benefit from explicit arrows showing information flow between the four agents and the trainable graph-structured knowledge networks.
  2. The paper does not include a limitations section discussing potential failure modes when the graph network fails to generalize across problem distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment in turn and outline the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The claim that outcome signals (pass/fail verdicts, certification quality, adversarial vulnerabilities) are recast as RL updates to produce accumulating transferable reasoning experience lacks a concrete reward definition, graph topology, or update rule (e.g., no specification of policy gradient, Q-learning, or how edge weights modulate Planner/Solver choices). This is load-bearing for the central assertion that the graph encodes reusable strategy patterns rather than per-problem heuristics.

    Authors: We thank the referee for highlighting this important point. Section 3.2 of the manuscript provides an overview of how outcome signals are used to update the graph-structured knowledge networks via reinforcement learning. However, we agree that additional technical details would clarify the mechanism. In the revised manuscript, we will expand this section to include the specific reward function, which combines the pass/fail verdict with terms for certification quality and vulnerability discovery. We will also specify the graph topology as a directed graph where nodes represent abstract reasoning patterns and edges encode transition probabilities between strategies. The update rule employs a policy gradient method, with edge weights adjusted based on the outcome to favor successful paths. This design ensures that the accumulated experience is transferable across problems, as the patterns are not tied to individual instances but to general problem-solving approaches. We will include a formal description and pseudocode to make this explicit. revision: yes

  2. Referee: [§5, Table 1] §5, Table 1: The SOTA claims and 'nearly doubling' accuracy improvement are stated without quantitative metrics, error analysis, ablation studies isolating the knowledge-network updates from multi-agent orchestration, or statistical significance tests, undermining verification of whether the described updates actually support the performance claims.

    Authors: We appreciate the referee's call for stronger empirical validation. Table 1 in Section 5 presents the performance results on the benchmarks, showing Solvita outperforming baselines and achieving the claimed improvements, including the near-doubling relative to single-pass methods. To address the concerns, we will add in the revision: (1) quantitative metrics with exact percentages and absolute numbers, (2) error analysis discussing failure cases, (3) ablation studies that compare the full system against a variant without the knowledge network updates (keeping the agent orchestration fixed), and (4) statistical significance tests such as paired t-tests or bootstrap confidence intervals on the performance differences. These additions will help isolate the contribution of the RL updates to the knowledge networks. revision: yes

Circularity Check

0 steps flagged

No circularity: external outcome signals drive updates without self-referential reduction

full rationale

The paper describes Solvita as a multi-agent framework (Planner, Solver, Oracle, Hacker) paired with trainable graph-structured knowledge networks. Outcome signals including pass/fail verdicts, test certification quality, and adversarial vulnerabilities are recast as reinforcement learning updates to network weights, enabling dynamic routing and accumulation of transferable reasoning experience. No equations, reward functions, graph topologies, or update rules appear that would reduce the claimed transferable experience or SOTA performance to the inputs by construction. The central mechanism relies on observable external problem-solving results rather than fitted parameters renamed as predictions or self-citations that bear the load of uniqueness. The evaluation on CodeContests, APPS, AetherCode, and live Codeforces rounds provides independent benchmarks, confirming the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities with supporting evidence; the graph knowledge networks are introduced as core components but lack independent verification details.

invented entities (1)
  • graph-structured knowledge network no independent evidence
    purpose: Store and update agent-specific reasoning experience from outcome signals
    Introduced as trainable component paired with each agent; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5797 in / 1129 out tokens · 59150 ms · 2026-05-19T16:23:24.195639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 11 internal anchors

  1. [1]

    Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

  2. [2]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In35th Conference on Neural Information Processing Systems (NeurIPS 2021), Track on Datasets and Benchmarks, 2021

  3. [3]

    Can language models solve olympiad programming?arXiv preprint arXiv:2404.10952, 2024

    Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?arXiv preprint arXiv:2404.10952, 2024

  4. [4]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida I. Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    Code generation with AlphaCodium : From prompt engineering to flow engineering

    Tal Ridnik, Dedy Kredo, and Itamar Friedman. Code generation with AlphaCodium: From prompt engineering to flow engineering.arXiv preprint arXiv:2401.08500, 2024

  7. [7]

    MapCoder: Multi-agent code generation for competitive problem solving

    Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  8. [8]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

  9. [9]

    Codecontests+: High-quality test case generation for competitive programming.CoRR, abs/2506.05817, 2025

    Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

  10. [10]

    Schapire

    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010

  11. [11]

    Aethercode: Evaluating llms’ ability to win in premier programming competitions, 2025

    Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, et al. Aethercode: Evaluating llms’ ability to win in premier programming competitions, 2025. URLhttps://arxiv.org/abs/2508.16402

  12. [12]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  13. [13]

    The Claude 3 model family: Opus, Sonnet, Haiku, 2024

    Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku, 2024. URL https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. An- thropic Model Card. 11

  14. [14]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- Coder: Multi-agent code generation with effective testing and self-optimisation.arXiv preprint arXiv:2312.13010, 2024

  15. [15]

    Codeelo: Benchmarking competition-level code generation of llms with human- comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

    Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable Elo ratings.arXiv preprint arXiv:2501.01257, 2025

  16. [16]

    Elo.The Rating of Chessplayers: Past and Present

    Arpad E. Elo.The Rating of Chessplayers: Past and Present. Arco Publishing, 1978

  17. [17]

    CodeChain: Towards modular code generation through chain of self-revisions with representative sub-modules

    Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. CodeChain: Towards modular code generation through chain of self-revisions with representative sub-modules. InInternational Conference on Learning Representations, 2024

  18. [18]

    Goodman, and Nick Haber

    Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, and Nick Haber. Parsel: Algorithmic reasoning with language models by composing decompositions. InAdvances in Neural Information Processing Systems, 2023

  19. [19]

    Tenenbaum, and Chuang Gan

    Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. Planning with large language models for code generation. InInternational Conference on Learning Representations, 2023

  20. [20]

    LEVER: Learning to verify language-to-code generation with execution

    Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I Wang, and Xi Victoria Lin. LEVER: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, 2023

  21. [21]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems, 2024

  22. [22]

    Teaching large language models to self-debug

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. InInternational Conference on Learning Representations, 2024

  23. [23]

    Is self-repair a silver bullet for code generation? InInternational Conference on Learning Representations, 2024

    Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InInternational Conference on Learning Representations, 2024

  24. [24]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

  25. [25]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

  26. [26]

    ChatDev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  27. [27]

    Experiential co-learning of software-developing agents

    Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Experiential co-learning of software-developing agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  28. [28]

    Encouraging divergent thinking in large language models through multi-agent debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  29. [29]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C.H. Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning.Advances in Neural Information Processing Systems, 2022

  30. [30]

    STaR: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, 2022. 12

  31. [31]

    Self-taught optimizer (stop): Recur- sively self-improving code generation

    Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Kalai. Self-taught optimizer (stop): Recur- sively self-improving code generation. InProceedings of the Conference on Language Modeling (COLM), 2024

  32. [32]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 2023

  33. [33]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InProceedings of the 41st International Conference on Machine Learning, 2024

  34. [34]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

  35. [35]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  36. [36]

    arXiv preprint arXiv:2404.14387 , year=

    Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models.arXiv preprint arXiv:2404.14387, 2024

  37. [37]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  38. [38]

    ExpeL: LLM agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InAAAI Conference on Artificial Intelligence, 2024

  39. [39]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

  40. [40]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

  41. [41]

    A Survey on the Memory Mechanism of Large Language Model based Agents

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents.arXiv preprint arXiv:2404.13501, 2024

  42. [42]

    Heterogeneous graph transformer

    Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings of The Web Conference (WWW), 2020

  43. [43]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hofler. Graph of thoughts: Solving elaborate problems with large language models. InAAAI Conference on Artificial Intelligence, 2024

  44. [44]

    Compiler validation via equivalence modulo inputs

    Vu Le, Mehrdad Afshari, and Zhendong Su. Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 216–226. ACM, 2014. doi: 10.1145/2594291.2594334

  45. [45]

    Lahiri, and Siddhartha Sen

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th International Conference on Software Engineering, 2023

  46. [46]

    Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2023

  47. [47]

    CodeT: Code generation with generated tests

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. CodeT: Code generation with generated tests. InInternational Conference on Learning Representations, 2023

  48. [48]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation.arXiv preprint arXiv:2305.01210, 2023. 13

  49. [49]

    An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50,

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering, 50,

  50. [50]

    doi: 10.1109/TSE.2023.3334955

  51. [51]

    canonical_problem

    Jingwei Shi, Xinxiang Yin, Jing Huang, Jinman Zhao, and Shengyu Tao. CodeHacker: Automated test case generation for detecting vulnerabilities in competitive programming solutions.arXiv preprint arXiv:2602.20213, 2026. Appendix A Data Pipeline Configuration This appendix lists the exact configuration of every step of the filtering pipeline in Section 2.3. ...

  52. [52]

    First judge whether the failures indicate a localized bug or a systemic/global flaw

  53. [53]

    Then identify the most likely error type

  54. [54]

    mode":"patch|full_regen

    Choose the better repair mode: - ‘patch‘ if the overall approach is still sound and the issue is localized. - ‘full_regen‘ if the overall approach is likely wrong or patching would keep drifting further. Rules: - Base the decision primarily on the objective evidence and current code. - Treat the auxiliary references as secondary hints only. - Do not guess...

  55. [55]

    If it doesn’t, your understanding of the problem is wrong (or your brute force is buggy) and you’ll be asked to revise

    Run your brute force on each public sample input and verify it produces the expected sample output. If it doesn’t, your understanding of the problem is wrong (or your brute force is buggy) and you’ll be asked to revise

  56. [56]

    brute_force

    Use your input generator + brute force as a reference to cross-validate the C++ solution we will write next on N random small inputs. Both scripts must: - Be syntactically valid Python. - Use ONLY: ‘sys‘, ‘math‘, ‘itertools‘, ‘collections‘, ‘bisect‘, ‘heapq‘, ‘random‘ (no other imports). - Read input from stdin (‘sys.stdin.read()‘) and write to stdout (‘p...

  57. [57]

    What does the wrong output tell you about where the code diverges?

    Read the failure evidence carefully. What does the wrong output tell you about where the code diverges?

  58. [58]

    Track key variables at each step

    Trace the failing test case through the code step by step. Track key variables at each step

  59. [59]

    Identify the exact line or logic block where the value first goes wrong

  60. [60]

    Phase 2 — Fix hypothesis: - State one specific fix hypothesis that addresses the root cause

    Name the root cause category: overflow / off-by-one / wrong formula / missing edge case / wrong data structure / TLE / MLE / other. Phase 2 — Fix hypothesis: - State one specific fix hypothesis that addresses the root cause. - If the root cause is a global algorithmic flaw (not a local bug), this should be ‘full_regen‘ territory — say so rather than patch...

  61. [61]

    The SEARCH block must match the current code EXACTLY (including whitespace, indentation)

  62. [62]

    The SEARCH block must appear EXACTLY ONCE in the code

  63. [63]

    You can have multiple SEARCH/REPLACE blocks to fix multiple issues

  64. [64]

    Preserve proper indentation in the REPLACE block

  65. [65]

    Make minimal, surgical changes - only fix what’s broken

  66. [66]

    Re-check BOTH time and space complexity before proposing edits

  67. [67]

    Replace unsafe data structures if the current implementation appears to allocate memory proportional to a dangerous product of input dimensions

  68. [68]

    Apply the same design-first principles before rewriting

    Do not preserve an existing approach just because it matches the plan if it is not implementable within the stated limits Example: «««< SEARCH for (int i = 1; i <= n; i++) { sum += arr[i]; } ======= for (int i = 0; i < n; i++) { sum += arr[i]; } »»»> REPLACE Generate the SEARCH/REPLACE edits now: 24 Solver — generate_code.regenerate You are repairing a fa...

  69. [69]

    Root cause of the errors

  70. [70]

    Specific fixes needed

  71. [71]

    Solver — analyze_feedback.test_failure You are a competitive programming debugging expert

    Corrected code snippets Be concise and actionable. Solver — analyze_feedback.test_failure You are a competitive programming debugging expert. Analyze the following failures and provide CONCRETE fixes. ## Problem Description <PROBLEM_DESC> ## Selected Approach Algorithm: <ALGORITHM> Steps: <STEPS_TEXT> ## Current Status Iteration: <ITERATION> Pass Rate: <P...

  72. [72]

    Trace the code execution step-by-step with that input

    Pick the SIMPLEST failure case above. Trace the code execution step-by-step with that input. Track key variables

  73. [73]

    Identify WHERE and WHY the code produces wrong output

  74. [74]

    Determine the root cause category: overflow, off-by-one, wrong formula, missing edge case, TLE, etc

  75. [75]

    analysis

    Provide SPECIFIC code-level fixes (not vague suggestions). Return ONLY valid JSON (no markdown, no explanation outside JSON): { "analysis": "<detailed step-by-step trace showing where the bug is>", "root_cause": "<one-line root cause>", "error_pattern": "<category: overflow/off-by-one/ wrong-formula/missing-edge-case/tle/other>", "suggested_fixes": [ "<sp...

  76. [76]

    Analyze why the code fails these specific hack cases

  77. [77]

    overflow, edge case, logic hole)

    Identify the root cause (e.g. overflow, edge case, logic hole)

  78. [78]

    analysis

    Provide a fixed C++ solution. Return ONLY JSON: { "analysis": "<analysis of hack failures>", "suggested_fixes": ["<fix 1>", "<fix 2>"] } E.5 Oracle: certified test generation The Oracle prompt set has four sub-prompts (generator, validator, checker, solver) corresponding to the four artifacts that compose a certified test suite (Section 3.5). The generato...

  79. [79]

    Your code MUST be a complete standalone program with #include, main(), cin/cout

  80. [80]

    Read input from stdin, write output to stdout, matching the exact I/O format shown in the public tests

Showing first 80 references.