pith. sign in

arxiv: 2605.18073 · v1 · pith:X27I7VBVnew · submitted 2026-05-18 · 💻 cs.SE · cs.AI

A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

Pith reviewed 2026-05-20 09:18 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords autonomous agentscode generationmulti-model feedbackcompetitive programmingiterative refinementLLM debuggingprogram synthesis
0
0 comments X p. Extension
pith:X27I7VBV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{X27I7VBV}

Prints a linked pith:X27I7VBV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A-ProS uses multi-model feedback to more than double solved competitive programming problems for AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces A-ProS as an autonomous agent that generates initial code solutions with one model and then refines them using feedback from multiple specialized debugging models. It tests six workflow combinations on 367 problems drawn from ICPC World Finals and Codeforces contests rated 1200-1800. Results show GPT-5 starting at 39 accepted solutions and reaching 85-90 after three rounds while GPT-4 moves from 15 to 31-38, with overall gains more than twice those of baseline agent loops. A controlled ablation on 47 problems finds that keeping state across refinement steps adds 8.5-10.6 percentage points and cuts repeated failures by up to 3.5 times. A sympathetic reader would care because the work isolates design choices that turn raw model output into reliable end-to-end program synthesis under strict correctness checks.

Core claim

A-ProS combines ChatGPT-based generators with three debugging critics under a 2 x 3 factorial design and shows that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds while GPT-4 improves from 15 to 31-38 on 367 ICPC and Codeforces problems, achieving over 2x greater gains than baseline agent loops through persistent context and multi-model feedback.

What carries the argument

The hybrid multi-model feedback framework that separates solution generation from specialized debugging by different models under stateful refinement.

Load-bearing premise

The performance gains come from the multi-model feedback and stateful refinement design rather than differences in base model strength or the particular problems chosen.

What would settle it

An experiment that applies the same base generator model with only self-debugging and no separate critics on the identical 367 problems, then checks whether the success rate still rises to the reported levels.

Figures

Figures reproduced from arXiv: 2605.18073 by Anika Tabassum, Md. Fahim Arefin, Md Sifat Hossain, Tarannum Shaila Zaman, Tariqul Islam.

Figure 1
Figure 1. Figure 1: Overview of the A-ProS workflow [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: illustrates two raw outputs generated by the solution generator (GPT-4) during Itr0 (initial zero-shot attempt) and Itr2 (after two feedback iterations). Each raw response includes system generated metadata, such as the attempt number, timestamp, model identifier, and context flag followed by the complete C++ implementation. Figure 2a presents Itr0, which reflects the model’s initial zero-shot understandin… view at source ↗
Figure 3
Figure 3. Figure 3: Feedback from Llama-3.3-70B after first failed attempt (Problem 2043-C) [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Itr𝑘 cumulative acceptance across all six workflow combinations. Each group shows improvement from Itr0 (zero-shot) to Itr3 (after 3 feedback iterations) [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Codeforces evaluation results. Workflow Combination ΔItr0 ΔItr1 ΔItr2 ΔItr3 Unsolved Avg Attempts Verif. Cost GPT-5 + Codestral 42 14 10 7 127 1.75 8.7 GPT-5 + Llama-3.3 44 15 10 8 123 1.77 8.2 GPT-5 + DeepSeek-R1 41 20 12 9 118 1.87 7.6 GPT-4 + Codestral 18 10 8 6 158 2.05 17.1 GPT-4 + Llama-3.3 20 11 9 7 153 2.06 15.1 GPT-4 + DeepSeek-R1 19 14 11 8 148 2.15 13.5 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Codeforces problems: Improvement from Itr [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Codeforces problems: Verdict type distribution across all six workflows [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ICPC World Finals problems: Waterfall chart showing cu [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Codeforces problems: Solvability score distribution by prob [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ICPC World Finals problems: Itr𝑘 progression showing consistent 30% generator gap across all iterations. Critically, even the weakest GPT-5 workflow (GPT-5 + Codestral, 50.9%) substantially outperforms the strongest GPT-4 workflow (GPT-4 + DeepSeek, 22.8%), demonstrating that superior debugging feedback cannot fully compensate for weaker initial solution generation. This finding has important implications… view at source ↗
read the original abstract

Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end-to-end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A-ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi-model feedback framework separating solution generation from specialized debugging. A-ProS combines ChatGPT-based generators (GPT-4 and GPT-5) with three debugging critics: Codestral-2508, Llama-3.3-70B, and DeepSeek-R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011-2024) and Codeforces (rated 1200-1800). The results show that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds, while GPT-4 improves from 15 to 31-38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A-ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi-model feedback for reliable autonomous program synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces A-ProS, a hybrid multi-model feedback agent for autonomous competitive programming. It pairs GPT-4/GPT-5 generators with three debugging critics (Codestral-2508, Llama-3.3-70B, DeepSeek-R1) under a 2×3 design and evaluates six workflows on 367 ICPC (2011–2024) and Codeforces (1200–1800) problems. GPT-5 workflows rise from 39 to 85–90 accepted solutions after three refinement rounds; GPT-4 rises from 15 to 31–38. A controlled ablation on 47 problems shows stateful refinement outperforming stateless by 8.5–10.6 pp and cutting repeated failures by up to 3.5×, with the claim that A-ProS yields over 2× greater gains than baseline agent loops.

Significance. If the attribution of gains to the multi-model and stateful design holds, the work offers concrete evidence that separating generation from specialized debugging and preserving persistent context improves iterative refinement reliability on algorithmic tasks. The use of standard external benchmarks with direct success counts (rather than self-defined metrics) is a positive feature. The results could guide future LLM agent architectures for code synthesis, provided the ablation evidence is strengthened.

major comments (1)
  1. [Ablation study] Ablation study section: The controlled comparison of stateful vs. stateless refinement (8.5–10.6 pp gain, up to 3.5× fewer repeated failures) is performed on only 47 problems while the main results use 367 problems. The manuscript provides no indication that the 47-problem subset was stratified by difficulty, source, or initial success rate, nor any statistical check of representativeness. Because this ablation is the primary evidence offered for attributing the headline >2× gains specifically to the proposed 2×3 multi-model + stateful design (rather than base-model strength or problem selection), the limited sample size is load-bearing for the central causal claim.
minor comments (3)
  1. [Methods] Methods / Experimental Setup: Exact prompts for the generators and the three critics are not supplied, limiting reproducibility of the six workflows.
  2. [Results] Results: No statistical significance tests, confidence intervals, or variance measures are reported for the reported improvements or ablation deltas.
  3. [Abstract] Abstract and §4: The baseline agent loops used for the “over 2× greater gains” comparison are not explicitly defined or referenced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the constructive criticism regarding the ablation study. We address this point in detail below and commit to revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Ablation study] Ablation study section: The controlled comparison of stateful vs. stateless refinement (8.5–10.6 pp gain, up to 3.5× fewer repeated failures) is performed on only 47 problems while the main results use 367 problems. The manuscript provides no indication that the 47-problem subset was stratified by difficulty, source, or initial success rate, nor any statistical check of representativeness. Because this ablation is the primary evidence offered for attributing the headline >2× gains specifically to the proposed 2×3 multi-model + stateful design (rather than base-model strength or problem selection), the limited sample size is load-bearing for the central causal claim.

    Authors: We appreciate the referee's careful attention to the ablation study and its role in supporting our claims. We acknowledge that the manuscript does not explicitly describe the selection process for the 47-problem subset or provide statistical verification of its representativeness relative to the full 367-problem set. This is a valid concern for the robustness of our causal attribution. In the revised manuscript, we will expand the ablation section to include: a detailed explanation of the subset selection criteria, ensuring coverage across difficulty levels (e.g., Codeforces ratings) and problem sources (ICPC vs. Codeforces); comparative statistics such as mean and distribution of problem difficulties and initial acceptance rates between the subset and the full set; and, where appropriate, statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the observed performance differences. These additions will better substantiate that the gains are attributable to the stateful multi-model design rather than selection bias. We believe this addresses the core of the comment while preserving the controlled experimental design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper reports direct counts of accepted solutions on 367 ICPC and Codeforces problems for GPT-4/GPT-5 workflows under multi-model feedback, plus an ablation on 47 problems comparing stateful vs stateless refinement. These are measured outcomes against fixed external test suites rather than quantities derived from the paper's own definitions or fitted parameters. No equations, self-citations, or ansatzes are invoked in a load-bearing way that reduces the central claims to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the chosen contest problems as a testbed for autonomous programming and on the assumption that cross-model feedback provides independent debugging value beyond single-model iteration.

axioms (1)
  • domain assumption The selected ICPC and Codeforces problems serve as a valid proxy for real algorithmic programming challenges requiring end-to-end correctness.
    Evaluation and claims depend on these benchmarks being sufficiently representative.

pith-pipeline@v0.9.0 · 5820 in / 1215 out tokens · 58662 ms · 2026-05-20T09:18:12.612686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 18 internal anchors

  1. [1]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, and C. Yang. A survey of large language models. https://arxiv.org/abs/2303.18223, 2023. [Online; accessed 29 October 2025]

  2. [2]

    From llms to llm-based agents for software engineering: A survey of current challenges and future directions

    Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. From llms to llm-based agents for software engineering: A survey of current challenges and future directions. https://arxiv.org/abs/2408.02479, 2024. [Online; accessed 29 October 2025]

  3. [3]

    http://ieeexplore.ieee.org/abstract/document/ 9426404, 2021

    A comparison of natural language understanding platforms for chatbots in software engineering. http://ieeexplore.ieee.org/abstract/document/ 9426404, 2021. [Online; accessed 29 October 2025]. A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback 37

  4. [4]

    A systematic literature review on using natural language processing in software requirements engineering

    Sabina-Cristiana Necula, Florin Dumitriu, and Valerică Greavu-Şerban. A systematic literature review on using natural language processing in software requirements engineering. https://www.mdpi.com/2079-9292/13/11/2055, 2024. [Online; accessed 29 October 2025]

  5. [5]

    https://www.sciencedirect.com/science/ article/pii/S0957415814000853, 2014

    Model-driven engineering of manufacturing automation software projects — a sysml-based approach. https://www.sciencedirect.com/science/ article/pii/S0957415814000853, 2014. [Online; accessed 29 October 2025]

  6. [6]

    Syspro: Reproducing system-level concurrency bugs from bug reports.Journal of Systems and Software, 236:112785, 2026

    Tarannum Shaila Zaman, Chadni Islam, Jiangfan Shi, Zihan Shi, Fiona Xian, and Tingting Yu. Syspro: Reproducing system-level concurrency bugs from bug reports.Journal of Systems and Software, 236:112785, 2026

  7. [7]

    Self-collaboration code generation via chatgpt

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. https://arxiv.org/abs/2304.07590, apr 15 2023. [Online; accessed 2025-11-03]

  8. [8]

    Repoagent: An llm-powered open-source framework for repository-level code documentation generation

    Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Repoagent: An llm-powered open-source framework for repository-level code documentation generation. https://arxiv.org/abs/2402.16667, feb 26 2024. [Online; accessed 2025-11-03]

  9. [9]

    https://dl.acm.org/doi/abs/10.1145/3641554.3701974

    Bugspotter: Automated generation of code debugging exercises. https://dl.acm.org/doi/abs/10.1145/3641554.3701974. [Online; accessed 2025-11-03]

  10. [10]

    Depro: Understanding the role of llms in debugging competitive programming code.arXiv preprint arXiv:2603.19399, 2026

    Nabiha Parvez, Tanvin Sarkar Pallab, Mia Mohammad Imran, and Tarannum Shaila Zaman. Depro: Understanding the role of llms in debugging competitive programming code.arXiv preprint arXiv:2603.19399, 2026

  11. [11]

    K. R. Chowdhary.Fundamentals of Artificial Intelligence. Springer, 2020. [Online; accessed 29 October 2025]

  12. [12]

    Are large language models good statisticians?Advances in Neural Information Processing Systems, 37:62697–62731, 2024

    Yizhang Zhu, Shiyin Du, Boyan Li, Yuyu Luo, and Nan Tang. Are large language models good statisticians?Advances in Neural Information Processing Systems, 37:62697–62731, 2024. [Online; accessed 2025-11-03]

  13. [13]

    Zico Kolter

    Dylan Sam, Marc Finzi, and J. Zico Kolter. Predicting the performance of black-box LLMs through self-queries. https://arxiv.org/abs/2501.01558, jan 2 2025. [Online; accessed 2025-11-03]

  14. [14]

    Laurie Hughes, Yogesh K. Dwivedi, Tegwen Malik, Mazen Shawosh, Mousa Ahmed Albashrawi, Il Jeon, Vincent Dutot, Mandanna Appanderanda, Tom Crick, Rahul De’, Mark Fenwick, Senali Madugoda Gunaratnege, Paulius Jurcys, Arpan Kumar Kar, Nir Kshetri, Keyao Li, Sashah Mutasa, Spyridon Samothrakis, Michael Wade, and Paul Walton. Ai agents and agentic systems: a m...

  15. [15]

    Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: Advanced reasoning and learning for autonomous AI agents. https://arxiv.org/abs/2408.07199, aug 13 2024. [Online; accessed 2025-11-03]

  16. [16]

    Improve: Iterative model pipeline refinement and optimization leveraging LLM experts

    Eric Xue, Ke Chen, Zeyi Huang, Yuyang Ji, and Haohan Wang. Improve: Iterative model pipeline refinement and optimization leveraging LLM experts. https://arxiv.org/abs/2502.18530, feb 25 2025. [Online; accessed 2025-11-03]

  17. [17]

    From text to trust: Empowering ai-assisted decision making with adaptive llm-powered analysis

    Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ziang Xiao, and Ming Yin. From text to trust: Empowering ai-assisted decision making with adaptive llm-powered analysis. https://arxiv.org/abs/2502.11919, feb 17 2025. [Online; accessed 2025-11-03]

  18. [18]

    Bissyandé, Yang Liu, and Haoye Tian

    Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F. Bissyandé, Yang Liu, and Haoye Tian. A survey of llm-based automated program repair: Taxonomies, design paradigms, and applications. https://arxiv.org/abs/2506.23749, jun 30 2025. [Online; accessed 2025-11-04]

  19. [19]

    An empirical study on llm-based agents for automated bug fixing

    Xiangxin Meng, Zexiong Ma, Pengfei Gao, and Chao Peng. An empirical study on llm-based agents for automated bug fixing. https://arxiv.org/abs/2411.10213, nov 15 2024. [Online; accessed 2025-11-04]

  20. [20]

    Markus J. Buehler. Preflexor: Preference-based recursive language modeling for exploratory optimization of reasoning and agentic thinking.npj Artificial Intelligence, 1(1):1–38, may 14 2025. [Online; accessed 2025-11-04]

  21. [21]

    Sri Lakshmi, E

    A. Sri Lakshmi, E. S. Sharmila Sigamany, Roopa Traisa, Raman Kumar, Karaka Ramakrishna Reddy, Jasgurpreet Singh Chohan, and Aseel Smerat. Enhancing code quality through automated refactoring using transformer-based language models.International Journal of Advanced Computer Science and Applications (IJACSA), 16(9), sep 30 2025. [Online; accessed 2025-11-04]

  22. [22]

    Icpc-Eval: Probing the frontiers of LLM reasoning with competitive programming contests

    Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Icpc-Eval: Probing the frontiers of LLM reasoning with competitive programming contests. https://arxiv.org/abs/2506.04894, jun 5 2025. [Online; accessed 2025-11-01]

  23. [23]

    Fahim Arefin, and Tarannum Shaila Zaman

    Md Sifat Hossain, Anika Tabassum, Md. Fahim Arefin, and Tarannum Shaila Zaman. Llm-pros: Analyzing large language models’ performance in competitive problem solving. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 80–87, 2025

  24. [24]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, and Dawn Song. Measuring coding challenge competence with apps. https: //arxiv.org/abs/2105.09938, 2021

  25. [25]

    Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

    Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-Agent code generation for competitive problem solving. https://arxiv.org/abs/2405.11403, may 18 2024. [Online; accessed 2025-11-01]

  26. [26]

    Haller, J

    P. Haller, J. Golde, and A. Akbik. Pecc: Problem extraction and coding challenges. https://arxiv.org/abs/2404.18766, 2024. [Online; accessed 29 October 2025]

  27. [27]

    CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

    Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, and Wentao Zhang. Codeflowbench: A multi-turn, iterative benchmark for complex code generation. https://arxiv.org/abs/2504.21751, apr 30 2025. [Online; accessed 2025-11-01]

  28. [28]

    Agentic AI for Software: Thoughts from Software Engineering community

    Abhik Roychoudhury. Agentic AI for Software: Thoughts from Software Engineering community. https://arxiv.org/abs/2508.17343, aug 24 2025. [Online; accessed 2025-11-04]

  29. [29]

    Competition-Level Code Generation with AlphaCode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  30. [30]

    Y. Peng, A. D. Gotmare, M. Lyu, C. Xiong, S. Savarese, and D. Sahoo. Perfcodegen: Improving performance of llm-generated code with execution feedback. https://arxiv.org/abs/2412.03578, 2024. [Online; accessed 29 October 2025]

  31. [31]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency improves chain of thought reasoning in language models. https://arxiv.org/abs/2203.11171, mar 21 2022. [Online; accessed 2025-11-04]

  32. [32]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, and Kaplan. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374, jul 7 2021. [Online; accessed 2025-11-04]

  33. [33]

    Chain-of-Thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-Thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. [Online; accessed 2025-11-04]

  34. [34]

    Intercode: Standardizing and benchmarking interactive coding with execution feedback

    John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback. https://arxiv.org/abs/2306.14898, jun 26 2023. [Online; accessed 2025-11-01]

  35. [35]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-Agent-based code generation with iterative testing and optimisation. https://arxiv.org/abs/2312.13010, dec 20 2023. [Online; accessed 2025-11-01]

  36. [36]

    Nguyen, and Nghi D

    Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. https://arxiv.org/abs/2406.11912, jun 16 2024. [Online; accessed 2025-11-01]

  37. [37]

    Li and L

    D. Li and L. Murr. Humaneval on latest gpt models – 2024. https://arxiv.org/abs/2402.14852, 2024. [Online; accessed 29 October 2025]

  38. [38]

    Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation

    Zijian Yu, Yuxiang Zhao, Arman Cohan, and Xue-Ping Zhang. Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation. https://arxiv.org/abs/2412.21199, 2024

  39. [39]

    Wizardcoder: Empowering code large language models with evol-instruct

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. https://arxiv.org/abs/2306.08568, jun 14 2023. [Online; accessed 2025-11-01]

  40. [40]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. https://arxiv.org/abs/2203.13474, mar 25 2022. [Online; accessed 2025-11-01]

  41. [41]

    Codegeex: A pre-trained model for code generation with multilingual evaluation on humaneval-x

    Yifei Zheng, Jiale Xue, Chenghao Xia, Zhipeng Zhang, Zhiyuan Liu, and Maosong Sun. Codegeex: A pre-trained model for code generation with multilingual evaluation on humaneval-x. https://arxiv.org/abs/2303.17568, 2023

  42. [42]

    https://icpc.global/worldfinals/past-problems

    The icpc international collegiate programming contest. https://icpc.global/worldfinals/past-problems. [Online; accessed 29 October 2025]

  43. [43]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. aug 13 2025. [Online; accessed 19 March 2026]

  44. [44]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, and Aleman. Gpt-4 technical report. https://arxiv.org/abs/2303.08774, mar 15 2023. [Online; accessed 2025-11-04]

  45. [45]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. https://arxiv.org/abs/2501.12948, jan 2025. [Online; accessed 19 March 2026]

  46. [46]

    https://sifat-hossain-niloy.github.io/A-Pros/

    A-pros. https://sifat-hossain-niloy.github.io/A-Pros/. [Online; accessed 2025-11-07]

  47. [47]

    Github - Sifat-hossain-niloy/A-Pros

    sifat-hossain-niloy. Github - Sifat-hossain-niloy/A-Pros. https://github.com/sifat-hossain-niloy/A-Pros. [Online; accessed 2025-11-06]

  48. [48]

    Ai agentic programming: A survey of techniques, challenges, and opportunities

    Huanting Wang, Jingzhi Gong, Huawei Zhang, Jie Xu, and Zheng Wang. Ai agentic programming: A survey of techniques, challenges, and opportunities. https://arxiv.org/abs/2508.11126, aug 15 2025. [Online; accessed 2025-10-31]

  49. [49]

    Large Language Model-Based Agents for Software Engineering: A Survey

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey. https://arxiv.org/abs/2409.02977, sep 4 2024. [Online; accessed 2025-11-01]

  50. [50]

    https://codeforces.com/gyms

    Codeforces gym – practice and training platform for competitive programming. https://codeforces.com/gyms. [Online; accessed 4 November 2025]

  51. [51]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and Al-Dahle. The llama 3 herd of models. https://arxiv.org/abs/2407.21783, jul 31 2024. [Online; accessed 2025-11-06]

  52. [52]

    Codestral 25.08

    Mistral AI. Codestral 25.08. https://docs.mistral.ai/models/codestral-25-08, aug 2025. [Online; accessed 19 March 2026]

  53. [53]

    Cooper, and Milos Hauskrecht

    Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015

  54. [54]

    https://pypi.org/project/latex2mathml/

    Latex2mathml: A converter for transforming latex equations to mathml. https://pypi.org/project/latex2mathml/. [Online; accessed 6 November 2025]

  55. [55]

    https://www.mathjax.org/

    Mathjax: Beautiful math in all browsers. https://www.mathjax.org/. [Online; accessed 6 November 2025]

  56. [56]

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/

    Beautifulsoup4: Html and xml parsing library for python. https://www.crummy.com/software/BeautifulSoup/bs4/doc/. [Online; accessed 6 November 2025]

  57. [57]

    Codeforces problems test cases

    sifat-hossain-niloy. Codeforces problems test cases. https://github.com/sifat-hossain-niloy/Codeforces-Problems-Test-Cases. [Online; accessed 19 March 2026]

  58. [58]

    https://www.selenium.dev/

    Selenium. https://www.selenium.dev/. [Online; accessed 19 March 2026]

  59. [59]

    https://playwright.dev/

    Playwright. https://playwright.dev/. [Online; accessed 19 March 2026]

  60. [60]

    https://codeforces.com/blog/entry/79

    Codeforces: Verdicts and judging system. https://codeforces.com/blog/entry/79. [Online; accessed 6 November 2025]

  61. [61]

    Wiley-Interscience, 2 edition, 2002

    Alan Agresti.Categorical Data Analysis. Wiley-Interscience, 2 edition, 2002

  62. [62]

    Allen L. Edwards. Note on the correction for continuity in testing the significance of the difference between correlated proportions.Psychometrika, 13(3):185–187, 1948

  63. [63]

    Lawrence Erlbaum Associates, 2 edition, 1988

    Jacob Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988. A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback 39

  64. [64]

    Edwin B. Wilson. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158):209–212, 1927

  65. [65]

    Brown, T

    Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta. Interval estimation for a binomial proportion.Statistical Science, 16(2):101–133, 2001

  66. [66]

    https://en.cppreference.com/w/cpp/17.html? [Online; accessed 2025-11-07]

    C++17. https://en.cppreference.com/w/cpp/17.html? [Online; accessed 2025-11-07]

  67. [67]

    https://codeforces.com/apiHelp

    Codeforces api help. https://codeforces.com/apiHelp. [Online; accessed 19 March 2026]

  68. [68]

    https://www.sqlite.org/index.html

    Sqlite home page. https://www.sqlite.org/index.html. [Online; accessed 2025-11-07]

  69. [69]

    Agentic AI: A quantitative analysis of performance and applications

    Prashant Sawant. Agentic AI: A quantitative analysis of performance and applications. https://www.preprints.org/manuscript/202502.1647, feb 20

  70. [70]

    [Online; accessed 2025-11-04]

  71. [71]

    Agentic AI for IT and beyond: A qualitative analysis of capabilities, challenges, and governance.The Artificial Intelligence Business Review, 1(1), aug 5 2025

    Hesham Allam and Juan Dempere. Agentic AI for IT and beyond: A qualitative analysis of capabilities, challenges, and governance.The Artificial Intelligence Business Review, 1(1), aug 5 2025. [Online; accessed 2025-11-04]

  72. [72]

    AlphaCode 2 technical report

    Google DeepMind. AlphaCode 2 technical report. Technical report, Google DeepMind, dec 2023. Available at https://storage.googleapis.com/ deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf; accessed 12 May 2026

  73. [73]

    Y. Wang, W. Wang, S. Joty, and S. C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. https://arxiv.org/abs/2109.00859, 2021. [Online; accessed 29 October 2025]

  74. [74]

    https://leetcode.com/problemset/?difficulty=HARD

    Leetcode – the world’s leading online programming learning platform. https://leetcode.com/problemset/?difficulty=HARD. [Online; accessed 29 October 2025]

  75. [75]

    StarCoder 2 and The Stack v2: The Next Generation

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, and Zucker. Starcoder 2 and The Stack v2: The Next Generation. https://arxiv.org/abs/2402.19173, feb 29 2024. [Online; accessed 2025-11-01]

  76. [76]

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. https://arxiv.org/abs/2401.03065, jan 5 2024. [Online; accessed 2025-11-01]

  77. [77]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. https://arxiv.org/abs/2403.07974, mar 12 2024. [Online; accessed 2025-11-01]

  78. [78]

    Agentif: Benchmarking instruction following of large language models in agentic scenarios

    Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, and Juanzi Li. Agentif: Benchmarking instruction following of large language models in agentic scenarios. https://arxiv.org/abs/2505.16944, may 22 2025. [Online; accessed 2025-11-01]

  79. [79]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, and Sujoy Basu. Instruction-following evaluation for large language models (ifeval). https://arxiv.org/abs/2311.07911, 2023

  80. [80]

    Followbench: A multi-level fine-grained constraints following benchmark for large language models

    Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4667–4688, Ban...

Showing first 80 references.