A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
Pith reviewed 2026-05-20 09:18 UTC · model grok-4.3
pith:X27I7VBV Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{X27I7VBV}
Prints a linked pith:X27I7VBV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A-ProS uses multi-model feedback to more than double solved competitive programming problems for AI agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A-ProS combines ChatGPT-based generators with three debugging critics under a 2 x 3 factorial design and shows that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds while GPT-4 improves from 15 to 31-38 on 367 ICPC and Codeforces problems, achieving over 2x greater gains than baseline agent loops through persistent context and multi-model feedback.
What carries the argument
The hybrid multi-model feedback framework that separates solution generation from specialized debugging by different models under stateful refinement.
Load-bearing premise
The performance gains come from the multi-model feedback and stateful refinement design rather than differences in base model strength or the particular problems chosen.
What would settle it
An experiment that applies the same base generator model with only self-debugging and no separate critics on the identical 367 problems, then checks whether the success rate still rises to the reported levels.
Figures
read the original abstract
Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end-to-end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A-ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi-model feedback framework separating solution generation from specialized debugging. A-ProS combines ChatGPT-based generators (GPT-4 and GPT-5) with three debugging critics: Codestral-2508, Llama-3.3-70B, and DeepSeek-R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011-2024) and Codeforces (rated 1200-1800). The results show that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds, while GPT-4 improves from 15 to 31-38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A-ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi-model feedback for reliable autonomous program synthesis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces A-ProS, a hybrid multi-model feedback agent for autonomous competitive programming. It pairs GPT-4/GPT-5 generators with three debugging critics (Codestral-2508, Llama-3.3-70B, DeepSeek-R1) under a 2×3 design and evaluates six workflows on 367 ICPC (2011–2024) and Codeforces (1200–1800) problems. GPT-5 workflows rise from 39 to 85–90 accepted solutions after three refinement rounds; GPT-4 rises from 15 to 31–38. A controlled ablation on 47 problems shows stateful refinement outperforming stateless by 8.5–10.6 pp and cutting repeated failures by up to 3.5×, with the claim that A-ProS yields over 2× greater gains than baseline agent loops.
Significance. If the attribution of gains to the multi-model and stateful design holds, the work offers concrete evidence that separating generation from specialized debugging and preserving persistent context improves iterative refinement reliability on algorithmic tasks. The use of standard external benchmarks with direct success counts (rather than self-defined metrics) is a positive feature. The results could guide future LLM agent architectures for code synthesis, provided the ablation evidence is strengthened.
major comments (1)
- [Ablation study] Ablation study section: The controlled comparison of stateful vs. stateless refinement (8.5–10.6 pp gain, up to 3.5× fewer repeated failures) is performed on only 47 problems while the main results use 367 problems. The manuscript provides no indication that the 47-problem subset was stratified by difficulty, source, or initial success rate, nor any statistical check of representativeness. Because this ablation is the primary evidence offered for attributing the headline >2× gains specifically to the proposed 2×3 multi-model + stateful design (rather than base-model strength or problem selection), the limited sample size is load-bearing for the central causal claim.
minor comments (3)
- [Methods] Methods / Experimental Setup: Exact prompts for the generators and the three critics are not supplied, limiting reproducibility of the six workflows.
- [Results] Results: No statistical significance tests, confidence intervals, or variance measures are reported for the reported improvements or ablation deltas.
- [Abstract] Abstract and §4: The baseline agent loops used for the “over 2× greater gains” comparison are not explicitly defined or referenced.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and for the constructive criticism regarding the ablation study. We address this point in detail below and commit to revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: [Ablation study] Ablation study section: The controlled comparison of stateful vs. stateless refinement (8.5–10.6 pp gain, up to 3.5× fewer repeated failures) is performed on only 47 problems while the main results use 367 problems. The manuscript provides no indication that the 47-problem subset was stratified by difficulty, source, or initial success rate, nor any statistical check of representativeness. Because this ablation is the primary evidence offered for attributing the headline >2× gains specifically to the proposed 2×3 multi-model + stateful design (rather than base-model strength or problem selection), the limited sample size is load-bearing for the central causal claim.
Authors: We appreciate the referee's careful attention to the ablation study and its role in supporting our claims. We acknowledge that the manuscript does not explicitly describe the selection process for the 47-problem subset or provide statistical verification of its representativeness relative to the full 367-problem set. This is a valid concern for the robustness of our causal attribution. In the revised manuscript, we will expand the ablation section to include: a detailed explanation of the subset selection criteria, ensuring coverage across difficulty levels (e.g., Codeforces ratings) and problem sources (ICPC vs. Codeforces); comparative statistics such as mean and distribution of problem difficulties and initial acceptance rates between the subset and the full set; and, where appropriate, statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the observed performance differences. These additions will better substantiate that the gains are attributable to the stateful multi-model design rather than selection bias. We believe this addresses the core of the comment while preserving the controlled experimental design. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmarks
full rationale
The paper reports direct counts of accepted solutions on 367 ICPC and Codeforces problems for GPT-4/GPT-5 workflows under multi-model feedback, plus an ablation on 47 problems comparing stateful vs stateless refinement. These are measured outcomes against fixed external test suites rather than quantities derived from the paper's own definitions or fitted parameters. No equations, self-citations, or ansatzes are invoked in a load-bearing way that reduces the central claims to their inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected ICPC and Codeforces problems serve as a valid proxy for real algorithmic programming challenges requiring end-to-end correctness.
Reference graph
Works this paper leans on
-
[1]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, and C. Yang. A survey of large language models. https://arxiv.org/abs/2303.18223, 2023. [Online; accessed 29 October 2025]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. From llms to llm-based agents for software engineering: A survey of current challenges and future directions. https://arxiv.org/abs/2408.02479, 2024. [Online; accessed 29 October 2025]
-
[3]
http://ieeexplore.ieee.org/abstract/document/ 9426404, 2021
A comparison of natural language understanding platforms for chatbots in software engineering. http://ieeexplore.ieee.org/abstract/document/ 9426404, 2021. [Online; accessed 29 October 2025]. A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback 37
work page 2021
-
[4]
Sabina-Cristiana Necula, Florin Dumitriu, and Valerică Greavu-Şerban. A systematic literature review on using natural language processing in software requirements engineering. https://www.mdpi.com/2079-9292/13/11/2055, 2024. [Online; accessed 29 October 2025]
work page 2079
-
[5]
https://www.sciencedirect.com/science/ article/pii/S0957415814000853, 2014
Model-driven engineering of manufacturing automation software projects — a sysml-based approach. https://www.sciencedirect.com/science/ article/pii/S0957415814000853, 2014. [Online; accessed 29 October 2025]
work page 2014
-
[6]
Tarannum Shaila Zaman, Chadni Islam, Jiangfan Shi, Zihan Shi, Fiona Xian, and Tingting Yu. Syspro: Reproducing system-level concurrency bugs from bug reports.Journal of Systems and Software, 236:112785, 2026
work page 2026
-
[7]
Self-collaboration code generation via chatgpt
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. https://arxiv.org/abs/2304.07590, apr 15 2023. [Online; accessed 2025-11-03]
-
[8]
Repoagent: An llm-powered open-source framework for repository-level code documentation generation
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Repoagent: An llm-powered open-source framework for repository-level code documentation generation. https://arxiv.org/abs/2402.16667, feb 26 2024. [Online; accessed 2025-11-03]
-
[9]
https://dl.acm.org/doi/abs/10.1145/3641554.3701974
Bugspotter: Automated generation of code debugging exercises. https://dl.acm.org/doi/abs/10.1145/3641554.3701974. [Online; accessed 2025-11-03]
-
[10]
Nabiha Parvez, Tanvin Sarkar Pallab, Mia Mohammad Imran, and Tarannum Shaila Zaman. Depro: Understanding the role of llms in debugging competitive programming code.arXiv preprint arXiv:2603.19399, 2026
-
[11]
K. R. Chowdhary.Fundamentals of Artificial Intelligence. Springer, 2020. [Online; accessed 29 October 2025]
work page 2020
-
[12]
Yizhang Zhu, Shiyin Du, Boyan Li, Yuyu Luo, and Nan Tang. Are large language models good statisticians?Advances in Neural Information Processing Systems, 37:62697–62731, 2024. [Online; accessed 2025-11-03]
work page 2024
-
[13]
Dylan Sam, Marc Finzi, and J. Zico Kolter. Predicting the performance of black-box LLMs through self-queries. https://arxiv.org/abs/2501.01558, jan 2 2025. [Online; accessed 2025-11-03]
-
[14]
Laurie Hughes, Yogesh K. Dwivedi, Tegwen Malik, Mazen Shawosh, Mousa Ahmed Albashrawi, Il Jeon, Vincent Dutot, Mandanna Appanderanda, Tom Crick, Rahul De’, Mark Fenwick, Senali Madugoda Gunaratnege, Paulius Jurcys, Arpan Kumar Kar, Nir Kshetri, Keyao Li, Sashah Mutasa, Spyridon Samothrakis, Michael Wade, and Paul Walton. Ai agents and agentic systems: a m...
work page 2025
-
[15]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: Advanced reasoning and learning for autonomous AI agents. https://arxiv.org/abs/2408.07199, aug 13 2024. [Online; accessed 2025-11-03]
work page internal anchor Pith review arXiv 2024
-
[16]
Improve: Iterative model pipeline refinement and optimization leveraging LLM experts
Eric Xue, Ke Chen, Zeyi Huang, Yuyang Ji, and Haohan Wang. Improve: Iterative model pipeline refinement and optimization leveraging LLM experts. https://arxiv.org/abs/2502.18530, feb 25 2025. [Online; accessed 2025-11-03]
-
[17]
From text to trust: Empowering ai-assisted decision making with adaptive llm-powered analysis
Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ziang Xiao, and Ming Yin. From text to trust: Empowering ai-assisted decision making with adaptive llm-powered analysis. https://arxiv.org/abs/2502.11919, feb 17 2025. [Online; accessed 2025-11-03]
-
[18]
Bissyandé, Yang Liu, and Haoye Tian
Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F. Bissyandé, Yang Liu, and Haoye Tian. A survey of llm-based automated program repair: Taxonomies, design paradigms, and applications. https://arxiv.org/abs/2506.23749, jun 30 2025. [Online; accessed 2025-11-04]
-
[19]
An empirical study on llm-based agents for automated bug fixing
Xiangxin Meng, Zexiong Ma, Pengfei Gao, and Chao Peng. An empirical study on llm-based agents for automated bug fixing. https://arxiv.org/abs/2411.10213, nov 15 2024. [Online; accessed 2025-11-04]
-
[20]
Markus J. Buehler. Preflexor: Preference-based recursive language modeling for exploratory optimization of reasoning and agentic thinking.npj Artificial Intelligence, 1(1):1–38, may 14 2025. [Online; accessed 2025-11-04]
work page 2025
-
[21]
A. Sri Lakshmi, E. S. Sharmila Sigamany, Roopa Traisa, Raman Kumar, Karaka Ramakrishna Reddy, Jasgurpreet Singh Chohan, and Aseel Smerat. Enhancing code quality through automated refactoring using transformer-based language models.International Journal of Advanced Computer Science and Applications (IJACSA), 16(9), sep 30 2025. [Online; accessed 2025-11-04]
work page 2025
-
[22]
Icpc-Eval: Probing the frontiers of LLM reasoning with competitive programming contests
Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Icpc-Eval: Probing the frontiers of LLM reasoning with competitive programming contests. https://arxiv.org/abs/2506.04894, jun 5 2025. [Online; accessed 2025-11-01]
-
[23]
Fahim Arefin, and Tarannum Shaila Zaman
Md Sifat Hossain, Anika Tabassum, Md. Fahim Arefin, and Tarannum Shaila Zaman. Llm-pros: Analyzing large language models’ performance in competitive problem solving. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 80–87, 2025
work page 2025
-
[24]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, and Dawn Song. Measuring coding challenge competence with apps. https: //arxiv.org/abs/2105.09938, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez
Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-Agent code generation for competitive problem solving. https://arxiv.org/abs/2405.11403, may 18 2024. [Online; accessed 2025-11-01]
- [26]
-
[27]
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, and Wentao Zhang. Codeflowbench: A multi-turn, iterative benchmark for complex code generation. https://arxiv.org/abs/2504.21751, apr 30 2025. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Agentic AI for Software: Thoughts from Software Engineering community
Abhik Roychoudhury. Agentic AI for Software: Thoughts from Software Engineering community. https://arxiv.org/abs/2508.17343, aug 24 2025. [Online; accessed 2025-11-04]
-
[29]
Competition-Level Code Generation with AlphaCode
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [30]
-
[31]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency improves chain of thought reasoning in language models. https://arxiv.org/abs/2203.11171, mar 21 2022. [Online; accessed 2025-11-04]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, and Kaplan. Evaluating large language models trained on code. https://arxiv.org/abs/2107.03374, jul 7 2021. [Online; accessed 2025-11-04]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-Thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. [Online; accessed 2025-11-04]
work page 2022
-
[34]
Intercode: Standardizing and benchmarking interactive coding with execution feedback
John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback. https://arxiv.org/abs/2306.14898, jun 26 2023. [Online; accessed 2025-11-01]
-
[35]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agentcoder: Multi-Agent-based code generation with iterative testing and optimisation. https://arxiv.org/abs/2312.13010, dec 20 2023. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. https://arxiv.org/abs/2406.11912, jun 16 2024. [Online; accessed 2025-11-01]
- [37]
-
[38]
Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation
Zijian Yu, Yuxiang Zhao, Arman Cohan, and Xue-Ping Zhang. Humaneval pro and mbpp pro: Evaluating large language models on self-invoking code generation. https://arxiv.org/abs/2412.21199, 2024
-
[39]
Wizardcoder: Empowering code large language models with evol-instruct
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. https://arxiv.org/abs/2306.08568, jun 14 2023. [Online; accessed 2025-11-01]
-
[40]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. https://arxiv.org/abs/2203.13474, mar 25 2022. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Codegeex: A pre-trained model for code generation with multilingual evaluation on humaneval-x
Yifei Zheng, Jiale Xue, Chenghao Xia, Zhipeng Zhang, Zhiyuan Liu, and Maosong Sun. Codegeex: A pre-trained model for code generation with multilingual evaluation on humaneval-x. https://arxiv.org/abs/2303.17568, 2023
-
[42]
https://icpc.global/worldfinals/past-problems
The icpc international collegiate programming contest. https://icpc.global/worldfinals/past-problems. [Online; accessed 29 October 2025]
work page 2025
-
[43]
OpenAI. Gpt-5 system card. aug 13 2025. [Online; accessed 19 March 2026]
work page 2025
-
[44]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, and Aleman. Gpt-4 technical report. https://arxiv.org/abs/2303.08774, mar 15 2023. [Online; accessed 2025-11-04]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. https://arxiv.org/abs/2501.12948, jan 2025. [Online; accessed 19 March 2026]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
https://sifat-hossain-niloy.github.io/A-Pros/
A-pros. https://sifat-hossain-niloy.github.io/A-Pros/. [Online; accessed 2025-11-07]
work page 2025
-
[47]
Github - Sifat-hossain-niloy/A-Pros
sifat-hossain-niloy. Github - Sifat-hossain-niloy/A-Pros. https://github.com/sifat-hossain-niloy/A-Pros. [Online; accessed 2025-11-06]
work page 2025
-
[48]
Ai agentic programming: A survey of techniques, challenges, and opportunities
Huanting Wang, Jingzhi Gong, Huawei Zhang, Jie Xu, and Zheng Wang. Ai agentic programming: A survey of techniques, challenges, and opportunities. https://arxiv.org/abs/2508.11126, aug 15 2025. [Online; accessed 2025-10-31]
-
[49]
Large Language Model-Based Agents for Software Engineering: A Survey
Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey. https://arxiv.org/abs/2409.02977, sep 4 2024. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Codeforces gym – practice and training platform for competitive programming. https://codeforces.com/gyms. [Online; accessed 4 November 2025]
work page 2025
-
[51]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and Al-Dahle. The llama 3 herd of models. https://arxiv.org/abs/2407.21783, jul 31 2024. [Online; accessed 2025-11-06]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Mistral AI. Codestral 25.08. https://docs.mistral.ai/models/codestral-25-08, aug 2025. [Online; accessed 19 March 2026]
work page 2025
-
[53]
Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015
work page 2015
-
[54]
https://pypi.org/project/latex2mathml/
Latex2mathml: A converter for transforming latex equations to mathml. https://pypi.org/project/latex2mathml/. [Online; accessed 6 November 2025]
work page 2025
-
[55]
Mathjax: Beautiful math in all browsers. https://www.mathjax.org/. [Online; accessed 6 November 2025]
work page 2025
-
[56]
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Beautifulsoup4: Html and xml parsing library for python. https://www.crummy.com/software/BeautifulSoup/bs4/doc/. [Online; accessed 6 November 2025]
work page 2025
-
[57]
Codeforces problems test cases
sifat-hossain-niloy. Codeforces problems test cases. https://github.com/sifat-hossain-niloy/Codeforces-Problems-Test-Cases. [Online; accessed 19 March 2026]
work page 2026
-
[58]
Selenium. https://www.selenium.dev/. [Online; accessed 19 March 2026]
work page 2026
-
[59]
Playwright. https://playwright.dev/. [Online; accessed 19 March 2026]
work page 2026
-
[60]
https://codeforces.com/blog/entry/79
Codeforces: Verdicts and judging system. https://codeforces.com/blog/entry/79. [Online; accessed 6 November 2025]
work page 2025
-
[61]
Wiley-Interscience, 2 edition, 2002
Alan Agresti.Categorical Data Analysis. Wiley-Interscience, 2 edition, 2002
work page 2002
-
[62]
Allen L. Edwards. Note on the correction for continuity in testing the significance of the difference between correlated proportions.Psychometrika, 13(3):185–187, 1948
work page 1948
-
[63]
Lawrence Erlbaum Associates, 2 edition, 1988
Jacob Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988. A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback 39
work page 1988
-
[64]
Edwin B. Wilson. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158):209–212, 1927
work page 1927
- [65]
-
[66]
https://en.cppreference.com/w/cpp/17.html? [Online; accessed 2025-11-07]
C++17. https://en.cppreference.com/w/cpp/17.html? [Online; accessed 2025-11-07]
work page 2025
-
[67]
https://codeforces.com/apiHelp
Codeforces api help. https://codeforces.com/apiHelp. [Online; accessed 19 March 2026]
work page 2026
-
[68]
https://www.sqlite.org/index.html
Sqlite home page. https://www.sqlite.org/index.html. [Online; accessed 2025-11-07]
work page 2025
-
[69]
Agentic AI: A quantitative analysis of performance and applications
Prashant Sawant. Agentic AI: A quantitative analysis of performance and applications. https://www.preprints.org/manuscript/202502.1647, feb 20
-
[70]
[Online; accessed 2025-11-04]
work page 2025
-
[71]
Hesham Allam and Juan Dempere. Agentic AI for IT and beyond: A qualitative analysis of capabilities, challenges, and governance.The Artificial Intelligence Business Review, 1(1), aug 5 2025. [Online; accessed 2025-11-04]
work page 2025
-
[72]
Google DeepMind. AlphaCode 2 technical report. Technical report, Google DeepMind, dec 2023. Available at https://storage.googleapis.com/ deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf; accessed 12 May 2026
work page 2023
-
[73]
Y. Wang, W. Wang, S. Joty, and S. C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. https://arxiv.org/abs/2109.00859, 2021. [Online; accessed 29 October 2025]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[74]
https://leetcode.com/problemset/?difficulty=HARD
Leetcode – the world’s leading online programming learning platform. https://leetcode.com/problemset/?difficulty=HARD. [Online; accessed 29 October 2025]
work page 2025
-
[75]
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, and Zucker. Starcoder 2 and The Stack v2: The Next Generation. https://arxiv.org/abs/2402.19173, feb 29 2024. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. https://arxiv.org/abs/2401.03065, jan 5 2024. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. https://arxiv.org/abs/2403.07974, mar 12 2024. [Online; accessed 2025-11-01]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Agentif: Benchmarking instruction following of large language models in agentic scenarios
Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, and Juanzi Li. Agentif: Benchmarking instruction following of large language models in agentic scenarios. https://arxiv.org/abs/2505.16944, may 22 2025. [Online; accessed 2025-11-01]
-
[79]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, and Sujoy Basu. Instruction-following evaluation for large language models (ifeval). https://arxiv.org/abs/2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
Followbench: A multi-level fine-grained constraints following benchmark for large language models
Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4667–4688, Ban...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.