arxiv: 2605.08468 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

Mehmet Iscan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords frozen LLMcoding agentsvalidation feedbackepisodic memoryeligibility tracesexternal controllerreinforcement learningcode generation

0 comments

The pith

An external controller for frozen LLMs raises strict validation success on coding tasks from 0/9 to 8/9 in a hard RL setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PYTHALAB-MERA as a lightweight external controller that supplies validation-grounded episodic memory, adaptive retrieval, and acceptance control to a frozen language model for code generation. The LLM proposes complete source files while the controller selects relevant memory records and AST-derived skills for each prompt, runs fail-fast validation, converts outcomes into bounded shaped rewards, and assigns delayed credit through eligibility traces. This combination targets the full requirements of settings where correctness depends on execution feedback and bounded repair rather than a single model output. A sympathetic reader would care because it shows measurable improvement over self-refinement baselines that scored zero in the same constrained setup, without any change to the underlying model weights.

Core claim

PYTHALAB-MERA is a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations while the self-refinement baseline and the investigated GRACE extension each passed 0/9.

What carries the argument

PYTHALAB-MERA, the external controller that integrates episodic memory selection, AST-derived skill retrieval, fail-fast validation, shaped rewards, and TD(lambda) eligibility traces around a frozen LLM proposer.

If this is right

Persistent episodic memory and AST-derived skill reuse become available for code generation without updating LLM weights.
Bounded shaped rewards and eligibility traces allow credit assignment across multiple repair attempts within a fixed budget.
The separation of proposal generation by the frozen model from control by the external system supports modular addition of validation gates.
Strict fail-fast validation converts directly into acceptance decisions that improve overall success rate in the measured setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The controller architecture could be tested on iterative validation tasks outside code, such as data pipeline construction where execution feedback is also available.
Varying the attempt budget above three while keeping the same tasks would show whether performance gains scale or plateau under looser constraints.
Adapting the AST skill extraction and memory selection to additional programming languages would indicate how language-specific the current gains are.

Load-bearing premise

The three chosen coding tasks together with the fail-fast validation pipeline and three-attempt budget are representative of the broader class of validation-grounded code generation problems the system is intended to address.

What would settle it

Re-running the identical three-task, three-repetition, three-attempt protocol on a fourth distinct coding task and observing whether PYTHALAB-MERA maintains an 8/9 or higher success rate while baselines remain near zero would test whether the reported improvement depends on the original task selection.

Figures

Figures reproduced from arXiv: 2605.08468 by Mehmet Iscan.

**Figure 1.** Figure 1: PYTHALAB-MERA validation-grounded control loop. The frozen generator proposes a candidate program, while the external controller selects retrieval evidence, validates the candidate, and converts validation feedback into memory and credit updates. agent architectures that externalize memory and control rather than embedding all adaptation into the LLM [40,51,65] [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Primary phase1c hard RL outcome. Success is reported as strict validator pass rate with 95% Wilson intervals; attempts and wall-clock time are run-level efficiency measures. The figure shows the same bounded conclusion as the table: PYTHALABMERA is the only condition with nonzero strict success in this measured setting, and it also uses fewer attempts and less time on average. 3.3 Per-task behavior and re… view at source ↗

**Figure 3.** Figure 3: Per-task success and residual failure distribution in phase1c. Cell labels show successes out of three repeats; failure bars count only non-passing runs. The profile indicates that PYTHALAB-MERA reduced hard-task failures in this subset, while GRACE incurred more runtime and import failures [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Secondary phase1b RL comparison. This plot is descriptive because each of the 11 tasks was run once per condition; it is included to show that the primary phase1c finding is not the only observed efficiency signal, but it is not treated as the main repeated-trial claim [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a clean 8/9 validation win for its external controller in a three-task test, but the result is too narrow and under-documented to support wider claims.

read the letter

This paper's main finding is that PYTHALAB-MERA reached 8 out of 9 strict validations on three coding tasks with three repetitions and a three-attempt budget, while the self-refinement baseline and the GRACE extension both scored zero. The controller sits outside a frozen LLM and decides what memory records and AST-derived skills go into the next prompt, runs fail-fast validation, turns results into shaped rewards, and applies TD(lambda) eligibility traces for credit assignment. The authors present this combination as the element missing from earlier separate techniques. The work is honest about its scope and does not claim general code synthesis or formal safety. That restraint is useful. The experiment is small enough that the numbers could easily shift with different tasks or slight changes in the validation gates, and the abstract supplies no code, no task definitions, no error analysis, and no statistical checks. Without those pieces it is difficult to judge whether the improvement comes from the joint design or from specifics of the three chosen problems. The novelty argument also depends on showing that no prior system combined exactly these pieces, but the provided text does not include enough citations to verify that. The paper is aimed at people who build local, non-retraining coding agents and want concrete ideas for adding memory and adaptive retrieval. A reader in that group could extract the controller architecture and try the memory-plus-AST-retrieval pattern. I would bring it to a reading group to discuss the design choices, but I would not cite it or send it for peer review in this form. The bounded claim is clear, yet the evidence base is still too thin to merit referee time.

Referee Report

1 major / 2 minor

Summary. The paper introduces PYTHALAB-MERA, a lightweight external controller for frozen LLM-based coding agents. The controller manages episodic memory, adaptive retrieval of records and AST-derived skills, fail-fast validation, shaped rewards, and TD(λ)-style eligibility traces for delayed credit assignment. In a specific experimental setting with three reinforcement-learning coding tasks, three repetitions, and a three-attempt budget, the system achieved 8/9 strict validations while self-refinement and GRACE baselines achieved 0/9. The authors explicitly bound the claim to this recorded setting without asserting generality, SOTA performance, or formal correctness.

Significance. If the reported counts are accurate, the work provides a concrete demonstration that an external memory-retrieval-acceptance controller can improve validation success over simple self-refinement in a constrained, validation-grounded code-generation setting. The deliberate bounding of the claim, the use of strict validation gates, and the avoidance of overclaiming are strengths. The very small trial count (nine total) and lack of supporting implementation details, however, limit the result's immediate significance and utility as a reproducible baseline for the broader class of problems the system targets.

major comments (1)

Evaluation section: The central empirical result (8/9 versus 0/9) is reported without any description of the three coding tasks, the concrete criteria or implementation of the fail-fast validation pipeline, the precise configuration of the TD(λ) traces, or the implementation details of the self-refinement and GRACE baselines. These omissions make the numerical outcome impossible to contextualize, replicate, or assess for robustness.

minor comments (2)

The abstract is unusually long and dense; some technical description could be moved to the introduction or methods to improve readability.
Ensure first-use definitions for acronyms such as TD(λ) and AST even if they are standard in the field.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive recognition of the paper's bounded claims, strict validation approach, and avoidance of overclaiming. We address the single major comment below and will revise the manuscript to improve reproducibility and contextualization of the results.

read point-by-point responses

Referee: Evaluation section: The central empirical result (8/9 versus 0/9) is reported without any description of the three coding tasks, the concrete criteria or implementation of the fail-fast validation pipeline, the precise configuration of the TD(λ) traces, or the implementation details of the self-refinement and GRACE baselines. These omissions make the numerical outcome impossible to contextualize, replicate, or assess for robustness.

Authors: We agree that the Evaluation section requires substantially more detail to support replication and assessment of the reported outcome. In the revised manuscript we will expand this section with: (1) explicit descriptions of the three reinforcement-learning coding tasks, including their objectives, input specifications, and the precise strict-validation criteria used for success; (2) the concrete implementation of the fail-fast validation pipeline, including the ordered sequence of checks, error categories, and acceptance thresholds; (3) the exact configuration of the TD(λ) eligibility traces, specifying the λ value, trace-decay schedule, and credit-propagation rules; and (4) implementation details for the self-refinement baseline and the GRACE extension, including prompt templates, iteration budgets, and any modifications made to the original GRACE procedure. These additions will be accompanied by pseudocode or a table summarizing hyperparameters so that the 8/9 versus 0/9 comparison can be fully contextualized and reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper contains no derivation chain, mathematical model, or predictive formalism that could reduce to its inputs. Its central claim is a bounded empirical report of measured validation counts (8/9 successes versus 0/9 for baselines) obtained from a fixed experimental protocol with three tasks, three repetitions, and a three-attempt budget. All described components—memory records, AST-derived skills, fail-fast validation, shaped rewards, and TD(λ) traces—are implementation choices whose correctness is assessed by direct execution outcomes rather than by any self-referential definition, fitted-parameter prediction, or load-bearing self-citation. The abstract explicitly disclaims generality, so no unstated premise is required for the reported counts to hold.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach relies on standard reinforcement-learning concepts (eligibility traces, shaped rewards) and prompting techniques (memory retrieval, AST-derived skills) without introducing new mathematical entities or free parameters whose values are fitted in the abstract.

axioms (2)

domain assumption Validation outcomes from a fail-fast pipeline can be converted into bounded shaped rewards suitable for TD(lambda) credit assignment.
Invoked to justify the reward mechanism that drives the controller.
domain assumption AST-derived skills extracted from prior code can be usefully selected and inserted into future prompts.
Core premise of the retrieval component.

pith-pipeline@v0.9.0 · 5561 in / 1464 out tokens · 85120 ms · 2026-05-12T01:06:57.113737+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the terminal acceptance cost is C_acc_t = 1 - A_t and the zero-cost terminal set is X_0(q,W) = {x̂ : C_acc(x̂;q,W)=0}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 9 internal anchors

[1]

Tablan, V., Taylor, S., Hurtado, G., Bernhem, K., Uhrenholt, A., Farei, G., & Moilanen, K. (2025). Smarter together: Creating agentic communities of practice through shared experiential learning. arXiv.https://arxiv.org/abs/2511.08301

work page arXiv 2025
[2]

Mishra, S., Niroula, S., Yadav, U., Thakur, D., Gyawali, S., & Gaire, S. (2026). Sok: Agenticretrieval-augmentedgeneration(rag):Taxonomy,architectures,evaluation, and research directions. arXiv.https://arxiv.org/abs/2603.07379

work page arXiv 2026
[3]

RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

Shanto, M. H., Asaduzzaman, M., & Ngom, A. (2026). RAG-Reflect: Agen- tic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow. arXiv.https://arxiv.org/abs/2604.22217

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

M., Kim, J., & Kim, S

Jiang, J., Shen, J., Kim, S., Yoo, K. M., Kim, J., & Kim, S. (2026). ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self- Correct It via Reinforcement Learning. arXiv.https://arxiv.org/abs/2603.058 63 24 M. Iscan

work page 2026
[5]

DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

Wu,L.,Pei,Y.,Yang,Z.,Li,K.,Lu,Z.,Tan,H.,...&Hao,D.(2026).DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging. arXiv.https://arxiv.org/abs/2604.19305

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Roumeliotis, and Manoj Karkee

Sapkota, R., Roumeliotis, K. I., & Karkee, M. (2025). Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic ai. arXiv.https://ar xiv.org/abs/2505.19443

work page arXiv 2025
[7]

Ali Babar, and M

Xia, C. S., Wei, Y., & Zhang, L. (2023, May). Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 1482-1494). IEEE.https://do i.org/10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023
[8]

Farzandway, M., & Ghassemi, F. (2025). Automated repair of c programs using large language models. arXiv.https://arxiv.org/abs/2509.01947

work page arXiv 2025
[9]

Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., & Tan, S. H. (2023, May). Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 1469-1481). IEEE.https://doi.org/10.1109/ICSE48619.2023.00128

work page doi:10.1109/icse48619.2023.00128 2023
[10]

S., & Narayan, O

Narajala, V. S., & Narayan, O. (2025). Securing agentic ai: A comprehensive threat model and mitigation framework for generative ai agents. arXiv.https://arxiv. org/abs/2504.19956

work page arXiv 2025
[11]

(2023, December)

Huang, Y., Gupta, S., Zhong, Z., Li, K., & Chen, D. (2023, December). Privacy implications of retrieval-based language models. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (pp. 14887-14902). https://doi.org/10.18653/v1/2023.emnlp-main.921

work page doi:10.18653/v1/2023.emnlp-main.921 2023
[12]

REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

Kuang, S., Tian, Z., Lin, K., Tao, C., Wang, S., Bai, H., ... & Chen, J. (2026). REAgent: Requirement-Driven LLM Agents for Software Issue Resolution. arXiv. https://arxiv.org/abs/2604.06861

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

& Han, Z

Wang, C., Zhou, Z., Wang, C., Sun, Y., Yang, S., Yuan, Y., ... & Han, Z. (2025, December). End-to-End Secure Code Repair with Context-Aware Anonymiza- tion and Isolated Agent Execution. In 2025 IEEE International Conference on Blockchain Technology and Information Security (ICBCTIS) (pp. 1-8). IEEE. https://doi.org/10.1109/ICBCTIS66509.2025.11387695

work page doi:10.1109/icbctis66509.2025.11387695 2025
[14]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36, 46534-46594.https://arxiv.org/ab s/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Gehring, J., Zheng, K., Copet, J., Mella, V., Carbonneaux, Q., Cohen, T., & Syn- naeve, G. (2024). Rlef: Grounding code llms in execution feedback with reinforce- ment learning. arXiv.https://arxiv.org/abs/2410.02089

work page arXiv 2024
[16]

Chen, Y., Sun, Y., Wang, H., Zhang, X., Shen, X., Li, W., & Zhang, W. (2026). Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration. arXiv.https://arxiv.org/abs/2603.06859

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

arXiv.https://arxiv.org/abs/2506.01442

Yang,X.,Li,W.,Sheng,J.,Shen,C.,Hua,Y.,&Wang,X.(2025).AgenticEpisodic Control. arXiv.https://arxiv.org/abs/2506.01442

work page arXiv 2025
[18]

Zhang, H., Long, Q., Bao, J., Feng, T., Zhang, W., Yue, H., & Wang, W. (2026). MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents. arXiv. https://arxiv.org/abs/2602.02474

work page internal anchor Pith review arXiv 2026
[19]

arXiv preprint arXiv:2603.00718 , year=

Chen, S., Gai, J., Zhou, R., Zhang, J., Zhu, T., Li, J., ... & Teh, Y. W. (2026). SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?. arXiv.https://arxi v.org/abs/2603.00718 PYTHALAB-MERA: Memory and Acceptance Control 25

work page arXiv 2026
[20]

Gonzalez, M. A. A., Hernandez, M. B., Perez, M. A. P., Orozco, B. L., Soto, J. T. C., & Malagon, S. (2025). Do Repetitions Matter? Strengthening Reliability in LLM Evaluations. arXiv.https://arxiv.org/abs/2509.24086

work page arXiv 2025
[21]

J., Baiocchi, M., Savage, T

Gallo, R. J., Baiocchi, M., Savage, T. R., & Chen, J. H. (2025). Establishing best practices in large language model research: an application to repeat prompting. Journal of the American Medical Informatics Association, 32(2), 386-390.https: //doi.org/10.1093/jamia/ocae294

work page doi:10.1093/jamia/ocae294 2025
[22]

& Zheng, Z

Ning, K., Chen, J., Zhang, J., Li, W., Wang, Z., Feng, Y., ... & Zheng, Z. (2026). Defining and Detecting the Defects of Large Language Model-Based Autonomous Agents. IEEE Transactions on Software Engineering.https://doi.org/10.1109/ TSE.2026.3658554

work page arXiv 2026
[23]

Chen, Z., Ma, W., & Jiang, L. (2025). Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv.https://arxiv.org/abs/2503.12374

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Barke, S., Goyal, A., Khare, A., Singh, A., Nath, S., & Bansal, C. (2026). AgentRx: Diagnosing AI Agent Failures from Execution Trajectories. arXiv.https://arxi v.org/abs/2602.02475

work page arXiv 2026
[25]

arXiv:2507.02825 , year =

Zhu, Y., Jin, T., Pruksachatkun, Y., Zhang, A., Liu, S., Cui, S., Kapoor, S., Longpre, S., Meng, K., Weiss, R., Barez, F., Gupta, R., Dhamala, J., Mer- izian, J., Giulianelli, M., Coppock, H., Ududec, C., Sekhon, J., Steinhardt, J., Kellermann, A., Schwettmann, S., Zaharia, M., Stoica, I., Liang, P., & Kang, D. (2025).Establishingbestpracticesforbuildingr...

work page arXiv 2025
[26]

Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web(pp. 661–670). ACM.https: //doi.org/10.1145/1772690.1772758

work page doi:10.1145/1772690.1772758 2010
[27]

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.https://doi.org/10.1007/BF00115009

work page doi:10.1007/bf00115009 1988
[28]

Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158), 209–212.http s://doi.org/10.1080/01621459.1927.10502953

work page doi:10.1080/01621459.1927.10502953 1927
[29]

Gautam, D., Garg, S., Jang, J., Sundaresan, N., & Zilouchian Moghaddam, R. (2025). RefactorBench: Evaluating stateful reasoning in language agents through code. arXiv.https://arxiv.org/abs/2503.07832

work page arXiv 2025
[30]

Arimbur, J. J. (2026). How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks. arXiv.https://arxiv. org/abs/2604.10508

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

& Zheng, Z

Dai, D., Liu, M., Li, A., Cao, J., Wang, Y., Wang, C., ... & Zheng, Z. (2025). Feed- backeval: A benchmark for evaluating large language models in feedback-driven code repair tasks. arXiv.https://arxiv.org/abs/2504.06939

work page arXiv 2025
[32]

Memory poisoning attack and defense on memory based llm-agents,

Sunil, B. D., Sinha, I., Maheshwari, P., Todmal, S., Mallik, S., & Mishra, S. (2026). Memory poisoning attack and defense on memory based llm-agents. arXiv.https: //arxiv.org/abs/2601.05504

work page arXiv 2026
[33]

S., & He, H

Srivastava, S. S., & He, H. (2025). MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval. arXiv.https://arxiv.org/abs/2512.1 6962

work page 2025
[34]

Liu, F., Zhang, Y., Luo, J., Dai, J., Chen, T., Yuan, L., Yu, Z., Shi, Y., Li, K., Zhou, C., Chen, H., & Yang, M. (2025). Make agent defeat agent: Automatic detec- tion of taint-style vulnerabilities in LLM-based agents. In 34th USENIX Security 26 M. Iscan Symposium (USENIX Security 25).https://www.usenix.org/conference/usen ixsecurity25/presentation/liu-fengyu

work page 2025
[35]

Wang, W., Wang, Y., Joty, S., & Hoi, S. C. (2023, November). Rap-gen: Retrieval- augmented patch generation with codet5 for automatic program repair. In Pro- ceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 146-158).https: //doi.org/10.1145/3611643.3616256

work page doi:10.1145/3611643.3616256 2023
[36]

(2025, November)

Lee, H., & Yang, G. (2025, November). AgentRepair: Multi-Agent, AST-Anchored, Retrieval-Augmented Program Repair for Cold-Start Environments. In 2025 12th International Conference on Dependable Systems and Their Applications (DSA) (pp. 121-132). IEEE.https://doi.org/10.1109/DSA66321.2025.00025

work page doi:10.1109/dsa66321.2025.00025 2025
[37]

Chondamrongkul, N., Kyaw, M. P. P., Ko, S. M., Paing, P. P., Swe, M. K. T., & Hongthong, T. (2026). RepoAI: Automated code refactoring through multi- agent LLM orchestration and retrieval-augmented generation. Science of Computer Programming, 253, Article 103477.https://doi.org/10.1016/j.scico.2026.1 03477

work page doi:10.1016/j.scico.2026.1 2026
[38]

Bui, N. D. (2026). Building Effective AI Coding Agents for the Terminal: Scaf- folding, Harness, Context Engineering, and Lessons Learned. arXiv.https: //arxiv.org/abs/2603.05344

work page arXiv 2026
[39]

Rombaut, B. (2026). Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures. arXiv.https://arxiv.org/abs/2604.03515

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Kim, M. H. (2025). Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop. arXiv.https://arxiv.org/abs/2511.1 7673

work page 2025
[41]

C., & Zuo, J

Poon, M., Dai, X., Liu, X., Kong, F., Lui, J. C., & Zuo, J. (2026, March). Online multi-llm selection via contextual bandits under unstructured context evolution. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 29, pp. 24855-24863).https://doi.org/10.1609/aaai.v40i29.39672

work page doi:10.1609/aaai.v40i29.39672 2026
[42]

& Dai, Z

Hong, Z., Zhang, Q., Sun, J., Shang, Z., Kong, M., Wang, X., ... & Dai, Z. (2026). MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks. arXiv.https://arxiv.org/abs/2603.02630

work page arXiv 2026
[43]

Rietz, F., Smirnov, O., Karimi, S., & Cao, L. (2026). Prompt Tuning Decision Transformers with Structured and Scalable Bandits. Advances in Neural Informa- tion Processing Systems, 38, 58258-58286.https://www.microsoft.com/en-us/ research/publication/prompt-tuning-decision-transformers-with-structu red-and-scalable-bandits/

work page 2026
[44]

61-70).https://doi.org/10.1145/28 08194.2809457

Sloan,M.,&Wang,J.(2015,September).Dynamicinformationretrieval:Theoreti- calframeworkand application.In Proceedingsof the2015 InternationalConference on the theory of Information Retrieval (pp. 61-70).https://doi.org/10.1145/28 08194.2809457

work page doi:10.1145/28 2015
[45]

In: Kamps, J., Kanoulas, E., de Rijke, M., Fang, H., Yilmaz, E

Yang,A.,&Yang,G.H.(2017,October).Acontextualbanditapproachtodynamic search. In Proceedings of the ACM SIGIR International Conference on Theory of InformationRetrieval(pp.301-304).https://doi.org/10.1145/3121050.3121101

work page doi:10.1145/3121050.3121101 2017
[46]

& Krishnan, S

Zhang, W., Zhu, Y., Lu, Y., Demarne, M., Wang, W., Deng, K., ... & Krishnan, S. (2025, November). Flair: Feedback learning for adaptive information retrieval. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (pp. 6284-6292).https://doi.org/10.1145/3746252.37 61553

work page doi:10.1145/3746252.37 2025
[47]

Acecoder: Acing coder rl via automated test-case synthesis.ArXiv, abs/2207.01780, 2025

Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement PYTHALAB-MERA: Memory and Acceptance Control 27 learning. Advances in Neural Information Processing Systems, 35, 21314-21328. https://arxiv.org/abs/2207.01780

work page arXiv 2022
[48]

Outcome-Refining Process Supervision for Code Generation,

Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., ... & Zhang, S. (2024). Reasoning through execution: Unifying process and outcome rewards for code gen- eration. arXiv.https://arxiv.org/abs/2412.15118

work page arXiv 2024
[49]

Han, B., Ren, Z., Wu, Z., Zhou, Y., & Peng, J. (2022). Off-policy reinforcement learning with delayed rewards. In Proceedings of the 39th International Conference on Machine Learning (PMLR, Vol. 162, pp. 8280–8303).https://proceedings. mlr.press/v162/han22e.html

work page 2022
[50]

& Lyu, C

Li, B., Sun, Z., Huang, T., Zhang, H., Wan, Y., Li, G., ... & Lyu, C. (2024). Ircoco:Immediaterewards-guideddeepreinforcementlearningforcodecompletion. Proceedings of the ACM on Software Engineering, 1(FSE), 182-203.https://do i.org/10.1145/3643735

work page doi:10.1145/3643735 2024
[51]

Zhang, D., Chen, L., Zhang, S., Xu, H., Zhao, Z., & Yu, K. (2023). Large language models are semi-parametric reinforcement learning agents. Advances in Neural In- formation Processing Systems, 36, 78227-78239.https://arxiv.org/abs/2306.0 7929

work page 2023
[52]

(2025, September)

Krishnamoorthy, A., Ivatury, K., & Ahmadnia, B. (2025, September). Multi-Agent ReinforcementLearningforInteractiveCodeDebuggingwithHumanFeedbackand Memory. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era (pp. 595-603).https://doi.org/10.26615/978-954...

work page doi:10.26615/978-954-452-098-4-070 2025
[53]

(2009, May)

Chilowicz, M., Duris, E., & Roussel, G. (2009, May). Syntax tree fingerprinting for source code similarity detection. In 2009 IEEE 17th international conference on program comprehension (pp. 243-247). IEEE.https://doi.org/10.1109/ICPC.2 009.5090050

work page doi:10.1109/icpc.2 2009
[54]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Li, Y., Wang, S., & Nguyen, T. (2021, May). Fault localization with code cover- age representation learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 661-673). IEEE.https://doi.org/10.1109/IC SE43902.2021.00067

work page doi:10.1109/ic 2021
[55]

M., Kn, N., & Chakrabarti, S

Verma, A., Udhayanan, P., Shankar, R. M., Kn, N., & Chakrabarti, S. K. (2021, October). Source-code similarity measurement: syntax tree fingerprinting for auto- mated evaluation. In Proceedings of the First International Conference on AI-ML Systems (pp. 1-7).https://doi.org/10.1145/3486001.3486228

work page doi:10.1145/3486001.3486228 2021
[56]

Ellis, K., Morales, L., Sablé-Meyer, M., Solar-Lezama, A., & Tenenbaum, J. (2018). Learning libraries of subroutines for neurally guided Bayesian program induction. In Advances in Neural Information Processing Systems 31 (pp. 7816–7826).https: //papers.nips.cc/paper/8006-learning-libraries-of-subroutines-for-neu rallyguided-bayesian-program-induction

work page 2018
[57]

& Swaminathan, V

Xu, P., Wu, G., Chen, X., Yu, T., Xiao, C., Dernoncourt, F., ... & Swaminathan, V. (2026, March). Skill discovery for software scripting automation via offline sim- ulations with llms. In Findings of the Association for Computational Linguistics: EACL 2026 (pp. 743-759).https://doi.org/10.18653/v1/2026.findings-eac l.37

work page doi:10.18653/v1/2026.findings-eac 2026
[58]

Stengel-Eskin, E., Prasad, A., & Bansal, M. (2024). Regal: Refactoring programs to discover generalizable abstractions. arXiv.https://arxiv.org/abs/2401.16467

work page arXiv 2024
[59]

Rabin, R., Hostetler, J., McGregor, S., Weir, B., & Judd, N. (2025). Sandboxeval: Towards securing test environment for untrusted code. arXiv.https://arxiv.or g/abs/2504.00018 28 M. Iscan

work page arXiv 2025
[60]

& Cai, Y

Wang, J., Luo, X., Cao, L., He, H., Huang, H., Xie, J., ... & Cai, Y. (2024). Is your ai-generated code really safe? evaluating large language models on secure code generation with codeseceval. arXiv.https://arxiv.org/abs/2407.02395

work page arXiv 2024
[61]

Y., & Tan, R

Jiang, H., Chen, Y., Cao, Y., Lee, H. Y., & Tan, R. T. (2025). Codejudgebench: Benchmarking llm-as-a-judge for coding tasks. arXiv.https://arxiv.org/abs/25 07.10535

work page 2025
[62]

Zhao, Y., Luo, Z., Tian, Y., Lin, H., Yan, W., Li, A., & Ma, J. (2024). CodeJudge- Eval: Can large language models be good judges in code understanding? arXiv. https://arxiv.org/abs/2408.10718

work page arXiv 2024
[63]

arXiv.https://arxiv.org/abs/ 2510.11822

Jain,S.,Ahmed,U.Z.,Sahai,S.,&Leong,B.(2025).Beyondconsensus:Mitigating the agreeableness bias in llm judge evaluations. arXiv.https://arxiv.org/abs/ 2510.11822

work page arXiv 2025
[64]

Devanbu, and Michael Pradel

Spiess, C., Gros, D., Pai, K. S., Pradel, M., Rabin, M. R. I., Alipour, A., ... & Ahmed, T. (2025, April). Calibration and correctness of language models for code. In2025IEEE/ACM47thInternationalConferenceonSoftwareEngineering(ICSE) (pp. 540-552). IEEE.https://doi.org/10.1109/ICSE55347.2025.00040

work page doi:10.1109/icse55347.2025.00040 2025
[65]

Chen, Z., Chen, D., Jin, R., Liang, Y., Xie, Y., & Sun, H. (2026). Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation. arXiv.https://arxiv.org/abs/2602.03806

work page arXiv 2026