pith. machine review for the scientific record. sign in

arxiv: 2605.08468 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords frozen LLMcoding agentsvalidation feedbackepisodic memoryeligibility tracesexternal controllerreinforcement learningcode generation
0
0 comments X

The pith

An external controller for frozen LLMs raises strict validation success on coding tasks from 0/9 to 8/9 in a hard RL setting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PYTHALAB-MERA as a lightweight external controller that supplies validation-grounded episodic memory, adaptive retrieval, and acceptance control to a frozen language model for code generation. The LLM proposes complete source files while the controller selects relevant memory records and AST-derived skills for each prompt, runs fail-fast validation, converts outcomes into bounded shaped rewards, and assigns delayed credit through eligibility traces. This combination targets the full requirements of settings where correctness depends on execution feedback and bounded repair rather than a single model output. A sympathetic reader would care because it shows measurable improvement over self-refinement baselines that scored zero in the same constrained setup, without any change to the underlying model weights.

Core claim

PYTHALAB-MERA is a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations while the self-refinement baseline and the investigated GRACE extension each passed 0/9.

What carries the argument

PYTHALAB-MERA, the external controller that integrates episodic memory selection, AST-derived skill retrieval, fail-fast validation, shaped rewards, and TD(lambda) eligibility traces around a frozen LLM proposer.

If this is right

  • Persistent episodic memory and AST-derived skill reuse become available for code generation without updating LLM weights.
  • Bounded shaped rewards and eligibility traces allow credit assignment across multiple repair attempts within a fixed budget.
  • The separation of proposal generation by the frozen model from control by the external system supports modular addition of validation gates.
  • Strict fail-fast validation converts directly into acceptance decisions that improve overall success rate in the measured setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The controller architecture could be tested on iterative validation tasks outside code, such as data pipeline construction where execution feedback is also available.
  • Varying the attempt budget above three while keeping the same tasks would show whether performance gains scale or plateau under looser constraints.
  • Adapting the AST skill extraction and memory selection to additional programming languages would indicate how language-specific the current gains are.

Load-bearing premise

The three chosen coding tasks together with the fail-fast validation pipeline and three-attempt budget are representative of the broader class of validation-grounded code generation problems the system is intended to address.

What would settle it

Re-running the identical three-task, three-repetition, three-attempt protocol on a fourth distinct coding task and observing whether PYTHALAB-MERA maintains an 8/9 or higher success rate while baselines remain near zero would test whether the reported improvement depends on the original task selection.

Figures

Figures reproduced from arXiv: 2605.08468 by Mehmet Iscan.

Figure 1
Figure 1. Figure 1: PYTHALAB-MERA validation-grounded control loop. The frozen generator proposes a candidate program, while the external controller selects retrieval evidence, validates the candidate, and converts validation feedback into memory and credit up￾dates. agent architectures that externalize memory and control rather than embedding all adaptation into the LLM [40,51,65] [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Primary phase1c hard RL outcome. Success is reported as strict validator pass rate with 95% Wilson intervals; attempts and wall-clock time are run-level efficiency measures. The figure shows the same bounded conclusion as the table: PYTHALAB￾MERA is the only condition with nonzero strict success in this measured setting, and it also uses fewer attempts and less time on average. 3.3 Per-task behavior and re… view at source ↗
Figure 3
Figure 3. Figure 3: Per-task success and residual failure distribution in phase1c. Cell labels show successes out of three repeats; failure bars count only non-passing runs. The profile indicates that PYTHALAB-MERA reduced hard-task failures in this subset, while GRACE incurred more runtime and import failures [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Secondary phase1b RL comparison. This plot is descriptive because each of the 11 tasks was run once per condition; it is included to show that the primary phase1c finding is not the only observed efficiency signal, but it is not treated as the main repeated-trial claim [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PYTHALAB-MERA, a lightweight external controller for frozen LLM-based coding agents. The controller manages episodic memory, adaptive retrieval of records and AST-derived skills, fail-fast validation, shaped rewards, and TD(λ)-style eligibility traces for delayed credit assignment. In a specific experimental setting with three reinforcement-learning coding tasks, three repetitions, and a three-attempt budget, the system achieved 8/9 strict validations while self-refinement and GRACE baselines achieved 0/9. The authors explicitly bound the claim to this recorded setting without asserting generality, SOTA performance, or formal correctness.

Significance. If the reported counts are accurate, the work provides a concrete demonstration that an external memory-retrieval-acceptance controller can improve validation success over simple self-refinement in a constrained, validation-grounded code-generation setting. The deliberate bounding of the claim, the use of strict validation gates, and the avoidance of overclaiming are strengths. The very small trial count (nine total) and lack of supporting implementation details, however, limit the result's immediate significance and utility as a reproducible baseline for the broader class of problems the system targets.

major comments (1)
  1. Evaluation section: The central empirical result (8/9 versus 0/9) is reported without any description of the three coding tasks, the concrete criteria or implementation of the fail-fast validation pipeline, the precise configuration of the TD(λ) traces, or the implementation details of the self-refinement and GRACE baselines. These omissions make the numerical outcome impossible to contextualize, replicate, or assess for robustness.
minor comments (2)
  1. The abstract is unusually long and dense; some technical description could be moved to the introduction or methods to improve readability.
  2. Ensure first-use definitions for acronyms such as TD(λ) and AST even if they are standard in the field.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive recognition of the paper's bounded claims, strict validation approach, and avoidance of overclaiming. We address the single major comment below and will revise the manuscript to improve reproducibility and contextualization of the results.

read point-by-point responses
  1. Referee: Evaluation section: The central empirical result (8/9 versus 0/9) is reported without any description of the three coding tasks, the concrete criteria or implementation of the fail-fast validation pipeline, the precise configuration of the TD(λ) traces, or the implementation details of the self-refinement and GRACE baselines. These omissions make the numerical outcome impossible to contextualize, replicate, or assess for robustness.

    Authors: We agree that the Evaluation section requires substantially more detail to support replication and assessment of the reported outcome. In the revised manuscript we will expand this section with: (1) explicit descriptions of the three reinforcement-learning coding tasks, including their objectives, input specifications, and the precise strict-validation criteria used for success; (2) the concrete implementation of the fail-fast validation pipeline, including the ordered sequence of checks, error categories, and acceptance thresholds; (3) the exact configuration of the TD(λ) eligibility traces, specifying the λ value, trace-decay schedule, and credit-propagation rules; and (4) implementation details for the self-refinement baseline and the GRACE extension, including prompt templates, iteration budgets, and any modifications made to the original GRACE procedure. These additions will be accompanied by pseudocode or a table summarizing hyperparameters so that the 8/9 versus 0/9 comparison can be fully contextualized and reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper contains no derivation chain, mathematical model, or predictive formalism that could reduce to its inputs. Its central claim is a bounded empirical report of measured validation counts (8/9 successes versus 0/9 for baselines) obtained from a fixed experimental protocol with three tasks, three repetitions, and a three-attempt budget. All described components—memory records, AST-derived skills, fail-fast validation, shaped rewards, and TD(λ) traces—are implementation choices whose correctness is assessed by direct execution outcomes rather than by any self-referential definition, fitted-parameter prediction, or load-bearing self-citation. The abstract explicitly disclaims generality, so no unstated premise is required for the reported counts to hold.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach relies on standard reinforcement-learning concepts (eligibility traces, shaped rewards) and prompting techniques (memory retrieval, AST-derived skills) without introducing new mathematical entities or free parameters whose values are fitted in the abstract.

axioms (2)
  • domain assumption Validation outcomes from a fail-fast pipeline can be converted into bounded shaped rewards suitable for TD(lambda) credit assignment.
    Invoked to justify the reward mechanism that drives the controller.
  • domain assumption AST-derived skills extracted from prior code can be usefully selected and inserted into future prompts.
    Core premise of the retrieval component.

pith-pipeline@v0.9.0 · 5561 in / 1464 out tokens · 85120 ms · 2026-05-12T01:06:57.113737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 9 internal anchors

  1. [1]

    Tablan, V., Taylor, S., Hurtado, G., Bernhem, K., Uhrenholt, A., Farei, G., & Moilanen, K. (2025). Smarter together: Creating agentic communities of practice through shared experiential learning. arXiv.https://arxiv.org/abs/2511.08301

  2. [2]

    Mishra, S., Niroula, S., Yadav, U., Thakur, D., Gyawali, S., & Gaire, S. (2026). Sok: Agenticretrieval-augmentedgeneration(rag):Taxonomy,architectures,evaluation, and research directions. arXiv.https://arxiv.org/abs/2603.07379

  3. [3]

    RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    Shanto, M. H., Asaduzzaman, M., & Ngom, A. (2026). RAG-Reflect: Agen- tic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow. arXiv.https://arxiv.org/abs/2604.22217

  4. [4]

    M., Kim, J., & Kim, S

    Jiang, J., Shen, J., Kim, S., Yoo, K. M., Kim, J., & Kim, S. (2026). ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self- Correct It via Reinforcement Learning. arXiv.https://arxiv.org/abs/2603.058 63 24 M. Iscan

  5. [5]

    DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

    Wu,L.,Pei,Y.,Yang,Z.,Li,K.,Lu,Z.,Tan,H.,...&Hao,D.(2026).DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging. arXiv.https://arxiv.org/abs/2604.19305

  6. [6]

    Roumeliotis, and Manoj Karkee

    Sapkota, R., Roumeliotis, K. I., & Karkee, M. (2025). Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic ai. arXiv.https://ar xiv.org/abs/2505.19443

  7. [7]

    Ali Babar, and M

    Xia, C. S., Wei, Y., & Zhang, L. (2023, May). Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 1482-1494). IEEE.https://do i.org/10.1109/ICSE48619.2023.00129

  8. [8]

    Farzandway, M., & Ghassemi, F. (2025). Automated repair of c programs using large language models. arXiv.https://arxiv.org/abs/2509.01947

  9. [9]

    Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., & Tan, S. H. (2023, May). Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 1469-1481). IEEE.https://doi.org/10.1109/ICSE48619.2023.00128

  10. [10]

    S., & Narayan, O

    Narajala, V. S., & Narayan, O. (2025). Securing agentic ai: A comprehensive threat model and mitigation framework for generative ai agents. arXiv.https://arxiv. org/abs/2504.19956

  11. [11]

    (2023, December)

    Huang, Y., Gupta, S., Zhong, Z., Li, K., & Chen, D. (2023, December). Privacy implications of retrieval-based language models. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (pp. 14887-14902). https://doi.org/10.18653/v1/2023.emnlp-main.921

  12. [12]

    REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

    Kuang, S., Tian, Z., Lin, K., Tao, C., Wang, S., Bai, H., ... & Chen, J. (2026). REAgent: Requirement-Driven LLM Agents for Software Issue Resolution. arXiv. https://arxiv.org/abs/2604.06861

  13. [13]

    & Han, Z

    Wang, C., Zhou, Z., Wang, C., Sun, Y., Yang, S., Yuan, Y., ... & Han, Z. (2025, December). End-to-End Secure Code Repair with Context-Aware Anonymiza- tion and Isolated Agent Execution. In 2025 IEEE International Conference on Blockchain Technology and Information Security (ICBCTIS) (pp. 1-8). IEEE. https://doi.org/10.1109/ICBCTIS66509.2025.11387695

  14. [14]

    Self-Refine: Iterative Refinement with Self-Feedback

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36, 46534-46594.https://arxiv.org/ab s/2303.17651

  15. [15]

    Gehring, J., Zheng, K., Copet, J., Mella, V., Carbonneaux, Q., Cohen, T., & Syn- naeve, G. (2024). Rlef: Grounding code llms in execution feedback with reinforce- ment learning. arXiv.https://arxiv.org/abs/2410.02089

  16. [16]

    Chen, Y., Sun, Y., Wang, H., Zhang, X., Shen, X., Li, W., & Zhang, W. (2026). Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration. arXiv.https://arxiv.org/abs/2603.06859

  17. [17]

    arXiv.https://arxiv.org/abs/2506.01442

    Yang,X.,Li,W.,Sheng,J.,Shen,C.,Hua,Y.,&Wang,X.(2025).AgenticEpisodic Control. arXiv.https://arxiv.org/abs/2506.01442

  18. [18]

    Zhang, H., Long, Q., Bao, J., Feng, T., Zhang, W., Yue, H., & Wang, W. (2026). MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents. arXiv. https://arxiv.org/abs/2602.02474

  19. [19]

    arXiv preprint arXiv:2603.00718 , year=

    Chen, S., Gai, J., Zhou, R., Zhang, J., Zhu, T., Li, J., ... & Teh, Y. W. (2026). SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?. arXiv.https://arxi v.org/abs/2603.00718 PYTHALAB-MERA: Memory and Acceptance Control 25

  20. [20]

    Gonzalez, M. A. A., Hernandez, M. B., Perez, M. A. P., Orozco, B. L., Soto, J. T. C., & Malagon, S. (2025). Do Repetitions Matter? Strengthening Reliability in LLM Evaluations. arXiv.https://arxiv.org/abs/2509.24086

  21. [21]

    J., Baiocchi, M., Savage, T

    Gallo, R. J., Baiocchi, M., Savage, T. R., & Chen, J. H. (2025). Establishing best practices in large language model research: an application to repeat prompting. Journal of the American Medical Informatics Association, 32(2), 386-390.https: //doi.org/10.1093/jamia/ocae294

  22. [22]

    & Zheng, Z

    Ning, K., Chen, J., Zhang, J., Li, W., Wang, Z., Feng, Y., ... & Zheng, Z. (2026). Defining and Detecting the Defects of Large Language Model-Based Autonomous Agents. IEEE Transactions on Software Engineering.https://doi.org/10.1109/ TSE.2026.3658554

  23. [23]

    Chen, Z., Ma, W., & Jiang, L. (2025). Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv.https://arxiv.org/abs/2503.12374

  24. [24]

    Barke, S., Goyal, A., Khare, A., Singh, A., Nath, S., & Bansal, C. (2026). AgentRx: Diagnosing AI Agent Failures from Execution Trajectories. arXiv.https://arxi v.org/abs/2602.02475

  25. [25]

    arXiv:2507.02825 , year =

    Zhu, Y., Jin, T., Pruksachatkun, Y., Zhang, A., Liu, S., Cui, S., Kapoor, S., Longpre, S., Meng, K., Weiss, R., Barez, F., Gupta, R., Dhamala, J., Mer- izian, J., Giulianelli, M., Coppock, H., Ududec, C., Sekhon, J., Steinhardt, J., Kellermann, A., Schwettmann, S., Zaharia, M., Stoica, I., Liang, P., & Kang, D. (2025).Establishingbestpracticesforbuildingr...

  26. [26]

    Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web(pp. 661–670). ACM.https: //doi.org/10.1145/1772690.1772758

  27. [27]

    Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.https://doi.org/10.1007/BF00115009

  28. [28]

    Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158), 209–212.http s://doi.org/10.1080/01621459.1927.10502953

  29. [29]

    Gautam, D., Garg, S., Jang, J., Sundaresan, N., & Zilouchian Moghaddam, R. (2025). RefactorBench: Evaluating stateful reasoning in language agents through code. arXiv.https://arxiv.org/abs/2503.07832

  30. [30]

    Arimbur, J. J. (2026). How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks. arXiv.https://arxiv. org/abs/2604.10508

  31. [31]

    & Zheng, Z

    Dai, D., Liu, M., Li, A., Cao, J., Wang, Y., Wang, C., ... & Zheng, Z. (2025). Feed- backeval: A benchmark for evaluating large language models in feedback-driven code repair tasks. arXiv.https://arxiv.org/abs/2504.06939

  32. [32]

    Memory poisoning attack and defense on memory based llm-agents,

    Sunil, B. D., Sinha, I., Maheshwari, P., Todmal, S., Mallik, S., & Mishra, S. (2026). Memory poisoning attack and defense on memory based llm-agents. arXiv.https: //arxiv.org/abs/2601.05504

  33. [33]

    S., & He, H

    Srivastava, S. S., & He, H. (2025). MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval. arXiv.https://arxiv.org/abs/2512.1 6962

  34. [34]

    Liu, F., Zhang, Y., Luo, J., Dai, J., Chen, T., Yuan, L., Yu, Z., Shi, Y., Li, K., Zhou, C., Chen, H., & Yang, M. (2025). Make agent defeat agent: Automatic detec- tion of taint-style vulnerabilities in LLM-based agents. In 34th USENIX Security 26 M. Iscan Symposium (USENIX Security 25).https://www.usenix.org/conference/usen ixsecurity25/presentation/liu-fengyu

  35. [35]

    Wang, W., Wang, Y., Joty, S., & Hoi, S. C. (2023, November). Rap-gen: Retrieval- augmented patch generation with codet5 for automatic program repair. In Pro- ceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 146-158).https: //doi.org/10.1145/3611643.3616256

  36. [36]

    (2025, November)

    Lee, H., & Yang, G. (2025, November). AgentRepair: Multi-Agent, AST-Anchored, Retrieval-Augmented Program Repair for Cold-Start Environments. In 2025 12th International Conference on Dependable Systems and Their Applications (DSA) (pp. 121-132). IEEE.https://doi.org/10.1109/DSA66321.2025.00025

  37. [37]

    Chondamrongkul, N., Kyaw, M. P. P., Ko, S. M., Paing, P. P., Swe, M. K. T., & Hongthong, T. (2026). RepoAI: Automated code refactoring through multi- agent LLM orchestration and retrieval-augmented generation. Science of Computer Programming, 253, Article 103477.https://doi.org/10.1016/j.scico.2026.1 03477

  38. [38]

    Bui, N. D. (2026). Building Effective AI Coding Agents for the Terminal: Scaf- folding, Harness, Context Engineering, and Lessons Learned. arXiv.https: //arxiv.org/abs/2603.05344

  39. [39]

    Rombaut, B. (2026). Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures. arXiv.https://arxiv.org/abs/2604.03515

  40. [40]

    Kim, M. H. (2025). Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop. arXiv.https://arxiv.org/abs/2511.1 7673

  41. [41]

    C., & Zuo, J

    Poon, M., Dai, X., Liu, X., Kong, F., Lui, J. C., & Zuo, J. (2026, March). Online multi-llm selection via contextual bandits under unstructured context evolution. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 29, pp. 24855-24863).https://doi.org/10.1609/aaai.v40i29.39672

  42. [42]

    & Dai, Z

    Hong, Z., Zhang, Q., Sun, J., Shang, Z., Kong, M., Wang, X., ... & Dai, Z. (2026). MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks. arXiv.https://arxiv.org/abs/2603.02630

  43. [43]

    Rietz, F., Smirnov, O., Karimi, S., & Cao, L. (2026). Prompt Tuning Decision Transformers with Structured and Scalable Bandits. Advances in Neural Informa- tion Processing Systems, 38, 58258-58286.https://www.microsoft.com/en-us/ research/publication/prompt-tuning-decision-transformers-with-structu red-and-scalable-bandits/

  44. [44]

    61-70).https://doi.org/10.1145/28 08194.2809457

    Sloan,M.,&Wang,J.(2015,September).Dynamicinformationretrieval:Theoreti- calframeworkand application.In Proceedingsof the2015 InternationalConference on the theory of Information Retrieval (pp. 61-70).https://doi.org/10.1145/28 08194.2809457

  45. [45]

    In: Kamps, J., Kanoulas, E., de Rijke, M., Fang, H., Yilmaz, E

    Yang,A.,&Yang,G.H.(2017,October).Acontextualbanditapproachtodynamic search. In Proceedings of the ACM SIGIR International Conference on Theory of InformationRetrieval(pp.301-304).https://doi.org/10.1145/3121050.3121101

  46. [46]

    & Krishnan, S

    Zhang, W., Zhu, Y., Lu, Y., Demarne, M., Wang, W., Deng, K., ... & Krishnan, S. (2025, November). Flair: Feedback learning for adaptive information retrieval. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (pp. 6284-6292).https://doi.org/10.1145/3746252.37 61553

  47. [47]

    Acecoder: Acing coder rl via automated test-case synthesis.ArXiv, abs/2207.01780, 2025

    Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement PYTHALAB-MERA: Memory and Acceptance Control 27 learning. Advances in Neural Information Processing Systems, 35, 21314-21328. https://arxiv.org/abs/2207.01780

  48. [48]

    Outcome-Refining Process Supervision for Code Generation,

    Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., ... & Zhang, S. (2024). Reasoning through execution: Unifying process and outcome rewards for code gen- eration. arXiv.https://arxiv.org/abs/2412.15118

  49. [49]

    Han, B., Ren, Z., Wu, Z., Zhou, Y., & Peng, J. (2022). Off-policy reinforcement learning with delayed rewards. In Proceedings of the 39th International Conference on Machine Learning (PMLR, Vol. 162, pp. 8280–8303).https://proceedings. mlr.press/v162/han22e.html

  50. [50]

    & Lyu, C

    Li, B., Sun, Z., Huang, T., Zhang, H., Wan, Y., Li, G., ... & Lyu, C. (2024). Ircoco:Immediaterewards-guideddeepreinforcementlearningforcodecompletion. Proceedings of the ACM on Software Engineering, 1(FSE), 182-203.https://do i.org/10.1145/3643735

  51. [51]

    Zhang, D., Chen, L., Zhang, S., Xu, H., Zhao, Z., & Yu, K. (2023). Large language models are semi-parametric reinforcement learning agents. Advances in Neural In- formation Processing Systems, 36, 78227-78239.https://arxiv.org/abs/2306.0 7929

  52. [52]

    (2025, September)

    Krishnamoorthy, A., Ivatury, K., & Ahmadnia, B. (2025, September). Multi-Agent ReinforcementLearningforInteractiveCodeDebuggingwithHumanFeedbackand Memory. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era (pp. 595-603).https://doi.org/10.26615/978-954...

  53. [53]

    (2009, May)

    Chilowicz, M., Duris, E., & Roussel, G. (2009, May). Syntax tree fingerprinting for source code similarity detection. In 2009 IEEE 17th international conference on program comprehension (pp. 243-247). IEEE.https://doi.org/10.1109/ICPC.2 009.5090050

  54. [54]

    Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

    Li, Y., Wang, S., & Nguyen, T. (2021, May). Fault localization with code cover- age representation learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 661-673). IEEE.https://doi.org/10.1109/IC SE43902.2021.00067

  55. [55]

    M., Kn, N., & Chakrabarti, S

    Verma, A., Udhayanan, P., Shankar, R. M., Kn, N., & Chakrabarti, S. K. (2021, October). Source-code similarity measurement: syntax tree fingerprinting for auto- mated evaluation. In Proceedings of the First International Conference on AI-ML Systems (pp. 1-7).https://doi.org/10.1145/3486001.3486228

  56. [56]

    Ellis, K., Morales, L., Sablé-Meyer, M., Solar-Lezama, A., & Tenenbaum, J. (2018). Learning libraries of subroutines for neurally guided Bayesian program induction. In Advances in Neural Information Processing Systems 31 (pp. 7816–7826).https: //papers.nips.cc/paper/8006-learning-libraries-of-subroutines-for-neu rallyguided-bayesian-program-induction

  57. [57]

    & Swaminathan, V

    Xu, P., Wu, G., Chen, X., Yu, T., Xiao, C., Dernoncourt, F., ... & Swaminathan, V. (2026, March). Skill discovery for software scripting automation via offline sim- ulations with llms. In Findings of the Association for Computational Linguistics: EACL 2026 (pp. 743-759).https://doi.org/10.18653/v1/2026.findings-eac l.37

  58. [58]

    Stengel-Eskin, E., Prasad, A., & Bansal, M. (2024). Regal: Refactoring programs to discover generalizable abstractions. arXiv.https://arxiv.org/abs/2401.16467

  59. [59]

    Rabin, R., Hostetler, J., McGregor, S., Weir, B., & Judd, N. (2025). Sandboxeval: Towards securing test environment for untrusted code. arXiv.https://arxiv.or g/abs/2504.00018 28 M. Iscan

  60. [60]

    & Cai, Y

    Wang, J., Luo, X., Cao, L., He, H., Huang, H., Xie, J., ... & Cai, Y. (2024). Is your ai-generated code really safe? evaluating large language models on secure code generation with codeseceval. arXiv.https://arxiv.org/abs/2407.02395

  61. [61]

    Y., & Tan, R

    Jiang, H., Chen, Y., Cao, Y., Lee, H. Y., & Tan, R. T. (2025). Codejudgebench: Benchmarking llm-as-a-judge for coding tasks. arXiv.https://arxiv.org/abs/25 07.10535

  62. [62]

    Zhao, Y., Luo, Z., Tian, Y., Lin, H., Yan, W., Li, A., & Ma, J. (2024). CodeJudge- Eval: Can large language models be good judges in code understanding? arXiv. https://arxiv.org/abs/2408.10718

  63. [63]

    arXiv.https://arxiv.org/abs/ 2510.11822

    Jain,S.,Ahmed,U.Z.,Sahai,S.,&Leong,B.(2025).Beyondconsensus:Mitigating the agreeableness bias in llm judge evaluations. arXiv.https://arxiv.org/abs/ 2510.11822

  64. [64]

    Devanbu, and Michael Pradel

    Spiess, C., Gros, D., Pai, K. S., Pradel, M., Rabin, M. R. I., Alipour, A., ... & Ahmed, T. (2025, April). Calibration and correctness of language models for code. In2025IEEE/ACM47thInternationalConferenceonSoftwareEngineering(ICSE) (pp. 540-552). IEEE.https://doi.org/10.1109/ICSE55347.2025.00040

  65. [65]

    Chen, Z., Chen, D., Jin, R., Liang, Y., Xie, Y., & Sun, H. (2026). Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation. arXiv.https://arxiv.org/abs/2602.03806