Recognition: 2 theorem links
· Lean TheoremPYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3
The pith
An external controller for frozen LLMs raises strict validation success on coding tasks from 0/9 to 8/9 in a hard RL setting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PYTHALAB-MERA is a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations while the self-refinement baseline and the investigated GRACE extension each passed 0/9.
What carries the argument
PYTHALAB-MERA, the external controller that integrates episodic memory selection, AST-derived skill retrieval, fail-fast validation, shaped rewards, and TD(lambda) eligibility traces around a frozen LLM proposer.
If this is right
- Persistent episodic memory and AST-derived skill reuse become available for code generation without updating LLM weights.
- Bounded shaped rewards and eligibility traces allow credit assignment across multiple repair attempts within a fixed budget.
- The separation of proposal generation by the frozen model from control by the external system supports modular addition of validation gates.
- Strict fail-fast validation converts directly into acceptance decisions that improve overall success rate in the measured setting.
Where Pith is reading between the lines
- The controller architecture could be tested on iterative validation tasks outside code, such as data pipeline construction where execution feedback is also available.
- Varying the attempt budget above three while keeping the same tasks would show whether performance gains scale or plateau under looser constraints.
- Adapting the AST skill extraction and memory selection to additional programming languages would indicate how language-specific the current gains are.
Load-bearing premise
The three chosen coding tasks together with the fail-fast validation pipeline and three-attempt budget are representative of the broader class of validation-grounded code generation problems the system is intended to address.
What would settle it
Re-running the identical three-task, three-repetition, three-attempt protocol on a fourth distinct coding task and observing whether PYTHALAB-MERA maintains an 8/9 or higher success rate while baselines remain near zero would test whether the reported improvement depends on the original task selection.
Figures
read the original abstract
Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PYTHALAB-MERA, a lightweight external controller for frozen LLM-based coding agents. The controller manages episodic memory, adaptive retrieval of records and AST-derived skills, fail-fast validation, shaped rewards, and TD(λ)-style eligibility traces for delayed credit assignment. In a specific experimental setting with three reinforcement-learning coding tasks, three repetitions, and a three-attempt budget, the system achieved 8/9 strict validations while self-refinement and GRACE baselines achieved 0/9. The authors explicitly bound the claim to this recorded setting without asserting generality, SOTA performance, or formal correctness.
Significance. If the reported counts are accurate, the work provides a concrete demonstration that an external memory-retrieval-acceptance controller can improve validation success over simple self-refinement in a constrained, validation-grounded code-generation setting. The deliberate bounding of the claim, the use of strict validation gates, and the avoidance of overclaiming are strengths. The very small trial count (nine total) and lack of supporting implementation details, however, limit the result's immediate significance and utility as a reproducible baseline for the broader class of problems the system targets.
major comments (1)
- Evaluation section: The central empirical result (8/9 versus 0/9) is reported without any description of the three coding tasks, the concrete criteria or implementation of the fail-fast validation pipeline, the precise configuration of the TD(λ) traces, or the implementation details of the self-refinement and GRACE baselines. These omissions make the numerical outcome impossible to contextualize, replicate, or assess for robustness.
minor comments (2)
- The abstract is unusually long and dense; some technical description could be moved to the introduction or methods to improve readability.
- Ensure first-use definitions for acronyms such as TD(λ) and AST even if they are standard in the field.
Simulated Author's Rebuttal
We thank the referee for the positive recognition of the paper's bounded claims, strict validation approach, and avoidance of overclaiming. We address the single major comment below and will revise the manuscript to improve reproducibility and contextualization of the results.
read point-by-point responses
-
Referee: Evaluation section: The central empirical result (8/9 versus 0/9) is reported without any description of the three coding tasks, the concrete criteria or implementation of the fail-fast validation pipeline, the precise configuration of the TD(λ) traces, or the implementation details of the self-refinement and GRACE baselines. These omissions make the numerical outcome impossible to contextualize, replicate, or assess for robustness.
Authors: We agree that the Evaluation section requires substantially more detail to support replication and assessment of the reported outcome. In the revised manuscript we will expand this section with: (1) explicit descriptions of the three reinforcement-learning coding tasks, including their objectives, input specifications, and the precise strict-validation criteria used for success; (2) the concrete implementation of the fail-fast validation pipeline, including the ordered sequence of checks, error categories, and acceptance thresholds; (3) the exact configuration of the TD(λ) eligibility traces, specifying the λ value, trace-decay schedule, and credit-propagation rules; and (4) implementation details for the self-refinement baseline and the GRACE extension, including prompt templates, iteration budgets, and any modifications made to the original GRACE procedure. These additions will be accompanied by pseudocode or a table summarizing hyperparameters so that the 8/9 versus 0/9 comparison can be fully contextualized and reproduced. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper contains no derivation chain, mathematical model, or predictive formalism that could reduce to its inputs. Its central claim is a bounded empirical report of measured validation counts (8/9 successes versus 0/9 for baselines) obtained from a fixed experimental protocol with three tasks, three repetitions, and a three-attempt budget. All described components—memory records, AST-derived skills, fail-fast validation, shaped rewards, and TD(λ) traces—are implementation choices whose correctness is assessed by direct execution outcomes rather than by any self-referential definition, fitted-parameter prediction, or load-bearing self-citation. The abstract explicitly disclaims generality, so no unstated premise is required for the reported counts to hold.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Validation outcomes from a fail-fast pipeline can be converted into bounded shaped rewards suitable for TD(lambda) credit assignment.
- domain assumption AST-derived skills extracted from prior code can be usefully selected and inserted into future prompts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the terminal acceptance cost is C_acc_t = 1 - A_t and the zero-cost terminal set is X_0(q,W) = {x̂ : C_acc(x̂;q,W)=0}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Shanto, M. H., Asaduzzaman, M., & Ngom, A. (2026). RAG-Reflect: Agen- tic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow. arXiv.https://arxiv.org/abs/2604.22217
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Jiang, J., Shen, J., Kim, S., Yoo, K. M., Kim, J., & Kim, S. (2026). ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self- Correct It via Reinforcement Learning. arXiv.https://arxiv.org/abs/2603.058 63 24 M. Iscan
work page 2026
-
[5]
DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging
Wu,L.,Pei,Y.,Yang,Z.,Li,K.,Lu,Z.,Tan,H.,...&Hao,D.(2026).DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging. arXiv.https://arxiv.org/abs/2604.19305
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Sapkota, R., Roumeliotis, K. I., & Karkee, M. (2025). Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic ai. arXiv.https://ar xiv.org/abs/2505.19443
-
[7]
Xia, C. S., Wei, Y., & Zhang, L. (2023, May). Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 1482-1494). IEEE.https://do i.org/10.1109/ICSE48619.2023.00129
- [8]
-
[9]
Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., & Tan, S. H. (2023, May). Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 1469-1481). IEEE.https://doi.org/10.1109/ICSE48619.2023.00128
-
[10]
Narajala, V. S., & Narayan, O. (2025). Securing agentic ai: A comprehensive threat model and mitigation framework for generative ai agents. arXiv.https://arxiv. org/abs/2504.19956
-
[11]
Huang, Y., Gupta, S., Zhong, Z., Li, K., & Chen, D. (2023, December). Privacy implications of retrieval-based language models. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing (pp. 14887-14902). https://doi.org/10.18653/v1/2023.emnlp-main.921
-
[12]
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
Kuang, S., Tian, Z., Lin, K., Tao, C., Wang, S., Bai, H., ... & Chen, J. (2026). REAgent: Requirement-Driven LLM Agents for Software Issue Resolution. arXiv. https://arxiv.org/abs/2604.06861
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Wang, C., Zhou, Z., Wang, C., Sun, Y., Yang, S., Yuan, Y., ... & Han, Z. (2025, December). End-to-End Secure Code Repair with Context-Aware Anonymiza- tion and Isolated Agent Execution. In 2025 IEEE International Conference on Blockchain Technology and Information Security (ICBCTIS) (pp. 1-8). IEEE. https://doi.org/10.1109/ICBCTIS66509.2025.11387695
-
[14]
Self-Refine: Iterative Refinement with Self-Feedback
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., ... & Clark, P. (2023). Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36, 46534-46594.https://arxiv.org/ab s/2303.17651
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [15]
-
[16]
Chen, Y., Sun, Y., Wang, H., Zhang, X., Shen, X., Li, W., & Zhang, W. (2026). Contextual Counterfactual Credit Assignment for Multi-Agent Reinforcement Learning in LLM Collaboration. arXiv.https://arxiv.org/abs/2603.06859
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
arXiv.https://arxiv.org/abs/2506.01442
Yang,X.,Li,W.,Sheng,J.,Shen,C.,Hua,Y.,&Wang,X.(2025).AgenticEpisodic Control. arXiv.https://arxiv.org/abs/2506.01442
-
[18]
Zhang, H., Long, Q., Bao, J., Feng, T., Zhang, W., Yue, H., & Wang, W. (2026). MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents. arXiv. https://arxiv.org/abs/2602.02474
work page internal anchor Pith review arXiv 2026
-
[19]
arXiv preprint arXiv:2603.00718 , year=
Chen, S., Gai, J., Zhou, R., Zhang, J., Zhu, T., Li, J., ... & Teh, Y. W. (2026). SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?. arXiv.https://arxi v.org/abs/2603.00718 PYTHALAB-MERA: Memory and Acceptance Control 25
- [20]
-
[21]
Gallo, R. J., Baiocchi, M., Savage, T. R., & Chen, J. H. (2025). Establishing best practices in large language model research: an application to repeat prompting. Journal of the American Medical Informatics Association, 32(2), 386-390.https: //doi.org/10.1093/jamia/ocae294
-
[22]
Ning, K., Chen, J., Zhang, J., Li, W., Wang, Z., Feng, Y., ... & Zheng, Z. (2026). Defining and Detecting the Defects of Large Language Model-Based Autonomous Agents. IEEE Transactions on Software Engineering.https://doi.org/10.1109/ TSE.2026.3658554
-
[23]
Chen, Z., Ma, W., & Jiang, L. (2025). Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv.https://arxiv.org/abs/2503.12374
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [24]
-
[25]
Zhu, Y., Jin, T., Pruksachatkun, Y., Zhang, A., Liu, S., Cui, S., Kapoor, S., Longpre, S., Meng, K., Weiss, R., Barez, F., Gupta, R., Dhamala, J., Mer- izian, J., Giulianelli, M., Coppock, H., Ududec, C., Sekhon, J., Steinhardt, J., Kellermann, A., Schwettmann, S., Zaharia, M., Stoica, I., Liang, P., & Kang, D. (2025).Establishingbestpracticesforbuildingr...
-
[26]
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web(pp. 661–670). ACM.https: //doi.org/10.1145/1772690.1772758
-
[27]
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.https://doi.org/10.1007/BF00115009
-
[28]
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association, 22(158), 209–212.http s://doi.org/10.1080/01621459.1927.10502953
- [29]
-
[30]
Arimbur, J. J. (2026). How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks. arXiv.https://arxiv. org/abs/2604.10508
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Dai, D., Liu, M., Li, A., Cao, J., Wang, Y., Wang, C., ... & Zheng, Z. (2025). Feed- backeval: A benchmark for evaluating large language models in feedback-driven code repair tasks. arXiv.https://arxiv.org/abs/2504.06939
-
[32]
Memory poisoning attack and defense on memory based llm-agents,
Sunil, B. D., Sinha, I., Maheshwari, P., Todmal, S., Mallik, S., & Mishra, S. (2026). Memory poisoning attack and defense on memory based llm-agents. arXiv.https: //arxiv.org/abs/2601.05504
-
[33]
Srivastava, S. S., & He, H. (2025). MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval. arXiv.https://arxiv.org/abs/2512.1 6962
work page 2025
-
[34]
Liu, F., Zhang, Y., Luo, J., Dai, J., Chen, T., Yuan, L., Yu, Z., Shi, Y., Li, K., Zhou, C., Chen, H., & Yang, M. (2025). Make agent defeat agent: Automatic detec- tion of taint-style vulnerabilities in LLM-based agents. In 34th USENIX Security 26 M. Iscan Symposium (USENIX Security 25).https://www.usenix.org/conference/usen ixsecurity25/presentation/liu-fengyu
work page 2025
-
[35]
Wang, W., Wang, Y., Joty, S., & Hoi, S. C. (2023, November). Rap-gen: Retrieval- augmented patch generation with codet5 for automatic program repair. In Pro- ceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 146-158).https: //doi.org/10.1145/3611643.3616256
-
[36]
Lee, H., & Yang, G. (2025, November). AgentRepair: Multi-Agent, AST-Anchored, Retrieval-Augmented Program Repair for Cold-Start Environments. In 2025 12th International Conference on Dependable Systems and Their Applications (DSA) (pp. 121-132). IEEE.https://doi.org/10.1109/DSA66321.2025.00025
-
[37]
Chondamrongkul, N., Kyaw, M. P. P., Ko, S. M., Paing, P. P., Swe, M. K. T., & Hongthong, T. (2026). RepoAI: Automated code refactoring through multi- agent LLM orchestration and retrieval-augmented generation. Science of Computer Programming, 253, Article 103477.https://doi.org/10.1016/j.scico.2026.1 03477
- [38]
-
[39]
Rombaut, B. (2026). Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures. arXiv.https://arxiv.org/abs/2604.03515
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Kim, M. H. (2025). Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop. arXiv.https://arxiv.org/abs/2511.1 7673
work page 2025
-
[41]
Poon, M., Dai, X., Liu, X., Kong, F., Lui, J. C., & Zuo, J. (2026, March). Online multi-llm selection via contextual bandits under unstructured context evolution. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 29, pp. 24855-24863).https://doi.org/10.1609/aaai.v40i29.39672
- [42]
-
[43]
Rietz, F., Smirnov, O., Karimi, S., & Cao, L. (2026). Prompt Tuning Decision Transformers with Structured and Scalable Bandits. Advances in Neural Informa- tion Processing Systems, 38, 58258-58286.https://www.microsoft.com/en-us/ research/publication/prompt-tuning-decision-transformers-with-structu red-and-scalable-bandits/
work page 2026
-
[44]
61-70).https://doi.org/10.1145/28 08194.2809457
Sloan,M.,&Wang,J.(2015,September).Dynamicinformationretrieval:Theoreti- calframeworkand application.In Proceedingsof the2015 InternationalConference on the theory of Information Retrieval (pp. 61-70).https://doi.org/10.1145/28 08194.2809457
work page doi:10.1145/28 2015
-
[45]
In: Kamps, J., Kanoulas, E., de Rijke, M., Fang, H., Yilmaz, E
Yang,A.,&Yang,G.H.(2017,October).Acontextualbanditapproachtodynamic search. In Proceedings of the ACM SIGIR International Conference on Theory of InformationRetrieval(pp.301-304).https://doi.org/10.1145/3121050.3121101
-
[46]
Zhang, W., Zhu, Y., Lu, Y., Demarne, M., Wang, W., Deng, K., ... & Krishnan, S. (2025, November). Flair: Feedback learning for adaptive information retrieval. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (pp. 6284-6292).https://doi.org/10.1145/3746252.37 61553
-
[47]
Acecoder: Acing coder rl via automated test-case synthesis.ArXiv, abs/2207.01780, 2025
Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement PYTHALAB-MERA: Memory and Acceptance Control 27 learning. Advances in Neural Information Processing Systems, 35, 21314-21328. https://arxiv.org/abs/2207.01780
-
[48]
Outcome-Refining Process Supervision for Code Generation,
Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., ... & Zhang, S. (2024). Reasoning through execution: Unifying process and outcome rewards for code gen- eration. arXiv.https://arxiv.org/abs/2412.15118
-
[49]
Han, B., Ren, Z., Wu, Z., Zhou, Y., & Peng, J. (2022). Off-policy reinforcement learning with delayed rewards. In Proceedings of the 39th International Conference on Machine Learning (PMLR, Vol. 162, pp. 8280–8303).https://proceedings. mlr.press/v162/han22e.html
work page 2022
-
[50]
Li, B., Sun, Z., Huang, T., Zhang, H., Wan, Y., Li, G., ... & Lyu, C. (2024). Ircoco:Immediaterewards-guideddeepreinforcementlearningforcodecompletion. Proceedings of the ACM on Software Engineering, 1(FSE), 182-203.https://do i.org/10.1145/3643735
-
[51]
Zhang, D., Chen, L., Zhang, S., Xu, H., Zhao, Z., & Yu, K. (2023). Large language models are semi-parametric reinforcement learning agents. Advances in Neural In- formation Processing Systems, 36, 78227-78239.https://arxiv.org/abs/2306.0 7929
work page 2023
-
[52]
Krishnamoorthy, A., Ivatury, K., & Ahmadnia, B. (2025, September). Multi-Agent ReinforcementLearningforInteractiveCodeDebuggingwithHumanFeedbackand Memory. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing-Natural Language Processing in the Generative AI Era (pp. 595-603).https://doi.org/10.26615/978-954...
-
[53]
Chilowicz, M., Duris, E., & Roussel, G. (2009, May). Syntax tree fingerprinting for source code similarity detection. In 2009 IEEE 17th international conference on program comprehension (pp. 243-247). IEEE.https://doi.org/10.1109/ICPC.2 009.5090050
-
[54]
Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving
Li, Y., Wang, S., & Nguyen, T. (2021, May). Fault localization with code cover- age representation learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 661-673). IEEE.https://doi.org/10.1109/IC SE43902.2021.00067
work page doi:10.1109/ic 2021
-
[55]
Verma, A., Udhayanan, P., Shankar, R. M., Kn, N., & Chakrabarti, S. K. (2021, October). Source-code similarity measurement: syntax tree fingerprinting for auto- mated evaluation. In Proceedings of the First International Conference on AI-ML Systems (pp. 1-7).https://doi.org/10.1145/3486001.3486228
-
[56]
Ellis, K., Morales, L., Sablé-Meyer, M., Solar-Lezama, A., & Tenenbaum, J. (2018). Learning libraries of subroutines for neurally guided Bayesian program induction. In Advances in Neural Information Processing Systems 31 (pp. 7816–7826).https: //papers.nips.cc/paper/8006-learning-libraries-of-subroutines-for-neu rallyguided-bayesian-program-induction
work page 2018
-
[57]
Xu, P., Wu, G., Chen, X., Yu, T., Xiao, C., Dernoncourt, F., ... & Swaminathan, V. (2026, March). Skill discovery for software scripting automation via offline sim- ulations with llms. In Findings of the Association for Computational Linguistics: EACL 2026 (pp. 743-759).https://doi.org/10.18653/v1/2026.findings-eac l.37
- [58]
- [59]
- [60]
-
[61]
Jiang, H., Chen, Y., Cao, Y., Lee, H. Y., & Tan, R. T. (2025). Codejudgebench: Benchmarking llm-as-a-judge for coding tasks. arXiv.https://arxiv.org/abs/25 07.10535
work page 2025
- [62]
-
[63]
arXiv.https://arxiv.org/abs/ 2510.11822
Jain,S.,Ahmed,U.Z.,Sahai,S.,&Leong,B.(2025).Beyondconsensus:Mitigating the agreeableness bias in llm judge evaluations. arXiv.https://arxiv.org/abs/ 2510.11822
-
[64]
Spiess, C., Gros, D., Pai, K. S., Pradel, M., Rabin, M. R. I., Alipour, A., ... & Ahmed, T. (2025, April). Calibration and correctness of language models for code. In2025IEEE/ACM47thInternationalConferenceonSoftwareEngineering(ICSE) (pp. 540-552). IEEE.https://doi.org/10.1109/ICSE55347.2025.00040
- [65]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.