arxiv: 2604.24222 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI· cs.CL

Recognition: unknown

MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation

Mofei Li , Taozhi Chen , Guowei Yang , Jia Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:35 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords private library code generationmulti-dimensional evolving memoryRAG enhancementLLM adaptationexecution feedbackAPI usage guidelinesenterprise codecontinual learning

0 comments

The pith

MEMCoder lets LLMs evolve multi-dimensional memory to close gaps in private-library code generation that static docs leave open.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard retrieval-augmented generation with API documentation fails for enterprise private libraries because it supplies only isolated definitions. This leaves two gaps: missing patterns for how APIs coordinate at the task level and incomplete understanding of parameter constraints and boundaries at the API level. MEMCoder addresses both by letting the model autonomously build and refine a multi-dimensional memory of usage guidelines distilled from its own past solution attempts. During generation it retrieves both the original documentation and relevant stored guidelines, then closes the loop by using actual execution results to reflect on outcomes, resolve conflicts, and update the memory. The approach produces large gains on targeted benchmarks and adapts to specific domains more effectively than prior memory-based methods.

Core claim

By maintaining a Multi-dimensional Evolving Memory that stores distilled usage guidelines across task-level coordination patterns and API-level constraints, retrieved together with static documentation, and updated automatically from objective execution feedback, MEMCoder enables LLMs to accumulate and apply private-library knowledge without retraining.

What carries the argument

Multi-dimensional Evolving Memory, which captures and refines lessons from problem-solving trajectories across coordination and constraint dimensions and supports dual-source retrieval plus feedback-driven updates.

If this is right

Existing RAG pipelines gain an average 16.31 percent absolute pass@1 improvement on the NdonnxEval and NumbaEval benchmarks.
Domain-specific adaptation to private libraries surpasses results from prior memory-based continual learning techniques.
The model can autonomously resolve knowledge conflicts that arise when new API interactions contradict earlier guidelines.
Continuous memory evolution removes the need for manual curation of usage examples for each private library.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback-driven memory structure could be applied to other private-domain tasks such as internal data pipelines or proprietary tool orchestration.
Performance on libraries introduced after initial memory construction would test whether the method generalizes beyond the two reported benchmarks.
Combining the evolving memory with lightweight parameter updates might further reduce reliance on full model retraining for enterprise codebases.
The dual retrieval of static docs and evolved guidelines offers a template for other retrieval systems that must handle both fixed knowledge and experience-derived rules.

Load-bearing premise

Objective execution feedback can be trusted to distill accurate lessons, resolve conflicts, and update memory without introducing new errors or overfitting to the benchmarks.

What would settle it

Running the system on a fresh set of private libraries and observing whether pass@1 gains disappear or memory updates degrade performance when execution signals are noisy or ambiguous.

Figures

Figures reproduced from arXiv: 2604.24222 by Guowei Yang, Jia Li, Mofei Li, Taozhi Chen.

**Figure 1.** Figure 1: A representative failure case study on NdonnxEval. Qualitative analysis view at source ↗

**Figure 2.** Figure 2: A reflection case on NdonnxEval with Qwen2.5-Coder-7B-Instruct. From failed code and execution feedback, the model derives task-level and API-level Usage Guidelines. Qwen2.5-Coder-7B for reflection. Experimental observations revealed that the model demonstrated outstanding self-correction and summarization capabilities: at the task level, it accurately identified the alignment gaps in multi-API collaborat… view at source ↗

**Figure 3.** Figure 3: Overview of the MEMCODER framework. Middle: The Multi-dimensional Evolutionary Memory (IV-B) stores refined task-level and API-level memories. Left: The Guideline-Driven Code Generation pipeline (IV-C) retrieves these memories along with API docs to guide code generation. Right: The Feedback-Driven Memory Evolution module (IV-D) updates and optimizes the memory based on real-time execution feedback. from M… view at source ↗

read the original abstract

Large Language Models (LLMs) excel at general code generation, but their performance drops sharply in enterprise settings that rely on internal private libraries absent from public pre-training corpora. While Retrieval-Augmented Generation (RAG) offers a training-free alternative by providing static API documentation, we find that such documentation typically provides only isolated definitions, leaving a fundamental knowledge gap. Specifically, LLMs struggle with a task-level lack of coordination patterns between APIs and an API-level misunderstanding of parameter constraints and boundary conditions. To address this, we propose MEMCoder, a novel framework that enables LLMs to autonomously accumulate and evolve Usage Guidelines across these two dimensions. MEMCoder introduces a Multi-dimensional Evolving Memory that captures distilled lessons from the model's own problem-solving trajectories. During inference, MEMCoder employs a dual-source retrieval mechanism to inject both static documentation and relevant historical guidelines into the context. The framework operates in an automated closed loop by using objective execution feedback to reflect on successes and failures, resolve knowledge conflicts, and dynamically update memory. Extensive evaluations on the NdonnxEval and NumbaEval benchmarks demonstrate that MEMCoder substantially enhances existing RAG systems, yielding an average absolute pass@1 gain of 16.31%. Furthermore, MEMCoder exhibits vastly superior domain-specific adaptation compared to existing memory-based continual learning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEMCoder adds a closed-loop multi-dimensional memory for private library usage that reports a 16% pass@1 lift over RAG on two custom benchmarks, but the evaluation lacks controls and boundary coverage details.

read the letter

The main thing to know is that this paper targets the practical gap where LLMs hit private enterprise libraries: static RAG gives isolated API docs but misses how APIs coordinate on tasks and what the real parameter boundaries are. MEMCoder lets the model build and update its own memory of distilled guidelines in those two dimensions, pulling both docs and past lessons at inference time, then using execution pass/fail to reflect, fix conflicts, and evolve the memory automatically.

Referee Report

2 major / 2 minor

Summary. The paper proposes MEMCoder, a framework for private-library code generation that maintains a Multi-dimensional Evolving Memory capturing distilled usage guidelines on task-level API coordination and API-level parameter constraints/boundaries. It augments RAG via dual-source retrieval of static docs plus evolved guidelines, closes the loop with objective execution feedback for reflection and memory updates, and reports an average 16.31% absolute pass@1 gain on NdonnxEval and NumbaEval plus superior domain adaptation versus memory-based continual learning baselines.

Significance. If the empirical gains prove robust under controlled conditions and the memory updates generalize without overfitting to benchmark test coverage, the approach would provide a practical, training-free mechanism for adapting LLMs to proprietary libraries, directly addressing documented gaps in static documentation for both coordination patterns and constraint knowledge.

major comments (2)

[Evaluation] Evaluation section: the central claim of a 16.31% absolute pass@1 gain and superior adaptation rests on benchmark results, yet the provided description supplies no information on baseline re-implementations, number of runs, statistical significance testing, prompt controls, or test-suite coverage statistics; without these the reported improvement cannot be assessed as load-bearing evidence rather than an artifact of incomplete signals.
[Framework description] Framework and reflection mechanism (described in the abstract and §3): the assertion that execution feedback reliably enables distillation of API-level parameter constraints and boundary conditions assumes the NdonnxEval/NumbaEval test suites exercise sufficient edge cases; if coverage is incomplete, the closed-loop updates can reinforce partial or incorrect guidelines, directly undermining the adaptation superiority claim.

minor comments (2)

[Abstract] Abstract: the phrase 'vastly superior' is qualitative; replace with quantitative deltas or tables comparing adaptation metrics against the continual-learning baselines.
[Introduction] Notation: the term 'Multi-dimensional Evolving Memory' is introduced without an explicit formal definition or update rule in the summary; a compact mathematical or algorithmic sketch would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important aspects of evaluation rigor and the reliability of our closed-loop reflection mechanism. We address each point below, indicating revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: Evaluation section: the central claim of a 16.31% absolute pass@1 gain and superior adaptation rests on benchmark results, yet the provided description supplies no information on baseline re-implementations, number of runs, statistical significance testing, prompt controls, or test-suite coverage statistics; without these the reported improvement cannot be assessed as load-bearing evidence rather than an artifact of incomplete signals.

Authors: We agree that additional experimental details are necessary for readers to fully evaluate the robustness of the reported gains. In the revised manuscript, we will expand the Evaluation section (§4) to explicitly describe: (1) how each baseline was re-implemented (including any adaptations for fair comparison), (2) the number of independent runs performed along with mean and standard deviation, (3) statistical significance testing (e.g., paired t-tests or Wilcoxon tests with p-values), (4) prompt engineering controls to isolate the contribution of the memory component, and (5) available test-suite coverage statistics for NdonnxEval and NumbaEval. These additions will make the 16.31% pass@1 improvement and domain-adaptation claims more verifiable. revision: yes
Referee: Framework and reflection mechanism (described in the abstract and §3): the assertion that execution feedback reliably enables distillation of API-level parameter constraints and boundary conditions assumes the NdonnxEval/NumbaEval test suites exercise sufficient edge cases; if coverage is incomplete, the closed-loop updates can reinforce partial or incorrect guidelines, directly undermining the adaptation superiority claim.

Authors: This concern is well-taken and points to a potential limitation of any execution-feedback-driven approach. While the benchmarks were designed to include diverse usage patterns and edge cases for the target private libraries, we acknowledge that without exhaustive coverage metrics it is possible for the memory to distill incomplete or context-specific guidelines. In the revision, we will add an explicit Limitations subsection discussing this risk, including how the dual-source retrieval (static docs + evolved guidelines) and conflict-resolution step in the reflection mechanism are intended to mitigate erroneous updates. We will also report any available coverage statistics and clarify that the observed gains are empirical rather than a guarantee of perfect constraint learning. We believe the closed-loop design still provides a practical advantage over static RAG, but we will not overstate its robustness. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains reported from benchmark evaluations, not derived from self-referential definitions or fits

full rationale

The paper describes a framework (MEMCoder) that accumulates usage guidelines via execution feedback and dual-source retrieval, then reports pass@1 improvements on NdonnxEval and NumbaEval as direct experimental outcomes. No equations, parameters fitted to subsets, or self-citations are invoked to derive the central performance claims; the 16.31% gain is presented as measured result rather than a prediction forced by construction. The derivation chain consists of procedural steps (reflect, resolve conflicts, update memory) whose validity is tested externally rather than assumed tautologically. This is the expected non-finding for an applied systems paper whose load-bearing evidence is benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on assumptions about LLM reflection capabilities and feedback reliability rather than new mathematical derivations; no free parameters or invented physical entities are specified.

axioms (2)

domain assumption LLMs can autonomously distill accurate usage guidelines from their own problem-solving trajectories using objective execution feedback
Invoked to enable the closed-loop memory update and conflict resolution described in the abstract.
domain assumption Execution feedback provides reliable signals for distinguishing successes, failures, and knowledge conflicts in code generation tasks
Central to the automated update mechanism that evolves the memory.

invented entities (1)

Multi-dimensional Evolving Memory no independent evidence
purpose: Captures distilled lessons on task-level API coordination patterns and API-level parameter constraints from model trajectories
Core new component introduced to address the knowledge gap left by static documentation in RAG.

pith-pipeline@v0.9.0 · 5539 in / 1342 out tokens · 61585 ms · 2026-05-08T03:35:44.558784+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 8 internal anchors

[1]

Structured chain-of-thought prompting for code generation,

J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025

2025
[2]

aixcoder-7b: A lightweight and effective large language model for code processing,

S. Jiang, J. Li, H. Zong, H. Liu, H. Zhu, S. Hu, E. Li, J. Ding, Y . Han, W. Ning,et al., “aixcoder-7b: A lightweight and effective large language model for code processing,” in2025 IEEE/ACM 47th International Con- ference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 215–226, IEEE, 2025

2025
[3]

Beyond autoregression: An empirical study of diffusion large language models for code generation,

C. Li, Y . Zhang, J. Li, L. Cai, and G. Li, “Beyond autoregression: An empirical study of diffusion large language models for code generation,”arXiv preprint arXiv:2509.11252, 2025

work page arXiv 2025
[4]

Ai-driven self- evolving software: A promising path toward software automation,

L. Cai, Y . Ren, Y . Zhang, and J. Li, “Ai-driven self- evolving software: A promising path toward software automation,”arXiv preprint arXiv:2510.00591, 2025

work page arXiv 2025
[5]

Exploracoder: Advancing code generation for multiple unseen apis via planning and chained exploration,

Y . Wang, Y . Zhang, Z. Qin, C. Zhi, B. Li, F. Huang, Y . Li, and S. Deng, “Exploracoder: Advancing code generation for multiple unseen apis via planning and chained exploration,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 18124–18145, 2025

2025
[6]

When language model meets private library,

D. Zan, B. Chen, Z. Lin, B. Guan, W. Yongji, and J.-G. Lou, “When language model meets private library,” in Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 277–288, 2022

2022
[7]

Docprompting: Generating code by retrieving the docs,

S. Zhou, U. Alon, F. F. Xu, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” inThe Eleventh International Conference on Learning Representations, 2022

2022
[8]

Epigen: An efficient multi-api code generation framework under enterprise scenario,

S. Li, S. Li, H. Zhang, S. Li, K. Chen, J. Yuan, Y . Cao, and L. Yang, “Epigen: An efficient multi-api code generation framework under enterprise scenario,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC- COLING 2024), pp. 6206–6215, 2024

2024
[9]

ndonnx (version 0.17.1)

QuantCo, “ndonnx (version 0.17.1).” https://pypi.org/pro ject/ndonnx/0.17.1/, 2025

2025
[10]

numba-cuda (version 0.27.0)

NVIDIA, “numba-cuda (version 0.27.0).” https://pypi.org /project/numba-cuda/0.27.0/, 2026

2026
[11]

To see is not to master: Teaching llms to use private libraries for code generation,

Y . Zhang, C. Li, R. Chen, G. Yang, X. Jia, Y . Ren, and J. Li, “To see is not to master: Teaching llms to use private libraries for code generation,” 2026

2026
[13]

Evo-memory: Benchmarking llm agent test-time learning with self- evolving memory,

T. Wei, N. Sachdeva, B. Coleman, Z. He, Y . Bei, X. Ning, M. Ai, Y . Li, J. He, E. H. Chi, C. Wang, S. Chen, F. Pereira, W.-C. Kang, and D. Z. Cheng, “Evo-memory: Benchmarking llm agent test-time learning with self- evolving memory,” 2025

2025
[14]

Chatdev: Communicative agents for software development,

C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong,et al., “Chatdev: Communicative agents for software development,” inProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 15174–15186, 2024

2024
[15]

What papers don’t tell you: Recovering tacit knowledge for automated paper reproduction,

L. Li, R. Wang, H. Song, Y . Mao, T. Zhang, Y . Wang, J. Fan, Y . Zhang, J. Ye, C. Zhang,et al., “What papers don’t tell you: Recovering tacit knowledge for automated paper reproduction,”arXiv preprint arXiv:2603.01801, 2026

work page arXiv 2026
[16]

Lookahead-then-verify: Reliable constrained decoding for diffusion llms under context-free grammars,

Y . Zhang, Y . Li, Y . Liu, J. Li, X. Jia, Z. Li, and G. Li, “Lookahead-then-verify: Reliable constrained decoding for diffusion llms under context-free grammars,”arXiv preprint arXiv:2602.00612, 2026

work page arXiv 2026
[17]

Acecoder: An effective prompting technique specialized in code generation,

J. Li, Y . Zhao, Y . Li, G. Li, and Z. Jin, “Acecoder: An effective prompting technique specialized in code generation,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–26, 2024

2024
[18]

Difftester: Acceler- ating unit test generation for diffusion llms via repetitive pattern,

L. Yang, Y . Liu, Y . Zhang, and J. Li, “Difftester: Acceler- ating unit test generation for diffusion llms via repetitive pattern,”arXiv preprint arXiv:2509.24975, 2025

work page arXiv 2025
[19]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram,et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review arXiv 2024
[21]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[22]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez,et al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review arXiv 2023
[23]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[24]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu,et al., “Qwen2. 5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review arXiv 2024
[25]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan,et al., “Deepseek- v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review arXiv 2024
[26]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Li,et al., “Deepseek-coder: when the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review arXiv 2024
[27]

Codesync: Synchronizing large language models with dynamic code evolution at scale,

C. Wang, Z. Chu, Z. Cheng, X. Yang, K. Qiu, Y . Wan, Z. Zhao, X. Shi, and D. Chen, “Codesync: Synchronizing large language models with dynamic code evolution at scale,”arXiv preprint arXiv:2502.16645, 2025

work page arXiv 2025
[28]

Unseen- codebases-domain data synthesis and training based on code graphs,

G. Ou, Q. Zhang, S. Chen, A. Li, D. Xu, T. Luo, D. Dai, C. Gao, L. Wang, J. Zhou, M. Liu, and Z. Zheng, “Unseen- codebases-domain data synthesis and training based on code graphs,” 2026

2026
[29]

Diffcoder: Enhancing large language model on api invocation via analogical code exercises,

D. Zan, A. Yu, B. Shen, B. Chen, W. Li, Y . Gong, X. Chen, Y . Yao, W. Luo, B. Guan,et al., “Diffcoder: Enhancing large language model on api invocation via analogical code exercises,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 406–426, 2024

2024
[30]

On the effectiveness of large lan- guage models in domain-specific code generation,

X. Gu, M. Chen, Y . Lin, Y . Hu, H. Zhang, C. Wan, Z. Wei, Y . Xu, and J. Wang, “On the effectiveness of large lan- guage models in domain-specific code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–22, 2025

2025
[31]

Think: Tackling api hallucinations in llms via injecting knowledge,

J. Liu, Y . Zhang, D. Wang, Y . Li, and W. Dong, “Think: Tackling api hallucinations in llms via injecting knowledge,” in2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 229–240, IEEE, 2025

2025
[32]

Codegen4libs: A two-stage approach for library-oriented code generation,

M. Liu, T. Yang, Y . Lou, X. Du, Y . Wang, and X. Peng, “Codegen4libs: A two-stage approach for library-oriented code generation,” in2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 434–445, IEEE, 2023

2023
[33]

Private-library-oriented code generation with large language models,

D. Zan, B. Chen, Y . Gong, J. Cao, F. Zhang, B. Wu, B. Guan, Y . Yin, and Y . Wang, “Private-library-oriented code generation with large language models,”Knowledge- Based Systems, vol. 326, p. 113934, 2025

2025
[34]

Revisiting catastrophic forgetting in large language model tuning,

H. Li, L. Ding, M. Fang, and D. Tao, “Revisiting catastrophic forgetting in large language model tuning,” 2024

2024
[35]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning,

Y . Luo, Z. Yang, F. Meng, Y . Li, J. Zhou, and Y . Zhang, “An empirical study of catastrophic forgetting in large language models during continual fine-tuning,” 2025

2025
[36]

Retrieval-augmented generation for knowledge- intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., “Retrieval-augmented generation for knowledge- intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

2020
[37]

Compositional api recommendation for library-oriented code generation,

Z. Ma, S. An, B. Xie, and Z. Lin, “Compositional api recommendation for library-oriented code generation,” inProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension, pp. 87–98, 2024

2024
[38]

Dynamic cheatsheet: Test-time learning with adaptive memory,

M. Suzgun, M. Yuksekgonul, F. Bianchi, D. Jurafsky, and J. Zou, “Dynamic cheatsheet: Test-time learning with adaptive memory,” 2025

2025
[39]

CodeT : Code generation with generated tests

B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,”arXiv preprint arXiv:2207.10397, 2022

work page arXiv 2022
[40]

Multi-lingual evaluation of code generation models,

B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y . Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, M. Shang,et al., “Multi-lingual evaluation of code generation models,” arXiv preprint arXiv:2210.14868, 2022

work page arXiv 2022