arxiv: 2203.13474 · v5 · submitted 2022-03-25 · 💻 cs.LG · cs.CL· cs.PL

Recognition: no theorem link

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp , Bo Pang , Hiroaki Hayashi , Lifu Tu , Huan Wang , Yingbo Zhou , Silvio Savarese , Caiming Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:59 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.PL

keywords program synthesislarge language modelscode generationmulti-turn promptingHumanEvalMTPB benchmarkPythonopen models

0 comments

The pith

The same problem intent given to CodeGen in multiple turns produces substantially better programs than a single turn on the MTPB benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains and releases the CodeGen family of large language models, reaching 16.1 billion parameters, on a mixture of natural language and programming language data while also open-sourcing the JAXFORMER training library. The models reach competitive zero-shot performance on the HumanEval benchmark for Python code generation. The authors then introduce the Multi-Turn Programming Benchmark of 115 problems, each broken into a sequence of subproblem prompts, and show that feeding the identical overall intent through these successive turns markedly raises the rate of correct program synthesis compared with presenting the full intent at once.

Core claim

We train a family of large language models up to 16.1B parameters on natural language and programming language data, achieving competitive results on zero-shot Python code generation on HumanEval, and demonstrate through the Multi-Turn Programming Benchmark that supplying the same intent via multiple subproblem prompts leads to substantially better program synthesis than a single prompt.

What carries the argument

Multi-turn program synthesis, in which a single programming task is factorized into a sequence of prompts each describing a subproblem.

If this is right

Open release of the models and JAXFORMER library enables wider experimentation with code generation systems.
Multi-turn prompting provides a practical method for improving synthesis accuracy on complex tasks.
The MTPB benchmark supplies a standardized testbed for evaluating stepwise synthesis approaches.
Factorizing problems into subprompts extends the effective capability of models trained on single-turn data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive coding tools could adopt multi-turn prompting to let users guide model output incrementally rather than in one shot.
The observed gains suggest that sequential decomposition may help models handle longer or more intricate reasoning chains in other domains.
Fine-tuning CodeGen on domain-specific multi-turn datasets could further amplify the performance difference.
Real-world adoption would require measuring whether typical developer workflows naturally produce prompt sequences similar to those in MTPB.

Load-bearing premise

The specific factorization of problems into successive prompts used in MTPB accurately represents natural human decomposition of programming tasks without selection bias or artificial simplification.

What would settle it

A controlled test on MTPB in which the total information content of each multi-turn sequence is instead supplied in one comprehensive single-turn prompt yields equal or higher success rates than the multi-turn format.

read the original abstract

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeGen opens 16B-scale models and a new multi-turn benchmark, but the reported gains likely mix in extra prompt information from manual factorization.

read the letter

CodeGen releases open models up to 16.1B parameters trained on natural language and code, plus the JAXFORMER library, and introduces the MTPB benchmark for multi-turn program synthesis. The models hit competitive numbers on HumanEval, and multi-turn prompts on MTPB give better results than single-turn for the same problems. The open release is the strongest part. Having checkpoints and training code available at this scale lets other groups build on it without massive resources. The benchmark with 115 factorized problems is a new resource that targets how people actually solve coding tasks step by step. The main weakness is the multi-turn evaluation. The authors manually break each problem into sub-prompts, so the multi-turn version gets explicit decomposition and intermediate goals that the single-turn prompt does not have. The abstract does not show that they controlled for total tokens, information content, or prompt quality between the two conditions. That leaves open the possibility that the gains come from better prompting rather than the multi-turn format itself. There are also no training details, ablations, or statistical tests reported, which makes it harder to assess how robust the findings are. This paper is useful for anyone working on large language models for code or program synthesis. The models and benchmark are concrete artifacts that others can use and extend. It deserves peer review because the open contributions are substantial, though the experimental design for the multi-turn claim needs more controls to be fully convincing. I would send it to referees. The release alone justifies the time, and reviewers can ask for the missing details and tighter comparisons.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeGen, a family of open large language models (up to 16.1B parameters) trained on natural language and programming language data using the open-sourced JAXFORMER library. It reports competitive zero-shot performance on HumanEval for Python code generation and introduces the Multi-Turn Programming Benchmark (MTPB) consisting of 115 factorized problem sets. The central empirical claim is that providing the same intent via multi-turn prompts to CodeGen yields significantly better program synthesis than single-turn prompts.

Significance. If the multi-turn results hold under proper controls, the open release of models, checkpoints, and training library represents a valuable contribution by democratizing access to large code models. The MTPB benchmark usefully highlights multi-turn synthesis as a promising direction. However, the current evidence for the multi-turn claim is only moderately supported due to potential confounds in prompt construction and missing experimental details.

major comments (2)

[Abstract / MTPB description] MTPB construction (Abstract): The 115 problems are manually factorized into multi-turn sub-prompts by the authors. This factorization likely embeds explicit decomposition steps and intermediate specifications absent from the single-turn prompt. The manuscript provides no evidence that single-turn prompts were matched for total information content, token length, or decomposition hints, so the reported lift could be due to prompt engineering rather than the multi-turn interaction itself. This directly undermines the central claim that multi-turn fashion significantly improves synthesis.
[Abstract] HumanEval evaluation (Abstract): The claim of competitive zero-shot results is stated without reporting exact pass@k scores for CodeGen variants, number of evaluation runs, variance, or statistical tests against prior models. This absence makes it impossible to assess whether the competitiveness is robust or merely within noise.

minor comments (2)

[Abstract] The abstract would be strengthened by explicitly stating the HumanEval pass@1 or pass@10 numbers achieved by the largest CodeGen model for direct comparison with prior work.
[Abstract] Training details such as data mixture ratios, exact hyper-parameters, and compute resources are referenced only at a high level; moving a concise summary into the main text would improve reproducibility even with the open-sourced library.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and prompt construction that we have addressed in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract / MTPB description] MTPB construction (Abstract): The 115 problems are manually factorized into multi-turn sub-prompts by the authors. This factorization likely embeds explicit decomposition steps and intermediate specifications absent from the single-turn prompt. The manuscript provides no evidence that single-turn prompts were matched for total information content, token length, or decomposition hints, so the reported lift could be due to prompt engineering rather than the multi-turn interaction itself. This directly undermines the central claim that multi-turn fashion significantly improves synthesis.

Authors: We acknowledge the potential confound raised. The MTPB benchmark intentionally factorizes problems to reflect realistic multi-step programming workflows, where the overall intent is preserved but delivered sequentially with intermediate specifications. The single-turn baseline uses the original, unfactored problem statement. To strengthen the evidence, the revised manuscript adds (1) explicit token-length statistics for paired single- and multi-turn prompts, (2) a controlled subset analysis where multi-turn prompts are truncated to match single-turn length, and (3) example prompt pairs in the appendix. While these additions show the performance advantage persists under length matching, we agree a fully information-matched control would require a new benchmark design; we therefore describe this limitation in the discussion and label the revision as partial. revision: partial
Referee: [Abstract] HumanEval evaluation (Abstract): The claim of competitive zero-shot results is stated without reporting exact pass@k scores for CodeGen variants, number of evaluation runs, variance, or statistical tests against prior models. This absence makes it impossible to assess whether the competitiveness is robust or merely within noise.

Authors: We agree that the abstract and main text should report these details for reproducibility. The revised version includes a new Table 2 with exact pass@k (k=1,10,100) scores for all CodeGen sizes (350M–16.1B), computed over 10 independent sampling runs with reported means and standard deviations. We also add pairwise statistical comparisons (Welch’s t-tests) against the strongest prior baselines in the appendix, confirming that CodeGen-16.1B remains statistically indistinguishable from the prior SOTA at k=10 while outperforming at k=1. These numbers and procedures are now referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper is an empirical contribution: it trains open LLMs on natural language and code data, releases checkpoints and the JAXFORMER library, and reports zero-shot performance on the external HumanEval benchmark plus a direct multi-turn vs single-turn comparison on the newly introduced MTPB. No mathematical derivations, first-principles predictions, or fitted parameters are presented whose outputs reduce to the inputs by construction. The MTPB construction and factorization are described as an experimental design choice, not a self-referential definition or renamed known result. Any self-citations are incidental and not load-bearing for the central empirical claims, which remain independently verifiable via the released artifacts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on the effectiveness of scaling transformer-based LLMs on mixed natural-language and programming-language corpora plus the validity of the MTPB construction; no new physical or mathematical entities are postulated.

free parameters (1)

model parameter counts (up to 16.1B)
Design choice for scaling experiments; specific sizes are selected rather than derived.

axioms (1)

domain assumption Transformer language models trained on combined natural language and code corpora can perform zero-shot program synthesis.
Invoked as the basis for training CodeGen and evaluating on HumanEval.

pith-pipeline@v0.9.0 · 5554 in / 1171 out tokens · 26653 ms · 2026-05-13T16:59:25.733083+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Papers Tell the Whole Story? A Benchmark and Framework for Uncovering Hidden Implementation Gaps in Bioinformatics
cs.LG 2026-03 unverdicted novelty 8.0

BioCon is the first benchmark dataset and cross-modal framework for detecting inconsistencies between methodological descriptions in bioinformatics papers and their code implementations.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
cs.SE 2026-04 unverdicted novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
CodeComp: Structural KV Cache Compression for Agentic Coding
cs.CL 2026-04 unverdicted novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context pa...
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
cs.PL 2026-04 unverdicted novelty 7.0

Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
CodeEvolve: LLM-Driven Evolutionary Optimization with Runtime-Enriched Target Selection for Multi-Language Code Enhancement
cs.SE 2026-05 unverdicted novelty 6.0

CodeEvolve uses runtime-guided target selection and MCTS-augmented LLM evolution to optimize real Java and Apex code, reporting 15.22x average speedup on seven hotspots while preserving correctness.
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
cs.SE 2026-04 unverdicted novelty 6.0

MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 unverdicted novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

Metamorphic testing on Defects4J and GitBug-Java reveals substantial performance drops in seven LLMs that correlate with NLL, indicating data leakage in LLM-based program repair.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
cs.CL 2026-04 unverdicted novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
cs.SE 2024-01 unverdicted novelty 5.0

DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
Self-Refine: Iterative Refinement with Self-Feedback
cs.CL 2023-03 unverdicted novelty 5.0

Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
cs.LG 2026-04 unverdicted novelty 4.0

LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.