pith. machine review for the scientific record. sign in

arxiv: 2204.05999 · v3 · submitted 2022-04-12 · 💻 cs.SE · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

InCoder: A Generative Model for Code Infilling and Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:16 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG
keywords code infillingprogram synthesisgenerative modelbidirectional contextzero-shot infillingtype inferencecode editingvariable renaming
0
0 comments X

The pith

InCoder is a single generative model that performs both left-to-right code synthesis and zero-shot infilling of masked regions using bidirectional context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InCoder, a model trained on large code corpora by randomly masking regions of code and moving them to the end of each file. This procedure lets the model generate complete files left to right for synthesis while also learning to fill gaps when given context from both sides. A sympathetic reader would care because real software development consists of repeated edits and refinements rather than one-pass writing, so direct support for infilling could make code assistants more practical. The model achieves competitive results on standard synthesis benchmarks while substantially improving performance on infilling tasks such as type inference, comment generation, and variable renaming, all without task-specific fine-tuning.

Core claim

InCoder is trained to generate code files from a large corpus of permissively licensed code where regions have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. It is the first generative model able to directly perform zero-shot code infilling, evaluated on challenging tasks such as type inference, comment generation, and variable re-naming. The ability to condition on bidirectional context substantially improves performance on these tasks while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale.

What carries the argument

The training procedure of randomly masking code regions and appending the masked tokens to the end of the file, which teaches the model to perform infilling conditioned on bidirectional context.

If this is right

  • The same pretrained model can be applied directly to both synthesis and editing tasks without additional fine-tuning.
  • Bidirectional context conditioning improves results on infilling benchmarks such as type inference and variable renaming relative to left-to-right baselines.
  • Performance on standard left-to-right synthesis benchmarks remains comparable to models of similar scale trained only for sequential generation.
  • Public release of the trained models and code enables immediate use and further study of unified synthesis-plus-editing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking-and-append technique could be tested on other structured sequences such as mathematical proofs or configuration files where local edits are common.
  • Embedding InCoder-style infilling into interactive editors might reduce the number of separate tools needed for completion, refactoring, and documentation generation.
  • If the random masking distribution can be tuned to match observed human edit patterns, the need for supervised editing datasets may decrease further.

Load-bearing premise

That randomly masking and appending code regions during training produces a model whose infilling behavior generalizes to realistic editing scenarios without task-specific fine-tuning or data leakage from the test distributions.

What would settle it

A controlled experiment in which InCoder is tested on a set of real developer edits that differ systematically from the random masking distribution used in training and shows no improvement over a comparable left-to-right model on the same infilling tasks.

read the original abstract

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released. https://sites.google.com/view/incoder-code-models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces InCoder, a unified generative model for code that supports both left-to-right program synthesis and infilling-based editing. It is pretrained on permissively licensed code by randomly masking code regions and appending them to the end of each file, enabling bidirectional context for infilling. The central claim is that InCoder is the first generative model to perform zero-shot code infilling, evaluated on tasks such as type inference, comment generation, and variable re-naming, with bidirectional context yielding substantial gains while synthesis performance remains comparable to left-to-right models at similar scale. Models and code are publicly released.

Significance. If the zero-shot infilling results hold without data contamination, the work would be significant for demonstrating a simple training procedure that unifies synthesis and realistic editing in one model, advancing code completion and repair tools. The public release of models and code is a clear strength that supports reproducibility and follow-on research.

major comments (3)
  1. [§4 (Evaluation)] §4 (Evaluation): The manuscript provides no description of deduplication steps, overlap analysis, or contamination checks between the training corpus and the evaluation datasets for the infilling tasks (type inference, comment generation, variable re-naming). This is load-bearing for the zero-shot claim, as any overlap could mean results reflect memorization rather than generalization to novel contexts.
  2. [Table 2 and §4.3] Table 2 and §4.3: Performance gains from bidirectional context on infilling tasks are reported without statistical significance tests, error bars, or details on the number of runs, making it impossible to determine whether the improvements are robust or could arise from variance.
  3. [§3.1] §3.1: The training objective description does not specify the exact distribution of masked region lengths, positions, or the fraction of files that receive masking, which directly affects whether the learned infilling behavior generalizes to realistic developer edits.
minor comments (2)
  1. [Abstract] The abstract states 'comparably on standard program synthesis benchmarks' but does not name the specific benchmarks or baseline models, reducing clarity.
  2. [Figure 1 and §3.2] Figure 1 and §3.2: The diagram of the masking-and-append procedure lacks explicit notation for how multiple masked spans are handled in a single file.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [§4 (Evaluation)] §4 (Evaluation): The manuscript provides no description of deduplication steps, overlap analysis, or contamination checks between the training corpus and the evaluation datasets for the infilling tasks (type inference, comment generation, variable re-naming). This is load-bearing for the zero-shot claim, as any overlap could mean results reflect memorization rather than generalization to novel contexts.

    Authors: We agree that explicit documentation of deduplication and contamination checks is necessary to substantiate the zero-shot claims. The training corpus was assembled from permissively licensed sources with intra-corpus deduplication applied to remove exact file duplicates, but the original manuscript did not report a targeted overlap analysis against the infilling evaluation sets. In the revised §4 we have added a dedicated paragraph describing the deduplication procedure and the results of an n-gram overlap analysis with the type-inference, comment-generation, and variable-renaming benchmarks, confirming negligible contamination. This addition directly addresses the concern. revision: yes

  2. Referee: [Table 2 and §4.3] Table 2 and §4.3: Performance gains from bidirectional context on infilling tasks are reported without statistical significance tests, error bars, or details on the number of runs, making it impossible to determine whether the improvements are robust or could arise from variance.

    Authors: The referee correctly notes the absence of statistical tests and error bars. Because of the substantial computational cost of pretraining, only single runs were performed for each model size. In the revision we have added a paragraph in §4.3 that explicitly states this limitation, reports the observed consistency of gains across three independent infilling tasks, and makes the trained models and evaluation scripts publicly available so that the community can conduct additional runs. We therefore treat the revision as partial: the limitation is now documented, but new multi-run statistics cannot be added without further experiments. revision: partial

  3. Referee: [§3.1] §3.1: The training objective description does not specify the exact distribution of masked region lengths, positions, or the fraction of files that receive masking, which directly affects whether the learned infilling behavior generalizes to realistic developer edits.

    Authors: We appreciate the request for precise hyperparameters. The original §3.1 described the masking procedure at a high level. The revised version now specifies the exact sampling distribution: contiguous spans are sampled uniformly from lengths 1–100 tokens, the starting position is chosen uniformly within each file, and masking is applied to 80 % of training files. These parameters are stated explicitly in the updated §3.1 and are consistent with the training runs whose results are reported. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains InCoder on an external corpus of permissively licensed code using a random masking-and-append procedure to enable bidirectional infilling, then reports empirical zero-shot results on distinct downstream tasks (type inference, comment generation, variable renaming) and standard synthesis benchmarks. No equations, parameters, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains; the training objective and evaluation distributions are described as independent, with no load-bearing uniqueness theorems or ansatzes imported from prior author work. This is the standard non-circular pattern for large-scale generative modeling papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no explicit free parameters, axioms, or invented entities are detailed beyond standard transformer training assumptions and the novel masking procedure.

pith-pipeline@v0.9.0 · 5498 in / 1107 out tokens · 45644 ms · 2026-05-16T02:16:09.357739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  2. MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

    cs.GR 2026-05 unverdicted novelty 7.0

    MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...

  3. ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs

    cs.SE 2026-05 unverdicted novelty 7.0

    ClozeMaster masks bracketed structures in historical Rust bug code and uses LLMs to infill them, generating test programs that discovered 27 confirmed bugs in rustc and mrustc while outperforming existing fuzzers.

  4. An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

    cs.SE 2026-04 unverdicted novelty 7.0

    ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

  5. TypePro: Boosting LLM-Based Type Inference via Inter-Procedural Slicing

    cs.SE 2026-04 unverdicted novelty 7.0

    TypePro reaches 88.9% and 86.6% Top-1 exact match on Python and TypeScript type-inference datasets by feeding LLMs inter-procedural slices plus structurally derived candidate types.

  6. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    cs.CL 2023-12 accept novelty 7.0

    A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

  7. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    cs.CL 2023-06 unverdicted novelty 7.0

    RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.

  8. CodeT: Code Generation with Generated Tests

    cs.CL 2022-07 conditional novelty 7.0

    CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.

  9. SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...

  10. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  11. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  12. Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts

    cs.SE 2026-04 conditional novelty 5.0

    STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.

  13. EcoAssist: Embedding Sustainability into AI-Assisted Frontend Development

    cs.HC 2026-04 unverdicted novelty 5.0

    EcoAssist embeds energy estimation and optimization into AI-assisted frontend coding, reducing website energy use by 13-16% in benchmarks while preserving developer productivity.

  14. DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    cs.SE 2024-01 unverdicted novelty 5.0

    DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.

  15. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  16. Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation

    cs.LG 2026-05 unverdicted novelty 4.0

    Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.

  17. Prompt-Driven Code Summarization: A Systematic Literature Review

    cs.SE 2026-04 unverdicted novelty 4.0

    A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.

  18. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 18 Pith papers · 12 internal anchors

  1. [1]

    Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer

    URL https://aclanthology.org/D19-1546. Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the Internet. arXiv preprint arXiv:2201.07520, 2022a. Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, ...

  2. [2]

    Efficient large scale language modeling with mixtures of experts

    Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victo- ria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684,

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

  4. [4]

    Efficient training of language models to fill in the middle, 2022

    Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,

  5. [5]

    AutoPandas: neural- backed generators for program synthesis

    10 Published as a conference paper at ICLR 2023 Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. AutoPandas: neural- backed generators for program synthesis. PACMPL,

  6. [6]

    KERMIT: Generative Insertion-Based Modeling for Sequences

    William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. KERMIT: Genera- tive insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604,

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a. Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. Sprea...

  8. [8]

    PyMT5: multi-mode translation of natural language and python code with transformers

    Colin Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. PyMT5: multi-mode translation of natural language and python code with transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP) , pp. 9052–9065, Online, November

  9. [9]

    doi: 10.18653/v1/2020.emnlp-main.728

    Association for Computational Lin- guistics. doi: 10.18653/v1/2020.emnlp-main.728. URL https://aclanthology.org/2020. emnlp-main.728. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL,

  10. [10]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  11. [11]

    Program synthesis

    11 Published as a conference paper at ICLR 2023 Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. In F oundations and Trends in Programming Languages,

  12. [12]

    Deep learning type inference

    Vincent J Hellendoorn, Christian Bird, Earl T Barr, and Miltiadis Allamanis. Deep learning type inference. In ESEC/FSE 2018,

  13. [13]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,

  14. [14]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436,

  15. [15]

    Deduplicating training data mitigates privacy risks in language models

    Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. arXiv preprint arXiv:2202.06539,

  16. [16]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  17. [17]

    CTRL: A conditional transformer language model for controllable generation

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858,

  18. [18]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461,

  19. [19]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R ´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode. arXiv preprint arXiv:2203.07814,

  20. [20]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. A conversational paradigm for program synthesis. arXiv preprint arXiv:2203.13474,

  21. [21]

    fairseq: A Fast, Extensible Toolkit for Sequence Modeling

    12 Published as a conference paper at ICLR 2023 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. Fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038,

  22. [22]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kel- ton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with...

  23. [23]

    Yong-Siang Shih, Wei-Cheng Chang, and Yiming Yang

    URL https://aclanthology.org/P16-1162. Yong-Siang Shih, Wei-Cheng Chang, and Yiming Yang. XL-Editor: Post-editing sentences with XLNet. arXiv preprint arXiv:1910.10479,

  24. [24]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239,

  25. [25]

    LambdaNet: Probabilistic type inference using graph neural networks

    13 Published as a conference paper at ICLR 2023 Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. LambdaNet: Probabilistic type inference using graph neural networks. In ICLR,

  26. [26]

    A systematic evaluation of large language models of code

    Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. A systematic evaluation of large language models of code. arXiv preprint arXiv:2202.13169,

  27. [27]

    We only include repositories with one of the above permissive licenses

    10We use https://cloud.google.com/blog/topics/public-datasets/ github-on-bigquery-analyze-all-the-open-source-code . We only include repositories with one of the above permissive licenses. 14 Published as a conference paper at ICLR 2023 Deduplication. Recent work has shown that deduplicating training data can improve model per- formance and reduce the ris...

  28. [28]

    and attribute prediction. For code file data, our attributes are the code filename, the file extension (as a proxy for language), the file source (GitHub or GitLab), and, for GitHub repositories, the number of stars binned into six buckets. 15 To allow this metadata to be optional when performing left- to-right prompting of the model, we insert each attribute...

  29. [29]

    We perform one epoch on the training data, using each training document exactly once

    INCODER -6.7B was trained on 248 V100 GPUs for 24 days. We perform one epoch on the training data, using each training document exactly once. Our implementation utilized the causal masking implementation (Aghajanyan et al., 2022a) available in Fairseq (Ott et al., 2019), with the underlying library being PyTorch (Paszke et al., 2019). Our per- GPU batch s...

  30. [30]

    For our learning rate scheduler, we use the built-in polynomial decay learning rate scheduler available in Paszke et al

    We clip all gradient norms to 1.0 and used the Adam optimizer withβ1 = 0.9,β2 = 0.98 (Kingma & Ba, 2015). For our learning rate scheduler, we use the built-in polynomial decay learning rate scheduler available in Paszke et al. (2019) with 1500 warmup updates. Fairscale was used for improving memory efficiency through fully sharding model states (Baines et ...

  31. [31]

    We use PLBART-Large (Ahmad et al., 2021), an encoder-decoder model trained on code (including 220GB of Python) using a BART (Lewis et al.,

    Encoder-decoder (PLBART). We use PLBART-Large (Ahmad et al., 2021), an encoder-decoder model trained on code (including 220GB of Python) using a BART (Lewis et al.,

  32. [32]

    masked denoising objective. We pre- and post-process each HumanEval infilling example as needed for PLBART: we represent each example as a stream of space-separated tokens (as identified by Python’s built-in lexer) with newlines and indentations replaced by control characters, and use a <mask> token to represent the line to be infilled. We extract the infille...

  33. [33]

    We evaluate our I NCODER -6.7B model in zero-shot evaluation on both of these benchmarks

    benchmarks, which require models to condition on natural language descriptions (docstrings) to produce Python programs (typically a single function), and evaluates overall functional accuracy (pass rate) across examples using several test cases for each program. We evaluate our I NCODER -6.7B model in zero-shot evaluation on both of these benchmarks. For ...

  34. [34]

    2.7 16 238 None — 5.6 9.8 17.7 — GPT-J (Wang & Komatsuzaki, 2021; Chen et al., 2021a) 6 6 90 730 — 11.6 15.7 27.7 — INCODER-6.7B 6.7 52 107 57 Permissive 15.2 27.8 47.0 19.4 GPT-NeoX (Black et al.,

  35. [35]

    16.1 279 375 1200 — 29.3 49.9 75.0 — Unreleased LaMDA (Austin et al., 2021; Thoppilan et al., 2022; Chowdhery et al.,

  36. [36]

    Permissive

    540∼20 ∼200 ∼4000 Permissive 36.0 — 88.4 47.0 Table 11: A comparison of our I NCODER -6.7B model to published code generation systems using pass rates @ K candidates sampled on the HumanEval and MBPP benchmarks. All models are decoder-only transformer models. A “Permissive” code license indicates models trained on only open-source repositories with non-co...

  37. [37]

    using a single candidate.16 We use top-p sampling withp= 0.95, with a temperature of 0.2 for pass@1 and 0.8 for pass@10 and pass@100. We compare our I NCODER -6.7B model to models from past work (which have all been left-to- right only) in Table 11, giving the model size and training data summary statistics as reported (or estimated, in cases when a paper...

  38. [38]

    ""Count the number of occurrences of each word in the file

    performed best, we found that our model did not benefit from additional examples in the prompt, which we attribute to much smaller size of our model (6.7B, versus 137B or 540B parameters) and the sensitivity of in-context learning to model scale. 21 Published as a conference paper at ICLR 2023 D E XAMPLES D.1 M ETADATA EXAMPLES <| = = |> from import def wi...