arxiv: 2102.04664 · v2 · submitted 2021-02-09 · 💻 cs.SE · cs.CL

Recognition: 3 theorem links

· Lean Theorem

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Shuai Lu , Daya Guo , Shuo Ren , Junjie Huang , Alexey Svyatkovskiy , Ambrosio Blanco , Colin Clement , Dawn Drain

show 14 more authors

Daxin Jiang Duyu Tang Ge Li Lidong Zhou Linjun Shou Long Zhou Michele Tufano Ming Gong Ming Zhou Nan Duan Neel Sundaresan Shao Kun Deng Shengyu Fu Shujie Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:35 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords CodeXGLUEbenchmarkcode understandingcode generationmachine learningevaluation platformbaseline modelsprogramming languages

0 comments

The pith

CodeXGLUE introduces a benchmark with 10 tasks across 14 datasets for code understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CodeXGLUE as a collection of tasks and datasets aimed at advancing machine learning for programming language problems. It gathers 10 tasks from 14 different datasets and includes an evaluation platform with three baseline models. This setup allows researchers to test new methods consistently across code understanding and generation challenges. A shared benchmark helps the community develop better systems by providing common ground for comparison.

Core claim

CodeXGLUE is a benchmark dataset that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison, along with three baseline systems consisting of BERT-style, GPT-style, and Encoder-Decoder models.

What carries the argument

The CodeXGLUE benchmark, which standardizes 10 tasks over 14 datasets and provides an evaluation platform with baseline models for code-related machine learning.

If this is right

Researchers can evaluate and compare models on code tasks using a single platform without assembling datasets individually.
New methods for code understanding can be tested against established baselines like BERT-style models.
The benchmark supports progress in both understanding existing code and generating new code through standardized tasks.
Development of machine learning tools for programming benefits from consistent metrics across multiple datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of this benchmark could shift focus toward practical applications in automated software engineering.
Future work might expand the tasks to cover additional programming languages or more complex real-world scenarios.
Performance improvements on these tasks may translate to better AI-assisted coding tools in practice.
Comparison across models could highlight which architectures suit specific code problems best.

Load-bearing premise

The chosen 10 tasks and 14 datasets capture the essential challenges in code understanding and generation.

What would settle it

Finding that models perform very differently on a new code task outside the benchmark compared to within it would indicate the selection is not representative.

read the original abstract

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeXGLUE bundles 10 existing code tasks from 14 datasets into one evaluation platform with baselines, which is a practical step but rests on unexamined choices about what counts as representative.

read the letter

The paper's core move is to collect 10 tasks across 14 datasets and wrap them in a single platform plus three baseline models (BERT-style, GPT-style, encoder-decoder). That aggregation itself is the new piece; prior work had the individual datasets, but not this unified setup modeled on GLUE. The baselines are a real convenience because they let people run quick comparisons without starting from scratch. The breadth across understanding tasks like clone detection and generation tasks like summarization or translation is also useful for the community working on code models. The main soft spot is the lack of any explicit rationale or coverage check for why these particular tasks and datasets were chosen. The stress-test note is right on that point: without selection criteria, diversity metrics, or a gap analysis, the claim that the collection adequately samples the space stays unverified. The abstract gives no numbers either, so the full paper has to show actual baseline results and error patterns to make the benchmark credible. This is for researchers who need a common testbed for code-related ML work rather than for people looking for novel methods. It is worth sending to review because the field benefits from shared benchmarks even when the initial selection is imperfect; referees can push on the coverage question without rejecting the whole effort.

Referee Report

2 major / 1 minor

Summary. The paper introduces CodeXGLUE, a benchmark dataset for machine learning research on program understanding and generation. It consists of 10 tasks spanning 14 datasets, three baseline models (BERT-style, GPT-style, and Encoder-Decoder), and an evaluation platform to support model comparison and development of new methods.

Significance. If the 10 tasks and 14 datasets prove representative, the benchmark and baselines could accelerate research by enabling standardized, reproducible evaluations across code-related tasks; the explicit release of data and baselines is a concrete strength that lowers barriers for follow-on work.

major comments (2)

[Abstract and task-listing section] The central claim that the 10 tasks across 14 datasets cover program understanding and generation rests on an unverified assumption of representativeness. No selection criteria, task taxonomy, language-coverage metrics, difficulty distribution, or gap analysis versus the broader problem space are supplied (see abstract and the task-listing section).
[Baseline systems section] Baseline descriptions are given at a high level only; the manuscript supplies neither implementation details sufficient for reproduction nor any quantitative performance numbers, error analysis, or dataset statistics that would allow readers to assess baseline quality or benchmark difficulty.

minor comments (1)

[Datasets section] Notation for the 14 datasets is introduced without a consolidated table listing sources, sizes, languages, and splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for improving the clarity of our benchmark's scope and the reproducibility of the baselines. We have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and task-listing section] The central claim that the 10 tasks across 14 datasets cover program understanding and generation rests on an unverified assumption of representativeness. No selection criteria, task taxonomy, language-coverage metrics, difficulty distribution, or gap analysis versus the broader problem space are supplied (see abstract and the task-listing section).

Authors: The 10 tasks were selected to represent core challenges in code understanding (defect detection, clone detection, code search, code summarization) and generation (code completion, code translation, code generation from text), drawing from widely studied problems in the software engineering and NLP literature. We acknowledge that the original submission did not explicitly articulate selection criteria or provide a taxonomy. In the revised manuscript we have added a new subsection under task listing that states the selection rationale, notes primary language coverage (Java, Python, C#), and briefly discusses coverage gaps (e.g., limited low-resource languages and absence of certain verification tasks). A full formal taxonomy and exhaustive gap analysis would require a separate survey and is beyond the scope of this benchmark paper. revision: yes
Referee: [Baseline systems section] Baseline descriptions are given at a high level only; the manuscript supplies neither implementation details sufficient for reproduction nor any quantitative performance numbers, error analysis, or dataset statistics that would allow readers to assess baseline quality or benchmark difficulty.

Authors: The full training and evaluation code for the three baseline families has been released on the CodeXGLUE GitHub repository and evaluation platform to support reproduction. We agree that the paper itself should contain more concrete information. The revised version expands the baseline section with implementation details (model architectures, hyperparameters, training regimes), reports quantitative performance numbers for each baseline across all tasks, includes basic dataset statistics (sizes, token distributions), and adds a short error analysis section that highlights common failure modes observed in the baselines. These additions make it easier for readers to gauge benchmark difficulty. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark release contains no derivations or predictions

full rationale

The paper introduces CodeXGLUE as a collection of 10 tasks across 14 datasets plus baselines and an evaluation platform. No equations, fitted parameters, predictions, or first-principles derivations are present anywhere in the manuscript. Task and dataset selection is stated directly without any claimed methodology that reduces to self-definition, self-citation chains, or renaming of prior results. The work is a descriptive data release whose central claim is the existence and availability of the benchmark itself; this claim does not rely on any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark-release paper; it introduces no free parameters, mathematical axioms, or invented entities beyond standard machine-learning modeling assumptions.

pith-pipeline@v0.9.0 · 5478 in / 1079 out tokens · 40048 ms · 2026-05-15T11:35:59.404148+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
cs.CL 2023-10 unverdicted novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
SWE-QA: A Dataset and Benchmark for Complex Code Understanding
cs.SE 2026-04 unverdicted novelty 7.0

SWE-QA is a new benchmark of 9,072 questions testing multi-hop code comprehension from 12 Python projects, where the best of 15 evaluated models reaches only 74.41% accuracy.
Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
cs.SE 2026-04 unverdicted novelty 7.0

CodeRQ-Bench and VERA enable evaluation of LLM reasoning quality in coding tasks beyond output correctness, with VERA improving AUCROC by up to 0.26 over baselines.
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
cs.SE 2026-04 unverdicted novelty 7.0

LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
cs.SE 2023-05 accept novelty 7.0

EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
gr-qc 2026-05 unverdicted novelty 6.0

LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
cs.CR 2026-04 unverdicted novelty 6.0

VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
Sustainability Analysis of Prompt Strategies for SLM-based Automated Test Generation
cs.SE 2026-04 unverdicted novelty 6.0

Prompt strategies for SLM-based automated test generation vary widely in energy consumption and carbon emissions, with simpler strategies delivering competitive coverage at markedly lower environmental cost.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
cs.CL 2021-09 conditional novelty 6.0

CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks...
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
cs.SE 2026-05 unverdicted novelty 5.0

LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
Neural Code Translation of Legacy Code: APL to C#
cs.SE 2026-05 unverdicted novelty 5.0

Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.
Towards Automated Pentesting with Large Language Models
cs.CR 2026-04 unverdicted novelty 5.0

RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance s...
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Large Language Models for Multilingual Code Intelligence: A Survey
cs.SE 2026-04 unverdicted novelty 4.0

A survey of methods, benchmarks, and open challenges for large language models in multilingual code generation and translation.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
cs.SE 2026-04 unverdicted novelty 4.0

LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
RedShell: A Generative AI-Based Approach to Ethical Hacking
cs.CR 2026-04 unverdicted novelty 3.0

RedShell fine-tunes LLMs on a custom dataset of public code samples to generate syntactically valid PowerShell scripts with semantic similarity to references, reporting under 10% parse errors and over 50%/40% mean sim...
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

107 extracted references · 107 canonical work pages · cited by 20 Pith papers · 18 internal anchors

[1]

Barr, Premkumar Devanbu, and Charles Sutton

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4, Article 81 (July 2018), 37 pages. https://doi.org/10.1145/3212695

work page doi:10.1145/3212695 2018
[2]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional at- tention network for extreme summarization of source code. In International conference on machine learning . 2091–2100

work page 2016
[4]

Miltiadis Allamanis and Charles Sutton. 2013. Mining Source Code Repositories at Massive Scale using Language Modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) . IEEE, 207–216

work page 2013
[5]

Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 472–483

work page 2014
[6]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learn- ing distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29

work page 2019
[7]

Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. In Advances in Neural Information Processing Systems . 3585–3597

work page 2018
[8]

Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: Probabilistic Model for Code. InProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML’16). JMLR.org, 2933–2942

work page 2016
[9]

Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from Ex- amples to Improve Code Completion Systems. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (Amsterdam, The Nether- lands) (ESEC/FSE ’09). Association for Computing Machine...

work page doi:10.1145/1595696.1595728 2009
[10]

Büch and A

L. Büch and A. Andrzejak. 2019. Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER) . 95–104. https://doi.org/10.1109/SANER.2019.8668039

work page doi:10.1109/saner.2019.8668039 2019
[11]

Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In Advances in neural information processing systems . 2547–2557

work page 2018
[12]

Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020)

work page arXiv 2020
[13]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

work page arXiv 2019
[14]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality- sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry . 253–262

work page 2004
[15]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition . Ieee, 248–255

work page 2009
[16]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In Pro- ceedings of the 29th ACM/IEEE international conference on Automated software engineering. 313–324

work page 2014
[18]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured neural summarization. arXiv preprint arXiv:1811.01824 (2018)

work page arXiv 2018
[20]

Feser, Swarat Chaudhuri, and Isil Dillig

John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing Data Struc- ture Transformations from Input-Output Examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (Portland, OR, USA) (PLDI ’15). Association for Computing Machinery, New York, NY, USA, 229–239. https://doi.org/10.1145/273792...

work page doi:10.1145/2737924.2737977 2015
[21]

Michael Fischer, Martin Pinzger, and Harald Gall. 2003. Populating a release history database from version control and bug tracking systems. In International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings. IEEE, 23–32

work page 2003
[22]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

work page 2011
[23]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In Proceedings of the 40th International Conference on Software Engineering (Gothen- burg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 933–944. https://doi.org/10.1145/3180155.3180167

work page doi:10.1145/3180155.3180167 2018
[24]

Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1-2 (2017), 1–119

work page 2017
[25]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, et al. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv preprint arXiv:2009.08366 (2020). CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2019. Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. arXiv preprint arXiv:1906.07108 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Rahul Gupta, Aditya Kanade, and Shirish Shevade. 2019. Neural Attribution for Se- mantic Bug-Localization in Student Programs. In Advances in Neural Information Processing Systems. 11884–11894

work page 2019
[28]

Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI’17). AAAI Press, 1345–1351

work page 2017
[29]

Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering . 763–773

work page 2017
[30]

Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2019. Global relational models of source code. In International Conference on Learning Representations

work page 2019
[31]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu

work page
[32]

In 2012 34th International Conference on Software Engineering (ICSE)

On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 837–847

work page 2012
[33]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780

work page 1997
[34]

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task bench- mark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080 (2020)

work page arXiv 2020
[35]

Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th Interna- tional Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 2269–2275

work page 2018
[36]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer. 2019. Learning program- matic idioms for scalable semantic parsing. arXiv preprint arXiv:1904.09086 (2019)

work page arXiv 2019
[38]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083

work page 2016
[39]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Map- ping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 96–105

work page 2007
[41]

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al

work page
[42]

Transactions of the Association for Computational Linguistics 5 (2017), 339–351

Google’s multilingual neural machine translation system: Enabling zero- shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351

work page 2017
[43]

Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming Software (Portland, Oregon, USA) (Onward! 2014). Association for Computing Machinery, New York, NY, USA, 173–184. h...

work page arXiv 2014
[44]

Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. 2020. Big Code!= Big Vocabulary: Open-Vocabulary Models for Source Code. arXiv preprint arXiv:2003.07914 (2020)

work page arXiv 2020
[45]

Yoon Kim. 2014. Convolutional neural networks for sentence classification.arXiv preprint arXiv:1408.5882 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[46]

2003.Statistical phrase-based trans- lation

Philipp Koehn, Franz J Och, and Daniel Marcu. 2003.Statistical phrase-based trans- lation. Technical Report. UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATION SCIENCES INST

work page 2003
[47]

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems . 11906–11917

work page 2019
[48]

Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample

work page
[49]

arXiv preprint arXiv:2006.03511 (2020)

Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2006.03511 (2020)

work page arXiv 2006
[50]

Yi Li, Shaohua Wang, Tien N Nguyen, and Son Van Nguyen. 2019. Improving bug detection via context-based code representation learning and attention-based neural networks. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–30

work page 2019
[51]

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, et al. 2020. Xglue: A new bench- mark dataset for cross-lingual pre-training, understanding and generation. arXiv preprint arXiv:2004.01401 (2020)

work page arXiv 2020
[52]

Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating auto- matic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics . 501–507

work page 2004
[53]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[54]

Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code recommendation via structural code search. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–28

work page 2019
[55]

Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural net- works over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence . 1287–1293

work page 2016
[56]

Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. 2015. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[57]

Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2015. Divide-and- conquer approach for multi-phase statistical migration for source code (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 585–596

work page 2015
[58]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5206–5210

work page 2015
[59]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics . 311–318

work page 2002
[60]

Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-Based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages. https://doi.org/10.1145/3276517

work page doi:10.1145/3276517 2018
[61]

Varot Premtoon, James Koppel, and Armando Solar-Lezama. 2020. Semantic Code Search via Equational Reasoning. InProceedings of the 41st ACM SIGPLAN Confer- ence on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 1066–1082. https://doi.org/10.1145/3385412.3386001

work page doi:10.1145/3385412.3386001 2020
[62]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9

work page 2019
[63]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[64]

naturalness

Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the" naturalness" of buggy code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) . IEEE, 428–439

work page 2016
[65]

Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic Model for Code with Decision Trees. ACM SIGPLAN Notices (2016), 731–747

work page 2016
[66]

Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Confer- ence on Programming Language Design and Implementation (Edinburgh, United Kingdom) (PLDI ’14). Association for Computing Machinery, New York, NY, USA, 419–428. https://doi.org/10.1145/2594291.2594321

work page doi:10.1145/2594291.2594321 2014
[67]

Scott Reed and Nando De Freitas. 2015. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[68]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[69]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[70]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 1715–1725

work page 2016
[71]

Rishabh Singh and Sumit Gulwani. 2015. Predicting a correct program in pro- gramming by example. InInternational Conference on Computer Aided Verification. Springer, 398–414

work page 2015
[72]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al

work page
[73]

Release strategies and the social impacts of language models.arXiv preprint arXiv:1908.09203 (2019)

work page arXiv 1908
[74]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. InAdvances in neural information processing systems. 3104– 3112

work page 2014
[75]

Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Moham- mad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 476–480. Lu, Guo, Ren and Huang, et al

work page 2014
[76]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode Compose: Code Generation Using Transformer. arXiv preprint arXiv:2005.08025 (2020)

work page arXiv 2020
[77]

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2727–2735

work page 2019
[78]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit Test Case Generation with Transformers. arXiv preprint arXiv:2009.05617 (2020)

work page arXiv 2020
[79]

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29

work page 2019
[80]

Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh

work page

Showing first 80 references.