Recognition: 3 theorem links
· Lean TheoremCodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Pith reviewed 2026-05-15 11:35 UTC · model grok-4.3
The pith
CodeXGLUE introduces a benchmark with 10 tasks across 14 datasets for code understanding and generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeXGLUE is a benchmark dataset that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison, along with three baseline systems consisting of BERT-style, GPT-style, and Encoder-Decoder models.
What carries the argument
The CodeXGLUE benchmark, which standardizes 10 tasks over 14 datasets and provides an evaluation platform with baseline models for code-related machine learning.
If this is right
- Researchers can evaluate and compare models on code tasks using a single platform without assembling datasets individually.
- New methods for code understanding can be tested against established baselines like BERT-style models.
- The benchmark supports progress in both understanding existing code and generating new code through standardized tasks.
- Development of machine learning tools for programming benefits from consistent metrics across multiple datasets.
Where Pith is reading between the lines
- Adoption of this benchmark could shift focus toward practical applications in automated software engineering.
- Future work might expand the tasks to cover additional programming languages or more complex real-world scenarios.
- Performance improvements on these tasks may translate to better AI-assisted coding tools in practice.
- Comparison across models could highlight which architectures suit specific code problems best.
Load-bearing premise
The chosen 10 tasks and 14 datasets capture the essential challenges in code understanding and generation.
What would settle it
Finding that models perform very differently on a new code task outside the benchmark compared to within it would indicate the selection is not representative.
read the original abstract
Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeXGLUE, a benchmark dataset for machine learning research on program understanding and generation. It consists of 10 tasks spanning 14 datasets, three baseline models (BERT-style, GPT-style, and Encoder-Decoder), and an evaluation platform to support model comparison and development of new methods.
Significance. If the 10 tasks and 14 datasets prove representative, the benchmark and baselines could accelerate research by enabling standardized, reproducible evaluations across code-related tasks; the explicit release of data and baselines is a concrete strength that lowers barriers for follow-on work.
major comments (2)
- [Abstract and task-listing section] The central claim that the 10 tasks across 14 datasets cover program understanding and generation rests on an unverified assumption of representativeness. No selection criteria, task taxonomy, language-coverage metrics, difficulty distribution, or gap analysis versus the broader problem space are supplied (see abstract and the task-listing section).
- [Baseline systems section] Baseline descriptions are given at a high level only; the manuscript supplies neither implementation details sufficient for reproduction nor any quantitative performance numbers, error analysis, or dataset statistics that would allow readers to assess baseline quality or benchmark difficulty.
minor comments (1)
- [Datasets section] Notation for the 14 datasets is introduced without a consolidated table listing sources, sizes, languages, and splits.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for improving the clarity of our benchmark's scope and the reproducibility of the baselines. We have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and task-listing section] The central claim that the 10 tasks across 14 datasets cover program understanding and generation rests on an unverified assumption of representativeness. No selection criteria, task taxonomy, language-coverage metrics, difficulty distribution, or gap analysis versus the broader problem space are supplied (see abstract and the task-listing section).
Authors: The 10 tasks were selected to represent core challenges in code understanding (defect detection, clone detection, code search, code summarization) and generation (code completion, code translation, code generation from text), drawing from widely studied problems in the software engineering and NLP literature. We acknowledge that the original submission did not explicitly articulate selection criteria or provide a taxonomy. In the revised manuscript we have added a new subsection under task listing that states the selection rationale, notes primary language coverage (Java, Python, C#), and briefly discusses coverage gaps (e.g., limited low-resource languages and absence of certain verification tasks). A full formal taxonomy and exhaustive gap analysis would require a separate survey and is beyond the scope of this benchmark paper. revision: yes
-
Referee: [Baseline systems section] Baseline descriptions are given at a high level only; the manuscript supplies neither implementation details sufficient for reproduction nor any quantitative performance numbers, error analysis, or dataset statistics that would allow readers to assess baseline quality or benchmark difficulty.
Authors: The full training and evaluation code for the three baseline families has been released on the CodeXGLUE GitHub repository and evaluation platform to support reproduction. We agree that the paper itself should contain more concrete information. The revised version expands the baseline section with implementation details (model architectures, hyperparameters, training regimes), reports quantitative performance numbers for each baseline across all tasks, includes basic dataset statistics (sizes, token distributions), and adds a short error analysis section that highlights common failure modes observed in the baselines. These additions make it easier for readers to gauge benchmark difficulty. revision: yes
Circularity Check
No circularity; benchmark release contains no derivations or predictions
full rationale
The paper introduces CodeXGLUE as a collection of 10 tasks across 14 datasets plus baselines and an evaluation platform. No equations, fitted parameters, predictions, or first-principles derivations are present anywhere in the manuscript. Task and dataset selection is stated directly without any claimed methodology that reduces to self-definition, self-citation chains, or renaming of prior results. The work is a descriptive data release whose central claim is the existence and availability of the benchmark itself; this claim does not rely on any internal reduction to its own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
-
SWE-QA: A Dataset and Benchmark for Complex Code Understanding
SWE-QA is a new benchmark of 9,072 questions testing multi-hop code comprehension from 12 Python projects, where the best of 15 evaluated models reaches only 74.41% accuracy.
-
Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
CodeRQ-Bench and VERA enable evaluation of LLM reasoning quality in coding tasks beyond output correctness, with VERA improving AUCROC by up to 0.26 over baselines.
-
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
-
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
-
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
-
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
-
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
-
Sustainability Analysis of Prompt Strategies for SLM-based Automated Test Generation
Prompt strategies for SLM-based automated test generation vary widely in energy consumption and carbon emissions, with simpler strategies delivering competitive coverage at markedly lower environmental cost.
-
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks...
-
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
-
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
-
Neural Code Translation of Legacy Code: APL to C#
Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.
-
Towards Automated Pentesting with Large Language Models
RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance s...
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Large Language Models for Multilingual Code Intelligence: A Survey
A survey of methods, benchmarks, and open challenges for large language models in multilingual code generation and translation.
-
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
-
RedShell: A Generative AI-Based Approach to Ethical Hacking
RedShell fine-tunes LLMs on a custom dataset of public code samples to generate syntactically valid PowerShell scripts with semantic similarity to references, reporting under 10% parse errors and over 50%/40% mean sim...
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Barr, Premkumar Devanbu, and Charles Sutton
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4, Article 81 (July 2018), 37 pages. https://doi.org/10.1145/3212695
-
[2]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional at- tention network for extreme summarization of source code. In International conference on machine learning . 2091–2100
work page 2016
-
[4]
Miltiadis Allamanis and Charles Sutton. 2013. Mining Source Code Repositories at Massive Scale using Language Modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) . IEEE, 207–216
work page 2013
-
[5]
Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 472–483
work page 2014
-
[6]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learn- ing distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29
work page 2019
-
[7]
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. In Advances in Neural Information Processing Systems . 3585–3597
work page 2018
-
[8]
Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: Probabilistic Model for Code. InProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (New York, NY, USA) (ICML’16). JMLR.org, 2933–2942
work page 2016
-
[9]
Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from Ex- amples to Improve Code Completion Systems. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (Amsterdam, The Nether- lands) (ESEC/FSE ’09). Association for Computing Machine...
-
[10]
L. Büch and A. Andrzejak. 2019. Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER) . 95–104. https://doi.org/10.1109/SANER.2019.8668039
-
[11]
Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In Advances in neural information processing systems . 2547–2557
work page 2018
- [12]
- [13]
-
[14]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality- sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry . 253–262
work page 2004
-
[15]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition . Ieee, 248–255
work page 2009
-
[16]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In Pro- ceedings of the 29th ACM/IEEE international conference on Automated software engineering. 313–324
work page 2014
-
[18]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [19]
-
[20]
Feser, Swarat Chaudhuri, and Isil Dillig
John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing Data Struc- ture Transformations from Input-Output Examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (Portland, OR, USA) (PLDI ’15). Association for Computing Machinery, New York, NY, USA, 229–239. https://doi.org/10.1145/273792...
-
[21]
Michael Fischer, Martin Pinzger, and Harald Gall. 2003. Populating a release history database from version control and bug tracking systems. In International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings. IEEE, 23–32
work page 2003
-
[22]
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419
work page 2011
-
[23]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In Proceedings of the 40th International Conference on Software Engineering (Gothen- burg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 933–944. https://doi.org/10.1145/3180155.3180167
-
[24]
Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1-2 (2017), 1–119
work page 2017
-
[25]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, et al. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv preprint arXiv:2009.08366 (2020). CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[26]
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2019. Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing. arXiv preprint arXiv:1906.07108 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Rahul Gupta, Aditya Kanade, and Shirish Shevade. 2019. Neural Attribution for Se- mantic Bug-Localization in Student Programs. In Advances in Neural Information Processing Systems. 11884–11894
work page 2019
-
[28]
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI’17). AAAI Press, 1345–1351
work page 2017
-
[29]
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering . 763–773
work page 2017
-
[30]
Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2019. Global relational models of source code. In International Conference on Learning Representations
work page 2019
-
[31]
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu
-
[32]
In 2012 34th International Conference on Software Engineering (ICSE)
On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 837–847
work page 2012
-
[33]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780
work page 1997
- [34]
-
[35]
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th Interna- tional Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 2269–2275
work page 2018
-
[36]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [37]
-
[38]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073–2083
work page 2016
-
[39]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Map- ping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[40]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 96–105
work page 2007
-
[41]
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al
-
[42]
Transactions of the Association for Computational Linguistics 5 (2017), 339–351
Google’s multilingual neural machine translation system: Enabling zero- shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351
work page 2017
-
[43]
Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming Software (Portland, Oregon, USA) (Onward! 2014). Association for Computing Machinery, New York, NY, USA, 173–184. h...
- [44]
-
[45]
Yoon Kim. 2014. Convolutional neural networks for sentence classification.arXiv preprint arXiv:1408.5882 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[46]
2003.Statistical phrase-based trans- lation
Philipp Koehn, Franz J Och, and Daniel Marcu. 2003.Statistical phrase-based trans- lation. Technical Report. UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATION SCIENCES INST
work page 2003
-
[47]
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems . 11906–11917
work page 2019
-
[48]
Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample
-
[49]
arXiv preprint arXiv:2006.03511 (2020)
Unsupervised Translation of Programming Languages. arXiv preprint arXiv:2006.03511 (2020)
-
[50]
Yi Li, Shaohua Wang, Tien N Nguyen, and Son Van Nguyen. 2019. Improving bug detection via context-based code representation learning and attention-based neural networks. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–30
work page 2019
- [51]
-
[52]
Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating auto- matic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics . 501–507
work page 2004
-
[53]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[54]
Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code recommendation via structural code search. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–28
work page 2019
-
[55]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural net- works over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence . 1287–1293
work page 2016
-
[56]
Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. 2015. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[57]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2015. Divide-and- conquer approach for multi-phase statistical migration for source code (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 585–596
work page 2015
-
[58]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5206–5210
work page 2015
-
[59]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics . 311–318
work page 2002
-
[60]
Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-Based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages. https://doi.org/10.1145/3276517
-
[61]
Varot Premtoon, James Koppel, and Armando Solar-Lezama. 2020. Semantic Code Search via Equational Reasoning. InProceedings of the 41st ACM SIGPLAN Confer- ence on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 1066–1082. https://doi.org/10.1145/3385412.3386001
-
[62]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9
work page 2019
-
[63]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[64]
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the" naturalness" of buggy code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) . IEEE, 428–439
work page 2016
-
[65]
Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic Model for Code with Decision Trees. ACM SIGPLAN Notices (2016), 731–747
work page 2016
-
[66]
Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of the 35th ACM SIGPLAN Confer- ence on Programming Language Design and Implementation (Edinburgh, United Kingdom) (PLDI ’14). Association for Computing Machinery, New York, NY, USA, 419–428. https://doi.org/10.1145/2594291.2594321
-
[67]
Scott Reed and Nando De Freitas. 2015. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[68]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[69]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[70]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 1715–1725
work page 2016
-
[71]
Rishabh Singh and Sumit Gulwani. 2015. Predicting a correct program in pro- gramming by example. InInternational Conference on Computer Aided Verification. Springer, 398–414
work page 2015
-
[72]
Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al
- [73]
-
[74]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. InAdvances in neural information processing systems. 3104– 3112
work page 2014
-
[75]
Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Moham- mad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 476–480. Lu, Guo, Ren and Huang, et al
work page 2014
- [76]
-
[77]
Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 2727–2735
work page 2019
- [78]
-
[79]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29
work page 2019
-
[80]
Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.