arxiv: 2105.09938 · v3 · submitted 2021-05-20 · 💻 cs.SE · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Measuring Coding Challenge Competence With APPS

Dan Hendrycks , Steven Basart , Saurav Kadavath , Mantas Mazeika , Akul Arora , Ethan Guo , Collin Burns , Samir Puranik

show 3 more authors

Horace He Dawn Song Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:05 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords code generationbenchmarkmachine learningprogramming problemsPythontest casesnatural language specification

0 comments

The pith

The APPS benchmark shows machine learning models are beginning to learn coding by passing roughly 20 percent of test cases on introductory problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces APPS, a benchmark of 10,000 coding problems that tests whether models can translate natural language problem descriptions into correct Python code. Models are scored by how well their generated code passes hidden test cases, similar to how companies evaluate developers. The authors fine-tune large language models and observe that syntax errors become rarer as models improve. They report that models like GPT-Neo succeed on roughly 20 percent of introductory problems. This suggests that machine learning is making initial progress on the broad skill of programming.

Core claim

APPS contains 10,000 problems ranging from simple one-line solutions to substantial algorithmic challenges. By evaluating generated code on test cases, the benchmark finds that recent models pass approximately 20% of the test cases on introductory problems. The prevalence of syntax errors decreases exponentially with model improvements after fine-tuning on GitHub and the training set.

What carries the argument

The APPS benchmark, which evaluates code generation models by executing their Python outputs against hidden test cases that check natural language problem specifications.

Load-bearing premise

Success on the provided test cases for each problem means the generated code satisfies the original natural language specification.

What would settle it

A model that passes all test cases on a problem yet produces code that fails to match the natural language intent on some untested input, or sustained inability of models to exceed low single-digit percentages even after larger-scale training.

read the original abstract

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APPS is a useful new benchmark for natural-language code generation at scale, but the 20% pass-rate claim overreaches without stronger evidence that the test suites actually verify the specs.

read the letter

The paper introduces APPS, a collection of 10,000 problems where models must turn arbitrary natural-language descriptions into Python code that passes provided test cases. Problems range from one-liners to full algorithmic tasks. This setup is new in its combination of scale, open-ended specs, and automatic test-based grading. Earlier code-generation evaluations were narrower, often using fixed templates or small hand-checked sets. The authors also show that fine-tuning on GitHub plus their training data cuts syntax errors exponentially and that GPT-Neo clears roughly 20% of the test cases on the easiest problems. Those are concrete, reproducible numbers that give the field a shared yardstick. The benchmark and evaluation protocol are defined independently of any model, which keeps the circularity burden low. The work is honest about its scope and supplies baselines that others can build on. The soft spot is the leap from “20% of test cases pass on intro problems” to “models are now beginning to learn how to code.” The abstract gives no audit of test-suite completeness, no count of how often passing code still fails on plausible unseen inputs that obey the spec, and no check for correlation between test cases and common training patterns. If the suites are sparse or biased toward easy cases, the headline number can be reached without general competence. That concern is real and load-bearing for the interpretation, even if the raw benchmark numbers hold up. The paper is aimed at researchers who need a practical, automatically scored measure of code generation progress. Anyone tracking when these systems might become useful for real software tasks will find the dataset and protocol worth looking at. It is coherent on its own terms and deserves a serious referee who can press on the test-coverage details and ask for more ablations. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces the APPS benchmark consisting of 10,000 natural language programming problems paired with test cases to evaluate code generation from arbitrary specifications. Models are fine-tuned on GitHub and APPS training data; the authors report an exponential reduction in syntax errors with improving models and that GPT-Neo passes approximately 20% of test cases on introductory problems, concluding that machine learning models are beginning to learn how to code.

Significance. The creation of a large-scale benchmark with problems spanning simple one-line solutions to substantial algorithmic challenges is a valuable contribution for tracking progress in code generation. The empirical observation of exponential syntax-error reduction provides a concrete, falsifiable trend. If the test-case protocol is shown to be robust, the 20% pass-rate baseline on introductory problems offers a useful reference point for future work in automatic code synthesis.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The headline claim that models are 'now beginning to learn how to code' rests on GPT-Neo passing ~20% of test cases for introductory problems. The manuscript supplies no information on data splits, test-case coverage of the natural-language specifications, or statistical controls (e.g., variance across random seeds or runs), leaving the robustness of the reported figure unclear.
[Benchmark description] Benchmark description: The central inference that passing the supplied test cases indicates the generated code satisfies the original specification is load-bearing, yet the paper reports no audit of test-suite completeness, no adversarial augmentation of test cases, and no measurement of cases where code passes the given tests but fails on plausible unseen inputs consistent with the specification.

minor comments (2)

[Results] Tables summarizing pass rates broken down by problem difficulty (introductory/interview/competition) would improve readability and allow readers to assess trends more precisely.
[Experiments] A brief discussion of potential data leakage between the GitHub pre-training corpus and the APPS test set would strengthen the experimental protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the value of the APPS benchmark. We address each major comment point by point below, indicating where revisions will be made to improve clarity and acknowledge limitations.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The headline claim that models are 'now beginning to learn how to code' rests on GPT-Neo passing ~20% of test cases for introductory problems. The manuscript supplies no information on data splits, test-case coverage of the natural-language specifications, or statistical controls (e.g., variance across random seeds or runs), leaving the robustness of the reported figure unclear.

Authors: We agree that additional details would improve the robustness of the reported results. The data splits are described in Section 3, which specifies the 5,000/5,000 train/test division of the APPS problems. Test cases originate from the source competitive programming platforms and are intended to cover the natural language specifications, though we will add an explicit statement to this effect. For statistical controls, our primary results are from single runs; we will include a brief analysis of variance across random seeds in the revised Evaluation section. We will also update the abstract to include a short qualifier referencing these details. These changes will be incorporated in the next version. revision: yes
Referee: [Benchmark description] Benchmark description: The central inference that passing the supplied test cases indicates the generated code satisfies the original specification is load-bearing, yet the paper reports no audit of test-suite completeness, no adversarial augmentation of test cases, and no measurement of cases where code passes the given tests but fails on plausible unseen inputs consistent with the specification.

Authors: We acknowledge that test-case evaluation has inherent limitations and does not constitute a formal proof of correctness for all inputs. The paper does not include an audit of test-suite completeness or adversarial augmentation, as the focus is on establishing the benchmark and initial baselines rather than exhaustive verification. We will add a dedicated paragraph in the Discussion section noting this limitation, clarifying that passing the provided tests is the standard proxy used in code generation research (analogous to human assessment), and suggesting adversarial testing as an avenue for future work. No new empirical measurements will be performed for this revision. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces the APPS benchmark with 10,000 natural-language problems and associated test cases defined independently of any model. It then evaluates models by generating Python code from the problem statements and measuring pass rates on the fixed test suites, reporting empirical results such as GPT-Neo passing approximately 20% of test cases on introductory problems. This is a direct measurement against external test cases rather than any derivation, fitted parameter, or self-referential equation; the claim that models are beginning to learn to code is an interpretation of these observed pass rates. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked, and the evaluation protocol does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that test-case passing is a sufficient proxy for code correctness; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Test cases supplied with each problem are sufficient to determine whether generated code satisfies the natural language specification.
The entire evaluation pipeline depends on this premise.

pith-pipeline@v0.9.0 · 5542 in / 1173 out tokens · 54330 ms · 2026-05-11T17:05:19.686344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code... we evaluate models by checking their generated code on test cases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentBench: Evaluating LLMs as Agents
cs.AI 2023-08 unverdicted novelty 8.0

AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.
Text-to-CAD Evaluation with CADTests
cs.CV 2026-05 unverdicted novelty 7.0

Introduces CADTestBench as a test-based benchmark for Text-to-CAD and shows that using CADTests to guide generation produces simple baselines outperforming prior methods.
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection
cs.LG 2026-05 unverdicted novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
cs.SE 2026-05 unverdicted novelty 7.0

ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
cs.SE 2026-04 unverdicted novelty 7.0

Frontier LLMs pass unit tests over 76% of the time on debugging tasks but achieve edit precision below 45%, indicating regeneration rather than precise debugging.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
cs.SE 2026-03 unverdicted novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
Uncertainty Quantification for LLM-based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
cs.LG 2026-05 unverdicted novelty 6.0

GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 6.0

REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 unverdicted novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Babbling Suppression: Making LLMs Greener One Token at a Time
cs.SE 2026-04 unverdicted novelty 6.0

Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
cs.CL 2026-01 unverdicted novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
cs.AI 2026-05 unverdicted novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
cs.SE 2026-04 unverdicted novelty 5.0

REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
cs.LG 2026-05 unverdicted novelty 4.0

Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
cs.SE 2026-04 unverdicted novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer
cs.LG 2026-04 unverdicted novelty 4.0

LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 33 Pith papers · 6 internal anchors

[1]

Mining source code repositories at massive scale using language modeling

Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pages 207–216. IEEE,

work page 2013
[2]

Sygus-comp 2018: Results and analysis

Rajeev Alur, Dana Fisman, Saswat Padhi, Rishabh Singh, and Abhishek Udupa. Sygus-comp 2018: Results and analysis. SYNT,

work page 2018
[3]

Language Models are Few-Shot Learners

URL https://doi.org/ 10.5281/zenodo.5297715. If you use this software, please cite it using these metadata. T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5297715 2005
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,

work page arXiv
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning AI with shared human values. Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Mapping language to code in programmatic context

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October-November

work page 2018
[9]

arXiv preprint arXiv:2006.03511 (2020)

Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511,

work page arXiv 2006
[10]

10 W. Ling, P. Blunsom, Edward Grefenstette, K. Hermann, Tomás Kociský, Fumin Wang, and A. Senior. Latent predictor networks for code generation. ArXiv, abs/1603.06744,

work page arXiv
[11]

Generative lan- guage modeling for automated theorem proving.arXiv preprint arXiv:2009.03393, 2020

Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. ArXiv, abs/2009.03393,

work page arXiv 2009
[12]

Veselin Raychev, Pavol Bielik, and Martin T. Vechev. Probabilistic model for code with decision trees. Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications,

work page 2016
[13]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, L. Zhou, Shujie Liu, Duyu Tang, M. Zhou, A. Blanco, and S. Ma. Codebleu: a method for automatic evaluation of code synthesis. ArXiv, abs/2009.10297,

work page internal anchor Pith review arXiv 2009
[14]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Superglue: A stickier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019a. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue...

work page arXiv 1911
[16]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887,

work page Pith review arXiv
[17]

the fair use of a copyrighted work, including such use by ... scholarship, or research, is not an infringement of copyright

12 Hearthstone Django NAPS APPS Programming Language Python Python UAST Python Test Cases Number of Programs 665 18,805 17,477 232,421 Lines per Program (Avg.) 7.7 1 21.7 18.0 Number of Exercises 665 18,805 2,231 10,000 Text Input Card Text Comment Pseudocode Problem Descriptions Table 4: Further comparisons of APPS with previous datasets. Top-5 Test Case...

work page 2018
[18]

fail to pass even a single predeﬁned test case

main paper. We continue the comparisons below. Ling et al. (2016) introduce datasets based on Hearthstone and Magic the Gathering card games for code generation. Oda et al. (2015) provide a language-to-code dataset using simple code comments. Zavershynskyi et al. (2018) introduce the NAPS dataset for converting pseudocode to code, obtained by crowdsourcin...

work page 2016