arxiv: 2304.05128 · v2 · submitted 2023-04-11 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

Teaching Large Language Models to Self-Debug

Xinyun Chen , Maxwell Lin , Nathanael Sch\"arli , Denny Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords self-debugginglarge language modelscode generationprogram repairfew-shot promptingtext-to-SQLcode translation

0 comments

The pith

Large language models can debug their own code by explaining it in natural language and analyzing execution results without human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Debugging, a method that uses few-shot demonstrations to train LLMs to debug their own code outputs. The model generates natural-language explanations of its code and then uses those explanations together with execution results to identify and correct errors. This yields higher accuracy on code generation tasks such as text-to-SQL and code translation, plus better use of multiple sampling attempts. A reader should care because the technique shows how an LLM can turn its own internal reasoning into a repair loop that works even when no external tests or human judgments are supplied.

Core claim

Self-Debugging teaches large language models to debug their predicted programs via few-shot demonstrations. The model performs rubber duck debugging by explaining the generated code in natural language and investigating execution results, all without any human feedback on correctness or error messages. The approach reaches state-of-the-art results on the Spider text-to-SQL benchmark, TransCoder C++-to-Python translation, and MBPP text-to-Python generation. On Spider it raises baseline accuracy by 2-3 percent overall and by 9 percent on the hardest problems; on the other two benchmarks it raises accuracy by as much as 12 percent when unit tests are present. It also improves sample efficiency,

What carries the argument

Self-Debugging, the few-shot prompting procedure that has the LLM first explain its code in natural language and then revise the code using that explanation plus raw execution output.

Load-bearing premise

An LLM can reliably detect its own semantic or logical errors and produce correct fixes solely from its own natural-language code explanations plus raw execution outputs, without external oracles or human judgment.

What would settle it

Give the model its own incorrect code on a held-out set of complex problems, prompt it to explain and debug the code with no additional error information, and measure whether the revised code passes independent verification at a rate clearly above the baseline single-pass rate.

read the original abstract

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-debugging via few-shot code explanation plus execution feedback delivers usable gains on code benchmarks, but the Spider results without tests look like the weakest part.

read the letter

The main thing to know is that this paper shows a simple few-shot loop where the LLM explains its own code in natural language and then uses that plus execution results to spot and fix mistakes, leading to better accuracy and sample efficiency on standard code generation tasks. On MBPP and TransCoder, where unit tests are available, it lifts baseline performance by up to 12% and can match models that generate over 10 times more candidates. On Spider, the gains are smaller at 2-3% overall and 9% on the hardest problems, relying only on the explanation step since no tests exist there. What is actually new is packaging rubber-duck debugging as few-shot demonstrations that combine explanation with result inspection, without needing human feedback or external oracles. The paper does a solid job documenting these concrete improvements on public benchmarks and highlighting the efficiency angle, which matters for practical use. The soft spots center on the Spider case. The abstract claims the model identifies mistakes by investigating execution results and explaining the code, but with no execution results available, everything hinges on whether the natural-language explanations reliably surface logical errors or just add prompt length that happens to help. If the explanations are post-hoc and the fixes do not causally depend on them, the reported lift could be an artifact rather than genuine self-debugging. The benchmarks with tests are on firmer ground because they provide real feedback. This paper is for people working on prompt-based LLM coding tools who want a lightweight correction method. A reader focused on self-correction or sample-efficient generation would get value from the numbers and the no-human-feedback framing. It deserves a serious referee because the method is straightforward, the datasets are standard, and the claims are falsifiable even if the Spider mechanism needs tighter ablations.

Referee Report

3 major / 1 minor

Summary. The paper introduces Self-Debugging, a few-shot prompting approach that teaches LLMs to perform rubber-duck debugging on their own code generations: the model explains the generated code in natural language and (where available) inspects execution results to identify and repair mistakes. It reports state-of-the-art results on three code-generation benchmarks—Spider (text-to-SQL), TransCoder (C++-to-Python), and MBPP (text-to-Python)—with absolute gains of 2–3 % (9 % on the hardest Spider problems) when no unit tests are present and up to 12 % when unit tests are available, plus improved sample efficiency by reusing failed candidates.

Significance. If the gains prove robust, the work would show that LLMs can achieve meaningful self-correction on complex code tasks without external oracles or human feedback, advancing the design of autonomous programming agents. The Spider results are especially noteworthy because they rely solely on the model’s own explanations rather than execution feedback.

major comments (3)

[§4 (Spider results)] §4 (Spider results): the 2–3 % overall and 9 % hardest-problem gains rest on the assumption that LLM-generated natural-language explanations faithfully expose semantic mismatches with the query intent. No ablation isolating the explanation step from extra inference steps or prompt length is reported, leaving open the possibility that the improvement is an artifact of additional generation rather than genuine self-debugging.
[§3 (method) and abstract] §3 (method) and abstract: the procedure is described as identifying mistakes “by investigating the execution results and explaining the generated code,” yet Spider has no execution results. The exact pipeline used on Spider (explanation-only vs. other mechanisms) must be stated explicitly and supported by qualitative examples showing that fixes are causally driven by the explanations.
[§4 (experimental setup)] §4 (experimental setup): no details are supplied on prompt templates, number of few-shot demonstrations, exact baseline prompts, temperature settings, or statistical significance tests for the reported deltas. These omissions are load-bearing because the central performance claims cannot be assessed or reproduced without them.

minor comments (1)

[Abstract] The abstract introduces the phrase “Self-Debugging with code explanation” without defining the variant or contrasting it with the base method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate clarifications, additional experiments, and details as outlined.

read point-by-point responses

Referee: [§4 (Spider results)] §4 (Spider results): the 2–3 % overall and 9 % hardest-problem gains rest on the assumption that LLM-generated natural-language explanations faithfully expose semantic mismatches with the query intent. No ablation isolating the explanation step from extra inference steps or prompt length is reported, leaving open the possibility that the improvement is an artifact of additional generation rather than genuine self-debugging.

Authors: We agree that an ablation isolating the contribution of the structured explanation step would strengthen the claims. In the revised manuscript we will add an ablation comparing the full Self-Debugging pipeline against (i) a baseline that performs an equivalent number of additional generation steps without the debugging prompt structure and (ii) a longer-prompt baseline that simply concatenates extra text. While our primary baselines already match the number of few-shot examples and total tokens in the initial generation, we acknowledge that this extra control is necessary to rule out artifacts from inference budget. revision: yes
Referee: [§3 (method) and abstract] §3 (method) and abstract: the procedure is described as identifying mistakes “by investigating the execution results and explaining the generated code,” yet Spider has no execution results. The exact pipeline used on Spider (explanation-only vs. other mechanisms) must be stated explicitly and supported by qualitative examples showing that fixes are causally driven by the explanations.

Authors: We thank the referee for catching this ambiguity. The abstract already notes that Spider results use “Self-Debugging with code explanation” because no unit tests are available. We will revise §3 to explicitly describe the two variants: (a) explanation-only for Spider (the model generates a natural-language explanation of the predicted SQL and compares it against the question to detect semantic mismatches) and (b) explanation-plus-execution for TransCoder and MBPP. We will also add qualitative examples (in the main text or appendix) that trace specific fixes back to statements in the generated explanations. revision: yes
Referee: [§4 (experimental setup)] §4 (experimental setup): no details are supplied on prompt templates, number of few-shot demonstrations, exact baseline prompts, temperature settings, or statistical significance tests for the reported deltas. These omissions are load-bearing because the central performance claims cannot be assessed or reproduced without them.

Authors: We apologize for the omission of these reproducibility details. In the revised manuscript we will add a new subsection (or appendix) containing: the complete prompt templates for each dataset, the exact number of few-shot demonstrations used (3 for Spider and MBPP, 2 for TransCoder), the precise baseline prompts, temperature = 0 for all main results (with additional temperature-0.7 runs reported for robustness), and statistical significance tests via paired bootstrap resampling over 5 random seeds to confirm the reported deltas. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results measured against independent external benchmarks

full rationale

The paper proposes an empirical prompting technique (few-shot self-debugging demonstrations) and reports accuracy improvements on standard code-generation benchmarks. Performance on Spider is measured against ground-truth SQL queries, while TransCoder and MBPP use unit-test oracles; none of these metrics are defined in terms of the method's own outputs or fitted parameters. No equations, self-definitional loops, or load-bearing self-citations appear in the derivation of the reported gains. The central claims therefore remain externally falsifiable and do not reduce to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that current LLMs possess sufficient in-context learning capacity to follow debugging demonstrations and to produce useful natural-language explanations; no new entities are postulated and no numeric parameters are fitted inside the method itself.

axioms (1)

domain assumption Large language models can follow few-shot demonstrations to perform complex reasoning tasks such as code explanation and error detection.
The entire Self-Debugging pipeline depends on this in-context learning capability being present in the base model.

pith-pipeline@v0.9.0 · 5566 in / 1307 out tokens · 47247 ms · 2026-05-12T06:19:14.738133+00:00 · methodology

discussion (0)

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
cs.LG 2026-05 unverdicted novelty 7.0

RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...
Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code
cs.LG 2026-05 unverdicted novelty 7.0

A new Intent Fidelity Score and refinement loop verify that LLM-generated simulation code matches the intended PDEs, improving performance on a 220-case benchmark where execution alone fails to ensure correctness.
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction
cs.CL 2026-05 unverdicted novelty 7.0

LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-...
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
cs.SE 2026-05 conditional novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 7.0

PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers
cs.AI 2026-05 unverdicted novelty 7.0

CP-SynC uses coordinated LLM agents to generate, validate via synthesized checkers, and select MiniZinc models from natural language, substantially outperforming baselines on a 100-problem benchmark.
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
cs.SE 2026-04 unverdicted novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
cs.RO 2026-04 unverdicted novelty 7.0

PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
PlayCoder: Making LLM-Generated GUI Code Playable
cs.SE 2026-04 conditional novelty 7.0

PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Structural Verification for Reliable EDA Code Generation without Tool-in-the-Loop Debugging
cs.SE 2026-04 unverdicted novelty 7.0

Structural dependency graphs and staged pre-execution verification raise LLM-based EDA code pass rates to 82.5% (single-step) and 70-84% (multi-step) while halving tool calls by catching dependency violations before runtime.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Feedback-Driven Execution for LLM-Based Binary Analysis
cs.CR 2026-04 unverdicted novelty 7.0

FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precis...
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
cs.NE 2026-03 unverdicted novelty 7.0

BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
cs.CL 2023-12 accept novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
Reflexion: Language Agents with Verbal Reinforcement Learning
cs.AI 2023-03 conditional novelty 7.0

Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
cs.CV 2026-05 unverdicted novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
cs.LG 2026-05 unverdicted novelty 6.0

RubricRefine raises average tool-use reliability to 0.86 on M3ToolEval across seven models by scoring candidate code against generated contract rubrics before execution, beating prior inference-time methods at 2.6X lo...
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
cs.CL 2026-05 unverdicted novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement
cs.DB 2026-05 unverdicted novelty 6.0

EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
cs.AI 2026-04 unverdicted novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 6.0

WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...
PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents
cs.AI 2026-04 unverdicted novelty 6.0

PV-SQL boosts Text-to-SQL execution accuracy by 5% and valid efficiency by 20.8% on BIRD benchmarks via database probing and rule-based SQL verification while using fewer tokens.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints
cs.SE 2026-04 unverdicted novelty 6.0

Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
cs.SE 2026-03 unverdicted novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
cs.SE 2026-04 unverdicted novelty 5.0

REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
Lazy or Efficient? Towards Accessible Eye-Tracking Event Detection Using LLMs
cs.HC 2026-04 unverdicted novelty 5.0

An LLM-driven system generates executable eye-tracking detectors from user prompts and achieves accuracy comparable to classical I-VT and I-DT methods while eliminating the need for specialized programming.
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 5.0

Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
Spec Kit Agents: Context-Grounded Agentic Workflows
cs.SE 2026-04 unverdicted novelty 5.0

A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
LLM-Assisted Repository-Level Generation with Structured Spec-Driven Engineering
cs.SE 2026-05 unverdicted novelty 4.0

Structured specifications as LLM inputs can make high-quality repository-level code generation feasible with better verifiability than natural language prompts.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
cs.SE 2026-04 unverdicted novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
cs.CL 2026-04 unverdicted novelty 3.0

Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 38 Pith papers · 12 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Synthesize, execute and debug: Learning to repair for neural program synthesis , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Advances in Neural Information Processing Systems , volume=

Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

International Conference on Machine Learning , pages=

Graph-based, self-supervised program repair from diagnostic feedback , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[5]

The Eleventh International Conference on Learning Representations , year=

Multi-lingual Evaluation of Code Generation Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[10]

Advances in Neural Information Processing Systems , editor=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[11]

International Conference on Learning Representations , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

work page
[12]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[13]

S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir. S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. Proceedings of the 2018 Conference on Empirical...

work page 2018
[14]

Natural Language to Code Translation with Execution

Shi, Freda and Fried, Daniel and Ghazvininejad, Marjan and Zettlemoyer, Luke and Wang, Sida I. Natural Language to Code Translation with Execution. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

work page 2022
[15]

International Conference on Machine Learning , pages=

Break-it-fix-it: Unsupervised learning for program repair , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[16]

Proceedings of the aaai conference on artificial intelligence , year=

Deepfix: Fixing common c language errors by deep learning , author=. Proceedings of the aaai conference on artificial intelligence , year=

work page
[19]

RAT-SQL : Relation-Aware Schema Encoding and Linking for Text-to- SQL Parsers

Wang, Bailin and Shin, Richard and Liu, Xiaodong and Polozov, Oleksandr and Richardson, Matthew. RAT-SQL : Relation-Aware Schema Encoding and Linking for Text-to- SQL Parsers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020
[20]

PICARD : Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

Scholak, Torsten and Schucher, Nathan and Bahdanau, Dzmitry. PICARD : Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

work page 2021
[22]

Importance of Synthesizing High-quality Data for Text-to-

Yiyun Zhao and Jiarong Jiang and Yiqun Hu and Wuwei Lan and Henghui Zhu and Anuj Chauhan and Alexander Hanbo Li and Lin Pan and Jun Wang and Chung-Wei Hang and Sheng Zhang and Mingwen Dong and Joseph Lilien and Patrick Ng and Zhiguo Wang and Vittorio Castelli and Bing Xiang , booktitle=. Importance of Synthesizing High-quality Data for Text-to-. 2022 , url=

work page 2022
[27]

Advances in Neural Information Processing Systems , volume=

Unsupervised translation of programming languages , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

International Conference on Learning Representations , year=

Leveraging Automated Unit Tests for Unsupervised Code Translation , author=. International Conference on Learning Representations , year=

work page
[29]

Advances in Neural Information Processing Systems , editor=

Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[31]

Advances in neural information processing systems , volume=

Tree-to-tree neural networks for program translation , author=. Advances in neural information processing systems , volume=

work page
[33]

2023 , eprint=

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X , author=. 2023 , eprint=

work page 2023
[37]

The Eleventh International Conference on Learning Representations , year=

CodeT: Code Generation with Generated Tests , author=. The Eleventh International Conference on Learning Representations , year=

work page
[38]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

work page 2022
[39]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Coding Challenge Competence With. 2021 , url=

work page 2021
[40]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[41]

International conference on machine learning , pages=

Robustfill: Neural program learning under noisy i/o , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[42]

International Conference on Learning Representations , year=

Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis , author=. International Conference on Learning Representations , year=

work page
[43]

International Conference on Learning Representations , year=

Execution-guided neural program synthesis , author=. International Conference on Learning Representations , year=

work page
[44]

Advances in Neural Information Processing Systems , volume=

Latent execution for neural program synthesis beyond domain-specific languages , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

Reranking for neural semantic parsing , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

work page
[46]

The Eleventh International Conference on Learning Representations , year=

Generating Sequences by Learning to Self-Correct , author=. The Eleventh International Conference on Learning Representations , year=

work page
[49]

The Eleventh International Conference on Learning Representations , year=

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. The Eleventh International Conference on Learning Representations , year=

work page
[50]

NeurIPS , year=

Coda: An End-to-End Neural Program Decompiler , author=. NeurIPS , year=

work page
[51]

International Conference on Learning Representations , year=

Dynamic Neural Program Embedding for Program Repair , author=. International Conference on Learning Representations , year=

work page
[56]

NeurIPS , year=

Chain of thought prompting elicits reasoning in large language models , author=. NeurIPS , year=

work page
[57]

The Eleventh International Conference on Learning Representations , year=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[60]

2000 , publisher=

The pragmatic programmer: from journeyman to master , author=. 2000 , publisher=

work page 2000
[62]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

work page 2020
[63]

Language to Logical Form with Neural Attention

Dong, Li and Lapata, Mirella. Language to Logical Form with Neural Attention. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016

work page 2016
[64]

Mapping Language to Code in Programmatic Context

Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke. Mapping Language to Code in Programmatic Context. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018
[65]

Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

work page
[67]

Learning a Neural Semantic Parser from User Feedback

Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Krishnamurthy, Jayant and Zettlemoyer, Luke. Learning a Neural Semantic Parser from User Feedback. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017

work page 2017
[68]

Speak to your Parser: Interactive Text-to- SQL with Natural Language Feedback

Elgohary, Ahmed and Hosseini, Saghar and Hassan Awadallah, Ahmed. Speak to your Parser: Interactive Text-to- SQL with Natural Language Feedback. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020
[69]

C o SQL : A Conversational Text-to- SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

Yu, Tao and Zhang, Rui and Er, Heyang and Li, Suyi and Xue, Eric and Pang, Bo and Lin, Xi Victoria and Tan, Yi Chern and Shi, Tianze and Li, Zihan and Jiang, Youxuan and Yasunaga, Michihiro and Shim, Sungrok and Chen, Tao and Fabbri, Alexander and Li, Zifan and Chen, Luyao and Zhang, Yuwen and Dixit, Shreya and Zhang, Vincent and Xiong, Caiming and Socher...

work page 2019
[70]

Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to- SQL Case Study

Yao, Ziyu and Su, Yu and Sun, Huan and Yih, Wen-tau. Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to- SQL Case Study. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

work page 2019
[71]

International Conference on Machine Learning , pages=

A data-driven approach for learning to control computers , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[74]

Multi-lingual evaluation of code generation models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudi...

work page 2023
[75]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[78]

Leveraging grammar and reinforcement learning for neural program synthesis

Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Xw62kRZ

work page 2018
[79]

Improving code generation by training with natural language feedback

Angelica Chen, J \'e r \'e my Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023 a

work page arXiv 2023
[80]

Codet: Code generation with generated tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=ktrw68Cmu9c

work page 2023
[81]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 a

work page internal anchor Pith review Pith/arXiv arXiv 2021
[82]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review arXiv 2022
[83]

Tree-to-tree neural networks for program translation

Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. Advances in neural information processing systems, 31, 2018

work page 2018
[84]

Execution-guided neural program synthesis

Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In International Conference on Learning Representations, 2019

work page 2019
[85]

Latent execution for neural program synthesis beyond domain-specific languages

Xinyun Chen, Dawn Song, and Yuandong Tian. Latent execution for neural program synthesis beyond domain-specific languages. Advances in Neural Information Processing Systems, 34: 0 22196--22208, 2021 b

work page 2021
[86]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[87]

Robustfill: Neural program learning under noisy i/o

Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy i/o. In International conference on machine learning, pp.\ 990--998. PMLR, 2017

work page 2017
[88]

Language to logical form with neural attention

Li Dong and Mirella Lapata. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016

work page 2016
[89]

Speak to your parser: Interactive text-to- SQL with natural language feedback

Ahmed Elgohary, Saghar Hosseini, and Ahmed Hassan Awadallah. Speak to your parser: Interactive text-to- SQL with natural language feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

work page 2020
[90]

Coda: An end-to-end neural program decompiler

Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, and Jishen Zhao. Coda: An end-to-end neural program decompiler. In NeurIPS, 2019

work page 2019
[91]

The capacity for moral self-correction in large language models

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil \.e Luko s i \=u t \.e , Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023

work page arXiv 2023
[92]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

work page Pith review arXiv 2022
[93]

Synthesize, execute and debug: Learning to repair for neural program synthesis

Kavi Gupta, Peter Ebert Christensen, Xinyun Chen, and Dawn Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. Advances in Neural Information Processing Systems, 33: 0 17685--17695, 2020

work page 2020
[94]

Deepfix: Fixing common c language errors by deep learning

Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the aaai conference on artificial intelligence, 2017

work page 2017
[95]

Measuring coding challenge competence with APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?...

work page 2021
[96]

A data-driven approach for learning to control computers

Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pp.\ 9466--9482. PMLR, 2022

work page 2022
[97]

The pragmatic programmer: from journeyman to master, 2000

Andrew Hunt and David Thomas. The pragmatic programmer: from journeyman to master, 2000

work page 2000
[98]

Learning a neural semantic parser from user feedback

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017

work page 2017
[99]

Mapping language to code in programmatic context

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[100]

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalken- burgh, Shengxin Zha, Bolin Lai, Licheng Yu, and 1 others

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

work page arXiv 2022
[101]

Language models can solve computer tasks

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

work page arXiv 2023
[102]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf

work page 2022
[103]

Pretraining language models with human preferences

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582, 2023

work page arXiv 2023
[104]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

work page 2022
[105]

Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing

Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, and Yongbin Li. Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing. arXiv preprint arXiv:2301.07507, 2023 a

work page arXiv 2023
[106]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[107]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

work page 2022
[108]

Chain of hindsight aligns language models with feedback

Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 2023

work page arXiv 2023
[109]

Learning performance-improving code edits

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 2023 a

work page arXiv 2023

Showing first 80 references.