pith. machine review for the scientific record. sign in

arxiv: 2304.05128 · v2 · submitted 2023-04-11 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

Teaching Large Language Models to Self-Debug

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords self-debugginglarge language modelscode generationprogram repairfew-shot promptingtext-to-SQLcode translation
0
0 comments X

The pith

Large language models can debug their own code by explaining it in natural language and analyzing execution results without human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Debugging, a method that uses few-shot demonstrations to train LLMs to debug their own code outputs. The model generates natural-language explanations of its code and then uses those explanations together with execution results to identify and correct errors. This yields higher accuracy on code generation tasks such as text-to-SQL and code translation, plus better use of multiple sampling attempts. A reader should care because the technique shows how an LLM can turn its own internal reasoning into a repair loop that works even when no external tests or human judgments are supplied.

Core claim

Self-Debugging teaches large language models to debug their predicted programs via few-shot demonstrations. The model performs rubber duck debugging by explaining the generated code in natural language and investigating execution results, all without any human feedback on correctness or error messages. The approach reaches state-of-the-art results on the Spider text-to-SQL benchmark, TransCoder C++-to-Python translation, and MBPP text-to-Python generation. On Spider it raises baseline accuracy by 2-3 percent overall and by 9 percent on the hardest problems; on the other two benchmarks it raises accuracy by as much as 12 percent when unit tests are present. It also improves sample efficiency,

What carries the argument

Self-Debugging, the few-shot prompting procedure that has the LLM first explain its code in natural language and then revise the code using that explanation plus raw execution output.

Load-bearing premise

An LLM can reliably detect its own semantic or logical errors and produce correct fixes solely from its own natural-language code explanations plus raw execution outputs, without external oracles or human judgment.

What would settle it

Give the model its own incorrect code on a held-out set of complex problems, prompt it to explain and debug the code with no additional error information, and measure whether the revised code passes independent verification at a rate clearly above the baseline single-pass rate.

read the original abstract

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Self-Debugging, a few-shot prompting approach that teaches LLMs to perform rubber-duck debugging on their own code generations: the model explains the generated code in natural language and (where available) inspects execution results to identify and repair mistakes. It reports state-of-the-art results on three code-generation benchmarks—Spider (text-to-SQL), TransCoder (C++-to-Python), and MBPP (text-to-Python)—with absolute gains of 2–3 % (9 % on the hardest Spider problems) when no unit tests are present and up to 12 % when unit tests are available, plus improved sample efficiency by reusing failed candidates.

Significance. If the gains prove robust, the work would show that LLMs can achieve meaningful self-correction on complex code tasks without external oracles or human feedback, advancing the design of autonomous programming agents. The Spider results are especially noteworthy because they rely solely on the model’s own explanations rather than execution feedback.

major comments (3)
  1. [§4 (Spider results)] §4 (Spider results): the 2–3 % overall and 9 % hardest-problem gains rest on the assumption that LLM-generated natural-language explanations faithfully expose semantic mismatches with the query intent. No ablation isolating the explanation step from extra inference steps or prompt length is reported, leaving open the possibility that the improvement is an artifact of additional generation rather than genuine self-debugging.
  2. [§3 (method) and abstract] §3 (method) and abstract: the procedure is described as identifying mistakes “by investigating the execution results and explaining the generated code,” yet Spider has no execution results. The exact pipeline used on Spider (explanation-only vs. other mechanisms) must be stated explicitly and supported by qualitative examples showing that fixes are causally driven by the explanations.
  3. [§4 (experimental setup)] §4 (experimental setup): no details are supplied on prompt templates, number of few-shot demonstrations, exact baseline prompts, temperature settings, or statistical significance tests for the reported deltas. These omissions are load-bearing because the central performance claims cannot be assessed or reproduced without them.
minor comments (1)
  1. [Abstract] The abstract introduces the phrase “Self-Debugging with code explanation” without defining the variant or contrasting it with the base method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate clarifications, additional experiments, and details as outlined.

read point-by-point responses
  1. Referee: [§4 (Spider results)] §4 (Spider results): the 2–3 % overall and 9 % hardest-problem gains rest on the assumption that LLM-generated natural-language explanations faithfully expose semantic mismatches with the query intent. No ablation isolating the explanation step from extra inference steps or prompt length is reported, leaving open the possibility that the improvement is an artifact of additional generation rather than genuine self-debugging.

    Authors: We agree that an ablation isolating the contribution of the structured explanation step would strengthen the claims. In the revised manuscript we will add an ablation comparing the full Self-Debugging pipeline against (i) a baseline that performs an equivalent number of additional generation steps without the debugging prompt structure and (ii) a longer-prompt baseline that simply concatenates extra text. While our primary baselines already match the number of few-shot examples and total tokens in the initial generation, we acknowledge that this extra control is necessary to rule out artifacts from inference budget. revision: yes

  2. Referee: [§3 (method) and abstract] §3 (method) and abstract: the procedure is described as identifying mistakes “by investigating the execution results and explaining the generated code,” yet Spider has no execution results. The exact pipeline used on Spider (explanation-only vs. other mechanisms) must be stated explicitly and supported by qualitative examples showing that fixes are causally driven by the explanations.

    Authors: We thank the referee for catching this ambiguity. The abstract already notes that Spider results use “Self-Debugging with code explanation” because no unit tests are available. We will revise §3 to explicitly describe the two variants: (a) explanation-only for Spider (the model generates a natural-language explanation of the predicted SQL and compares it against the question to detect semantic mismatches) and (b) explanation-plus-execution for TransCoder and MBPP. We will also add qualitative examples (in the main text or appendix) that trace specific fixes back to statements in the generated explanations. revision: yes

  3. Referee: [§4 (experimental setup)] §4 (experimental setup): no details are supplied on prompt templates, number of few-shot demonstrations, exact baseline prompts, temperature settings, or statistical significance tests for the reported deltas. These omissions are load-bearing because the central performance claims cannot be assessed or reproduced without them.

    Authors: We apologize for the omission of these reproducibility details. In the revised manuscript we will add a new subsection (or appendix) containing: the complete prompt templates for each dataset, the exact number of few-shot demonstrations used (3 for Spider and MBPP, 2 for TransCoder), the precise baseline prompts, temperature = 0 for all main results (with additional temperature-0.7 runs reported for robustness), and statistical significance tests via paired bootstrap resampling over 5 random seeds to confirm the reported deltas. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results measured against independent external benchmarks

full rationale

The paper proposes an empirical prompting technique (few-shot self-debugging demonstrations) and reports accuracy improvements on standard code-generation benchmarks. Performance on Spider is measured against ground-truth SQL queries, while TransCoder and MBPP use unit-test oracles; none of these metrics are defined in terms of the method's own outputs or fitted parameters. No equations, self-definitional loops, or load-bearing self-citations appear in the derivation of the reported gains. The central claims therefore remain externally falsifiable and do not reduce to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that current LLMs possess sufficient in-context learning capacity to follow debugging demonstrations and to produce useful natural-language explanations; no new entities are postulated and no numeric parameters are fitted inside the method itself.

axioms (1)
  • domain assumption Large language models can follow few-shot demonstrations to perform complex reasoning tasks such as code explanation and error detection.
    The entire Self-Debugging pipeline depends on this in-context learning capability being present in the base model.

pith-pipeline@v0.9.0 · 5566 in / 1307 out tokens · 47247 ms · 2026-05-12T06:19:14.738133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 7.0

    RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...

  2. Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code

    cs.LG 2026-05 unverdicted novelty 7.0

    A new Intent Fidelity Score and refinement loop verify that LLM-generated simulation code matches the intended PDEs, improving performance on a 220-case benchmark where execution alone fails to ensure correctness.

  3. LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

    cs.CL 2026-05 unverdicted novelty 7.0

    LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-...

  4. Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

    cs.SE 2026-05 conditional novelty 7.0

    SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...

  5. Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.

  6. CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers

    cs.AI 2026-05 unverdicted novelty 7.0

    CP-SynC uses coordinated LLM agents to generate, validate via synthesized checkers, and select MiniZinc models from natural language, substantially outperforming baselines on a 100-problem benchmark.

  7. Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

    cs.SE 2026-04 unverdicted novelty 7.0

    ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.

  8. PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

    cs.RO 2026-04 unverdicted novelty 7.0

    PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.

  9. PlayCoder: Making LLM-Generated GUI Code Playable

    cs.SE 2026-04 conditional novelty 7.0

    PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

  10. Structural Verification for Reliable EDA Code Generation without Tool-in-the-Loop Debugging

    cs.SE 2026-04 unverdicted novelty 7.0

    Structural dependency graphs and staged pre-execution verification raise LLM-based EDA code pass rates to 82.5% (single-step) and 70-84% (multi-step) while halving tool calls by catching dependency violations before runtime.

  11. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  12. Feedback-Driven Execution for LLM-Based Binary Analysis

    cs.CR 2026-04 unverdicted novelty 7.0

    FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precis...

  13. BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

    cs.NE 2026-03 unverdicted novelty 7.0

    BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

  14. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  15. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    cs.CL 2023-12 accept novelty 7.0

    A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.

  16. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  17. Reflexion: Language Agents with Verbal Reinforcement Learning

    cs.AI 2023-03 conditional novelty 7.0

    Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.

  18. Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

    cs.CL 2026-05 unverdicted novelty 6.0

    CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.

  19. Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

    cs.CV 2026-05 unverdicted novelty 6.0

    Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.

  20. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 6.0

    RubricRefine raises average tool-use reliability to 0.86 on M3ToolEval across seven models by scoring candidate code against generated contract rubrics before execution, beating prior inference-time methods at 2.6X lo...

  21. PaT: Planning-after-Trial for Efficient Test-Time Code Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

  22. EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

    cs.DB 2026-05 unverdicted novelty 6.0

    EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.

  23. When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

  24. WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 6.0

    WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...

  25. PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    PV-SQL boosts Text-to-SQL execution accuracy by 5% and valid efficiency by 20.8% on BIRD benchmarks via database probing and rule-based SQL verification while using fewer tokens.

  26. Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

    cs.SE 2026-04 unverdicted novelty 6.0

    Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.

  27. Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

    cs.SE 2026-03 unverdicted novelty 6.0

    Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

  28. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  29. Gorilla: Large Language Model Connected with Massive APIs

    cs.CL 2023-05 conditional novelty 6.0

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  30. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  31. Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation

    cs.SE 2026-04 unverdicted novelty 5.0

    REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.

  32. Lazy or Efficient? Towards Accessible Eye-Tracking Event Detection Using LLMs

    cs.HC 2026-04 unverdicted novelty 5.0

    An LLM-driven system generates executable eye-tracking detectors from user prompts and achieves accuracy comparable to classical I-VT and I-DT methods while eliminating the need for specialized programming.

  33. Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.

  34. Spec Kit Agents: Context-Grounded Agentic Workflows

    cs.SE 2026-04 unverdicted novelty 5.0

    A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.

  35. LLM-Assisted Repository-Level Generation with Structured Spec-Driven Engineering

    cs.SE 2026-05 unverdicted novelty 4.0

    Structured specifications as LLM inputs can make high-quality repository-level code generation feasible with better verifiability than natural language prompts.

  36. How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

    cs.SE 2026-04 unverdicted novelty 4.0

    Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.

  37. Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

    cs.CL 2026-04 unverdicted novelty 3.0

    Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.

  38. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

  39. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 38 Pith papers · 12 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Synthesize, execute and debug: Learning to repair for neural program synthesis , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  3. [4]

    International Conference on Machine Learning , pages=

    Graph-based, self-supervised program repair from diagnostic feedback , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  4. [5]

    The Eleventh International Conference on Learning Representations , year=

    Multi-lingual Evaluation of Code Generation Models , author=. The Eleventh International Conference on Learning Representations , year=

  5. [10]

    Advances in Neural Information Processing Systems , editor=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  6. [11]

    International Conference on Learning Representations , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

  7. [12]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  8. [13]

    S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

    Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir. S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. Proceedings of the 2018 Conference on Empirical...

  9. [14]

    Natural Language to Code Translation with Execution

    Shi, Freda and Fried, Daniel and Ghazvininejad, Marjan and Zettlemoyer, Luke and Wang, Sida I. Natural Language to Code Translation with Execution. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

  10. [15]

    International Conference on Machine Learning , pages=

    Break-it-fix-it: Unsupervised learning for program repair , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  11. [16]

    Proceedings of the aaai conference on artificial intelligence , year=

    Deepfix: Fixing common c language errors by deep learning , author=. Proceedings of the aaai conference on artificial intelligence , year=

  12. [19]

    RAT-SQL : Relation-Aware Schema Encoding and Linking for Text-to- SQL Parsers

    Wang, Bailin and Shin, Richard and Liu, Xiaodong and Polozov, Oleksandr and Richardson, Matthew. RAT-SQL : Relation-Aware Schema Encoding and Linking for Text-to- SQL Parsers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  13. [20]

    PICARD : Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

    Scholak, Torsten and Schucher, Nathan and Bahdanau, Dzmitry. PICARD : Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021

  14. [22]

    Importance of Synthesizing High-quality Data for Text-to-

    Yiyun Zhao and Jiarong Jiang and Yiqun Hu and Wuwei Lan and Henghui Zhu and Anuj Chauhan and Alexander Hanbo Li and Lin Pan and Jun Wang and Chung-Wei Hang and Sheng Zhang and Mingwen Dong and Joseph Lilien and Patrick Ng and Zhiguo Wang and Vittorio Castelli and Bing Xiang , booktitle=. Importance of Synthesizing High-quality Data for Text-to-. 2022 , url=

  15. [27]

    Advances in Neural Information Processing Systems , volume=

    Unsupervised translation of programming languages , author=. Advances in Neural Information Processing Systems , volume=

  16. [28]

    International Conference on Learning Representations , year=

    Leveraging Automated Unit Tests for Unsupervised Code Translation , author=. International Conference on Learning Representations , year=

  17. [29]

    Advances in Neural Information Processing Systems , editor=

    Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  18. [31]

    Advances in neural information processing systems , volume=

    Tree-to-tree neural networks for program translation , author=. Advances in neural information processing systems , volume=

  19. [33]

    2023 , eprint=

    CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X , author=. 2023 , eprint=

  20. [37]

    The Eleventh International Conference on Learning Representations , year=

    CodeT: Code Generation with Generated Tests , author=. The Eleventh International Conference on Learning Representations , year=

  21. [38]

    Science , volume=

    Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

  22. [39]

    Measuring Coding Challenge Competence With

    Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Coding Challenge Competence With. 2021 , url=

  23. [40]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  24. [41]

    International conference on machine learning , pages=

    Robustfill: Neural program learning under noisy i/o , author=. International conference on machine learning , pages=. 2017 , organization=

  25. [42]

    International Conference on Learning Representations , year=

    Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis , author=. International Conference on Learning Representations , year=

  26. [43]

    International Conference on Learning Representations , year=

    Execution-guided neural program synthesis , author=. International Conference on Learning Representations , year=

  27. [44]

    Advances in Neural Information Processing Systems , volume=

    Latent execution for neural program synthesis beyond domain-specific languages , author=. Advances in Neural Information Processing Systems , volume=

  28. [45]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

    Reranking for neural semantic parsing , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

  29. [46]

    The Eleventh International Conference on Learning Representations , year=

    Generating Sequences by Learning to Self-Correct , author=. The Eleventh International Conference on Learning Representations , year=

  30. [49]

    The Eleventh International Conference on Learning Representations , year=

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. The Eleventh International Conference on Learning Representations , year=

  31. [50]

    NeurIPS , year=

    Coda: An End-to-End Neural Program Decompiler , author=. NeurIPS , year=

  32. [51]

    International Conference on Learning Representations , year=

    Dynamic Neural Program Embedding for Program Repair , author=. International Conference on Learning Representations , year=

  33. [56]

    NeurIPS , year=

    Chain of thought prompting elicits reasoning in large language models , author=. NeurIPS , year=

  34. [57]

    The Eleventh International Conference on Learning Representations , year=

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  35. [60]

    2000 , publisher=

    The pragmatic programmer: from journeyman to master , author=. 2000 , publisher=

  36. [62]

    The Journal of Machine Learning Research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

  37. [63]

    Language to Logical Form with Neural Attention

    Dong, Li and Lapata, Mirella. Language to Logical Form with Neural Attention. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016

  38. [64]

    Mapping Language to Code in Programmatic Context

    Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke. Mapping Language to Code in Programmatic Context. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  39. [65]

    Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

    A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

  40. [67]

    Learning a Neural Semantic Parser from User Feedback

    Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Krishnamurthy, Jayant and Zettlemoyer, Luke. Learning a Neural Semantic Parser from User Feedback. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017

  41. [68]

    Speak to your Parser: Interactive Text-to- SQL with Natural Language Feedback

    Elgohary, Ahmed and Hosseini, Saghar and Hassan Awadallah, Ahmed. Speak to your Parser: Interactive Text-to- SQL with Natural Language Feedback. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  42. [69]

    C o SQL : A Conversational Text-to- SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

    Yu, Tao and Zhang, Rui and Er, Heyang and Li, Suyi and Xue, Eric and Pang, Bo and Lin, Xi Victoria and Tan, Yi Chern and Shi, Tianze and Li, Zihan and Jiang, Youxuan and Yasunaga, Michihiro and Shim, Sungrok and Chen, Tao and Fabbri, Alexander and Li, Zifan and Chen, Luyao and Zhang, Yuwen and Dixit, Shreya and Zhang, Vincent and Xiong, Caiming and Socher...

  43. [70]

    Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to- SQL Case Study

    Yao, Ziyu and Su, Yu and Sun, Huan and Yih, Wen-tau. Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to- SQL Case Study. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019

  44. [71]

    International Conference on Machine Learning , pages=

    A data-driven approach for learning to control computers , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  45. [74]

    Multi-lingual evaluation of code generation models

    Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudi...

  46. [75]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  47. [76]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  48. [77]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  49. [78]

    Leveraging grammar and reinforcement learning for neural program synthesis

    Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Xw62kRZ

  50. [79]

    Improving code generation by training with natural language feedback

    Angelica Chen, J \'e r \'e my Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023 a

  51. [80]

    Codet: Code generation with generated tests

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=ktrw68Cmu9c

  52. [81]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021 a

  53. [82]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

  54. [83]

    Tree-to-tree neural networks for program translation

    Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. Advances in neural information processing systems, 31, 2018

  55. [84]

    Execution-guided neural program synthesis

    Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In International Conference on Learning Representations, 2019

  56. [85]

    Latent execution for neural program synthesis beyond domain-specific languages

    Xinyun Chen, Dawn Song, and Yuandong Tian. Latent execution for neural program synthesis beyond domain-specific languages. Advances in Neural Information Processing Systems, 34: 0 22196--22208, 2021 b

  57. [86]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  58. [87]

    Robustfill: Neural program learning under noisy i/o

    Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy i/o. In International conference on machine learning, pp.\ 990--998. PMLR, 2017

  59. [88]

    Language to logical form with neural attention

    Li Dong and Mirella Lapata. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016

  60. [89]

    Speak to your parser: Interactive text-to- SQL with natural language feedback

    Ahmed Elgohary, Saghar Hosseini, and Ahmed Hassan Awadallah. Speak to your parser: Interactive text-to- SQL with natural language feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

  61. [90]

    Coda: An end-to-end neural program decompiler

    Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, and Jishen Zhao. Coda: An end-to-end neural program decompiler. In NeurIPS, 2019

  62. [91]

    The capacity for moral self-correction in large language models

    Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil \.e Luko s i \=u t \.e , Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023

  63. [92]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

  64. [93]

    Synthesize, execute and debug: Learning to repair for neural program synthesis

    Kavi Gupta, Peter Ebert Christensen, Xinyun Chen, and Dawn Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. Advances in Neural Information Processing Systems, 33: 0 17685--17695, 2020

  65. [94]

    Deepfix: Fixing common c language errors by deep learning

    Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the aaai conference on artificial intelligence, 2017

  66. [95]

    Measuring coding challenge competence with APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS . In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?...

  67. [96]

    A data-driven approach for learning to control computers

    Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pp.\ 9466--9482. PMLR, 2022

  68. [97]

    The pragmatic programmer: from journeyman to master, 2000

    Andrew Hunt and David Thomas. The pragmatic programmer: from journeyman to master, 2000

  69. [98]

    Learning a neural semantic parser from user feedback

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017

  70. [99]

    Mapping language to code in programmatic context

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  71. [100]

    Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalken- burgh, Shengxin Zha, Bolin Lai, Licheng Yu, and 1 others

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

  72. [101]

    Language models can solve computer tasks

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

  73. [102]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf

  74. [103]

    Pretraining language models with human preferences

    Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582, 2023

  75. [104]

    Coderl: Mastering code generation through pretrained models and deep reinforcement learning

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

  76. [105]

    Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing

    Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, and Yongbin Li. Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing. arXiv preprint arXiv:2301.07507, 2023 a

  77. [106]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023 b

  78. [107]

    Competition-level code generation with alphacode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

  79. [108]

    Chain of hindsight aligns language models with feedback

    Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 2023

  80. [109]

    Learning performance-improving code edits

    Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 2023 a

Showing first 80 references.