GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
arXiv preprint arXiv:1704.01696 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.
citation-role summary
citation-polarity summary
fields
cs.SE 4roles
background 1polarities
background 1representative citing papers
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
citing papers explorer
-
GraphCodeBERT: Pre-training Code Representations with Data Flow
GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
-
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
-
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.