Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

Cen Zhang; Li Li; Liming Nie; Lingxiao Jiang; Qiang Hu; Shangqing Liu; Wei Ma; Wenhan Wang; Yang Liu; Ye Liu

arxiv: 2305.12138 · v5 · pith:QI6FLKNOnew · submitted 2023-05-20 · 💻 cs.SE · cs.AI

Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

Wei Ma , Zhihao Lin , Shangqing Liu , Qiang Hu , Ye Liu , Wenhan Wang , Cen Zhang , Liming Nie

show 3 more authors

Li Li Yang Liu Lingxiao Jiang

This is my paper

Pith reviewed 2026-05-24 08:08 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelscode analysissyntax parsingstatic analysisdynamic reasoningzero-shot evaluationsoftware engineeringAST generation

0 comments

The pith

LLMs reach 90%+ on syntax parsing tasks but stay below 70% on dynamic reasoning with strong project-to-project variability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can perform the core steps of code analysis that human developers use: breaking code into syntax trees, inferring static relationships such as data flow and taint, and reasoning about runtime behavior such as flaky tests. Across 21 models and 3,124 samples in four languages, the results show a stable ordering of difficulty—syntax is handled reliably, static inference is workable, and dynamic reasoning is weak and unstable when code moves to new projects. A reader would care because these three layers underpin debugging, security checks, and maintenance; if the ordering is real, it tells practitioners when LLMs can safely replace or assist existing analyzers and when they cannot.

Core claim

The authors claim that LLMs display a consistent capability hierarchy on fundamental code analysis: they generate abstract syntax trees at 90%+ accuracy and match expressions at 84-100%, perform adequately on static-semantics tasks such as control-flow graphs and taint analysis, yet remain below 70% on dynamic-reasoning tasks and exhibit per-project F1 scores that swing from 0 to 1.0. The same ordering appears across model families and sizes, which the authors interpret as evidence of structural rather than transient limits. They further note that LLMs supply cross-language generalization at the cost of nondeterministic outputs, while conventional tools supply deterministic results at the代价的

What carries the argument

A three-layer protocol that scores syntax parsing, static-semantics inference, and dynamic reasoning on the same code samples using automated metrics, expert adjudication, and consistency checks.

If this is right

LLMs can supply cross-language analysis without per-language configuration that traditional tools require.
LLM outputs on dynamic tasks will need external validation because of high data-shift sensitivity.
Traditional analyzers remain necessary for tasks where deterministic guarantees matter.
A hybrid workflow can route syntax and simple static checks to LLMs and route dynamic checks to conventional tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the hierarchy is fundamental, fine-tuning focused only on dynamic tasks may still leave a residual gap unless the training distribution covers project diversity.
The observed sensitivity to project shift suggests that LLMs may be capturing surface patterns rather than portable reasoning rules.
Tool builders could expose an explicit 'dynamic-reasoning' flag so users know when to distrust zero-shot LLM answers.

Load-bearing premise

The nine chosen tasks and 3,124 code samples, scored by the three-layer protocol, give a representative picture of what counts as fundamental code analysis.

What would settle it

A new model that sustains above 80% F1 on the dynamic-reasoning tasks across multiple held-out projects without retraining would contradict the claim of a stable performance hierarchy.

Figures

Figures reproduced from arXiv: 2305.12138 by Cen Zhang, Li Li, Liming Nie, Lingxiao Jiang, Qiang Hu, Shangqing Liu, Wei Ma, Wenhan Wang, Yang Liu, Ye Liu, Zhihao Lin.

**Figure 2.** Figure 2: An semantic equivalent version of the buggy func [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of our study. 3.2 Code Syntax Understanding (RQ1) Code syntax refers to the set of rules that define valid combinations of symbols in a given programming language. Abstract Syntax Tree (AST) is a data structure that represents code syntax. In this section, our objective is to investigate whether LLM can understand code syntax using AST. 3.2.1 AST Generation. We begin our evaluation by assessin… view at source ↗

**Figure 4.** Figure 4: An example of the task of Expression Matching. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 7.** Figure 7: (1) Equivalent Mutant Example (Python) and [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 6.** Figure 6: (1) Data Dependency Example (Python), (2) Taint [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: AST Generation Result were reasonable and few are minor. However, the open-source models CodeLlama-13-instruct and StarCoder are worse than OpenAI’s models and have more unreasonable AST outputs. But CodeLlama is slightly better than StarCoder. We further investigate the issues of the generated ASTs and Figure 8b displays the number of issues that we identified. A single AST may exhibit multiple issues, … view at source ↗

**Figure 10.** Figure 10: CFG Generation Results GPT4 GPT3.5 CodeLlama StarChat 0 5 10 15 20 25 Number Good Minor No (a) Reasonable VS. Unreasonable GPT4 GPT3.5 CodeLlama StarChat 0 5 10 15 20 Number Redundancy Fabrication MissCall (b) Issue Distribution [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 14.** Figure 14: Predictions of LLMs for Flaky Test Reasoning [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

read the original abstract

Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are effective but limited by language specificity and weak cross-language generalization. Large language models (LLMs) are promising for code tasks, yet their capabilities for fundamental code analysis remain underexplored. We structure our study around three aspects aligned with human practices: syntax parsing, static semantics inference, and dynamic reasoning. We evaluate 21 state-of-the-art LLMs across nine tasks in four languages (C, Java, Python, Solidity), including AST generation, CFG construction, data dependency, taint analysis, and flaky test reasoning. We apply a three-layer evaluation protocol (automated metrics, expert adjudication, consistency validation) to 3,124 code samples, achieving high inter-rater reliability (Cohen's kappa = 0.844-0.936) and strong human-machine agreement (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882). While the best LLMs excel in syntax parsing (AST 90%+, expression matching 84-100%) and show promise in static analysis, their dynamic reasoning remains limited (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental rather than transient limitations. These findings show how LLMs complement traditional analyzers: they offer cross-language generalization but non-deterministic outputs needing validation, while traditional tools give deterministic guarantees but need language-specific configuration. We contribute a validated evaluation framework with comparison against traditional analyzers (Tree-sitter, Soot, Joern) and task-specific applicability tiers. Benchmark: https://github.com/mathieu0905/llm_code_analysis.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical benchmark and clear performance hierarchy for LLMs on code tasks, but the 'fundamental limitation' claim needs checking against sample details.

read the letter

The main thing here is an empirical map showing LLMs handle syntax parsing reliably (90%+ on ASTs) but stay below 70% on dynamic reasoning, with big swings across projects. They ran 21 models on nine tasks in four languages and compared them directly to Tree-sitter, Soot, and Joern, then released the dataset. That scale and the public benchmark are the concrete additions over earlier single-task studies. The three-layer protocol with expert adjudication and solid inter-rater numbers (kappa 0.844-0.936) gives the numbers some grounding. The hierarchy across model families is worth noting for tool builders deciding when an LLM might replace or sit beside a traditional analyzer. The soft spot is the representativeness angle. The abstract does not detail how the 3,124 samples were drawn or whether the dynamic tasks (like flaky test reasoning) use realistic execution contexts or just simplified prompts. If the set skews toward short snippets, the low dynamic scores and 0-1.0 F1 variation could partly reflect benchmark design rather than intrinsic model limits. That undercuts the 'fundamental rather than transient' reading until the sampling and prompt choices are laid out. This is for software engineering researchers and tool developers who need data on LLM applicability tiers. It is coherent on its own terms and ships a usable artifact, so it deserves a serious referee even if the interpretation gets pushed on during review.

Referee Report

1 major / 1 minor

Summary. The paper evaluates 21 state-of-the-art LLMs on nine tasks spanning syntax parsing (e.g., AST generation), static semantics inference (e.g., CFG construction, data dependency, taint analysis), and dynamic reasoning (e.g., flaky test reasoning) across C, Java, Python, and Solidity. Using 3,124 code samples and a three-layer protocol (automated metrics, expert adjudication, consistency validation) with reported high inter-rater reliability (Cohen's kappa 0.844-0.936) and human-machine agreement, it finds LLMs excel at syntax (AST 90%+, expression matching 84-100%), show promise in static analysis, but remain limited in dynamic reasoning (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental limitations; the work contributes a validated framework, applicability tiers, and benchmark comparing LLMs to traditional tools like Tree-sitter, Soot, and Joern.

Significance. If the evaluation protocol and sample representativeness hold, the results offer concrete empirical grounding for where LLMs can complement (via cross-language generalization) versus where they fall short of traditional deterministic analyzers in software engineering. The open benchmark and three-layer protocol with quantified agreement metrics are strengths that could support reproducible follow-up work on improving dynamic code reasoning.

major comments (1)

[Abstract] Abstract: The central claim that the observed performance hierarchy (syntax strong, static promising, dynamic limited with data-shift sensitivity) reflects 'fundamental rather than transient limitations' is load-bearing on the representativeness of the nine tasks and 3,124 samples. The abstract provides no detail on sample selection criteria (e.g., distribution across projects, languages, or complexity levels) or how dynamic tasks isolate state reasoning versus prompt-following, leaving open the possibility that the <70% ceiling and 0-1.0 per-project F1 swings are artifacts of task design rather than intrinsic model properties.

minor comments (1)

[Abstract] Abstract: The reported human-machine agreement metrics (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882) are given as ranges without mapping to specific tasks or evaluation layers, reducing clarity on which aspects of the protocol drive the reliability claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about the abstract's brevity and its implications for the central claim is well-taken. We address it point-by-point below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the observed performance hierarchy (syntax strong, static promising, dynamic limited with data-shift sensitivity) reflects 'fundamental rather than transient limitations' is load-bearing on the representativeness of the nine tasks and 3,124 samples. The abstract provides no detail on sample selection criteria (e.g., distribution across projects, languages, or complexity levels) or how dynamic tasks isolate state reasoning versus prompt-following, leaving open the possibility that the <70% ceiling and 0-1.0 per-project F1 swings are artifacts of task design rather than intrinsic model properties.

Authors: We agree the abstract is concise and omits explicit details on sample selection and task isolation. The full manuscript (Sections 3.1–3.2 and 4) specifies that the 3,124 samples were drawn from multiple open-source projects per language, stratified by size and complexity, with dynamic tasks (e.g., flaky-test reasoning) explicitly constructed to require inference over execution history and state changes rather than surface-level prompt compliance. The per-project F1 variance is reported as direct evidence of data-shift sensitivity. To address the referee’s point, we will revise the abstract to include a brief clause on sample distribution and task design. The claim of fundamental limitations is presented as an inference from the consistent hierarchy across 21 models and four scales; we do not claim it is proven, only that the empirical pattern is not explained by transient factors such as model size alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential fits

full rationale

The paper reports direct experimental results from evaluating 21 LLMs on nine fixed tasks using 3,124 samples and a three-layer protocol (automated metrics, expert adjudication, consistency checks). No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. The hierarchy claim (syntax > static > dynamic) is presented as an observed pattern across models, not as a mathematical consequence of prior definitions or fits. This matches the default case of an empirical study whose central claims rest on external data rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that zero-shot prompting and the chosen tasks reveal fundamental rather than prompt-dependent capabilities, plus standard assumptions about the validity of automated metrics and expert adjudication.

axioms (2)

domain assumption Zero-shot prompting is sufficient to elicit and measure core code analysis capabilities in LLMs
The study design assumes that the reported performance hierarchy reflects intrinsic model limits rather than artifacts of prompt formulation.
domain assumption The selected tasks and samples are representative of fundamental code analysis as practiced by humans
The three-aspect structure (syntax, static semantics, dynamic reasoning) is taken as aligned with human practice without further justification in the abstract.

pith-pipeline@v0.9.0 · 5904 in / 1396 out tokens · 36542 ms · 2026-05-24T08:08:16.072781+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

(How) Do Large Language Models Understand High-Level Message Sequence Charts?
cs.SE 2026-05 conditional novelty 6.0

LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
cs.SE 2026-05 unverdicted novelty 6.0

NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
LLM-Powered Detection of Price Manipulation in DeFi
cs.CR 2025-10 unverdicted novelty 6.0

PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73...
A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox
cs.CR 2026-05 unverdicted novelty 5.0

ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
cs.SE 2026-05 unverdicted novelty 5.0

LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
cs.SE 2024-02 unverdicted novelty 4.0

CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 5 Pith papers · 10 internal anchors

[1]

2018. Slither. https://github.com/crytic/slither

work page 2018
[2]

Chatgpt: Optimizing language models for dialogue

2022-11. Chatgpt: Optimizing language models for dialogue . https://chat.openai. com

work page 2022
[3]

Alphacode

2023. Alphacode. https://www.deepmind.com/blog/competitive-programming- with-alphacode

work page 2023
[4]

AST Analyzer

2023. AST Analyzer. https://chat.openai.com/g/g-cAZMow3gy-ast-analyzer

work page 2023
[5]

awesome-chatgpt-prompts

2023. awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt- prompts

work page 2023
[6]

Capabilities of ChatGPT for Code Analysis: An Empirical Study

2023. Capabilities of ChatGPT for Code Analysis: An Empirical Study . https: //sites.google.com/view/chatgpt4se

work page 2023
[7]

CFG Analyzer

2023. CFG Analyzer. https://chat.openai.com/g/g-rY90G6DgV-cfg-analyst

work page 2023
[8]

CG Analyzer

2023. CG Analyzer. https://chat.openai.com/g/g-P5Qzq5vdB-call-graph-analyzer

work page 2023
[9]

2023. Copilot. https://github.com/features/copilot

work page 2023
[10]

Etherscan

2023. Etherscan. https://etherscan.io/

work page 2023
[11]

2023. Llama2. https://ai.meta.com/llama/

work page 2023
[12]

Openai Playground

2023. Openai Playground. https://platform.openai.com/playground

work page 2023
[13]

QuixBugs

2023. QuixBugs. https://github.com/jkoppel/QuixBugs/

work page 2023
[14]

Tree sitter

2023. Tree sitter. https://tree-sitter.github.io/tree-sitter/

work page 2023
[15]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang

work page
[16]

arXiv preprint arXiv:2103.06333 (2021)

Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)

work page arXiv 2021
[17]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

work page
[18]

arXiv preprint arXiv:2208.14799 (2022)

Predicting Flaky Tests Categories using Few-Shot Learning. arXiv preprint arXiv:2208.14799 (2022)

work page arXiv 2022
[19]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization (Urbana-Champaign, Illinois). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/800028.808479

work page doi:10.1145/800028.808479 1970
[21]

Baxter, A

I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272) . 368–377. https://doi.org/10.1109/ICSM.1998. 738528

work page doi:10.1109/icsm.1998 1998
[22]

Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. Interna- tional Conference on Software Maintenance (Cat. No. 98CB36272) . IEEE, 368–377

work page 1998
[23]

Xiao Cheng, Haoyu Wang, Jiayi Hua, Miao Zhang, Guoai Xu, Li Yi, and Yulei Sui. 2019. Static detection of control-flow-related vulnerabilities using graph embedding. In 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 41–50

work page 2019
[24]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017)

work page 2017
[25]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning

work page
[26]

What Does BERT Look At? An Analysis of BERT's Attention

What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[27]

Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C. In Software Engineering and Formal Methods, George Eleftherakis, Mike Hinchey, and Mike Holcombe (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 233–247

work page 2012
[28]

Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734

work page 2023
[29]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023)

work page arXiv 2023
[31]

Josselin Feist, Gustavo Greico, and Alex Groce. 2019. Slither: A Static Analysis Framework for Smart Contracts. In Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (Montreal, Quebec, Canada) (WETSEB ’19). IEEE Press, 8–15. https://doi.org/10.1109/WETSEB.2019. 00008

work page doi:10.1109/wetseb.2019 2019
[32]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[33]

Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1984. The program depen- dence graph and its use in optimization. In International Symposium on Program- ming. Springer, 125–132

work page 1984
[34]

Ottenstein, and Joe D

Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (jul 1987), 319–349. https://doi.org/10.1145/24039.24041

work page doi:10.1145/24039.24041 1987
[35]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[36]

José Antonio Hernández López, Martin Weyssow, Jesús Sánchez Cuadrado, and Houari Sahraoui. 2022. AST-Probe: Recovering abstract syntax trees from hid- den representations of pre-trained language models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–11

work page 2022
[37]

John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Find- ing Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers) . As- sociation for Computational Linguistics, Minneapolis, Minn...

work page doi:10.18653/v1/n19-1419 2019
[38]

Michael Hind and Anthony Pioli. 2000. Which Pointer Analysis Should I Use? SIGSOFT Softw. Eng. Notes 25, 5 (aug 2000), 113–123. https://doi.org/10.1145/ 347636.348916

work page arXiv 2000
[39]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for soft- ware engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023)

work page arXiv 2023
[40]

Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. PMLR, 54–63

work page 2021
[41]

Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer Conference’17, July 2017, Washington, DC, USA Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Cen Zhang, Ye Liu, Li Li, ...

work page arXiv 2022
[42]

Ken Kennedy. 1979. A survey of data flow analysis techniques . IBM Thomas J. Watson Research Division

work page 1979
[43]

Junhyoung Kim, TaeGuen Kim, and Eul Gyu Im. 2014. Survey of dynamic taint analysis. In 2014 4th IEEE International Conference on Network Infrastructure and Digital Content. IEEE, 269–272

work page 2014
[44]

Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using ab- stract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering . IEEE, 253–262

work page 2006
[45]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35

work page 2023
[47]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815

work page doi:10.1145/3560815 2023
[48]

Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. 2020. Retrieval- augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405 (2020)

work page arXiv 2020
[49]

Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2023. Con- traBERT: Enhancing Code Pre-trained Models via Contrastive Learning. arXiv preprint arXiv:2301.09072 (2023)

work page arXiv 2023
[50]

Shangqing Liu, Xiaofei Xie, Jingkai Siow, Lei Ma, Guozhu Meng, and Yang Liu

work page
[51]

IEEE Transactions on Software Engineering (2023)

GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search. IEEE Transactions on Software Engineering (2023)

work page 2023
[52]

Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022)

work page arXiv 2022
[53]

James H Lubowitz. 2023. ChatGPT, an artificial intelligence chatbot, is impacting medical literature. Arthroscopy 39, 5 (2023), 1121–1122

work page 2023
[54]

Wei Ma, Mengjie Zhao, Ezekiel Soremekun, Qiang Hu, Jie M Zhang, Mike Pa- padakis, Maxime Cordy, Xiaofei Xie, and Yves Le Traon. 2022. GraphCode2Vec: generic code embedding via lexical and program dependence analyses. In Pro- ceedings of the 19th International Conference on Mining Software Repositories . 524–536

work page 2022
[55]

Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jiexin Zhang, Wenhan Wang, and Yang Liu. 2022. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? https://api.semanticscholar.org/CorpusID: 258556996

work page 2022
[56]

Murphy, David Notkin, William G

Gail C. Murphy, David Notkin, William G. Griswold, and Erica S. Lan. 1998. An Empirical Study of Static Call Graph Extractors. ACM Trans. Softw. Eng. Methodol. 7, 2 (apr 1998), 158–191. https://doi.org/10.1145/279310.279314

work page doi:10.1145/279310.279314 1998
[57]

Anh Nguyen-Duc, Beatriz Cabrero-Daniel, Adam Przybylek, Chetan Arora, Dron Khanna, Tomas Herda, Usman Rafiq, Jorge Melegati, Eduardo Guerra, Kai-Kristian Kemell, et al. 2023. Generative Artificial Intelligence for Software Engineering–A Research Agenda. arXiv preprint arXiv:2310.18648 (2023)

work page arXiv 2023
[58]

Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo

work page
[59]

In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22)

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2006–2018. https://doi.org/10.1145/3510003. 3510096

work page doi:10.1145/3510003 2006
[60]

OpenAI. 2019. ChatGPT Demo. https://www.youtube.com/watch?v= outcGtbnMuQ&ab_channel=OpenAI

work page 2019
[61]

OpenAI. 2023. GPT-3.5 Turbo fine-tuning and API updates . https://openai.com/ blog/gpt-3-5-turbo-fine-tuning-and-api-updates

work page 2023
[62]

OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https: //api.semanticscholar.org/CorpusID:257532815

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

OpenAI. 2023. GPT-4 Technical Report. arXiv (2023)

work page 2023
[64]

Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946

work page 2015
[65]

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378

work page 2019
[66]

Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020)

work page arXiv 2020
[67]

Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780 (2023)

work page arXiv 2023
[68]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics 8 (01 2021), 842–866. https://doi.org/10.1162/tacl_a_00349 arXiv:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/tacl_a_00349/1923281/tacl_a_00349.pdf

work page doi:10.1162/tacl_a_00349 2021
[69]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2021), 842–866

work page 2021
[70]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh

work page
[72]

L., Wallace, E., and Singh, S

Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020)

work page arXiv 2010
[73]

Yannis Smaragdakis, George Balatsouras, et al. 2015. Pointer analysis. Foundations and Trends® in Programming Languages 2, 1 (2015), 1–69

work page 2015
[74]

Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023)

work page arXiv 2023
[75]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021

work page 2020
[76]

Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th international conference on compiler construction . ACM, 265–266

work page 2016
[77]

Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023)

work page arXiv 2023
[78]

Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP . Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 371–383. https://aclanthology.org/ 2022.blackboxnlp-1.31

work page 2022
[79]

Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Edward Beeching, Teven Le Scao, Leandro von Werra, Sheon Han, Philipp Schmid, and Alexander Rush

work page
[80]

Hugging Face Blog (2023)

Creating a Coding Assistant with StarCoder. Hugging Face Blog (2023). https://huggingface.co/blog/starchat

work page 2023

Showing first 80 references.

[1] [1]

2018. Slither. https://github.com/crytic/slither

work page 2018

[2] [2]

Chatgpt: Optimizing language models for dialogue

2022-11. Chatgpt: Optimizing language models for dialogue . https://chat.openai. com

work page 2022

[3] [3]

Alphacode

2023. Alphacode. https://www.deepmind.com/blog/competitive-programming- with-alphacode

work page 2023

[4] [4]

AST Analyzer

2023. AST Analyzer. https://chat.openai.com/g/g-cAZMow3gy-ast-analyzer

work page 2023

[5] [5]

awesome-chatgpt-prompts

2023. awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt- prompts

work page 2023

[6] [6]

Capabilities of ChatGPT for Code Analysis: An Empirical Study

2023. Capabilities of ChatGPT for Code Analysis: An Empirical Study . https: //sites.google.com/view/chatgpt4se

work page 2023

[7] [7]

CFG Analyzer

2023. CFG Analyzer. https://chat.openai.com/g/g-rY90G6DgV-cfg-analyst

work page 2023

[8] [8]

CG Analyzer

2023. CG Analyzer. https://chat.openai.com/g/g-P5Qzq5vdB-call-graph-analyzer

work page 2023

[9] [9]

2023. Copilot. https://github.com/features/copilot

work page 2023

[10] [10]

Etherscan

2023. Etherscan. https://etherscan.io/

work page 2023

[11] [11]

2023. Llama2. https://ai.meta.com/llama/

work page 2023

[12] [12]

Openai Playground

2023. Openai Playground. https://platform.openai.com/playground

work page 2023

[13] [13]

QuixBugs

2023. QuixBugs. https://github.com/jkoppel/QuixBugs/

work page 2023

[14] [14]

Tree sitter

2023. Tree sitter. https://tree-sitter.github.io/tree-sitter/

work page 2023

[15] [15]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang

work page

[16] [16]

arXiv preprint arXiv:2103.06333 (2021)

Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)

work page arXiv 2021

[17] [17]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

work page

[18] [18]

arXiv preprint arXiv:2208.14799 (2022)

Predicting Flaky Tests Categories using Few-Shot Learning. arXiv preprint arXiv:2208.14799 (2022)

work page arXiv 2022

[19] [19]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization (Urbana-Champaign, Illinois). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/800028.808479

work page doi:10.1145/800028.808479 1970

[21] [21]

Baxter, A

I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272) . 368–377. https://doi.org/10.1109/ICSM.1998. 738528

work page doi:10.1109/icsm.1998 1998

[22] [22]

Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. Interna- tional Conference on Software Maintenance (Cat. No. 98CB36272) . IEEE, 368–377

work page 1998

[23] [23]

Xiao Cheng, Haoyu Wang, Jiayi Hua, Miao Zhang, Guoai Xu, Li Yi, and Yulei Sui. 2019. Static detection of control-flow-related vulnerabilities using graph embedding. In 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 41–50

work page 2019

[24] [24]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017)

work page 2017

[25] [25]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning

work page

[26] [26]

What Does BERT Look At? An Analysis of BERT's Attention

What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[27] [27]

Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C. In Software Engineering and Formal Methods, George Eleftherakis, Mike Hinchey, and Mike Holcombe (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 233–247

work page 2012

[28] [28]

Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734

work page 2023

[29] [29]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023)

work page arXiv 2023

[31] [31]

Josselin Feist, Gustavo Greico, and Alex Groce. 2019. Slither: A Static Analysis Framework for Smart Contracts. In Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (Montreal, Quebec, Canada) (WETSEB ’19). IEEE Press, 8–15. https://doi.org/10.1109/WETSEB.2019. 00008

work page doi:10.1109/wetseb.2019 2019

[32] [32]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[33] [33]

Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1984. The program depen- dence graph and its use in optimization. In International Symposium on Program- ming. Springer, 125–132

work page 1984

[34] [34]

Ottenstein, and Joe D

Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (jul 1987), 319–349. https://doi.org/10.1145/24039.24041

work page doi:10.1145/24039.24041 1987

[35] [35]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[36] [36]

José Antonio Hernández López, Martin Weyssow, Jesús Sánchez Cuadrado, and Houari Sahraoui. 2022. AST-Probe: Recovering abstract syntax trees from hid- den representations of pre-trained language models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–11

work page 2022

[37] [37]

John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Find- ing Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers) . As- sociation for Computational Linguistics, Minneapolis, Minn...

work page doi:10.18653/v1/n19-1419 2019

[38] [38]

Michael Hind and Anthony Pioli. 2000. Which Pointer Analysis Should I Use? SIGSOFT Softw. Eng. Notes 25, 5 (aug 2000), 113–123. https://doi.org/10.1145/ 347636.348916

work page arXiv 2000

[39] [39]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for soft- ware engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023)

work page arXiv 2023

[40] [40]

Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. PMLR, 54–63

work page 2021

[41] [41]

Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer Conference’17, July 2017, Washington, DC, USA Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Cen Zhang, Ye Liu, Li Li, ...

work page arXiv 2022

[42] [42]

Ken Kennedy. 1979. A survey of data flow analysis techniques . IBM Thomas J. Watson Research Division

work page 1979

[43] [43]

Junhyoung Kim, TaeGuen Kim, and Eul Gyu Im. 2014. Survey of dynamic taint analysis. In 2014 4th IEEE International Conference on Network Infrastructure and Digital Content. IEEE, 269–272

work page 2014

[44] [44]

Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using ab- stract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering . IEEE, 253–262

work page 2006

[45] [45]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35

work page 2023

[47] [47]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815

work page doi:10.1145/3560815 2023

[48] [48]

Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. 2020. Retrieval- augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405 (2020)

work page arXiv 2020

[49] [49]

Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2023. Con- traBERT: Enhancing Code Pre-trained Models via Contrastive Learning. arXiv preprint arXiv:2301.09072 (2023)

work page arXiv 2023

[50] [50]

Shangqing Liu, Xiaofei Xie, Jingkai Siow, Lei Ma, Guozhu Meng, and Yang Liu

work page

[51] [51]

IEEE Transactions on Software Engineering (2023)

GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search. IEEE Transactions on Software Engineering (2023)

work page 2023

[52] [52]

Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022)

work page arXiv 2022

[53] [53]

James H Lubowitz. 2023. ChatGPT, an artificial intelligence chatbot, is impacting medical literature. Arthroscopy 39, 5 (2023), 1121–1122

work page 2023

[54] [54]

Wei Ma, Mengjie Zhao, Ezekiel Soremekun, Qiang Hu, Jie M Zhang, Mike Pa- padakis, Maxime Cordy, Xiaofei Xie, and Yves Le Traon. 2022. GraphCode2Vec: generic code embedding via lexical and program dependence analyses. In Pro- ceedings of the 19th International Conference on Mining Software Repositories . 524–536

work page 2022

[55] [55]

Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jiexin Zhang, Wenhan Wang, and Yang Liu. 2022. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? https://api.semanticscholar.org/CorpusID: 258556996

work page 2022

[56] [56]

Murphy, David Notkin, William G

Gail C. Murphy, David Notkin, William G. Griswold, and Erica S. Lan. 1998. An Empirical Study of Static Call Graph Extractors. ACM Trans. Softw. Eng. Methodol. 7, 2 (apr 1998), 158–191. https://doi.org/10.1145/279310.279314

work page doi:10.1145/279310.279314 1998

[57] [57]

Anh Nguyen-Duc, Beatriz Cabrero-Daniel, Adam Przybylek, Chetan Arora, Dron Khanna, Tomas Herda, Usman Rafiq, Jorge Melegati, Eduardo Guerra, Kai-Kristian Kemell, et al. 2023. Generative Artificial Intelligence for Software Engineering–A Research Agenda. arXiv preprint arXiv:2310.18648 (2023)

work page arXiv 2023

[58] [58]

Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo

work page

[59] [59]

In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22)

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2006–2018. https://doi.org/10.1145/3510003. 3510096

work page doi:10.1145/3510003 2006

[60] [60]

OpenAI. 2019. ChatGPT Demo. https://www.youtube.com/watch?v= outcGtbnMuQ&ab_channel=OpenAI

work page 2019

[61] [61]

OpenAI. 2023. GPT-3.5 Turbo fine-tuning and API updates . https://openai.com/ blog/gpt-3-5-turbo-fine-tuning-and-api-updates

work page 2023

[62] [62]

OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https: //api.semanticscholar.org/CorpusID:257532815

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

OpenAI. 2023. GPT-4 Technical Report. arXiv (2023)

work page 2023

[64] [64]

Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946

work page 2015

[65] [65]

Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378

work page 2019

[66] [66]

Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020)

work page arXiv 2020

[67] [67]

Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780 (2023)

work page arXiv 2023

[68] [68]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics 8 (01 2021), 842–866. https://doi.org/10.1162/tacl_a_00349 arXiv:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/tacl_a_00349/1923281/tacl_a_00349.pdf

work page doi:10.1162/tacl_a_00349 2021

[69] [69]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2021), 842–866

work page 2021

[70] [70]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh

work page

[72] [72]

L., Wallace, E., and Singh, S

Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020)

work page arXiv 2010

[73] [73]

Yannis Smaragdakis, George Balatsouras, et al. 2015. Pointer analysis. Foundations and Trends® in Programming Languages 2, 1 (2015), 1–69

work page 2015

[74] [74]

Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023)

work page arXiv 2023

[75] [75]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021

work page 2020

[76] [76]

Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th international conference on compiler construction . ACM, 265–266

work page 2016

[77] [77]

Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023)

work page arXiv 2023

[78] [78]

Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP . Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 371–383. https://aclanthology.org/ 2022.blackboxnlp-1.31

work page 2022

[79] [79]

Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Edward Beeching, Teven Le Scao, Leandro von Werra, Sheon Han, Philipp Schmid, and Alexander Rush

work page

[80] [80]

Hugging Face Blog (2023)

Creating a Coding Assistant with StarCoder. Hugging Face Blog (2023). https://huggingface.co/blog/starchat

work page 2023