pith. sign in

arxiv: 2305.12138 · v5 · pith:QI6FLKNOnew · submitted 2023-05-20 · 💻 cs.SE · cs.AI

Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

Pith reviewed 2026-05-24 08:08 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords large language modelscode analysissyntax parsingstatic analysisdynamic reasoningzero-shot evaluationsoftware engineeringAST generation
0
0 comments X

The pith

LLMs reach 90%+ on syntax parsing tasks but stay below 70% on dynamic reasoning with strong project-to-project variability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can perform the core steps of code analysis that human developers use: breaking code into syntax trees, inferring static relationships such as data flow and taint, and reasoning about runtime behavior such as flaky tests. Across 21 models and 3,124 samples in four languages, the results show a stable ordering of difficulty—syntax is handled reliably, static inference is workable, and dynamic reasoning is weak and unstable when code moves to new projects. A reader would care because these three layers underpin debugging, security checks, and maintenance; if the ordering is real, it tells practitioners when LLMs can safely replace or assist existing analyzers and when they cannot.

Core claim

The authors claim that LLMs display a consistent capability hierarchy on fundamental code analysis: they generate abstract syntax trees at 90%+ accuracy and match expressions at 84-100%, perform adequately on static-semantics tasks such as control-flow graphs and taint analysis, yet remain below 70% on dynamic-reasoning tasks and exhibit per-project F1 scores that swing from 0 to 1.0. The same ordering appears across model families and sizes, which the authors interpret as evidence of structural rather than transient limits. They further note that LLMs supply cross-language generalization at the cost of nondeterministic outputs, while conventional tools supply deterministic results at the代价的

What carries the argument

A three-layer protocol that scores syntax parsing, static-semantics inference, and dynamic reasoning on the same code samples using automated metrics, expert adjudication, and consistency checks.

If this is right

  • LLMs can supply cross-language analysis without per-language configuration that traditional tools require.
  • LLM outputs on dynamic tasks will need external validation because of high data-shift sensitivity.
  • Traditional analyzers remain necessary for tasks where deterministic guarantees matter.
  • A hybrid workflow can route syntax and simple static checks to LLMs and route dynamic checks to conventional tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the hierarchy is fundamental, fine-tuning focused only on dynamic tasks may still leave a residual gap unless the training distribution covers project diversity.
  • The observed sensitivity to project shift suggests that LLMs may be capturing surface patterns rather than portable reasoning rules.
  • Tool builders could expose an explicit 'dynamic-reasoning' flag so users know when to distrust zero-shot LLM answers.

Load-bearing premise

The nine chosen tasks and 3,124 code samples, scored by the three-layer protocol, give a representative picture of what counts as fundamental code analysis.

What would settle it

A new model that sustains above 80% F1 on the dynamic-reasoning tasks across multiple held-out projects without retraining would contradict the claim of a stable performance hierarchy.

Figures

Figures reproduced from arXiv: 2305.12138 by Cen Zhang, Li Li, Liming Nie, Lingxiao Jiang, Qiang Hu, Shangqing Liu, Wei Ma, Wenhan Wang, Yang Liu, Ye Liu, Zhihao Lin.

Figure 1
Figure 1. Figure 1: A buggy function of Bucketsort from QuixBugs. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An semantic equivalent version of the buggy func [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of our study. 3.2 Code Syntax Understanding (RQ1) Code syntax refers to the set of rules that define valid combinations of symbols in a given programming language. Abstract Syntax Tree (AST) is a data structure that represents code syntax. In this section, our objective is to investigate whether LLM can understand code syntax using AST. 3.2.1 AST Generation. We begin our evaluation by assessin… view at source ↗
Figure 4
Figure 4. Figure 4: An example of the task of Expression Matching. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: (1) Equivalent Mutant Example (Python) and [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: (1) Data Dependency Example (Python), (2) Taint [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: AST Generation Result were reasonable and few are minor. However, the open-source mod￾els CodeLlama-13-instruct and StarCoder are worse than OpenAI’s models and have more unreasonable AST outputs. But CodeLlama is slightly better than StarCoder. We further investigate the issues of the generated ASTs and Figure 8b displays the number of is￾sues that we identified. A single AST may exhibit multiple issues, … view at source ↗
Figure 10
Figure 10. Figure 10: CFG Generation Results GPT4 GPT3.5 CodeLlama StarChat 0 5 10 15 20 25 Number Good Minor No (a) Reasonable VS. Unreasonable GPT4 GPT3.5 CodeLlama StarChat 0 5 10 15 20 Number Redundancy Fabrication MissCall (b) Issue Distribution [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 14
Figure 14. Figure 14: Predictions of LLMs for Flaky Test Reasoning [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
read the original abstract

Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are effective but limited by language specificity and weak cross-language generalization. Large language models (LLMs) are promising for code tasks, yet their capabilities for fundamental code analysis remain underexplored. We structure our study around three aspects aligned with human practices: syntax parsing, static semantics inference, and dynamic reasoning. We evaluate 21 state-of-the-art LLMs across nine tasks in four languages (C, Java, Python, Solidity), including AST generation, CFG construction, data dependency, taint analysis, and flaky test reasoning. We apply a three-layer evaluation protocol (automated metrics, expert adjudication, consistency validation) to 3,124 code samples, achieving high inter-rater reliability (Cohen's kappa = 0.844-0.936) and strong human-machine agreement (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882). While the best LLMs excel in syntax parsing (AST 90%+, expression matching 84-100%) and show promise in static analysis, their dynamic reasoning remains limited (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental rather than transient limitations. These findings show how LLMs complement traditional analyzers: they offer cross-language generalization but non-deterministic outputs needing validation, while traditional tools give deterministic guarantees but need language-specific configuration. We contribute a validated evaluation framework with comparison against traditional analyzers (Tree-sitter, Soot, Joern) and task-specific applicability tiers. Benchmark: https://github.com/mathieu0905/llm_code_analysis.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper evaluates 21 state-of-the-art LLMs on nine tasks spanning syntax parsing (e.g., AST generation), static semantics inference (e.g., CFG construction, data dependency, taint analysis), and dynamic reasoning (e.g., flaky test reasoning) across C, Java, Python, and Solidity. Using 3,124 code samples and a three-layer protocol (automated metrics, expert adjudication, consistency validation) with reported high inter-rater reliability (Cohen's kappa 0.844-0.936) and human-machine agreement, it finds LLMs excel at syntax (AST 90%+, expression matching 84-100%), show promise in static analysis, but remain limited in dynamic reasoning (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental limitations; the work contributes a validated framework, applicability tiers, and benchmark comparing LLMs to traditional tools like Tree-sitter, Soot, and Joern.

Significance. If the evaluation protocol and sample representativeness hold, the results offer concrete empirical grounding for where LLMs can complement (via cross-language generalization) versus where they fall short of traditional deterministic analyzers in software engineering. The open benchmark and three-layer protocol with quantified agreement metrics are strengths that could support reproducible follow-up work on improving dynamic code reasoning.

major comments (1)
  1. [Abstract] Abstract: The central claim that the observed performance hierarchy (syntax strong, static promising, dynamic limited with data-shift sensitivity) reflects 'fundamental rather than transient limitations' is load-bearing on the representativeness of the nine tasks and 3,124 samples. The abstract provides no detail on sample selection criteria (e.g., distribution across projects, languages, or complexity levels) or how dynamic tasks isolate state reasoning versus prompt-following, leaving open the possibility that the <70% ceiling and 0-1.0 per-project F1 swings are artifacts of task design rather than intrinsic model properties.
minor comments (1)
  1. [Abstract] Abstract: The reported human-machine agreement metrics (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882) are given as ranges without mapping to specific tasks or evaluation layers, reducing clarity on which aspects of the protocol drive the reliability claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about the abstract's brevity and its implications for the central claim is well-taken. We address it point-by-point below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the observed performance hierarchy (syntax strong, static promising, dynamic limited with data-shift sensitivity) reflects 'fundamental rather than transient limitations' is load-bearing on the representativeness of the nine tasks and 3,124 samples. The abstract provides no detail on sample selection criteria (e.g., distribution across projects, languages, or complexity levels) or how dynamic tasks isolate state reasoning versus prompt-following, leaving open the possibility that the <70% ceiling and 0-1.0 per-project F1 swings are artifacts of task design rather than intrinsic model properties.

    Authors: We agree the abstract is concise and omits explicit details on sample selection and task isolation. The full manuscript (Sections 3.1–3.2 and 4) specifies that the 3,124 samples were drawn from multiple open-source projects per language, stratified by size and complexity, with dynamic tasks (e.g., flaky-test reasoning) explicitly constructed to require inference over execution history and state changes rather than surface-level prompt compliance. The per-project F1 variance is reported as direct evidence of data-shift sensitivity. To address the referee’s point, we will revise the abstract to include a brief clause on sample distribution and task design. The claim of fundamental limitations is presented as an inference from the consistent hierarchy across 21 models and four scales; we do not claim it is proven, only that the empirical pattern is not explained by transient factors such as model size alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential fits

full rationale

The paper reports direct experimental results from evaluating 21 LLMs on nine fixed tasks using 3,124 samples and a three-layer protocol (automated metrics, expert adjudication, consistency checks). No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methodology. The hierarchy claim (syntax > static > dynamic) is presented as an observed pattern across models, not as a mathematical consequence of prior definitions or fits. This matches the default case of an empirical study whose central claims rest on external data rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that zero-shot prompting and the chosen tasks reveal fundamental rather than prompt-dependent capabilities, plus standard assumptions about the validity of automated metrics and expert adjudication.

axioms (2)
  • domain assumption Zero-shot prompting is sufficient to elicit and measure core code analysis capabilities in LLMs
    The study design assumes that the reported performance hierarchy reflects intrinsic model limits rather than artifacts of prompt formulation.
  • domain assumption The selected tasks and samples are representative of fundamental code analysis as practiced by humans
    The three-aspect structure (syntax, static semantics, dynamic reasoning) is taken as aligned with human practice without further justification in the abstract.

pith-pipeline@v0.9.0 · 5904 in / 1396 out tokens · 36542 ms · 2026-05-24T08:08:16.072781+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    cs.SE 2026-05 conditional novelty 6.0

    LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.

  2. NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

    cs.SE 2026-05 unverdicted novelty 6.0

    NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.

  3. LLM-Powered Detection of Price Manipulation in DeFi

    cs.CR 2025-10 unverdicted novelty 6.0

    PMDetector is a hybrid static-plus-LLM framework that detects price manipulation in DeFi protocols via taint analysis, defense filtering, attack simulation, and validation, achieving 88% precision and 90% recall on 73...

  4. A Large Language Model Approach to Generating Bypass Rules for Malware Evasion in Analysis Sandbox

    cs.CR 2026-05 unverdicted novelty 5.0

    ABLE uses LLMs with sanitization and iterative refinement to generate bypass YARA rules from malware traces, achieving 79% success on 334 samples and 47% more family detections.

  5. (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    cs.SE 2026-05 unverdicted novelty 5.0

    LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.

  6. CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology

    cs.SE 2024-02 unverdicted novelty 4.0

    CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 5 Pith papers · 10 internal anchors

  1. [1]

    2018. Slither. https://github.com/crytic/slither

  2. [2]

    Chatgpt: Optimizing language models for dialogue

    2022-11. Chatgpt: Optimizing language models for dialogue . https://chat.openai. com

  3. [3]

    Alphacode

    2023. Alphacode. https://www.deepmind.com/blog/competitive-programming- with-alphacode

  4. [4]

    AST Analyzer

    2023. AST Analyzer. https://chat.openai.com/g/g-cAZMow3gy-ast-analyzer

  5. [5]

    awesome-chatgpt-prompts

    2023. awesome-chatgpt-prompts. https://github.com/f/awesome-chatgpt- prompts

  6. [6]

    Capabilities of ChatGPT for Code Analysis: An Empirical Study

    2023. Capabilities of ChatGPT for Code Analysis: An Empirical Study . https: //sites.google.com/view/chatgpt4se

  7. [7]

    CFG Analyzer

    2023. CFG Analyzer. https://chat.openai.com/g/g-rY90G6DgV-cfg-analyst

  8. [8]

    CG Analyzer

    2023. CG Analyzer. https://chat.openai.com/g/g-P5Qzq5vdB-call-graph-analyzer

  9. [9]

    2023. Copilot. https://github.com/features/copilot

  10. [10]

    Etherscan

    2023. Etherscan. https://etherscan.io/

  11. [11]

    2023. Llama2. https://ai.meta.com/llama/

  12. [12]

    Openai Playground

    2023. Openai Playground. https://platform.openai.com/playground

  13. [13]

    QuixBugs

    2023. QuixBugs. https://github.com/jkoppel/QuixBugs/

  14. [14]

    Tree sitter

    2023. Tree sitter. https://tree-sitter.github.io/tree-sitter/

  15. [15]

    Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang

  16. [16]

    arXiv preprint arXiv:2103.06333 (2021)

    Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)

  17. [17]

    Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

  18. [18]

    arXiv preprint arXiv:2208.14799 (2022)

    Predicting Flaky Tests Categories using Few-Shot Learning. arXiv preprint arXiv:2208.14799 (2022)

  19. [19]

    Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740 (2017)

  20. [20]

    Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization (Urbana-Champaign, Illinois). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/800028.808479

  21. [21]

    Baxter, A

    I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272) . 368–377. https://doi.org/10.1109/ICSM.1998. 738528

  22. [22]

    Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. Interna- tional Conference on Software Maintenance (Cat. No. 98CB36272) . IEEE, 368–377

  23. [23]

    Xiao Cheng, Haoyu Wang, Jiayi Hua, Miao Zhang, Guoai Xu, Li Yi, and Yulei Sui. 2019. Static detection of control-flow-related vulnerabilities using graph embedding. In 2019 24th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 41–50

  24. [24]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017)

  25. [25]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning

  26. [26]

    What Does BERT Look At? An Analysis of BERT's Attention

    What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)

  27. [27]

    Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-C. In Software Engineering and Formal Methods, George Eleftherakis, Mike Hinchey, and Mike Holcombe (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 233–247

  28. [28]

    Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software 203 (2023), 111734

  29. [29]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018)

  30. [30]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023)

  31. [31]

    Josselin Feist, Gustavo Greico, and Alex Groce. 2019. Slither: A Static Analysis Framework for Smart Contracts. In Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (Montreal, Quebec, Canada) (WETSEB ’19). IEEE Press, 8–15. https://doi.org/10.1109/WETSEB.2019. 00008

  32. [32]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

  33. [33]

    Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1984. The program depen- dence graph and its use in optimization. In International Symposium on Program- ming. Springer, 125–132

  34. [34]

    Ottenstein, and Joe D

    Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 9, 3 (jul 1987), 319–349. https://doi.org/10.1145/24039.24041

  35. [35]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)

  36. [36]

    José Antonio Hernández López, Martin Weyssow, Jesús Sánchez Cuadrado, and Houari Sahraoui. 2022. AST-Probe: Recovering abstract syntax trees from hid- den representations of pre-trained language models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–11

  37. [37]

    John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Find- ing Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers) . As- sociation for Computational Linguistics, Minneapolis, Minn...

  38. [38]

    Michael Hind and Anthony Pioli. 2000. Which Pointer Analysis Should I Use? SIGSOFT Softw. Eng. Notes 25, 5 (aug 2000), 113–123. https://doi.org/10.1145/ 347636.348916

  39. [39]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large language models for soft- ware engineering: A systematic literature review. arXiv preprint arXiv:2308.10620 (2023)

  40. [40]

    Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence. PMLR, 54–63

  41. [41]

    Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer Conference’17, July 2017, Washington, DC, USA Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Cen Zhang, Ye Liu, Li Li, ...

  42. [42]

    Ken Kennedy. 1979. A survey of data flow analysis techniques . IBM Thomas J. Watson Research Division

  43. [43]

    Junhyoung Kim, TaeGuen Kim, and Eul Gyu Im. 2014. Survey of dynamic taint analysis. In 2014 4th IEEE International Conference on Network Infrastructure and Digital Content. IEEE, 269–272

  44. [44]

    Rainer Koschke, Raimar Falke, and Pierre Frenzel. 2006. Clone detection using ab- stract syntax suffix trees. In 2006 13th Working Conference on Reverse Engineering . IEEE, 253–262

  45. [45]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023)

  46. [46]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35

  47. [47]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815

  48. [48]

    Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. 2020. Retrieval- augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405 (2020)

  49. [49]

    Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2023. Con- traBERT: Enhancing Code Pre-trained Models via Contrastive Learning. arXiv preprint arXiv:2301.09072 (2023)

  50. [50]

    Shangqing Liu, Xiaofei Xie, Jingkai Siow, Lei Ma, Guozhu Meng, and Yang Liu

  51. [51]

    IEEE Transactions on Software Engineering (2023)

    GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search. IEEE Transactions on Software Engineering (2023)

  52. [52]

    Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022)

  53. [53]

    James H Lubowitz. 2023. ChatGPT, an artificial intelligence chatbot, is impacting medical literature. Arthroscopy 39, 5 (2023), 1121–1122

  54. [54]

    Wei Ma, Mengjie Zhao, Ezekiel Soremekun, Qiang Hu, Jie M Zhang, Mike Pa- padakis, Maxime Cordy, Xiaofei Xie, and Yves Le Traon. 2022. GraphCode2Vec: generic code embedding via lexical and program dependence analyses. In Pro- ceedings of the 19th International Conference on Mining Software Repositories . 524–536

  55. [55]

    Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jiexin Zhang, Wenhan Wang, and Yang Liu. 2022. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics? https://api.semanticscholar.org/CorpusID: 258556996

  56. [56]

    Murphy, David Notkin, William G

    Gail C. Murphy, David Notkin, William G. Griswold, and Erica S. Lan. 1998. An Empirical Study of Static Call Graph Extractors. ACM Trans. Softw. Eng. Methodol. 7, 2 (apr 1998), 158–191. https://doi.org/10.1145/279310.279314

  57. [57]

    Anh Nguyen-Duc, Beatriz Cabrero-Daniel, Adam Przybylek, Chetan Arora, Dron Khanna, Tomas Herda, Usman Rafiq, Jorge Melegati, Eduardo Guerra, Kai-Kristian Kemell, et al. 2023. Generative Artificial Intelligence for Software Engineering–A Research Agenda. arXiv preprint arXiv:2310.18648 (2023)

  58. [58]

    Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo

  59. [59]

    In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22)

    SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2006–2018. https://doi.org/10.1145/3510003. 3510096

  60. [60]

    OpenAI. 2019. ChatGPT Demo. https://www.youtube.com/watch?v= outcGtbnMuQ&ab_channel=OpenAI

  61. [61]

    OpenAI. 2023. GPT-3.5 Turbo fine-tuning and API updates . https://openai.com/ blog/gpt-3-5-turbo-fine-tuning-and-api-updates

  62. [62]

    OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https: //api.semanticscholar.org/CorpusID:257532815

  63. [63]

    OpenAI. 2023. GPT-4 Technical Report. arXiv (2023)

  64. [64]

    Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 936–946

  65. [65]

    Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378

  66. [66]

    Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020)

  67. [67]

    Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. arXiv preprint arXiv:2303.13780 (2023)

  68. [68]

    Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics 8 (01 2021), 842–866. https://doi.org/10.1162/tacl_a_00349 arXiv:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/tacl_a_00349/1923281/tacl_a_00349.pdf

  69. [69]

    Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2021), 842–866

  70. [70]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)

  71. [71]

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh

  72. [72]

    L., Wallace, E., and Singh, S

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020)

  73. [73]

    Yannis Smaragdakis, George Balatsouras, et al. 2015. Pointer analysis. Foundations and Trends® in Programming Languages 2, 1 (2015), 1–69

  74. [74]

    Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023)

  75. [75]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021

  76. [76]

    Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th international conference on compiler construction . ACM, 265–266

  77. [77]

    Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023)

  78. [78]

    Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP . Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 371–383. https://aclanthology.org/ 2022.blackboxnlp-1.31

  79. [79]

    Lewis Tunstall, Nathan Lambert, Nazneen Rajani, Edward Beeching, Teven Le Scao, Leandro von Werra, Sheon Han, Philipp Schmid, and Alexander Rush

  80. [80]

    Hugging Face Blog (2023)

    Creating a Coding Assistant with StarCoder. Hugging Face Blog (2023). https://huggingface.co/blog/starchat

Showing first 80 references.