arxiv: 2605.03697 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.AI

Recognition: unknown

Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts

Xing Zhang , Keyu Zhang , Taohong Zhu , Anbang Ruan

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords smart contractsvulnerability detectionlarge language modelsblockchain securityprompt engineeringabstract syntax treesecurity analysis

0 comments

The pith

An LLM framework using AST context and tailored prompts detects 13 smart contract vulnerability types at 0.92 positive recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an LLM-based method for finding security flaws in smart contracts on blockchains. It builds and releases a dataset of 31,165 professionally annotated vulnerability instances from over 3,200 real projects across 15 platforms. The method uses abstract syntax tree analysis to extract relevant code context and designs specific prompts for each of 13 common vulnerability categories to create customized detectors. Experiments show average positive recall of 0.92 and negative recall of 0.85, offering a flexible alternative to manual rule-based approaches for a domain where exploits can cause irreversible financial damage.

Core claim

By leveraging precise AST-based context extraction and vulnerability-specific prompt design, customized LLM detectors can be instantiated for 13 prevalent smart contract vulnerability categories, achieving an average positive recall of 0.92 and an average negative recall of 0.85 on a dataset of 31,165 annotated instances from over 3,200 real-world projects.

What carries the argument

Vulnerability-specific prompt design combined with AST-based context extraction, which supplies the LLM with targeted code snippets and instructions for each vulnerability category.

Load-bearing premise

The 31,165 professionally annotated instances accurately represent real-world vulnerabilities without labeling errors or bias, and LLM outputs remain reliable on unseen contracts without high rates of missed issues or false alarms.

What would settle it

Evaluating the detectors on a fresh collection of smart contracts containing documented vulnerabilities from recent exploits and measuring whether positive recall stays near 0.92 and negative recall near 0.85.

Figures

Figures reproduced from arXiv: 2605.03697 by Anbang Ruan, Keyu Zhang, Taohong Zhu, Xing Zhang.

**Figure 1.** Figure 1: Overview of the proposed LLM-based smart contract view at source ↗

**Figure 2.** Figure 2: AST information extraction overview whereas timestamp-dependence vulnerabilities arise in functions that rely on block.timestamp for critical computations or control-flow decisions. Guided by these observations and expert knowledge from real-world auditing practice, the fine-grained extraction stage extracts call-stack functions and associated contextual information most relevant to the targeted vulnerab… view at source ↗

**Figure 3.** Figure 3: Prompt template design view at source ↗

read the original abstract

Smart contracts on blockchains are prone to diverse security vulnerabilities that can lead to significant financial losses due to their immutable nature. Existing detection approaches often lack flexibility across vulnerability types and rely heavily on manually crafted expert rules. In this paper, we present an LLM-based framework for practical smart contract vulnerability detection. We construct and release a large-scale dataset comprising 31,165 professionally annotated vulnerability instances collected from over 3,200 real-world projects across 15 major blockchain platforms. Our approach leverages precise AST-based context extraction and vulnerability-specific prompt design to instantiate customized detectors for 13 prevalent vulnerability categories. Experimental results demonstrate strong effectiveness, achieving an average positive recall of 0.92 and an average negative recall of 0.85, highlighting the potential of carefully engineered contextual prompting for scalable and high-precision smart contract security analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dataset of 31k annotated instances is the real contribution, but the detection results lack the controls needed to trust the recall numbers.

read the letter

The main thing to know is that the paper releases a large new collection of 31,165 professionally annotated smart contract vulnerabilities drawn from over 3,200 real projects across 15 platforms. That dataset is the clearest addition to the field. The rest of the work applies LLMs with AST-based context extraction and vulnerability-specific prompts for 13 categories, which is a reasonable way to make prompting more targeted than generic zero-shot approaches. The reported average positive recall of 0.92 and negative recall of 0.85 sound promising on the surface, and the idea of tailoring prompts per vulnerability type shows practical thinking about how to use LLMs for code security without drowning the model in full contract text. Releasing the data is a concrete step that others can build on, especially since good labeled benchmarks have been scarce in smart contract analysis. The approach itself is described clearly enough and engages directly with the problem of inflexible rule-based detectors. The soft spots sit in the experimental side. The abstract and framing give no information on how the professional annotations were performed, what inter-annotator agreement looked like, or whether there was expert review to catch labeling errors. More critically, there is no explicit statement that prompt engineering and template design were confined to a training portion while evaluation used a completely held-out test set from unseen projects or platforms. Without that separation, the high recalls could reflect prompt overfitting or data leakage rather than genuine detection power on new contracts. The paper also does not appear to include direct comparisons against established static tools or other LLM baselines, which makes it harder to judge whether the gains are meaningful. This work is aimed at researchers and practitioners in blockchain security who need labeled data or are experimenting with LLM detectors. Readers looking for a ready-to-use dataset will get immediate value; those focused on rigorous performance claims will want the missing controls filled in. The thinking is straightforward and the dataset release is honest work, so the paper deserves a serious referee. I would send it to peer review with the expectation that the authors strengthen the annotation protocol and generalization tests in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces an LLM-based framework for smart contract vulnerability detection that uses AST-based context extraction and vulnerability-specific prompt design to create customized detectors for 13 prevalent vulnerability categories. It constructs and releases a dataset of 31,165 professionally annotated instances drawn from over 3,200 real-world projects across 15 blockchain platforms. The central empirical claim is that the approach achieves strong performance, with an average positive recall of 0.92 and average negative recall of 0.85.

Significance. If the reported recalls prove robust under proper held-out evaluation, the work would be significant for offering a flexible, prompt-engineered alternative to rigid rule-based or static-analysis tools in smart-contract security. A clear strength is the construction and public release of a large-scale, multi-platform annotated dataset, which can serve as a reusable benchmark and addresses a common data scarcity issue in the field. The vulnerability-specific prompting strategy also illustrates a practical way to adapt general LLMs to domain-specific detection tasks without full fine-tuning.

major comments (3)

The Experimental Evaluation section (and abstract) reports average positive recall of 0.92 and negative recall of 0.85 but provides no information on the train/test split of the 31,165 instances, whether prompt engineering and template selection were performed exclusively on training data, or results on contracts from entirely held-out projects/platforms. This is load-bearing for the central effectiveness claim, because without explicit separation the metrics could reflect prompt overfitting or label leakage rather than generalization.
The dataset construction description lacks any account of the professional annotation protocol, including inter-annotator agreement statistics, expert review process, or steps taken to mitigate labeling errors and bias. Because the recalls are computed against these labels, the absence of validation details directly affects the reliability of the headline numbers and the claim that the instances “accurately represent real-world vulnerabilities.”
No baseline comparisons (e.g., to established static analyzers such as Slither or Mythril, or to prior LLM-based detectors) are presented alongside the internal recall figures. Without such comparisons it is impossible to determine whether the tailored-prompt approach advances beyond existing methods or simply reproduces known performance on the same data.

minor comments (2)

The abstract and introduction would benefit from a brief statement of the exact 13 vulnerability categories and the precise definitions of positive/negative recall used in the averages.
The paper should include a limitations paragraph discussing potential LLM-specific issues such as hallucination rates on unseen contract patterns and inference cost at scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: The Experimental Evaluation section (and abstract) reports average positive recall of 0.92 and negative recall of 0.85 but provides no information on the train/test split of the 31,165 instances, whether prompt engineering and template selection were performed exclusively on training data, or results on contracts from entirely held-out projects/platforms. This is load-bearing for the central effectiveness claim, because without explicit separation the metrics could reflect prompt overfitting or label leakage rather than generalization.

Authors: We agree that the absence of these details is a material shortcoming that weakens the central claim. The current manuscript does not describe the split or confirm that prompt engineering was restricted to training data. We will revise the Experimental Evaluation section to document the partitioning procedure (including any project- or platform-level separation), state that all prompt design occurred on training data only, and add results on contracts drawn from entirely held-out projects and platforms. revision: yes
Referee: The dataset construction description lacks any account of the professional annotation protocol, including inter-annotator agreement statistics, expert review process, or steps taken to mitigate labeling errors and bias. Because the recalls are computed against these labels, the absence of validation details directly affects the reliability of the headline numbers and the claim that the instances “accurately represent real-world vulnerabilities.”

Authors: We concur that a full account of the annotation protocol is required to support the reliability of the reported metrics. The manuscript currently provides only a high-level statement that the instances are “professionally annotated.” We will add a dedicated subsection describing the annotation guidelines, the number and qualifications of annotators, inter-annotator agreement statistics, the multi-stage expert review process, and the specific measures used to reduce labeling errors and bias. revision: yes
Referee: No baseline comparisons (e.g., to established static analyzers such as Slither or Mythril, or to prior LLM-based detectors) are presented alongside the internal recall figures. Without such comparisons it is impossible to determine whether the tailored-prompt approach advances beyond existing methods or simply reproduces known performance on the same data.

Authors: We agree that direct comparisons to established tools are necessary to situate the contribution. The manuscript presents only the internal recall figures of the proposed framework. We will revise the Experimental Evaluation section to include side-by-side results against Slither, Mythril, and representative prior LLM-based detectors on the same 31,165-instance dataset and the same 13 vulnerability categories. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on annotated dataset with no self-referential derivations

full rationale

The paper presents an empirical framework: it constructs a dataset of 31,165 annotated instances from real-world projects, applies AST-based context extraction, designs vulnerability-specific prompts for 13 categories, and reports experimental recalls (0.92 positive, 0.85 negative). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations that bear the central load are present in the provided text. The claims rest on dataset construction and prompt engineering evaluated via standard metrics rather than reducing by definition or construction to the inputs themselves. This is a standard applied ML paper structure with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the custom dataset and the effectiveness of manually designed prompts, which are not independently verified beyond the reported recalls in the abstract.

free parameters (1)

Vulnerability-specific prompt templates
The exact wording and design of the 13 tailored prompts are engineered choices that directly influence detector performance but are not detailed or shown to be derived from first principles.

axioms (1)

domain assumption The professionally annotated dataset of 31,165 instances accurately captures real-world smart contract vulnerabilities without significant errors or selection bias.
Performance metrics depend entirely on the correctness of these labels collected from over 3,200 projects.

pith-pipeline@v0.9.0 · 5441 in / 1402 out tokens · 64239 ms · 2026-05-07T15:36:51.297157+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages · 5 internal anchors

[1]

An empirical analysis of smart contracts: platforms, applications, and design patterns,

M. Bartoletti and L. Pompianu, “An empirical analysis of smart contracts: platforms, applications, and design patterns,” inFinancial Cryptography and Data Security: FC 2017 International Workshops, WAHC, BITCOIN, VOTING, WTSC, and TA, Sliema, Malta, April 7, 2017, Revised Selected Papers 21, pp. 494–509, Springer, 2017

2017
[2]

A systematic literature review of blockchain and smart contract development: Tech- niques, tools, and open challenges,

A. Vacca, A. Di Sorbo, C. A. Visaggio, and G. Canfora, “A systematic literature review of blockchain and smart contract development: Tech- niques, tools, and open challenges,”Journal of Systems and Software, vol. 174, p. 110891, 2021

2021
[3]

Smart contracts vulnerabilities: a call for blockchain software engineering?,

G. Destefanis, M. Marchesi, M. Ortu, R. Tonelli, A. Bracciali, and R. Hierons, “Smart contracts vulnerabilities: a call for blockchain software engineering?,” in2018 International Workshop on Blockchain Oriented Software Engineering (IWBOSE), pp. 19–25, IEEE, 2018

2018
[4]

Sok: Decentralized finance (defi) attacks,

L. Zhou, X. Xiong, J. Ernstberger, S. Chaliasos, Z. Wang, Y . Wang, K. Qin, R. Wattenhofer, D. Song, and A. Gervais, “Sok: Decentralized finance (defi) attacks,” in2023 IEEE Symposium on Security and Privacy (SP), pp. 2444–2461, IEEE, 2023

2023
[5]

Blockchain smart contracts formalization: Approaches and challenges to address vulnerabilities,

A. Singh, R. M. Parizi, Q. Zhang, K.-K. R. Choo, and A. Dehghantanha, “Blockchain smart contracts formalization: Approaches and challenges to address vulnerabilities,”Computers & Security, vol. 88, p. 101654, 2020

2020
[6]

Demystifying exploitable bugs in smart contracts,

Z. Zhang, B. Zhang, W. Xu, and Z. Lin, “Demystifying exploitable bugs in smart contracts,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 615–627, IEEE, 2023

2023
[7]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[8]

A Survey on Large Language Models for Code Generation

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”arXiv preprint arXiv:2406.00515, 2024

work page internal anchor Pith review arXiv 2024
[9]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,et al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019
[10]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[11]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

2022
[12]

An overview of smart contract: architecture, applications, and future trends,

S. Wang, Y . Yuan, X. Wang, J. Li, R. Qin, and F.-Y . Wang, “An overview of smart contract: architecture, applications, and future trends,” in2018 IEEE intelligent vehicles symposium (IV), pp. 108–113, IEEE, 2018

2018
[13]

Smart contract vulnerability detection technique: A survey,

P. Qian, Z. Liu, Q. He, B. Huang, D. Tian, and X. Wang, “Smart contract vulnerability detection technique: A survey,”arXiv preprint arXiv:2209.05872, 2022

work page arXiv 2022
[14]

A semantic frame- work for the security analysis of ethereum smart contracts,

I. Grishchenko, M. Maffei, and C. Schneidewind, “A semantic frame- work for the security analysis of ethereum smart contracts,” inInter- national conference on principles of security and trust, pp. 243–269, Springer, 2018

2018
[15]

Kevm: A complete formal semantics of the ethereum virtual machine,

E. Hildenbrandt, M. Saxena, N. Rodrigues, X. Zhu, P. Daian, D. Guth, B. Moore, D. Park, Y . Zhang, A. Stefanescu,et al., “Kevm: A complete formal semantics of the ethereum virtual machine,” in2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 204–217, IEEE, 2018

2018
[16]

Making smart contracts smarter,

L. Luu, D.-H. Chu, H. Olickel, P. Saxena, and A. Hobor, “Making smart contracts smarter,” inProceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 254–269, 2016

2016
[17]

A framework for bug hunting on the ethereum blockchain,

B. Mueller, “A framework for bug hunting on the ethereum blockchain,” ConsenSys/mythril, 2017

2017
[18]

Contractfuzzer: Fuzzing smart con- tracts for vulnerability detection,

B. Jiang, Y . Liu, and W. K. Chan, “Contractfuzzer: Fuzzing smart con- tracts for vulnerability detection,” inProceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp. 259– 269, 2018

2018
[19]

Reguard: finding reentrancy bugs in smart contracts,

C. Liu, H. Liu, Z. Cao, Z. Chen, B. Chen, and B. Roscoe, “Reguard: finding reentrancy bugs in smart contracts,” inProceedings of the 40th international conference on software engineering: companion proceeed- ings, pp. 65–68, 2018

2018
[20]

Slither: a static analysis framework for smart contracts,

J. Feist, G. Grieco, and A. Groce, “Slither: a static analysis framework for smart contracts,” in2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), pp. 8–15, IEEE, 2019

2019
[21]

Vandal: A scalable security analysis framework for smart contracts,

L. Brent, A. Jurisevic, M. Kong, E. Liu, F. Gauthier, V . Gramoli, R. Holz, and B. Scholz, “Vandal: A scalable security analysis framework for smart contracts,”arXiv preprint arXiv:1809.03981, 2018

work page arXiv 2018
[22]

Towards safer smart contracts: A sequence learning approach to detecting security threats,

W. J.-W. Tann, X. J. Han, S. S. Gupta, and Y .-S. Ong, “Towards safer smart contracts: A sequence learning approach to detecting security threats,”arXiv preprint arXiv:1811.06632, 2018

work page arXiv 2018
[23]

Smart contract vulnerability detection using graph neural networks,

Y . Zhuang, Z. Liu, P. Qian, Q. Liu, X. Wang, and Q. He, “Smart contract vulnerability detection using graph neural networks,” inProceedings of the twenty-ninth international conference on international joint confer- ences on artificial intelligence, pp. 3283–3290, 2021

2021
[24]

Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,

Y . Sun, D. Wu, Y . Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y . Liu, “Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13, 2024

2024
[25]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman,et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[26]

Chatgpt for good? on opportunities and challenges of large language models for education,

E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023

2023
[27]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong,et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review arXiv 2023
[28]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”arXiv preprint arXiv:2402.07927, 2024

work page internal anchor Pith review arXiv 2024
[29]

Smartguard: An llm- enhanced framework for smart contract vulnerability detection,

H. Ding, Y . Liu, X. Piao, H. Song, and Z. Ji, “Smartguard: An llm- enhanced framework for smart contract vulnerability detection,”Expert Systems with Applications, vol. 269, p. 126479, 2025

2025
[30]

Smartllmsentry: A comprehensive llm based smart contract vulnerability detection framework,

O. Zaazaa and H. El Bakkali, “Smartllmsentry: A comprehensive llm based smart contract vulnerability detection framework,”Journal of Metaverse, vol. 4, no. 2, pp. 126–137, 2024

2024
[31]

Vulnhunt-gpt: a smart contract vulnerabilities detector based on openai chatgpt,

B. Boi, C. Esposito, and S. Lee, “Vulnhunt-gpt: a smart contract vulnerabilities detector based on openai chatgpt,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, pp. 1517–1524, 2024

2024
[32]

Automated smart contract vulnerability detection using fine-tuned large language models,

Z. Yang, G. Man, and S. Yue, “Automated smart contract vulnerability detection using fine-tuned large language models,” inProceedings of the 2023 6th International Conference on Blockchain Technology and Applications, pp. 19–23, 2023

2023
[33]

A context-driven approach for co-auditing smart contracts with the support of gpt-4 code interpreter,

M. S. Bouafif, C. Zheng, I. A. Qasse, E. Zulkoski, M. Hamdaqa, and F. Khomh, “A context-driven approach for co-auditing smart contracts with the support of gpt-4 code interpreter,”arXiv preprint arXiv:2406.18075, 2024

work page arXiv 2024
[34]

Vulnerability detector

Anonymous, “Vulnerability detector.” Available at (URL removed for double-blind review), 2026. Accessed: 2026-01-13

2026
[35]

python-solidity-parser

ConsenSys Diligence, “python-solidity-parser.” Available at github.com/ConsenSysDiligence/python-solidity-parser