Recognition: unknown
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
Pith reviewed 2026-05-07 15:36 UTC · model grok-4.3
The pith
An LLM framework using AST context and tailored prompts detects 13 smart contract vulnerability types at 0.92 positive recall.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging precise AST-based context extraction and vulnerability-specific prompt design, customized LLM detectors can be instantiated for 13 prevalent smart contract vulnerability categories, achieving an average positive recall of 0.92 and an average negative recall of 0.85 on a dataset of 31,165 annotated instances from over 3,200 real-world projects.
What carries the argument
Vulnerability-specific prompt design combined with AST-based context extraction, which supplies the LLM with targeted code snippets and instructions for each vulnerability category.
Load-bearing premise
The 31,165 professionally annotated instances accurately represent real-world vulnerabilities without labeling errors or bias, and LLM outputs remain reliable on unseen contracts without high rates of missed issues or false alarms.
What would settle it
Evaluating the detectors on a fresh collection of smart contracts containing documented vulnerabilities from recent exploits and measuring whether positive recall stays near 0.92 and negative recall near 0.85.
Figures
read the original abstract
Smart contracts on blockchains are prone to diverse security vulnerabilities that can lead to significant financial losses due to their immutable nature. Existing detection approaches often lack flexibility across vulnerability types and rely heavily on manually crafted expert rules. In this paper, we present an LLM-based framework for practical smart contract vulnerability detection. We construct and release a large-scale dataset comprising 31,165 professionally annotated vulnerability instances collected from over 3,200 real-world projects across 15 major blockchain platforms. Our approach leverages precise AST-based context extraction and vulnerability-specific prompt design to instantiate customized detectors for 13 prevalent vulnerability categories. Experimental results demonstrate strong effectiveness, achieving an average positive recall of 0.92 and an average negative recall of 0.85, highlighting the potential of carefully engineered contextual prompting for scalable and high-precision smart contract security analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an LLM-based framework for smart contract vulnerability detection that uses AST-based context extraction and vulnerability-specific prompt design to create customized detectors for 13 prevalent vulnerability categories. It constructs and releases a dataset of 31,165 professionally annotated instances drawn from over 3,200 real-world projects across 15 blockchain platforms. The central empirical claim is that the approach achieves strong performance, with an average positive recall of 0.92 and average negative recall of 0.85.
Significance. If the reported recalls prove robust under proper held-out evaluation, the work would be significant for offering a flexible, prompt-engineered alternative to rigid rule-based or static-analysis tools in smart-contract security. A clear strength is the construction and public release of a large-scale, multi-platform annotated dataset, which can serve as a reusable benchmark and addresses a common data scarcity issue in the field. The vulnerability-specific prompting strategy also illustrates a practical way to adapt general LLMs to domain-specific detection tasks without full fine-tuning.
major comments (3)
- The Experimental Evaluation section (and abstract) reports average positive recall of 0.92 and negative recall of 0.85 but provides no information on the train/test split of the 31,165 instances, whether prompt engineering and template selection were performed exclusively on training data, or results on contracts from entirely held-out projects/platforms. This is load-bearing for the central effectiveness claim, because without explicit separation the metrics could reflect prompt overfitting or label leakage rather than generalization.
- The dataset construction description lacks any account of the professional annotation protocol, including inter-annotator agreement statistics, expert review process, or steps taken to mitigate labeling errors and bias. Because the recalls are computed against these labels, the absence of validation details directly affects the reliability of the headline numbers and the claim that the instances “accurately represent real-world vulnerabilities.”
- No baseline comparisons (e.g., to established static analyzers such as Slither or Mythril, or to prior LLM-based detectors) are presented alongside the internal recall figures. Without such comparisons it is impossible to determine whether the tailored-prompt approach advances beyond existing methods or simply reproduces known performance on the same data.
minor comments (2)
- The abstract and introduction would benefit from a brief statement of the exact 13 vulnerability categories and the precise definitions of positive/negative recall used in the averages.
- The paper should include a limitations paragraph discussing potential LLM-specific issues such as hallucination rates on unseen contract patterns and inference cost at scale.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: The Experimental Evaluation section (and abstract) reports average positive recall of 0.92 and negative recall of 0.85 but provides no information on the train/test split of the 31,165 instances, whether prompt engineering and template selection were performed exclusively on training data, or results on contracts from entirely held-out projects/platforms. This is load-bearing for the central effectiveness claim, because without explicit separation the metrics could reflect prompt overfitting or label leakage rather than generalization.
Authors: We agree that the absence of these details is a material shortcoming that weakens the central claim. The current manuscript does not describe the split or confirm that prompt engineering was restricted to training data. We will revise the Experimental Evaluation section to document the partitioning procedure (including any project- or platform-level separation), state that all prompt design occurred on training data only, and add results on contracts drawn from entirely held-out projects and platforms. revision: yes
-
Referee: The dataset construction description lacks any account of the professional annotation protocol, including inter-annotator agreement statistics, expert review process, or steps taken to mitigate labeling errors and bias. Because the recalls are computed against these labels, the absence of validation details directly affects the reliability of the headline numbers and the claim that the instances “accurately represent real-world vulnerabilities.”
Authors: We concur that a full account of the annotation protocol is required to support the reliability of the reported metrics. The manuscript currently provides only a high-level statement that the instances are “professionally annotated.” We will add a dedicated subsection describing the annotation guidelines, the number and qualifications of annotators, inter-annotator agreement statistics, the multi-stage expert review process, and the specific measures used to reduce labeling errors and bias. revision: yes
-
Referee: No baseline comparisons (e.g., to established static analyzers such as Slither or Mythril, or to prior LLM-based detectors) are presented alongside the internal recall figures. Without such comparisons it is impossible to determine whether the tailored-prompt approach advances beyond existing methods or simply reproduces known performance on the same data.
Authors: We agree that direct comparisons to established tools are necessary to situate the contribution. The manuscript presents only the internal recall figures of the proposed framework. We will revise the Experimental Evaluation section to include side-by-side results against Slither, Mythril, and representative prior LLM-based detectors on the same 31,165-instance dataset and the same 13 vulnerability categories. revision: yes
Circularity Check
No circularity; empirical results on annotated dataset with no self-referential derivations
full rationale
The paper presents an empirical framework: it constructs a dataset of 31,165 annotated instances from real-world projects, applies AST-based context extraction, designs vulnerability-specific prompts for 13 categories, and reports experimental recalls (0.92 positive, 0.85 negative). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations that bear the central load are present in the provided text. The claims rest on dataset construction and prompt engineering evaluated via standard metrics rather than reducing by definition or construction to the inputs themselves. This is a standard applied ML paper structure with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- Vulnerability-specific prompt templates
axioms (1)
- domain assumption The professionally annotated dataset of 31,165 instances accurately captures real-world smart contract vulnerabilities without significant errors or selection bias.
Reference graph
Works this paper leans on
-
[1]
An empirical analysis of smart contracts: platforms, applications, and design patterns,
M. Bartoletti and L. Pompianu, “An empirical analysis of smart contracts: platforms, applications, and design patterns,” inFinancial Cryptography and Data Security: FC 2017 International Workshops, WAHC, BITCOIN, VOTING, WTSC, and TA, Sliema, Malta, April 7, 2017, Revised Selected Papers 21, pp. 494–509, Springer, 2017
2017
-
[2]
A systematic literature review of blockchain and smart contract development: Tech- niques, tools, and open challenges,
A. Vacca, A. Di Sorbo, C. A. Visaggio, and G. Canfora, “A systematic literature review of blockchain and smart contract development: Tech- niques, tools, and open challenges,”Journal of Systems and Software, vol. 174, p. 110891, 2021
2021
-
[3]
Smart contracts vulnerabilities: a call for blockchain software engineering?,
G. Destefanis, M. Marchesi, M. Ortu, R. Tonelli, A. Bracciali, and R. Hierons, “Smart contracts vulnerabilities: a call for blockchain software engineering?,” in2018 International Workshop on Blockchain Oriented Software Engineering (IWBOSE), pp. 19–25, IEEE, 2018
2018
-
[4]
Sok: Decentralized finance (defi) attacks,
L. Zhou, X. Xiong, J. Ernstberger, S. Chaliasos, Z. Wang, Y . Wang, K. Qin, R. Wattenhofer, D. Song, and A. Gervais, “Sok: Decentralized finance (defi) attacks,” in2023 IEEE Symposium on Security and Privacy (SP), pp. 2444–2461, IEEE, 2023
2023
-
[5]
Blockchain smart contracts formalization: Approaches and challenges to address vulnerabilities,
A. Singh, R. M. Parizi, Q. Zhang, K.-K. R. Choo, and A. Dehghantanha, “Blockchain smart contracts formalization: Approaches and challenges to address vulnerabilities,”Computers & Security, vol. 88, p. 101654, 2020
2020
-
[6]
Demystifying exploitable bugs in smart contracts,
Z. Zhang, B. Zhang, W. Xu, and Z. Lin, “Demystifying exploitable bugs in smart contracts,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 615–627, IEEE, 2023
2023
-
[7]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
A Survey on Large Language Models for Code Generation
J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”arXiv preprint arXiv:2406.00515, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,et al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019
2019
-
[10]
Language mod- els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[11]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022
2022
-
[12]
An overview of smart contract: architecture, applications, and future trends,
S. Wang, Y . Yuan, X. Wang, J. Li, R. Qin, and F.-Y . Wang, “An overview of smart contract: architecture, applications, and future trends,” in2018 IEEE intelligent vehicles symposium (IV), pp. 108–113, IEEE, 2018
2018
-
[13]
Smart contract vulnerability detection technique: A survey,
P. Qian, Z. Liu, Q. He, B. Huang, D. Tian, and X. Wang, “Smart contract vulnerability detection technique: A survey,”arXiv preprint arXiv:2209.05872, 2022
-
[14]
A semantic frame- work for the security analysis of ethereum smart contracts,
I. Grishchenko, M. Maffei, and C. Schneidewind, “A semantic frame- work for the security analysis of ethereum smart contracts,” inInter- national conference on principles of security and trust, pp. 243–269, Springer, 2018
2018
-
[15]
Kevm: A complete formal semantics of the ethereum virtual machine,
E. Hildenbrandt, M. Saxena, N. Rodrigues, X. Zhu, P. Daian, D. Guth, B. Moore, D. Park, Y . Zhang, A. Stefanescu,et al., “Kevm: A complete formal semantics of the ethereum virtual machine,” in2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 204–217, IEEE, 2018
2018
-
[16]
Making smart contracts smarter,
L. Luu, D.-H. Chu, H. Olickel, P. Saxena, and A. Hobor, “Making smart contracts smarter,” inProceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 254–269, 2016
2016
-
[17]
A framework for bug hunting on the ethereum blockchain,
B. Mueller, “A framework for bug hunting on the ethereum blockchain,” ConsenSys/mythril, 2017
2017
-
[18]
Contractfuzzer: Fuzzing smart con- tracts for vulnerability detection,
B. Jiang, Y . Liu, and W. K. Chan, “Contractfuzzer: Fuzzing smart con- tracts for vulnerability detection,” inProceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp. 259– 269, 2018
2018
-
[19]
Reguard: finding reentrancy bugs in smart contracts,
C. Liu, H. Liu, Z. Cao, Z. Chen, B. Chen, and B. Roscoe, “Reguard: finding reentrancy bugs in smart contracts,” inProceedings of the 40th international conference on software engineering: companion proceeed- ings, pp. 65–68, 2018
2018
-
[20]
Slither: a static analysis framework for smart contracts,
J. Feist, G. Grieco, and A. Groce, “Slither: a static analysis framework for smart contracts,” in2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), pp. 8–15, IEEE, 2019
2019
-
[21]
Vandal: A scalable security analysis framework for smart contracts,
L. Brent, A. Jurisevic, M. Kong, E. Liu, F. Gauthier, V . Gramoli, R. Holz, and B. Scholz, “Vandal: A scalable security analysis framework for smart contracts,”arXiv preprint arXiv:1809.03981, 2018
-
[22]
Towards safer smart contracts: A sequence learning approach to detecting security threats,
W. J.-W. Tann, X. J. Han, S. S. Gupta, and Y .-S. Ong, “Towards safer smart contracts: A sequence learning approach to detecting security threats,”arXiv preprint arXiv:1811.06632, 2018
-
[23]
Smart contract vulnerability detection using graph neural networks,
Y . Zhuang, Z. Liu, P. Qian, Q. Liu, X. Wang, and Q. He, “Smart contract vulnerability detection using graph neural networks,” inProceedings of the twenty-ninth international conference on international joint confer- ences on artificial intelligence, pp. 3283–3290, 2021
2021
-
[24]
Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,
Y . Sun, D. Wu, Y . Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y . Liu, “Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13, 2024
2024
-
[25]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman,et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review arXiv 2021
-
[26]
Chatgpt for good? on opportunities and challenges of large language models for education,
E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023
2023
-
[27]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong,et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”arXiv preprint arXiv:2402.07927, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Smartguard: An llm- enhanced framework for smart contract vulnerability detection,
H. Ding, Y . Liu, X. Piao, H. Song, and Z. Ji, “Smartguard: An llm- enhanced framework for smart contract vulnerability detection,”Expert Systems with Applications, vol. 269, p. 126479, 2025
2025
-
[30]
Smartllmsentry: A comprehensive llm based smart contract vulnerability detection framework,
O. Zaazaa and H. El Bakkali, “Smartllmsentry: A comprehensive llm based smart contract vulnerability detection framework,”Journal of Metaverse, vol. 4, no. 2, pp. 126–137, 2024
2024
-
[31]
Vulnhunt-gpt: a smart contract vulnerabilities detector based on openai chatgpt,
B. Boi, C. Esposito, and S. Lee, “Vulnhunt-gpt: a smart contract vulnerabilities detector based on openai chatgpt,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, pp. 1517–1524, 2024
2024
-
[32]
Automated smart contract vulnerability detection using fine-tuned large language models,
Z. Yang, G. Man, and S. Yue, “Automated smart contract vulnerability detection using fine-tuned large language models,” inProceedings of the 2023 6th International Conference on Blockchain Technology and Applications, pp. 19–23, 2023
2023
-
[33]
M. S. Bouafif, C. Zheng, I. A. Qasse, E. Zulkoski, M. Hamdaqa, and F. Khomh, “A context-driven approach for co-auditing smart contracts with the support of gpt-4 code interpreter,”arXiv preprint arXiv:2406.18075, 2024
-
[34]
Vulnerability detector
Anonymous, “Vulnerability detector.” Available at (URL removed for double-blind review), 2026. Accessed: 2026-01-13
2026
-
[35]
python-solidity-parser
ConsenSys Diligence, “python-solidity-parser.” Available at github.com/ConsenSysDiligence/python-solidity-parser
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.