pith. machine review for the scientific record. sign in

arxiv: 2605.11163 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords smart contractslarge language modelsstatic analysisvulnerability detectionfalse positiveslexical biasblockchain securityhybrid solutions
0
0 comments X

The pith

Large language models cannot reliably audit smart contracts on their own because they depend on variable names rather than code semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can replace traditional static analysis tools for finding security flaws in smart contracts. It shows that the models suffer from lexical bias, basing many decisions on how identifiers are named instead of on actual code behavior, which produces frequent false positives. Prompting methods create a clear trade-off between catching real issues and avoiding incorrect ones. The evaluation uses a custom automated system that matches human classification of outputs 92 percent of the time. The results matter because blockchain transactions are irreversible, so unreliable detection can allow costly exploits to reach deployment.

Core claim

Large language models are not viable as autonomous security auditors for smart contracts. Their efficacy is limited by inherent lexical bias and insufficient validation of external data, leading to high rates of false positives through reliance on non-semantic heuristics such as identifier naming. Prompting techniques exhibit a precision-recall trade-off. These findings rest on a custom automated framework that classifies model outputs with 92 percent accuracy.

What carries the argument

Custom automated framework that classifies LLM outputs on smart contract vulnerabilities at 92 percent accuracy, used to benchmark models and prompting strategies against test contracts.

If this is right

  • LLMs function best as complements to traditional static analysis tools rather than standalone auditors for smart contracts.
  • Reliance on identifier naming as a heuristic generates unreliable results in vulnerability detection.
  • Prompt engineering can shift the balance between precision and recall but does not remove the underlying lexical bias.
  • Hybrid solutions that pair LLM suggestions with semantic checks offer a route to improved security analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model training on datasets that prioritize code semantics over surface lexical features could reduce false positives in security tasks.
  • The same classification framework could be reused to measure LLM performance on vulnerability detection in other programming languages or domains.
  • Post-processing LLM outputs with execution simulation or formal verification steps might compensate for the observed limitations.

Load-bearing premise

The custom automated framework classifies LLM outputs correctly at 92 percent accuracy and the tested prompts and contracts represent real-world smart contract security analysis.

What would settle it

Manual review of LLM vulnerability reports on a fresh collection of deployed smart contracts that produces substantially different false-positive rates than the framework reports.

Figures

Figures reproduced from arXiv: 2605.11163 by Andrei Arusoaie, Dorel Lucanu, Stefan-Claudiu Susan.

Figure 1
Figure 1. Figure 1: Balanced accuracy across all experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Slither bias: comparison of matched detections across experiments. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Methodology flowchart for the automated benchmarking framework. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

The irreversible nature of blockchain transactions makes the identification of smart contract vulnerabilities an essential requirement for secure system development. While Large Language Models (LLMs) are increasingly integrated into developer workflows, their reliability as autonomous security auditors remains unproven. We assess whether current generative models are a viable replacement for, or only a complement to, traditional static-analysis tools. Our findings indicate that LLM efficacy is undermined by both inherent lexical bias and a lack of rigorous validation of external data inputs. This reliance on non-semantic heuristics, such as identifier naming, leads to a high frequency of false positives. Furthermore, prompting techniques reveal a trade-off between precision and recall. These results were derived using our custom automated framework, which achieves 92% accuracy in classifying model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper benchmarks LLMs for static analysis of smart contracts, claiming that their efficacy is undermined by lexical bias (reliance on non-semantic heuristics like identifier naming) leading to high false positives, that prompting techniques exhibit precision-recall trade-offs, and that a custom automated framework classifies LLM outputs at 92% accuracy relative to ground truth, supporting recommendations for hybrid LLM-traditional tool solutions.

Significance. If the custom framework's 92% accuracy holds under disclosed validation and the tested contracts/prompts are representative, the results would highlight practically important limitations of LLMs as autonomous security auditors in blockchain contexts, providing empirical grounding for hybrid approaches and cautioning against over-reliance on generative models for vulnerability detection.

major comments (2)
  1. [Abstract] Abstract: The central claim that the custom automated framework 'achieves 92% accuracy in classifying model outputs' is load-bearing for all quantitative results (false-positive frequency, lexical bias, precision-recall trade-offs), yet the manuscript supplies no details on validation methodology, ground-truth construction, validation-set size, expert annotation process, or inter-rater agreement. Without these, the reported findings cannot be assessed for circularity or error correlation with the lexical patterns under study.
  2. [Methods/Results] Methods/Results sections: No information is given on dataset size, number of contracts, specific LLMs evaluated, precise vulnerability definitions or taxonomies used, or direct baseline comparisons against established static-analysis tools, leaving the generalizability of the 'high frequency of false positives' and 'lexical bias' conclusions unsupported.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'lack of rigorous validation of external data inputs' is vague; clarify whether this refers to LLM training data, prompt inputs, or contract source code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the custom automated framework 'achieves 92% accuracy in classifying model outputs' is load-bearing for all quantitative results (false-positive frequency, lexical bias, precision-recall trade-offs), yet the manuscript supplies no details on validation methodology, ground-truth construction, validation-set size, expert annotation process, or inter-rater agreement. Without these, the reported findings cannot be assessed for circularity or error correlation with the lexical patterns under study.

    Authors: We agree that the manuscript currently lacks sufficient methodological details on the validation of the custom automated framework, which is necessary to fully substantiate the 92% accuracy figure and allow assessment of potential circularity or bias. In the revised version, we will add a dedicated subsection to the Methods section that describes the validation methodology in full, including ground-truth construction, validation-set size, the expert annotation process, and inter-rater agreement statistics. We will also explicitly discuss any limitations related to error correlation with the lexical patterns studied. revision: yes

  2. Referee: [Methods/Results] Methods/Results sections: No information is given on dataset size, number of contracts, specific LLMs evaluated, precise vulnerability definitions or taxonomies used, or direct baseline comparisons against established static-analysis tools, leaving the generalizability of the 'high frequency of false positives' and 'lexical bias' conclusions unsupported.

    Authors: We acknowledge that the current manuscript does not provide these key details, which limits the ability to evaluate generalizability. The revised manuscript will expand the Methods and Results sections to include the dataset size and number of contracts analyzed, the specific LLMs evaluated, the precise vulnerability definitions and taxonomies employed, and direct baseline comparisons against established static-analysis tools such as Slither and Mythril. These additions will provide stronger empirical support for the reported findings on false-positive rates and lexical bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper is an empirical benchmarking study of LLMs for smart contract vulnerability detection. It reports results derived from a custom automated classification framework stated to achieve 92% accuracy, but contains no equations, mathematical derivations, fitted parameters, or self-referential definitions that reduce any claim to its own inputs by construction. The central findings on lexical bias and false positives are presented as outcomes of applying the framework to LLM outputs on external contracts, without any quoted reduction showing the framework's classifications are forced by the same heuristics under critique or by self-citation chains. This is a standard data-driven evaluation whose validity hinges on the (undetailed) framework rather than circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified accuracy of the custom classification framework and the assumption that lexical bias is the dominant failure mode across representative smart contract code.

axioms (1)
  • ad hoc to paper The custom automated framework classifies LLM outputs with 92% accuracy relative to ground truth
    Stated directly in the abstract as the basis for all reported findings on model performance.

pith-pipeline@v0.9.0 · 5434 in / 1194 out tokens · 74802 ms · 2026-05-13T02:14:07.283081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Arusoaie, A., Susan, S.: Towards trusted smart contracts: A comprehen- sive test suite for vulnerability detection. Empir. Softw. Eng.29(5), 117 (2024). https://doi.org/10.1007/S10664-024-10509-W

  3. [3]

    In: Maffei, M., Ryan, M

    Atzei, N., Bartoletti, M., Cimoli, T.: A survey of attacks on Ethereum smart contracts (SoK). In: Maffei, M., Ryan, M. (eds.) Principles of Security and Trust. pp. 164–186. Springer, Berlin, Heidelberg (2017)

  4. [4]

    https://hackingdistributed.com/2017/07/22/deep- dive-parity-bug/ (July 2017)

    Breidenbach, L., Daian, P., Juels, A., Sirer, E.G.: An in-depth look at the parity multisig bug. https://hackingdistributed.com/2017/07/22/deep- dive-parity-bug/ (July 2017)

  5. [5]

    https://ethereum.org/en/whitepaper/ (dec 2014)

    Buterin, V .: A next-generation smart contract and decentralized applica- tion platform. https://ethereum.org/en/whitepaper/ (dec 2014)

  6. [6]

    ACM Computing Surveys (Csur)54(2), 1–37 (2021)

    Chandrasekaran, D., Mago, V .: Evolution of semantic similarity—a survey. ACM Computing Surveys (Csur)54(2), 1–37 (2021)

  7. [7]

    Chen, B., Zhang, Z., Langren ´e, N., Zhu, S.: Unleashing the potential of prompt engineering in large language models: a comprehensive review (2024), https://arxiv.org/abs/2310.14735

  8. [8]

    Chen, C., Su, J., Chen, J., Wang, Y ., Bi, T., Yu, J., Wang, Y ., Lin, X., Chen, T., Zheng, Z.: When ChatGPT meets smart contract vulnerability detection: How far are we? ACM Transactions on Software Engineering and Methodology (2023)

  9. [9]

    In: Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB ’19), pp

    Feist, J., Greico, G., Groce, A.: Slither: A static analysis framework for smart contracts. In: Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB ’19), pp. 8–15. IEEE Press, Montreal, Quebec, Canada (2019). https: //doi.org/10.1109/WETSEB.2019.00008

  10. [10]

    https://github

    Ferreira, J., Durieux, T., Maranhao, R.: Smartbugs wild. https://github. com/smartbugs/smartbugs-wild (2020)

  11. [11]

    https://github.com/smartbugs/ smartbugs-curated (2023)

    Ferreira, J., Salzer, G.: Smartbugs curated. https://github.com/smartbugs/ smartbugs-curated (2023)

  12. [12]

    In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing

    Ferreira, J.a.F., Cruz, P., Durieux, T., Abreu, R.: Smartbugs: A frame- work to analyze Solidity smart contracts. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing. p. 1349–1352. ASE ’20, ACM, New York, NY , USA (2020). https://doi.org/10.1145/3324884.3415298

  13. [13]

    Gemini Team: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (2024), https://arxiv.org/abs/2403.05530

  14. [14]

    https://github.com/ DependableSystemsLab/SolidiFI-benchmark (2020)

    Ghaleb, A., Pattabiraman, K.: Solidifi benchmark. https://github.com/ DependableSystemsLab/SolidiFI-benchmark (2020)

  15. [15]

    In: Bauer, L., K¨usters, R

    Grishchenko, I., Maffei, M., Schneidewind, C.: A semantic framework for the security analysis of Ethereum smart contracts. In: Bauer, L., K¨usters, R. (eds.) Principles of Security and Trust. pp. 243–269. Springer International Publishing, Cham (2018)

  16. [16]

    In: Proceedings of the 20th International Conference on Information Integration and Web-Based Applications and Ser- vices

    Mense, A., Flatscher, M.: Security vulnerabilities in Ethereum smart contracts. In: Proceedings of the 20th International Conference on Information Integration and Web-Based Applications and Ser- vices. p. 375–380. iiW AS2018, ACM, New York, NY , USA (2018). https://doi.org/10.1145/3282373.3282419

  17. [17]

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013), https://arxiv.org/abs/1301. 3781

  18. [18]

    https://dasp.co/ (2018), accessed: 2023-04-04

    NCC-Group: Decentralized application security project. https://dasp.co/ (2018), accessed: 2023-04-04

  19. [19]

    In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532– 1543 (2014)

  20. [20]

    IEEE Access7, 78194–78213 (2019)

    Pinna, A., Ibba, S., Baralla, G., Tonelli, R., Marchesi, M.: A massive analysis of ethereum smart contracts empirical study and code metrics. IEEE Access7, 78194–78213 (2019). https://doi.org/10.1109/ACCESS.2019.2921936

  21. [21]

    Frontiers in Blockchain 5(2022)

    Rameder, H., di Angelo, M., Salzer, G.: Review of automated vulnera- bility analysis of smart contracts on Ethereum. Frontiers in Blockchain 5(2022). https://doi.org/10.3389/fbloc.2022.814977

  22. [22]

    International Journal of Advanced Research in Computer Science13, 51003–51010 (2022)

    Sharma, N., Sharma, S.: A survey of mythril, a smart contract security analysis tool for evm bytecode. International Journal of Advanced Research in Computer Science13, 51003–51010 (2022)

  23. [23]

    In: 2015 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

    Shiraishi, S., Mohan, V ., Marimuthu, H.: Test suites for benchmarks of static analysis tools. In: 2015 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). pp. 12–15 (2015). https://doi.org/10.1109/ISSREW.2015.7392027

  24. [24]

    https://www.coindesk.com/learn/2016/06/25/understanding-the-dao- attack/ (July 2016)

    Siegel, D.: Understanding the dao attack. https://www.coindesk.com/learn/2016/06/25/understanding-the-dao- attack/ (July 2016)

  25. [25]

    https://docs.soliditylang.org/en/v0.8.16/, ac- cessed: 2022-09-01

    Solidity documentation. https://docs.soliditylang.org/en/v0.8.16/, ac- cessed: 2022-09-01

  26. [26]

    https://github.com/SunWeb3Sec/ DeFiHackLabs/ (2023)

    SunWeb3Sec: Defihacks. https://github.com/SunWeb3Sec/ DeFiHackLabs/ (2023)

  27. [27]

    https://swcregistry

    Smart contract weakness classification and test cases. https://swcregistry. io/, accessed: 2023-04-04

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  29. [29]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382 (2023)

  30. [30]

    arXiv preprint arXiv:2501.07058 (2025)

    Xiao, Z., Wang, Q., Pearce, H., Chen, S.: Logic meets magic: Llms cracking smart contract vulnerabilities. arXiv preprint arXiv:2501.07058 (2025)

  31. [31]

    https://github.com/renardbebe/Smart-Contract-Benchmark-Suites (2021)

    Xu, Z., Ren, M.: Smart-contract-benchmark-suites: A unified dataset. https://github.com/renardbebe/Smart-Contract-Benchmark-Suites (2021)

  32. [32]

    Young, A., et al.: Yi: Open foundation models by 01.AI (2025), https: //arxiv.org/abs/2403.04652

  33. [33]

    Zhang, L., Ergen, T., Logeswaran, L., Lee, M., Jurgens, D.: Sprig: Improving large language model performance by system prompt op- timization (2024), https://arxiv.org/abs/2410.14826

  34. [34]

    https://github.com/ ZhangZhuoSJTU/Web3Bugs/tree/main (2022)

    Zhang, Z., Zhang, B., Xu, W., Lin, Z.: Web3bugs. https://github.com/ ZhangZhuoSJTU/Web3Bugs/tree/main (2022)

  35. [35]

    Lost in the Middle: How Language Models Use Long Contexts

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” arXiv preprint arXiv:2307.03172, 2023

  36. [36]

    RepoFusion: Training code models to understand your repository,

    D. Shrivastava, D. Kocetkov, H. de Vries, D. Bahdanau, and T. Scholak, “RepoFusion: Training code models to understand your repository,” arXiv preprint arXiv:2306.10998, 2023

  37. [37]

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity,

    P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar, “The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity,” inProc. 38th Conf. Neural Inf. Process. Syst. (NeurIPS), 2025

  38. [38]

    Why Language Models Hallucinate

    A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why language models hallucinate,”arXiv preprint arXiv:2509.04664, 2025

  39. [39]

    On the dangers of stochastic parrots: Can language models be too big?

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” inProc. 2021 ACM Conf. Fairness, Accountability, and Transparency (FAccT), 2021, pp. 610–623

  40. [40]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. 30th Conf. Neural Inf. Process. Syst. (NIPS), 2017

  41. [41]

    LLM.int8(): 8- bit Matrix Multiplication for Transformers at Scale,

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8- bit Matrix Multiplication for Transformers at Scale,”Advances in Neural Information Processing Systems, vol. 35, pp. 22128–22142, 2022

  42. [42]

    A Survey of Quantization Methods for Efficient Neural Network Inference,

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A Survey of Quantization Methods for Efficient Neural Network Inference,”Low-Power Computer Vision, pp. 291–326, 2021

  43. [43]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” in Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR), 2023

  44. [44]

    QLoRA: Efficient Finetuning of Quantized LLMs,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,”Advances in Neural Informa- tion Processing Systems, vol. 36, 2023

  45. [45]

    Efficiently Scaling Transformer Inference,

    R. Pope, S. Douglas, A. Chowdhery, C. Devane, J. Bradbury, A. Lev- skaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently Scaling Transformer Inference,” inProceedings of the 6th MLSys Conference, 2023

  46. [46]

    Language Models are Few-Shot Learners,

    T. Brown et al., “Language Models are Few-Shot Learners,” inProc. 34th Conf. Neural Inf. Process. Syst. (NeurIPS), 2020, pp. 1877–1901

  47. [47]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

    J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inProc. 36th Conf. Neural Inf. Process. Syst. (NeurIPS), 2022, pp. 24824–24837

  48. [48]

    LLM-Smart-Contract-Analysis-Benchmark: Solidity Bench- mark,

    S. Susan, “LLM-Smart-Contract-Analysis-Benchmark: Solidity Bench- mark,” https://doi.org/10.5281/zenodo.20109866 APPENDIXA METHODOLOGYFLOWCHART This diagram provides a compact overview of the end-to-end benchmarking pipeline used in our study, from dataset preparation and prompt configuration to model execution and detection classification. It is intended...

  49. [49]

    Invariant Mapping: Identify the critical security invariants of this contract (e.g., ”Total deposits must always equal or exceed the sum of individual balances”). 2. Adversarial State Analysis: Systematically analyze every state-changing function. Determine if a sequence of transactions–potentially involving multiple users or flash-loan-funded interaction...

  50. [50]

    Use this to identify immediate ”hotspots” in the code

    Baseline Triage: I have provided the Slither static analysis output below. Use this to identify immediate ”hotspots” in the code. In your internal reasoning, evaluate if these detections are true positives or if the contract’s specific business logic renders them non-exploitable. 2. Independent Invariant Mapping: Disregard the Slither output for a moment ...