arxiv: 2605.11163 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Benchmarking LLM-Based Static Analysis for Secure Smart Contract Development: Reliability, Limitations, and Potential Hybrid Solutions

Stefan-Claudiu Susan , Andrei Arusoaie , Dorel Lucanu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords smart contractslarge language modelsstatic analysisvulnerability detectionfalse positiveslexical biasblockchain securityhybrid solutions

0 comments

The pith

Large language models cannot reliably audit smart contracts on their own because they depend on variable names rather than code semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can replace traditional static analysis tools for finding security flaws in smart contracts. It shows that the models suffer from lexical bias, basing many decisions on how identifiers are named instead of on actual code behavior, which produces frequent false positives. Prompting methods create a clear trade-off between catching real issues and avoiding incorrect ones. The evaluation uses a custom automated system that matches human classification of outputs 92 percent of the time. The results matter because blockchain transactions are irreversible, so unreliable detection can allow costly exploits to reach deployment.

Core claim

Large language models are not viable as autonomous security auditors for smart contracts. Their efficacy is limited by inherent lexical bias and insufficient validation of external data, leading to high rates of false positives through reliance on non-semantic heuristics such as identifier naming. Prompting techniques exhibit a precision-recall trade-off. These findings rest on a custom automated framework that classifies model outputs with 92 percent accuracy.

What carries the argument

Custom automated framework that classifies LLM outputs on smart contract vulnerabilities at 92 percent accuracy, used to benchmark models and prompting strategies against test contracts.

If this is right

LLMs function best as complements to traditional static analysis tools rather than standalone auditors for smart contracts.
Reliance on identifier naming as a heuristic generates unreliable results in vulnerability detection.
Prompt engineering can shift the balance between precision and recall but does not remove the underlying lexical bias.
Hybrid solutions that pair LLM suggestions with semantic checks offer a route to improved security analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model training on datasets that prioritize code semantics over surface lexical features could reduce false positives in security tasks.
The same classification framework could be reused to measure LLM performance on vulnerability detection in other programming languages or domains.
Post-processing LLM outputs with execution simulation or formal verification steps might compensate for the observed limitations.

Load-bearing premise

The custom automated framework classifies LLM outputs correctly at 92 percent accuracy and the tested prompts and contracts represent real-world smart contract security analysis.

What would settle it

Manual review of LLM vulnerability reports on a fresh collection of deployed smart contracts that produces substantially different false-positive rates than the framework reports.

Figures

Figures reproduced from arXiv: 2605.11163 by Andrei Arusoaie, Dorel Lucanu, Stefan-Claudiu Susan.

**Figure 2.** Figure 2: Slither bias: comparison of matched detections across experiments. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Methodology flowchart for the automated benchmarking framework. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

The irreversible nature of blockchain transactions makes the identification of smart contract vulnerabilities an essential requirement for secure system development. While Large Language Models (LLMs) are increasingly integrated into developer workflows, their reliability as autonomous security auditors remains unproven. We assess whether current generative models are a viable replacement for, or only a complement to, traditional static-analysis tools. Our findings indicate that LLM efficacy is undermined by both inherent lexical bias and a lack of rigorous validation of external data inputs. This reliance on non-semantic heuristics, such as identifier naming, leads to a high frequency of false positives. Furthermore, prompting techniques reveal a trade-off between precision and recall. These results were derived using our custom automated framework, which achieves 92% accuracy in classifying model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs for smart contract auditing lean on lexical cues like names and produce high false positives, but the 92% accuracy for their classification framework is asserted without enough validation details.

read the letter

The key point is that current LLMs are not reliable on their own for spotting smart contract vulnerabilities because they pick up on identifier names and other surface patterns instead of deeper semantics. This leads to frequent false positives, and the authors also note that different prompting styles trade precision for recall. They support this with experiments run through a custom automated framework they built to label the model outputs.

Referee Report

2 major / 1 minor

Summary. The paper benchmarks LLMs for static analysis of smart contracts, claiming that their efficacy is undermined by lexical bias (reliance on non-semantic heuristics like identifier naming) leading to high false positives, that prompting techniques exhibit precision-recall trade-offs, and that a custom automated framework classifies LLM outputs at 92% accuracy relative to ground truth, supporting recommendations for hybrid LLM-traditional tool solutions.

Significance. If the custom framework's 92% accuracy holds under disclosed validation and the tested contracts/prompts are representative, the results would highlight practically important limitations of LLMs as autonomous security auditors in blockchain contexts, providing empirical grounding for hybrid approaches and cautioning against over-reliance on generative models for vulnerability detection.

major comments (2)

[Abstract] Abstract: The central claim that the custom automated framework 'achieves 92% accuracy in classifying model outputs' is load-bearing for all quantitative results (false-positive frequency, lexical bias, precision-recall trade-offs), yet the manuscript supplies no details on validation methodology, ground-truth construction, validation-set size, expert annotation process, or inter-rater agreement. Without these, the reported findings cannot be assessed for circularity or error correlation with the lexical patterns under study.
[Methods/Results] Methods/Results sections: No information is given on dataset size, number of contracts, specific LLMs evaluated, precise vulnerability definitions or taxonomies used, or direct baseline comparisons against established static-analysis tools, leaving the generalizability of the 'high frequency of false positives' and 'lexical bias' conclusions unsupported.

minor comments (1)

[Abstract] Abstract: The phrase 'lack of rigorous validation of external data inputs' is vague; clarify whether this refers to LLM training data, prompt inputs, or contract source code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the custom automated framework 'achieves 92% accuracy in classifying model outputs' is load-bearing for all quantitative results (false-positive frequency, lexical bias, precision-recall trade-offs), yet the manuscript supplies no details on validation methodology, ground-truth construction, validation-set size, expert annotation process, or inter-rater agreement. Without these, the reported findings cannot be assessed for circularity or error correlation with the lexical patterns under study.

Authors: We agree that the manuscript currently lacks sufficient methodological details on the validation of the custom automated framework, which is necessary to fully substantiate the 92% accuracy figure and allow assessment of potential circularity or bias. In the revised version, we will add a dedicated subsection to the Methods section that describes the validation methodology in full, including ground-truth construction, validation-set size, the expert annotation process, and inter-rater agreement statistics. We will also explicitly discuss any limitations related to error correlation with the lexical patterns studied. revision: yes
Referee: [Methods/Results] Methods/Results sections: No information is given on dataset size, number of contracts, specific LLMs evaluated, precise vulnerability definitions or taxonomies used, or direct baseline comparisons against established static-analysis tools, leaving the generalizability of the 'high frequency of false positives' and 'lexical bias' conclusions unsupported.

Authors: We acknowledge that the current manuscript does not provide these key details, which limits the ability to evaluate generalizability. The revised manuscript will expand the Methods and Results sections to include the dataset size and number of contracts analyzed, the specific LLMs evaluated, the precise vulnerability definitions and taxonomies employed, and direct baseline comparisons against established static-analysis tools such as Slither and Mythril. These additions will provide stronger empirical support for the reported findings on false-positive rates and lexical bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper is an empirical benchmarking study of LLMs for smart contract vulnerability detection. It reports results derived from a custom automated classification framework stated to achieve 92% accuracy, but contains no equations, mathematical derivations, fitted parameters, or self-referential definitions that reduce any claim to its own inputs by construction. The central findings on lexical bias and false positives are presented as outcomes of applying the framework to LLM outputs on external contracts, without any quoted reduction showing the framework's classifications are forced by the same heuristics under critique or by self-citation chains. This is a standard data-driven evaluation whose validity hinges on the (undetailed) framework rather than circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified accuracy of the custom classification framework and the assumption that lexical bias is the dominant failure mode across representative smart contract code.

axioms (1)

ad hoc to paper The custom automated framework classifies LLM outputs with 92% accuracy relative to ground truth
Stated directly in the abstract as the basis for all reported findings on model performance.

pith-pipeline@v0.9.0 · 5434 in / 1194 out tokens · 74802 ms · 2026-05-13T02:14:07.283081+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our findings indicate that LLM efficacy is undermined by both inherent lexical bias and a lack of rigorous validation of external data inputs. This reliance on non-semantic heuristics, such as identifier naming, leads to a high frequency of false positives.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We built a reusable infrastructure that allows the experiments to be run consistently... automatic classifier... achieves 92% accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Arusoaie, A., Susan, S.: Towards trusted smart contracts: A comprehen- sive test suite for vulnerability detection. Empir. Softw. Eng.29(5), 117 (2024). https://doi.org/10.1007/S10664-024-10509-W

work page doi:10.1007/s10664-024-10509-w 2024
[3]

In: Maffei, M., Ryan, M

Atzei, N., Bartoletti, M., Cimoli, T.: A survey of attacks on Ethereum smart contracts (SoK). In: Maffei, M., Ryan, M. (eds.) Principles of Security and Trust. pp. 164–186. Springer, Berlin, Heidelberg (2017)

work page 2017
[4]

https://hackingdistributed.com/2017/07/22/deep- dive-parity-bug/ (July 2017)

Breidenbach, L., Daian, P., Juels, A., Sirer, E.G.: An in-depth look at the parity multisig bug. https://hackingdistributed.com/2017/07/22/deep- dive-parity-bug/ (July 2017)

work page 2017
[5]

https://ethereum.org/en/whitepaper/ (dec 2014)

Buterin, V .: A next-generation smart contract and decentralized applica- tion platform. https://ethereum.org/en/whitepaper/ (dec 2014)

work page 2014
[6]

ACM Computing Surveys (Csur)54(2), 1–37 (2021)

Chandrasekaran, D., Mago, V .: Evolution of semantic similarity—a survey. ACM Computing Surveys (Csur)54(2), 1–37 (2021)

work page 2021
[7]

Chen, B., Zhang, Z., Langren ´e, N., Zhu, S.: Unleashing the potential of prompt engineering in large language models: a comprehensive review (2024), https://arxiv.org/abs/2310.14735

work page arXiv 2024
[8]

Chen, C., Su, J., Chen, J., Wang, Y ., Bi, T., Yu, J., Wang, Y ., Lin, X., Chen, T., Zheng, Z.: When ChatGPT meets smart contract vulnerability detection: How far are we? ACM Transactions on Software Engineering and Methodology (2023)

work page 2023
[9]

In: Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB ’19), pp

Feist, J., Greico, G., Groce, A.: Slither: A static analysis framework for smart contracts. In: Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB ’19), pp. 8–15. IEEE Press, Montreal, Quebec, Canada (2019). https: //doi.org/10.1109/WETSEB.2019.00008

work page doi:10.1109/wetseb.2019.00008 2019
[10]

https://github

Ferreira, J., Durieux, T., Maranhao, R.: Smartbugs wild. https://github. com/smartbugs/smartbugs-wild (2020)

work page 2020
[11]

https://github.com/smartbugs/ smartbugs-curated (2023)

Ferreira, J., Salzer, G.: Smartbugs curated. https://github.com/smartbugs/ smartbugs-curated (2023)

work page 2023
[12]

In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing

Ferreira, J.a.F., Cruz, P., Durieux, T., Abreu, R.: Smartbugs: A frame- work to analyze Solidity smart contracts. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineer- ing. p. 1349–1352. ASE ’20, ACM, New York, NY , USA (2020). https://doi.org/10.1145/3324884.3415298

work page doi:10.1145/3324884.3415298 2020
[13]

Gemini Team: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (2024), https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

https://github.com/ DependableSystemsLab/SolidiFI-benchmark (2020)

Ghaleb, A., Pattabiraman, K.: Solidifi benchmark. https://github.com/ DependableSystemsLab/SolidiFI-benchmark (2020)

work page 2020
[15]

In: Bauer, L., K¨usters, R

Grishchenko, I., Maffei, M., Schneidewind, C.: A semantic framework for the security analysis of Ethereum smart contracts. In: Bauer, L., K¨usters, R. (eds.) Principles of Security and Trust. pp. 243–269. Springer International Publishing, Cham (2018)

work page 2018
[16]

In: Proceedings of the 20th International Conference on Information Integration and Web-Based Applications and Ser- vices

Mense, A., Flatscher, M.: Security vulnerabilities in Ethereum smart contracts. In: Proceedings of the 20th International Conference on Information Integration and Web-Based Applications and Ser- vices. p. 375–380. iiW AS2018, ACM, New York, NY , USA (2018). https://doi.org/10.1145/3282373.3282419

work page doi:10.1145/3282373.3282419 2018
[17]

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013), https://arxiv.org/abs/1301. 3781

work page 2013
[18]

https://dasp.co/ (2018), accessed: 2023-04-04

NCC-Group: Decentralized application security project. https://dasp.co/ (2018), accessed: 2023-04-04

work page 2018
[19]

In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532– 1543 (2014)

work page 2014
[20]

IEEE Access7, 78194–78213 (2019)

Pinna, A., Ibba, S., Baralla, G., Tonelli, R., Marchesi, M.: A massive analysis of ethereum smart contracts empirical study and code metrics. IEEE Access7, 78194–78213 (2019). https://doi.org/10.1109/ACCESS.2019.2921936

work page doi:10.1109/access.2019.2921936 2019
[21]

Frontiers in Blockchain 5(2022)

Rameder, H., di Angelo, M., Salzer, G.: Review of automated vulnera- bility analysis of smart contracts on Ethereum. Frontiers in Blockchain 5(2022). https://doi.org/10.3389/fbloc.2022.814977

work page doi:10.3389/fbloc.2022.814977 2022
[22]

International Journal of Advanced Research in Computer Science13, 51003–51010 (2022)

Sharma, N., Sharma, S.: A survey of mythril, a smart contract security analysis tool for evm bytecode. International Journal of Advanced Research in Computer Science13, 51003–51010 (2022)

work page 2022
[23]

In: 2015 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

Shiraishi, S., Mohan, V ., Marimuthu, H.: Test suites for benchmarks of static analysis tools. In: 2015 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). pp. 12–15 (2015). https://doi.org/10.1109/ISSREW.2015.7392027

work page doi:10.1109/issrew.2015.7392027 2015
[24]

https://www.coindesk.com/learn/2016/06/25/understanding-the-dao- attack/ (July 2016)

Siegel, D.: Understanding the dao attack. https://www.coindesk.com/learn/2016/06/25/understanding-the-dao- attack/ (July 2016)

work page 2016
[25]

https://docs.soliditylang.org/en/v0.8.16/, ac- cessed: 2022-09-01

Solidity documentation. https://docs.soliditylang.org/en/v0.8.16/, ac- cessed: 2022-09-01

work page 2022
[26]

https://github.com/SunWeb3Sec/ DeFiHackLabs/ (2023)

SunWeb3Sec: Defihacks. https://github.com/SunWeb3Sec/ DeFiHackLabs/ (2023)

work page 2023
[27]

https://swcregistry

Smart contract weakness classification and test cases. https://swcregistry. io/, accessed: 2023-04-04

work page 2023
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382 (2023)

work page internal anchor Pith review arXiv 2023
[30]

arXiv preprint arXiv:2501.07058 (2025)

Xiao, Z., Wang, Q., Pearce, H., Chen, S.: Logic meets magic: Llms cracking smart contract vulnerabilities. arXiv preprint arXiv:2501.07058 (2025)

work page arXiv 2025
[31]

https://github.com/renardbebe/Smart-Contract-Benchmark-Suites (2021)

Xu, Z., Ren, M.: Smart-contract-benchmark-suites: A unified dataset. https://github.com/renardbebe/Smart-Contract-Benchmark-Suites (2021)

work page 2021
[32]

Young, A., et al.: Yi: Open foundation models by 01.AI (2025), https: //arxiv.org/abs/2403.04652

work page internal anchor Pith review arXiv 2025
[33]

Zhang, L., Ergen, T., Logeswaran, L., Lee, M., Jurgens, D.: Sprig: Improving large language model performance by system prompt op- timization (2024), https://arxiv.org/abs/2410.14826

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

https://github.com/ ZhangZhuoSJTU/Web3Bugs/tree/main (2022)

Zhang, Z., Zhang, B., Xu, W., Lin, Z.: Web3bugs. https://github.com/ ZhangZhuoSJTU/Web3Bugs/tree/main (2022)

work page 2022
[35]

Lost in the Middle: How Language Models Use Long Contexts

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

RepoFusion: Training code models to understand your repository,

D. Shrivastava, D. Kocetkov, H. de Vries, D. Bahdanau, and T. Scholak, “RepoFusion: Training code models to understand your repository,” arXiv preprint arXiv:2306.10998, 2023

work page arXiv 2023
[37]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity,

P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar, “The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity,” inProc. 38th Conf. Neural Inf. Process. Syst. (NeurIPS), 2025

work page 2025
[38]

Why Language Models Hallucinate

A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why language models hallucinate,”arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review arXiv 2025
[39]

On the dangers of stochastic parrots: Can language models be too big?

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?” inProc. 2021 ACM Conf. Fairness, Accountability, and Transparency (FAccT), 2021, pp. 610–623

work page 2021
[40]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. 30th Conf. Neural Inf. Process. Syst. (NIPS), 2017

work page 2017
[41]

LLM.int8(): 8- bit Matrix Multiplication for Transformers at Scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8- bit Matrix Multiplication for Transformers at Scale,”Advances in Neural Information Processing Systems, vol. 35, pp. 22128–22142, 2022

work page 2022
[42]

A Survey of Quantization Methods for Efficient Neural Network Inference,

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A Survey of Quantization Methods for Efficient Neural Network Inference,”Low-Power Computer Vision, pp. 291–326, 2021

work page 2021
[43]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” in Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR), 2023

work page 2023
[44]

QLoRA: Efficient Finetuning of Quantized LLMs,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,”Advances in Neural Informa- tion Processing Systems, vol. 36, 2023

work page 2023
[45]

Efficiently Scaling Transformer Inference,

R. Pope, S. Douglas, A. Chowdhery, C. Devane, J. Bradbury, A. Lev- skaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently Scaling Transformer Inference,” inProceedings of the 6th MLSys Conference, 2023

work page 2023
[46]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,” inProc. 34th Conf. Neural Inf. Process. Syst. (NeurIPS), 2020, pp. 1877–1901

work page 2020
[47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inProc. 36th Conf. Neural Inf. Process. Syst. (NeurIPS), 2022, pp. 24824–24837

work page 2022
[48]

LLM-Smart-Contract-Analysis-Benchmark: Solidity Bench- mark,

S. Susan, “LLM-Smart-Contract-Analysis-Benchmark: Solidity Bench- mark,” https://doi.org/10.5281/zenodo.20109866 APPENDIXA METHODOLOGYFLOWCHART This diagram provides a compact overview of the end-to-end benchmarking pipeline used in our study, from dataset preparation and prompt configuration to model execution and detection classification. It is intended...

work page doi:10.5281/zenodo.20109866
[49]

Invariant Mapping: Identify the critical security invariants of this contract (e.g., ”Total deposits must always equal or exceed the sum of individual balances”). 2. Adversarial State Analysis: Systematically analyze every state-changing function. Determine if a sequence of transactions–potentially involving multiple users or flash-loan-funded interaction...

work page
[50]

Use this to identify immediate ”hotspots” in the code

Baseline Triage: I have provided the Slither static analysis output below. Use this to identify immediate ”hotspots” in the code. In your internal reasoning, evaluate if these detections are true positives or if the contract’s specific business logic renders them non-exploitable. 2. Independent Invariant Mapping: Disregard the Slither output for a moment ...

work page 2031