Bastet: A Fine-Grained Expert-Labeled Dataset for DeFi Smart Contract Vulnerability Detection
Pith reviewed 2026-06-28 09:53 UTC · model grok-4.3
The pith
Bastet provides an expert-labeled dataset of 849 DeFi smart contract vulnerabilities from 2021-2024 audits to fix outdated code, noisy labels, and coarse categories in detection research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes Bastet as a dataset of 4,402 findings collected from 394 Code4rena reports spanning April 2021 to November 2024, of which 849 are fully annotated by white-hat security researchers via a two-annotator consensus workflow, employing a two-layer taxonomy of 46 Tags and 77 Subtags to reflect the root causes of vulnerabilities in modern DeFi contracts.
What carries the argument
The two-layer taxonomy of 46 Tags and 77 Subtags paired with the discussion-based consensus annotation workflow that ties each label to findings from actual audit reports.
If this is right
- Detection models can be trained and tested against labels that reflect current Solidity versions and real audit conditions instead of older codebases.
- The subtag level enables evaluation of how well tools identify specific business-logic flaws rather than only broad vulnerability classes.
- The dataset supplies a benchmark tied directly to incidents that produced measurable losses, allowing performance claims to be checked against external loss records.
- Researchers can isolate and study patterns within individual tag categories to develop targeted detection techniques.
Where Pith is reading between the lines
- The consensus annotation method could be adapted to build similar high-quality datasets for vulnerabilities in other smart-contract domains such as NFTs or cross-chain bridges.
- Models refined on these labels might surface issues earlier during protocol development before deployment and loss occur.
- Periodic updates with new audit reports could track how vulnerability distributions shift as DeFi protocols and Solidity features evolve.
- Combining the dataset with automated pre-filtering tools might allow the labeled portion to grow beyond the current 849 findings without losing expert oversight.
Load-bearing premise
The two-annotator consensus workflow by security researchers produces labels that accurately capture the root causes of vulnerabilities as they appear in the audited contracts.
What would settle it
Independent re-annotation of a sample of the 849 findings by a different set of experts that shows frequent mismatches with the published labels.
read the original abstract
Smart contract vulnerabilities in Decentralized Finance (DeFi) protocols resulted in over 1.49 billion USD in confirmed losses in 2024 alone, across 192 incidents [1]. As LLM-based vulnerability detection emerges as a promising approach to address these threats, the quality of evaluation datasets has become a critical bottleneck. Existing datasets suffer from three fundamental problems: they are built on outdated Solidity versions (e.g., v0.4) that no longer reflect modern DeFi contracts [5][6][7]; they rely on automated or LLM-generated annotations that introduce hallucination-driven label noise [9][10]; and they apply coarse single-layer labeling that fails to capture the semantic complexity of real-world business logic vulnerabilities [6][7][11][12]. We present Bastet, an expert-labeled DeFi smart contract vulnerability dataset that addresses all three problems through real-world audit findings (2021-2024), human expert annotation with discussion-based consensus, and a two-layer taxonomy of 46 Tags and 77 Subtags. Bastet comprises 4,402 findings collected from 394 Code4rena competitive audit reports spanning April 2021 to November 2024, of which 849 findings are fully annotated by white-hat security researchers from the DeFiHackLabs community. All annotations are produced through a two-annotator consensus workflow, ensuring label accuracy grounded in real-world vulnerability root causes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Bastet, a dataset of 4,402 findings extracted from 394 Code4rena competitive audit reports (April 2021–November 2024), of which 849 are annotated by white-hat researchers from the DeFiHackLabs community. It claims to resolve three problems in prior DeFi vulnerability datasets—outdated Solidity versions, hallucination-driven label noise from automated or LLM-generated annotations, and coarse single-layer labeling—via real-world audit data, a two-annotator discussion-based consensus workflow, and a two-layer taxonomy consisting of 46 Tags and 77 Subtags.
Significance. If the annotation quality holds, Bastet would supply a timely, fine-grained resource for training and evaluating LLM-based detectors on modern DeFi business-logic vulnerabilities. The use of recent competitive audit reports and expert human annotators constitutes a clear methodological advance over synthetic or LLM-labeled corpora.
major comments (1)
- [Abstract] Abstract: the claim that the two-annotator consensus workflow 'ensuring label accuracy grounded in real-world vulnerability root causes' is unsupported by any quantitative evidence (inter-annotator agreement rates, Cohen’s kappa, pre-consensus disagreement statistics, or external validation). Because the dataset’s primary value proposition is the elimination of hallucination-driven label noise, the absence of verification that the workflow actually produces reliable labels is load-bearing for the central claim.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for quantitative support behind the annotation workflow claim in the abstract. We address this point directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the two-annotator consensus workflow 'ensuring label accuracy grounded in real-world vulnerability root causes' is unsupported by any quantitative evidence (inter-annotator agreement rates, Cohen’s kappa, pre-consensus disagreement statistics, or external validation). Because the dataset’s primary value proposition is the elimination of hallucination-driven label noise, the absence of verification that the workflow actually produces reliable labels is load-bearing for the central claim.
Authors: The referee is correct that the abstract asserts the workflow 'ensuring label accuracy' without accompanying quantitative metrics. The manuscript (Section 3.2) describes a two-annotator process in which each finding is labeled independently and then discussed until consensus is reached, with the goal of grounding labels in the original audit report context. However, we did not record or report pre-consensus disagreement rates, Cohen’s kappa, or external validation, because the workflow is explicitly collaborative rather than a set of independent annotations. This leaves the strong phrasing in the abstract without direct empirical backing. We will revise the abstract to replace the clause 'ensuring label accuracy grounded in real-world vulnerability root causes' with a factual description of the workflow ('produced through a two-annotator discussion-based consensus process'). We will also add a short limitations paragraph noting the absence of quantitative agreement statistics and the reliance on expert discussion. These changes will be incorporated in the next version. revision: yes
Circularity Check
No circularity: dataset curation paper with no derivation chain or self-referential claims
full rationale
The paper presents a data collection and labeling effort based on external Code4rena audit reports (2021-2024) and expert annotation by DeFiHackLabs researchers. No equations, parameter fitting, predictions, or mathematical derivations exist. The central claim of label accuracy via two-annotator consensus is a methodological assertion grounded in external audit findings rather than any self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotators from the DeFiHackLabs community can produce accurate vulnerability labels through discussion-based consensus.
Reference graph
Works this paper leans on
-
[1]
Crypto losses in 2024
Immunefi. Crypto losses in 2024. Immunefi Research Report, 2024. URL https:// immunefi.com/research/
2024
-
[2]
Logic meets magic: LLMs cracking smart contract vulnerabilities, 2025
Zexin Xiao, Qingyuan Wang, Hammond Pearce, and Sze Yiu Chen. Logic meets magic: LLMs cracking smart contract vulnerabilities, 2025. URL https://arxiv.org/abs/ 2501.07058
-
[3]
Ferreira, Pedro Cruz, Thomas Durieux, and Rui Abreu
Jo˜ao F. Ferreira, Pedro Cruz, Thomas Durieux, and Rui Abreu. SmartBugs: A framework to analyze solidity smart contracts. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020
2020
-
[4]
Asem Ghaleb and Karthik Pattabiraman. How effective are smart contract analysis tools? Evaluating smart contract static analysis tools using bug injection. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2020. doi: 10.1145/3395363.3397385. 8
-
[5]
Yuqing Ding, Yifan Fu, Omar Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?, 2024. URLhttps://arxiv.org/abs/2403.18624
-
[6]
Tao Chen et al. FORGE: An LLM-driven framework for large-scale smart contract vulnerability dataset construction, 2025. URLhttps://arxiv.org/abs/2506.18795
-
[7]
DIVE: A multi-label smart contract vulnerability dataset.Scientific Data,
Huashan Zheng et al. DIVE: A multi-label smart contract vulnerability dataset.Scientific Data,
-
[8]
URLhttps://www.nature.com/articles/s41597-026-07025-5
-
[9]
Jo˜ao Oliveira et al. OpenSCV: An open hierarchical taxonomy for smart contract vulnerabili- ties.Empirical Software Engineering, 2024. doi: 10.1007/s10664-024-10446-8
-
[10]
Stefanos Chaliasos, Marcos A. Charalambous, Liyi Zhou, Raghavendra Galanopoulou, Arthur Gervais, Dionysis Mitropoulos, and Benjamin Livshits. Smart contract and DeFi security tools: Do they meet the needs of practitioners? InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024. doi: 10.1145/3597503.3623302
-
[11]
S. M. Mehedi Hossain, Ammar Altarawneh, and Jerry Roberts. Leveraging large language models and machine learning for smart contract vulnerability detection. InProceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC),
2025
- [12]
-
[13]
Zhiyuan Wei, Jing Sun, Yuhang Sun, Yi Liu, Defeng Wu, Zheng Zhang, Xiaofei Zhang, Meng Li, Yang Liu, Chunrong Li, Ming Wan, Jin Song Dong, and Liming Zhu. Advanced smart contract vulnerability detection via LLM-powered multi-agent systems.IEEE Transactions on Software Engineering, 51(10):2830–2846, 2025. doi: 10.1109/TSE.2025.3597319
-
[14]
Turn the rudder: A beacon of reentrancy detection for smart contracts on Ethereum
Zibin Zheng, Neng Chen, Jianzhong Ye, Jiachi Chen, Zhijie Hu, Lianghao Wu, Weizhe Luo, Wenjia Jiang, Zhiying Xie, and Yanlin Wang. Turn the rudder: A beacon of reentrancy detection for smart contracts on Ethereum. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), 2023
2023
-
[15]
SCRUBD: Smart contracts reentrancy and unhandled exceptions vulnerability dataset, 2024
Thomas Durieux et al. SCRUBD: Smart contracts reentrancy and unhandled exceptions vulnerability dataset, 2024. URLhttps://arxiv.org/abs/2412.09935
-
[16]
Why Language Models Hallucinate
Adam Tauman Kalai and Ofir Nachum. Why language models hallucinate, 2025. URL https://arxiv.org/abs/2509.04664
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Code4rena: Competitive smart contract audit platform
Code4rena. Code4rena: Competitive smart contract audit platform. Online, 2021. URL https://code4rena.com/
2021
-
[18]
DeFiHackLabs: Web3 security incident database
SunWeb3Sec. DeFiHackLabs: Web3 security incident database. GitHub Repository, 2022. URLhttps://github.com/SunWeb3Sec/DeFiHackLabs. 9
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.