pith. sign in

arxiv: 2606.03387 · v1 · pith:7M4HQB6Dnew · submitted 2026-06-02 · 💻 cs.CR

Bastet: A Fine-Grained Expert-Labeled Dataset for DeFi Smart Contract Vulnerability Detection

Pith reviewed 2026-06-28 09:53 UTC · model grok-4.3

classification 💻 cs.CR
keywords DeFi smart contractsvulnerability detectionexpert-labeled datasetaudit reportstwo-layer taxonomyconsensus annotationSolidity
0
0 comments X

The pith

Bastet provides an expert-labeled dataset of 849 DeFi smart contract vulnerabilities from 2021-2024 audits to fix outdated code, noisy labels, and coarse categories in detection research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing datasets for smart contract vulnerability detection use outdated Solidity versions, rely on automated or LLM-generated labels that introduce errors, and apply single broad categories that overlook business logic details. Bastet counters these issues by drawing 4,402 findings from 394 Code4rena competitive audit reports, with 849 of them annotated through a two-annotator consensus process by security experts. The dataset applies a two-layer taxonomy of 46 tags and 77 subtags to capture the semantic complexity of real vulnerabilities that produced confirmed losses. A reader would care because higher-quality ground truth supports more reliable LLM-based tools that could help reduce the scale of financial damage from DeFi incidents.

Core claim

The paper establishes Bastet as a dataset of 4,402 findings collected from 394 Code4rena reports spanning April 2021 to November 2024, of which 849 are fully annotated by white-hat security researchers via a two-annotator consensus workflow, employing a two-layer taxonomy of 46 Tags and 77 Subtags to reflect the root causes of vulnerabilities in modern DeFi contracts.

What carries the argument

The two-layer taxonomy of 46 Tags and 77 Subtags paired with the discussion-based consensus annotation workflow that ties each label to findings from actual audit reports.

If this is right

  • Detection models can be trained and tested against labels that reflect current Solidity versions and real audit conditions instead of older codebases.
  • The subtag level enables evaluation of how well tools identify specific business-logic flaws rather than only broad vulnerability classes.
  • The dataset supplies a benchmark tied directly to incidents that produced measurable losses, allowing performance claims to be checked against external loss records.
  • Researchers can isolate and study patterns within individual tag categories to develop targeted detection techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The consensus annotation method could be adapted to build similar high-quality datasets for vulnerabilities in other smart-contract domains such as NFTs or cross-chain bridges.
  • Models refined on these labels might surface issues earlier during protocol development before deployment and loss occur.
  • Periodic updates with new audit reports could track how vulnerability distributions shift as DeFi protocols and Solidity features evolve.
  • Combining the dataset with automated pre-filtering tools might allow the labeled portion to grow beyond the current 849 findings without losing expert oversight.

Load-bearing premise

The two-annotator consensus workflow by security researchers produces labels that accurately capture the root causes of vulnerabilities as they appear in the audited contracts.

What would settle it

Independent re-annotation of a sample of the 849 findings by a different set of experts that shows frequent mismatches with the published labels.

read the original abstract

Smart contract vulnerabilities in Decentralized Finance (DeFi) protocols resulted in over 1.49 billion USD in confirmed losses in 2024 alone, across 192 incidents [1]. As LLM-based vulnerability detection emerges as a promising approach to address these threats, the quality of evaluation datasets has become a critical bottleneck. Existing datasets suffer from three fundamental problems: they are built on outdated Solidity versions (e.g., v0.4) that no longer reflect modern DeFi contracts [5][6][7]; they rely on automated or LLM-generated annotations that introduce hallucination-driven label noise [9][10]; and they apply coarse single-layer labeling that fails to capture the semantic complexity of real-world business logic vulnerabilities [6][7][11][12]. We present Bastet, an expert-labeled DeFi smart contract vulnerability dataset that addresses all three problems through real-world audit findings (2021-2024), human expert annotation with discussion-based consensus, and a two-layer taxonomy of 46 Tags and 77 Subtags. Bastet comprises 4,402 findings collected from 394 Code4rena competitive audit reports spanning April 2021 to November 2024, of which 849 findings are fully annotated by white-hat security researchers from the DeFiHackLabs community. All annotations are produced through a two-annotator consensus workflow, ensuring label accuracy grounded in real-world vulnerability root causes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Bastet, a dataset of 4,402 findings extracted from 394 Code4rena competitive audit reports (April 2021–November 2024), of which 849 are annotated by white-hat researchers from the DeFiHackLabs community. It claims to resolve three problems in prior DeFi vulnerability datasets—outdated Solidity versions, hallucination-driven label noise from automated or LLM-generated annotations, and coarse single-layer labeling—via real-world audit data, a two-annotator discussion-based consensus workflow, and a two-layer taxonomy consisting of 46 Tags and 77 Subtags.

Significance. If the annotation quality holds, Bastet would supply a timely, fine-grained resource for training and evaluating LLM-based detectors on modern DeFi business-logic vulnerabilities. The use of recent competitive audit reports and expert human annotators constitutes a clear methodological advance over synthetic or LLM-labeled corpora.

major comments (1)
  1. [Abstract] Abstract: the claim that the two-annotator consensus workflow 'ensuring label accuracy grounded in real-world vulnerability root causes' is unsupported by any quantitative evidence (inter-annotator agreement rates, Cohen’s kappa, pre-consensus disagreement statistics, or external validation). Because the dataset’s primary value proposition is the elimination of hallucination-driven label noise, the absence of verification that the workflow actually produces reliable labels is load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for quantitative support behind the annotation workflow claim in the abstract. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the two-annotator consensus workflow 'ensuring label accuracy grounded in real-world vulnerability root causes' is unsupported by any quantitative evidence (inter-annotator agreement rates, Cohen’s kappa, pre-consensus disagreement statistics, or external validation). Because the dataset’s primary value proposition is the elimination of hallucination-driven label noise, the absence of verification that the workflow actually produces reliable labels is load-bearing for the central claim.

    Authors: The referee is correct that the abstract asserts the workflow 'ensuring label accuracy' without accompanying quantitative metrics. The manuscript (Section 3.2) describes a two-annotator process in which each finding is labeled independently and then discussed until consensus is reached, with the goal of grounding labels in the original audit report context. However, we did not record or report pre-consensus disagreement rates, Cohen’s kappa, or external validation, because the workflow is explicitly collaborative rather than a set of independent annotations. This leaves the strong phrasing in the abstract without direct empirical backing. We will revise the abstract to replace the clause 'ensuring label accuracy grounded in real-world vulnerability root causes' with a factual description of the workflow ('produced through a two-annotator discussion-based consensus process'). We will also add a short limitations paragraph noting the absence of quantitative agreement statistics and the reliance on expert discussion. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset curation paper with no derivation chain or self-referential claims

full rationale

The paper presents a data collection and labeling effort based on external Code4rena audit reports (2021-2024) and expert annotation by DeFiHackLabs researchers. No equations, parameter fitting, predictions, or mathematical derivations exist. The central claim of label accuracy via two-annotator consensus is a methodological assertion grounded in external audit findings rather than any self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The work is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a curated dataset whose value rests on the reliability of the expert labeling process rather than on any derivation or fitted parameters.

axioms (1)
  • domain assumption Expert annotators from the DeFiHackLabs community can produce accurate vulnerability labels through discussion-based consensus.
    Invoked in the description of the annotation workflow for the 849 fully annotated findings.

pith-pipeline@v0.9.1-grok · 5803 in / 1175 out tokens · 21977 ms · 2026-06-28T09:53:12.542646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Crypto losses in 2024

    Immunefi. Crypto losses in 2024. Immunefi Research Report, 2024. URL https:// immunefi.com/research/

  2. [2]

    Logic meets magic: LLMs cracking smart contract vulnerabilities, 2025

    Zexin Xiao, Qingyuan Wang, Hammond Pearce, and Sze Yiu Chen. Logic meets magic: LLMs cracking smart contract vulnerabilities, 2025. URL https://arxiv.org/abs/ 2501.07058

  3. [3]

    Ferreira, Pedro Cruz, Thomas Durieux, and Rui Abreu

    Jo˜ao F. Ferreira, Pedro Cruz, Thomas Durieux, and Rui Abreu. SmartBugs: A framework to analyze solidity smart contracts. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020

  4. [4]

    Briand, and Thomas Stifter

    Asem Ghaleb and Karthik Pattabiraman. How effective are smart contract analysis tools? Evaluating smart contract static analysis tools using bug injection. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2020. doi: 10.1145/3395363.3397385. 8

  5. [5]

    Vulnerability detection with code language models: How far are we? arXiv preprint arXiv:2403.18624, 2024

    Yuqing Ding, Yifan Fu, Omar Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?, 2024. URLhttps://arxiv.org/abs/2403.18624

  6. [6]

    FORGE: An LLM-driven framework for large-scale smart contract vulnerability dataset construction, 2025

    Tao Chen et al. FORGE: An LLM-driven framework for large-scale smart contract vulnerability dataset construction, 2025. URLhttps://arxiv.org/abs/2506.18795

  7. [7]

    DIVE: A multi-label smart contract vulnerability dataset.Scientific Data,

    Huashan Zheng et al. DIVE: A multi-label smart contract vulnerability dataset.Scientific Data,

  8. [8]

    URLhttps://www.nature.com/articles/s41597-026-07025-5

  9. [9]

    OpenSCV: An open hierarchical taxonomy for smart contract vulnerabili- ties.Empirical Software Engineering, 2024

    Jo˜ao Oliveira et al. OpenSCV: An open hierarchical taxonomy for smart contract vulnerabili- ties.Empirical Software Engineering, 2024. doi: 10.1007/s10664-024-10446-8

  10. [10]

    Charalambous, Liyi Zhou, Raghavendra Galanopoulou, Arthur Gervais, Dionysis Mitropoulos, and Benjamin Livshits

    Stefanos Chaliasos, Marcos A. Charalambous, Liyi Zhou, Raghavendra Galanopoulou, Arthur Gervais, Dionysis Mitropoulos, and Benjamin Livshits. Smart contract and DeFi security tools: Do they meet the needs of practitioners? InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024. doi: 10.1145/3597503.3623302

  11. [11]

    S. M. Mehedi Hossain, Ammar Altarawneh, and Jerry Roberts. Leveraging large language models and machine learning for smart contract vulnerability detection. InProceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC),

  12. [12]

    URLhttps://arxiv.org/abs/2501.02229

  13. [13]

    Advanced smart contract vulnerability detection via LLM-powered multi-agent systems.IEEE Transactions on Software Engineering, 51(10):2830–2846, 2025

    Zhiyuan Wei, Jing Sun, Yuhang Sun, Yi Liu, Defeng Wu, Zheng Zhang, Xiaofei Zhang, Meng Li, Yang Liu, Chunrong Li, Ming Wan, Jin Song Dong, and Liming Zhu. Advanced smart contract vulnerability detection via LLM-powered multi-agent systems.IEEE Transactions on Software Engineering, 51(10):2830–2846, 2025. doi: 10.1109/TSE.2025.3597319

  14. [14]

    Turn the rudder: A beacon of reentrancy detection for smart contracts on Ethereum

    Zibin Zheng, Neng Chen, Jianzhong Ye, Jiachi Chen, Zhijie Hu, Lianghao Wu, Weizhe Luo, Wenjia Jiang, Zhiying Xie, and Yanlin Wang. Turn the rudder: A beacon of reentrancy detection for smart contracts on Ethereum. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), 2023

  15. [15]

    SCRUBD: Smart contracts reentrancy and unhandled exceptions vulnerability dataset, 2024

    Thomas Durieux et al. SCRUBD: Smart contracts reentrancy and unhandled exceptions vulnerability dataset, 2024. URLhttps://arxiv.org/abs/2412.09935

  16. [16]

    Why Language Models Hallucinate

    Adam Tauman Kalai and Ofir Nachum. Why language models hallucinate, 2025. URL https://arxiv.org/abs/2509.04664

  17. [17]

    Code4rena: Competitive smart contract audit platform

    Code4rena. Code4rena: Competitive smart contract audit platform. Online, 2021. URL https://code4rena.com/

  18. [18]

    DeFiHackLabs: Web3 security incident database

    SunWeb3Sec. DeFiHackLabs: Web3 security incident database. GitHub Repository, 2022. URLhttps://github.com/SunWeb3Sec/DeFiHackLabs. 9