pith. sign in

arxiv: 2604.26495 · v2 · submitted 2026-04-29 · 💻 cs.CR

Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols

Pith reviewed 2026-05-07 13:13 UTC · model grok-4.3

classification 💻 cs.CR
keywords specification-based auditingLLM-driven security analysismulti-implementation distributed protocolsblockchain consensus securitysecurity property derivationcross-implementation comparisoninvariant violation detection
0
0 comments X

The pith

Deriving security properties from natural-language specifications enables consistent auditing of multiple protocol implementations and catches specification-level bugs missed by code-only analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Auditing distributed protocols like blockchain consensus is difficult because correctness depends on adherence to a shared specification rather than the details of any one code implementation. This paper presents a framework that uses large language models to extract explicit, categorized security properties such as invariants, preconditions, and trust assumptions directly from natural-language specifications. These properties then serve as a fixed baseline for examining different independent implementations of the same protocol. The result is the ability to perform cross-implementation comparisons and detect divergences or invariant violations that code reasoning alone cannot identify. Evaluations on a major Ethereum audit contest and a C/C++ benchmark demonstrate recovery of all known vulnerabilities plus discovery of additional confirmed bugs, with traceable false positives.

Core claim

The central claim is that an LLM-driven framework can derive categorized security properties from natural language specifications and reuse them across multiple implementations to ground audits in specification requirements. This allows controlled comparisons between clients, detections based on invariants not present in code patterns, and tracing of errors to specific stages like property generation. In the Sherlock Ethereum Fusaka contest involving 10 targets and 366 submissions, the method recovers all 15 in-scope high/medium/low vulnerabilities with expert augmentation for some and fully automated for others, while also identifying 4 fix-confirmed new bugs including a cryptographic-invar

What carries the argument

SPECA, the specification-anchored auditing framework that extracts and categorizes security properties from specifications for cross-implementation reuse and comparison.

Load-bearing premise

Large language models are able to extract complete, accurate, and properly categorized security properties from natural-language specifications without introducing critical omissions or hallucinations.

What would settle it

Running the framework on a new multi-implementation protocol audit where it either fails to detect a known specification-based vulnerability due to incomplete property derivation or produces detections based on hallucinated properties that do not correspond to actual issues in the code.

Figures

Figures reproduced from arXiv: 2604.26495 by Akiyoshi Sannai, Hirotake Murakami, Masato Kamba.

Figure 1
Figure 1. Figure 1: The SPECA pipeline. Phases 1–3 (Knowledge Structuring) derive security properties from specifications; Phases 4–6 (Systematic Auditing) verify properties against implementations. In multi-implementation settings, the left stage executes once and the right stage executes per target view at source ↗
Figure 2
Figure 2. Figure 2: Phase 5 audit flow for a single property. Each property undergoes three sub-phases: view at source ↗
Figure 3
Figure 3. Figure 3: Phase 5 (Audit) vs Phase 6 (FP Filter). Phase 6 im view at source ↗
Figure 5
Figure 5. Figure 5: Property family to ground truth issue flow. The view at source ↗
Figure 6
Figure 6. Figure 6: Per-repository finding distribution (Phase 5 vs view at source ↗
Figure 7
Figure 7. Figure 7: Phase 6 three-gate FP filter effectiveness (N=30 DIS view at source ↗
Figure 10
Figure 10. Figure 10: Partially controlled comparison across model back view at source ↗
Figure 9
Figure 9. Figure 9: True positives vs false positives across all tools. view at source ↗
Figure 11
Figure 11. Figure 11: Cost vs. detection performance across tools on view at source ↗
Figure 12
Figure 12. Figure 12: SPECA output quality breakdown (N=102). Of 102 view at source ↗
Figure 14
Figure 14. Figure 14: False-positive root-cause taxonomy with pipeline view at source ↗
Figure 13
Figure 13. Figure 13: Issue × property family TP finding counts (post￾review). The property neighborhood effect is visible as hor￾izontal bands where a single issue is detected by multiple complementary properties; shared vertical structure across repositories reflects the cross-implementation comparability the property vocabulary enables view at source ↗
Figure 15
Figure 15. Figure 15: Precision by property type (final output). view at source ↗
Figure 17
Figure 17. Figure 17: Bug type breakdown (NPD / MLK / UAF) across view at source ↗
Figure 16
Figure 16. Figure 16: Precision vs total bugs found across all tools and view at source ↗
Figure 19
Figure 19. Figure 19: Bug detection matrix (40 total bugs: 35 non view at source ↗
read the original abstract

Code-driven auditing fails when correctness depends on what the specification requires rather than how the code is written. Production blockchain networks expose this directly: byzantine consensus runs many independent clients of a shared specification, so a specification-divergence defect in one client can fork the network or halt finality. Existing tools reason one repository at a time, with no shared baseline held constant across implementations. We present SPECA, an LLM-driven audit framework that derives explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications and reuses them across implementations. SPECA enables controlled cross-implementation comparison, detections grounded in specification invariants no code pattern encodes, and false positives traceable to a specific pipeline phase rather than opaque model errors. On the Sherlock Ethereum Fusaka Audit Contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities expert-augmented (8/15 automated-only) and surfaces 4 fix-confirmed bugs, including a cryptographic-invariant violation missed by every adjudicated finding. On the RepoAudit C/C++ benchmark, SPECA reaches 88.9% precision at 100% recall (F1=0.94) and surfaces 12 author-validated bugs beyond ground truth, two externally validated. SPECA also flags 5 of RepoAudit's 40 published bugs as defensive-coding fixes with no reachable exploit path. False positives trace to three pipeline-pinned root causes; a multi-model study identifies property-generation quality as the binding constraint. End-to-end cost is ~$30 per H/M/L bug (~42 min wall-clock under parallel execution).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SPECA, an LLM-driven auditing framework that extracts explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications of distributed protocols and reuses them for cross-implementation comparison. It targets specification-divergence defects in multi-client systems such as blockchain networks. On the Sherlock Ethereum Fusaka contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities (8 fully automated) and identifies 4 additional fix-confirmed bugs, including a cryptographic-invariant violation missed by all adjudicated findings. On the RepoAudit C/C++ benchmark it achieves 100% recall and 88.9% precision (F1=0.94), surfaces 12 author-validated bugs beyond ground truth (2 externally validated), and correctly flags 5 published bugs as non-exploitable defensive coding. False positives are traced to three pipeline stages; a multi-model ablation identifies property-generation quality as the primary limit. End-to-end cost is reported as ~$30 per H/M/L bug.

Significance. If the empirical results hold under replication, the work provides a concrete, reusable method for specification-anchored auditing that directly addresses the gap between single-repository code analysis and the reality of multiple independent implementations of a shared spec. The full recovery of contest vulnerabilities, discovery of new bugs, traceable false-positive analysis, and explicit identification of the property-generation bottleneck constitute measurable progress over purely code-driven approaches. The use of public benchmarks with author- and externally-validated findings, together with cost and wall-clock metrics, makes the contribution immediately usable by auditors and researchers working on blockchain and distributed-system security.

major comments (2)
  1. [§4] §4 (Sherlock Evaluation): the distinction between the 8 fully automated recoveries and the 7 expert-augmented recoveries is load-bearing for the claim of reduced human intervention; the manuscript should provide the precise prompting templates, property-categorization rules, and decision procedure used to classify a detection as automated-only versus expert-augmented so that the 8/15 figure can be reproduced.
  2. [§3] Property-extraction pipeline (described in §3): while the multi-model study correctly identifies property quality as the binding constraint, the central claim that derived properties are “complete” and “accurate” for downstream detection would be strengthened by a quantitative completeness metric (e.g., fraction of spec-stated invariants successfully extracted) or a small-scale human validation of the extracted property sets on a held-out specification; without this, the 100% recall on RepoAudit remains tied to the specific benchmarks rather than shown to generalize.
minor comments (2)
  1. [Abstract] Abstract and §5: the reported ~$30 / ~42 min figures should state the exact LLM models, temperature settings, and parallelization hardware used, as these directly affect reproducibility of the cost claim.
  2. [Evaluation] Table or figure summarizing false-positive root causes: adding a compact table that cross-tabulates the three pipeline-pinned causes across both benchmarks would improve readability and allow readers to assess the relative contribution of each phase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment point by point below.

read point-by-point responses
  1. Referee: §4 (Sherlock Evaluation): the distinction between the 8 fully automated recoveries and the 7 expert-augmented recoveries is load-bearing for the claim of reduced human intervention; the manuscript should provide the precise prompting templates, property-categorization rules, and decision procedure used to classify a detection as automated-only versus expert-augmented so that the 8/15 figure can be reproduced.

    Authors: We agree that reproducibility of the 8/15 split is essential to substantiate the reduced-human-intervention claim. In the revised manuscript we will add the exact prompting templates used for property extraction and detection, the categorization rules for invariants/pre/postconditions/trust assumptions, and the explicit decision procedure for labeling a recovery as fully automated versus expert-augmented. These materials will appear in Section 4 together with a new appendix containing the full templates and classification logic. revision: yes

  2. Referee: Property-extraction pipeline (described in §3): while the multi-model study correctly identifies property quality as the binding constraint, the central claim that derived properties are “complete” and “accurate” for downstream detection would be strengthened by a quantitative completeness metric (e.g., fraction of spec-stated invariants successfully extracted) or a small-scale human validation of the extracted property sets on a held-out specification; without this, the 100% recall on RepoAudit remains tied to the specific benchmarks rather than shown to generalize.

    Authors: We recognize that an explicit completeness metric or held-out human validation would further support generalization claims. Because natural-language specifications do not supply an enumerated ground-truth set of invariants, a direct quantitative completeness fraction would require additional formalization and could introduce subjectivity. We will therefore add a small-scale human validation study of extracted property sets on a held-out specification in the revised version and will expand the limitations discussion in Section 3 to clarify the scope of the 100% recall result. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical framework (SPECA) evaluated on external public benchmarks (Sherlock Ethereum Fusaka contest with 366 submissions and RepoAudit C/C++ benchmark). All reported metrics (vulnerability recovery, precision/recall, extra bugs) are grounded in traceable external findings and author/external validation rather than internal fits or self-referential definitions. No load-bearing self-citations, ansatzes, or predictions that reduce to fitted inputs are present; property generation quality is explicitly identified as the limiting factor with multi-model controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unverified capability of LLMs to extract complete security properties from natural language; no explicit free parameters or invented entities are stated, but implicit choices in model selection and property categorization are required for the pipeline to function.

axioms (1)
  • domain assumption Large language models can accurately and completely extract categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications.
    This is the binding step that enables all downstream cross-implementation detections and is invoked in the property-generation phase.

pith-pipeline@v0.9.0 · 5606 in / 1346 out tokens · 87606 ms · 2026-05-07T13:13:23.583344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Md Sakib Anwar, Carter Yagemann, and Zhiqiang Lin. 2025. GoSonar: Detecting Logical Vulnerabilities in Memory Safe Language Using Inductive Constraint Reasoning . In2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 758–773. doi:10.1109/SP61157.2025.00043

  2. [2]

    Myrto Arapinis, Véronique Cortier, Hubert de Groote, Charlie Jacomme, and Steve Kremer. 2026. Are ideal functionalities really ideal? https://inria.hal.science/hal- 05422941

  3. [3]

    Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. 2022. Dos and Don’ts of Machine Learning in Computer Security. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 3971–3988. https://www.usenix.org/conference/usenixsecurity22/pr...

  4. [4]

    Kellas, Vasileios P

    Yaniv David, Neophytos Christou, Andreas D. Kellas, Vasileios P. Kemerlis, and Junfeng Yang. 2024. QUACK: Hindering Deserialization Attacks via Static Duck Typing. InProceedings of the 31st Network and Distributed System Security Sym- posium (NDSS)

  5. [5]

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2025. Vulnerabil- ity Detection with Code Language Models: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1729–1741. ...

  6. [6]

    Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. In Proceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan ...

  7. [7]

    Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3819–3823. doi:10.1109/ASE63991.2025.00323

  8. [8]

    Thanh Le-Cong, Bach Le, and Toby Murray. 2025. Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specifica- tion Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)...

  9. [9]

    Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. InProceedings of the 34th USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity25/presentation/ lekssays

  10. [10]

    Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach.Proc. ACM Program. Lang.8, OOPSLA1, Article 111 (April 2024), 26 pages. doi:10.1145/ 3649828

  11. [12]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=9LdJDU7E91

  12. [13]

    Ye Liu, Yue Xue, Daoyuan Wu, Yuqiang Sun, Yi Li, Miaolei Shi, and Yang Liu

  13. [14]

    InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS)

    PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation. InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS). Distinguished Paper Award

  14. [15]

    Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2025. Combining Fine-Tuning and LLM-Based Agents for Intuitive Smart Contract Auditing with Justifications. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1742–1754. doi:10.11...

  15. [16]

    Kemerlis

    Marius Momeu, Fabian Kilger, Christopher Roemheld, Simon Schnückel, Sergej Proskurin, Michalis Polychronakis, and Vasileios P. Kemerlis. 2024. ISLAB: Im- mutable Memory Management Metadata for Commodity Operating System Ker- nels. InProceedings of the 19th ACM Asia Conference on Computer and Communica- tions Security(Singapore, Singapore)(ASIA CCS ’24). A...

  16. [17]

    1992.Semantics with Applications: A Formal Introduction

    Hanne Riis Nielson and Flemming Nielson. 1992.Semantics with Applications: A Formal Introduction. John Wiley & Sons, Chichester, England

  17. [18]

    Lianglu Pan, Shaanan Cohney, Toby Murray, and Van-Thuan Pham. 2024. EDE- Fuzz: A Web API Fuzzer for Excessive Data Exposures. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Por- tugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 45, 12 pages. doi:10.1145/3597503.3608133

  18. [19]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 2048–2060. doi:10.1145/3597503.3639117

  19. [20]

    Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks . In2024 IEEE Symposium on Security and Privacy (SP). IEEE Com- puter Society, Los Alamitos, CA, USA, 862–880. doi:10.1109/S...

  20. [21]

    Sarisht Wadhwa, Luca Zanolini, Aditya Asgaonkar, Francesco D’Amato, Chengrui Fang, Fan Zhang, and Kartik Nayak. 2024. Data Independent Order Policy En- forcement: Limitations and Solutions. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security(Salt Lake City, UT, USA) (CCS ’24). Association for Computing Machinery, New...

  21. [22]

    Nodes MUST maintain custody of at least CUSTODY_REQUIREMENT columns

    Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, and Xiangyu Zhang. 2025. RFCAudit: AI Agent for Auditing Protocol Implementations Against RFC Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1221–1233. doi:10.1109/ASE63991.2025. 00105 Open Science In accordance with the ACM CCS 2026 ...