Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols
Pith reviewed 2026-05-07 13:13 UTC · model grok-4.3
The pith
Deriving security properties from natural-language specifications enables consistent auditing of multiple protocol implementations and catches specification-level bugs missed by code-only analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LLM-driven framework can derive categorized security properties from natural language specifications and reuse them across multiple implementations to ground audits in specification requirements. This allows controlled comparisons between clients, detections based on invariants not present in code patterns, and tracing of errors to specific stages like property generation. In the Sherlock Ethereum Fusaka contest involving 10 targets and 366 submissions, the method recovers all 15 in-scope high/medium/low vulnerabilities with expert augmentation for some and fully automated for others, while also identifying 4 fix-confirmed new bugs including a cryptographic-invar
What carries the argument
SPECA, the specification-anchored auditing framework that extracts and categorizes security properties from specifications for cross-implementation reuse and comparison.
Load-bearing premise
Large language models are able to extract complete, accurate, and properly categorized security properties from natural-language specifications without introducing critical omissions or hallucinations.
What would settle it
Running the framework on a new multi-implementation protocol audit where it either fails to detect a known specification-based vulnerability due to incomplete property derivation or produces detections based on hallucinated properties that do not correspond to actual issues in the code.
Figures
read the original abstract
Code-driven auditing fails when correctness depends on what the specification requires rather than how the code is written. Production blockchain networks expose this directly: byzantine consensus runs many independent clients of a shared specification, so a specification-divergence defect in one client can fork the network or halt finality. Existing tools reason one repository at a time, with no shared baseline held constant across implementations. We present SPECA, an LLM-driven audit framework that derives explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications and reuses them across implementations. SPECA enables controlled cross-implementation comparison, detections grounded in specification invariants no code pattern encodes, and false positives traceable to a specific pipeline phase rather than opaque model errors. On the Sherlock Ethereum Fusaka Audit Contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities expert-augmented (8/15 automated-only) and surfaces 4 fix-confirmed bugs, including a cryptographic-invariant violation missed by every adjudicated finding. On the RepoAudit C/C++ benchmark, SPECA reaches 88.9% precision at 100% recall (F1=0.94) and surfaces 12 author-validated bugs beyond ground truth, two externally validated. SPECA also flags 5 of RepoAudit's 40 published bugs as defensive-coding fixes with no reachable exploit path. False positives trace to three pipeline-pinned root causes; a multi-model study identifies property-generation quality as the binding constraint. End-to-end cost is ~$30 per H/M/L bug (~42 min wall-clock under parallel execution).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SPECA, an LLM-driven auditing framework that extracts explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications of distributed protocols and reuses them for cross-implementation comparison. It targets specification-divergence defects in multi-client systems such as blockchain networks. On the Sherlock Ethereum Fusaka contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities (8 fully automated) and identifies 4 additional fix-confirmed bugs, including a cryptographic-invariant violation missed by all adjudicated findings. On the RepoAudit C/C++ benchmark it achieves 100% recall and 88.9% precision (F1=0.94), surfaces 12 author-validated bugs beyond ground truth (2 externally validated), and correctly flags 5 published bugs as non-exploitable defensive coding. False positives are traced to three pipeline stages; a multi-model ablation identifies property-generation quality as the primary limit. End-to-end cost is reported as ~$30 per H/M/L bug.
Significance. If the empirical results hold under replication, the work provides a concrete, reusable method for specification-anchored auditing that directly addresses the gap between single-repository code analysis and the reality of multiple independent implementations of a shared spec. The full recovery of contest vulnerabilities, discovery of new bugs, traceable false-positive analysis, and explicit identification of the property-generation bottleneck constitute measurable progress over purely code-driven approaches. The use of public benchmarks with author- and externally-validated findings, together with cost and wall-clock metrics, makes the contribution immediately usable by auditors and researchers working on blockchain and distributed-system security.
major comments (2)
- [§4] §4 (Sherlock Evaluation): the distinction between the 8 fully automated recoveries and the 7 expert-augmented recoveries is load-bearing for the claim of reduced human intervention; the manuscript should provide the precise prompting templates, property-categorization rules, and decision procedure used to classify a detection as automated-only versus expert-augmented so that the 8/15 figure can be reproduced.
- [§3] Property-extraction pipeline (described in §3): while the multi-model study correctly identifies property quality as the binding constraint, the central claim that derived properties are “complete” and “accurate” for downstream detection would be strengthened by a quantitative completeness metric (e.g., fraction of spec-stated invariants successfully extracted) or a small-scale human validation of the extracted property sets on a held-out specification; without this, the 100% recall on RepoAudit remains tied to the specific benchmarks rather than shown to generalize.
minor comments (2)
- [Abstract] Abstract and §5: the reported ~$30 / ~42 min figures should state the exact LLM models, temperature settings, and parallelization hardware used, as these directly affect reproducibility of the cost claim.
- [Evaluation] Table or figure summarizing false-positive root causes: adding a compact table that cross-tabulates the three pipeline-pinned causes across both benchmarks would improve readability and allow readers to assess the relative contribution of each phase.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment point by point below.
read point-by-point responses
-
Referee: §4 (Sherlock Evaluation): the distinction between the 8 fully automated recoveries and the 7 expert-augmented recoveries is load-bearing for the claim of reduced human intervention; the manuscript should provide the precise prompting templates, property-categorization rules, and decision procedure used to classify a detection as automated-only versus expert-augmented so that the 8/15 figure can be reproduced.
Authors: We agree that reproducibility of the 8/15 split is essential to substantiate the reduced-human-intervention claim. In the revised manuscript we will add the exact prompting templates used for property extraction and detection, the categorization rules for invariants/pre/postconditions/trust assumptions, and the explicit decision procedure for labeling a recovery as fully automated versus expert-augmented. These materials will appear in Section 4 together with a new appendix containing the full templates and classification logic. revision: yes
-
Referee: Property-extraction pipeline (described in §3): while the multi-model study correctly identifies property quality as the binding constraint, the central claim that derived properties are “complete” and “accurate” for downstream detection would be strengthened by a quantitative completeness metric (e.g., fraction of spec-stated invariants successfully extracted) or a small-scale human validation of the extracted property sets on a held-out specification; without this, the 100% recall on RepoAudit remains tied to the specific benchmarks rather than shown to generalize.
Authors: We recognize that an explicit completeness metric or held-out human validation would further support generalization claims. Because natural-language specifications do not supply an enumerated ground-truth set of invariants, a direct quantitative completeness fraction would require additional formalization and could introduce subjectivity. We will therefore add a small-scale human validation study of extracted property sets on a held-out specification in the revised version and will expand the limitations discussion in Section 3 to clarify the scope of the 100% recall result. revision: partial
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical framework (SPECA) evaluated on external public benchmarks (Sherlock Ethereum Fusaka contest with 366 submissions and RepoAudit C/C++ benchmark). All reported metrics (vulnerability recovery, precision/recall, extra bugs) are grounded in traceable external findings and author/external validation rather than internal fits or self-referential definitions. No load-bearing self-citations, ansatzes, or predictions that reduce to fitted inputs are present; property generation quality is explicitly identified as the limiting factor with multi-model controls.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can accurately and completely extract categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications.
Reference graph
Works this paper leans on
-
[1]
Md Sakib Anwar, Carter Yagemann, and Zhiqiang Lin. 2025. GoSonar: Detecting Logical Vulnerabilities in Memory Safe Language Using Inductive Constraint Reasoning . In2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 758–773. doi:10.1109/SP61157.2025.00043
-
[2]
Myrto Arapinis, Véronique Cortier, Hubert de Groote, Charlie Jacomme, and Steve Kremer. 2026. Are ideal functionalities really ideal? https://inria.hal.science/hal- 05422941
work page 2026
-
[3]
Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. 2022. Dos and Don’ts of Machine Learning in Computer Security. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 3971–3988. https://www.usenix.org/conference/usenixsecurity22/pr...
work page 2022
-
[4]
Yaniv David, Neophytos Christou, Andreas D. Kellas, Vasileios P. Kemerlis, and Junfeng Yang. 2024. QUACK: Hindering Deserialization Attacks via Static Duck Typing. InProceedings of the 31st Network and Distributed System Security Sym- posium (NDSS)
work page 2024
-
[5]
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2025. Vulnerabil- ity Detection with Code Language Models: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1729–1741. ...
-
[6]
Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. In Proceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan ...
work page 2025
-
[7]
Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3819–3823. doi:10.1109/ASE63991.2025.00323
-
[8]
Thanh Le-Cong, Bach Le, and Toby Murray. 2025. Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specifica- tion Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)...
-
[9]
Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. InProceedings of the 34th USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity25/presentation/ lekssays
work page 2025
-
[10]
Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach.Proc. ACM Program. Lang.8, OOPSLA1, Article 111 (April 2024), 26 pages. doi:10.1145/ 3649828
work page 2024
-
[12]
Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=9LdJDU7E91
work page 2025
-
[13]
Ye Liu, Yue Xue, Daoyuan Wu, Yuqiang Sun, Yi Li, Miaolei Shi, and Yang Liu
-
[14]
InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS)
PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation. InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS). Distinguished Paper Award
-
[15]
Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2025. Combining Fine-Tuning and LLM-Based Agents for Intuitive Smart Contract Auditing with Justifications. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1742–1754. doi:10.11...
-
[16]
Marius Momeu, Fabian Kilger, Christopher Roemheld, Simon Schnückel, Sergej Proskurin, Michalis Polychronakis, and Vasileios P. Kemerlis. 2024. ISLAB: Im- mutable Memory Management Metadata for Commodity Operating System Ker- nels. InProceedings of the 19th ACM Asia Conference on Computer and Communica- tions Security(Singapore, Singapore)(ASIA CCS ’24). A...
-
[17]
1992.Semantics with Applications: A Formal Introduction
Hanne Riis Nielson and Flemming Nielson. 1992.Semantics with Applications: A Formal Introduction. John Wiley & Sons, Chichester, England
work page 1992
-
[18]
Lianglu Pan, Shaanan Cohney, Toby Murray, and Van-Thuan Pham. 2024. EDE- Fuzz: A Web API Fuzzer for Excessive Data Exposures. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Por- tugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 45, 12 pages. doi:10.1145/3597503.3608133
-
[19]
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 2048–2060. doi:10.1145/3597503.3639117
-
[20]
Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks . In2024 IEEE Symposium on Security and Privacy (SP). IEEE Com- puter Society, Los Alamitos, CA, USA, 862–880. doi:10.1109/S...
-
[21]
Sarisht Wadhwa, Luca Zanolini, Aditya Asgaonkar, Francesco D’Amato, Chengrui Fang, Fan Zhang, and Kartik Nayak. 2024. Data Independent Order Policy En- forcement: Limitations and Solutions. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security(Salt Lake City, UT, USA) (CCS ’24). Association for Computing Machinery, New...
-
[22]
Nodes MUST maintain custody of at least CUSTODY_REQUIREMENT columns
Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, and Xiangyu Zhang. 2025. RFCAudit: AI Agent for Auditing Protocol Implementations Against RFC Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1221–1233. doi:10.1109/ASE63991.2025. 00105 Open Science In accordance with the ACM CCS 2026 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.