Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols

Akiyoshi Sannai; Hirotake Murakami; Masato Kamba

arxiv: 2604.26495 · v2 · submitted 2026-04-29 · 💻 cs.CR

Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols

Masato Kamba , Hirotake Murakami , Akiyoshi Sannai This is my paper

Pith reviewed 2026-05-07 13:13 UTC · model grok-4.3

classification 💻 cs.CR

keywords specification-based auditingLLM-driven security analysismulti-implementation distributed protocolsblockchain consensus securitysecurity property derivationcross-implementation comparisoninvariant violation detection

0 comments

The pith

Deriving security properties from natural-language specifications enables consistent auditing of multiple protocol implementations and catches specification-level bugs missed by code-only analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Auditing distributed protocols like blockchain consensus is difficult because correctness depends on adherence to a shared specification rather than the details of any one code implementation. This paper presents a framework that uses large language models to extract explicit, categorized security properties such as invariants, preconditions, and trust assumptions directly from natural-language specifications. These properties then serve as a fixed baseline for examining different independent implementations of the same protocol. The result is the ability to perform cross-implementation comparisons and detect divergences or invariant violations that code reasoning alone cannot identify. Evaluations on a major Ethereum audit contest and a C/C++ benchmark demonstrate recovery of all known vulnerabilities plus discovery of additional confirmed bugs, with traceable false positives.

Core claim

The central claim is that an LLM-driven framework can derive categorized security properties from natural language specifications and reuse them across multiple implementations to ground audits in specification requirements. This allows controlled comparisons between clients, detections based on invariants not present in code patterns, and tracing of errors to specific stages like property generation. In the Sherlock Ethereum Fusaka contest involving 10 targets and 366 submissions, the method recovers all 15 in-scope high/medium/low vulnerabilities with expert augmentation for some and fully automated for others, while also identifying 4 fix-confirmed new bugs including a cryptographic-invar

What carries the argument

SPECA, the specification-anchored auditing framework that extracts and categorizes security properties from specifications for cross-implementation reuse and comparison.

Load-bearing premise

Large language models are able to extract complete, accurate, and properly categorized security properties from natural-language specifications without introducing critical omissions or hallucinations.

What would settle it

Running the framework on a new multi-implementation protocol audit where it either fails to detect a known specification-based vulnerability due to incomplete property derivation or produces detections based on hallucinated properties that do not correspond to actual issues in the code.

Figures

Figures reproduced from arXiv: 2604.26495 by Akiyoshi Sannai, Hirotake Murakami, Masato Kamba.

**Figure 1.** Figure 1: The SPECA pipeline. Phases 1–3 (Knowledge Structuring) derive security properties from specifications; Phases 4–6 (Systematic Auditing) verify properties against implementations. In multi-implementation settings, the left stage executes once and the right stage executes per target view at source ↗

**Figure 2.** Figure 2: Phase 5 audit flow for a single property. Each property undergoes three sub-phases: view at source ↗

**Figure 3.** Figure 3: Phase 5 (Audit) vs Phase 6 (FP Filter). Phase 6 im view at source ↗

**Figure 5.** Figure 5: Property family to ground truth issue flow. The view at source ↗

**Figure 6.** Figure 6: Per-repository finding distribution (Phase 5 vs view at source ↗

**Figure 7.** Figure 7: Phase 6 three-gate FP filter effectiveness (N=30 DIS view at source ↗

**Figure 10.** Figure 10: Partially controlled comparison across model back view at source ↗

**Figure 9.** Figure 9: True positives vs false positives across all tools. view at source ↗

**Figure 11.** Figure 11: Cost vs. detection performance across tools on view at source ↗

**Figure 12.** Figure 12: SPECA output quality breakdown (N=102). Of 102 view at source ↗

**Figure 14.** Figure 14: False-positive root-cause taxonomy with pipeline view at source ↗

**Figure 13.** Figure 13: Issue × property family TP finding counts (postreview). The property neighborhood effect is visible as horizontal bands where a single issue is detected by multiple complementary properties; shared vertical structure across repositories reflects the cross-implementation comparability the property vocabulary enables view at source ↗

**Figure 15.** Figure 15: Precision by property type (final output). view at source ↗

**Figure 17.** Figure 17: Bug type breakdown (NPD / MLK / UAF) across view at source ↗

**Figure 16.** Figure 16: Precision vs total bugs found across all tools and view at source ↗

**Figure 19.** Figure 19: Bug detection matrix (40 total bugs: 35 non view at source ↗

read the original abstract

Code-driven auditing fails when correctness depends on what the specification requires rather than how the code is written. Production blockchain networks expose this directly: byzantine consensus runs many independent clients of a shared specification, so a specification-divergence defect in one client can fork the network or halt finality. Existing tools reason one repository at a time, with no shared baseline held constant across implementations. We present SPECA, an LLM-driven audit framework that derives explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications and reuses them across implementations. SPECA enables controlled cross-implementation comparison, detections grounded in specification invariants no code pattern encodes, and false positives traceable to a specific pipeline phase rather than opaque model errors. On the Sherlock Ethereum Fusaka Audit Contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities expert-augmented (8/15 automated-only) and surfaces 4 fix-confirmed bugs, including a cryptographic-invariant violation missed by every adjudicated finding. On the RepoAudit C/C++ benchmark, SPECA reaches 88.9% precision at 100% recall (F1=0.94) and surfaces 12 author-validated bugs beyond ground truth, two externally validated. SPECA also flags 5 of RepoAudit's 40 published bugs as defensive-coding fixes with no reachable exploit path. False positives trace to three pipeline-pinned root causes; a multi-model study identifies property-generation quality as the binding constraint. End-to-end cost is ~$30 per H/M/L bug (~42 min wall-clock under parallel execution).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPECA extracts security properties from specs with LLMs and reuses them to audit multiple protocol implementations, recovering all vulnerabilities on the Sherlock benchmark plus extras on RepoAudit.

read the letter

The paper's main contribution is a framework called SPECA that pulls security properties like invariants and pre/postconditions from a protocol's natural language specification and then uses those to audit multiple different code implementations of it. This is aimed at cases like blockchain clients where the spec is the common reference but code can diverge. What works here is the cross-implementation reuse. Prior tools tend to analyze one repository at a time. By anchoring on the spec, SPECA can compare findings across clients and ground detections in requirements that no single code pattern captures. On the Sherlock Ethereum Fusaka contest with 10 targets and 366 submissions, it recovered all 15 in-scope vulnerabilities, with 8 found automatically without expert augmentation, and it found 4 additional bugs that were fix-confirmed, one being a cryptographic invariant violation missed by others. The RepoAudit results are also concrete: 100% recall, 88.9% precision, F1 of 0.94, plus 12 extra author-validated bugs and two externally validated. It even flagged 5 of the 40 published bugs as non-exploitable defensive fixes. False positives are pinned to three specific pipeline stages, and a multi-model experiment points to property generation as the current limit rather than other steps. The evaluation avoids circularity by using public benchmarks with traceable validations. The authors are clear that property quality from the LLM is the binding constraint, which keeps the claims proportionate. A soft spot is the dependence on LLM extraction of properties. While they test multiple models and trace issues, complete and accurate property sets from specs are not guaranteed, and any omission would affect downstream audit coverage. The paper does not provide detailed prompt engineering or validation procedures beyond the results, so reproducibility of the property step would need checking. This is for auditors and researchers focused on multi-client distributed systems, particularly in security-critical infrastructure like consensus protocols. Readers interested in LLM-assisted auditing that stays tied to specs rather than pure code patterns would get practical value. It deserves a serious referee because the results are on external benchmarks with clear metrics and the method addresses a documented production issue without overclaiming generality. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces SPECA, an LLM-driven auditing framework that extracts explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications of distributed protocols and reuses them for cross-implementation comparison. It targets specification-divergence defects in multi-client systems such as blockchain networks. On the Sherlock Ethereum Fusaka contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities (8 fully automated) and identifies 4 additional fix-confirmed bugs, including a cryptographic-invariant violation missed by all adjudicated findings. On the RepoAudit C/C++ benchmark it achieves 100% recall and 88.9% precision (F1=0.94), surfaces 12 author-validated bugs beyond ground truth (2 externally validated), and correctly flags 5 published bugs as non-exploitable defensive coding. False positives are traced to three pipeline stages; a multi-model ablation identifies property-generation quality as the primary limit. End-to-end cost is reported as ~$30 per H/M/L bug.

Significance. If the empirical results hold under replication, the work provides a concrete, reusable method for specification-anchored auditing that directly addresses the gap between single-repository code analysis and the reality of multiple independent implementations of a shared spec. The full recovery of contest vulnerabilities, discovery of new bugs, traceable false-positive analysis, and explicit identification of the property-generation bottleneck constitute measurable progress over purely code-driven approaches. The use of public benchmarks with author- and externally-validated findings, together with cost and wall-clock metrics, makes the contribution immediately usable by auditors and researchers working on blockchain and distributed-system security.

major comments (2)

[§4] §4 (Sherlock Evaluation): the distinction between the 8 fully automated recoveries and the 7 expert-augmented recoveries is load-bearing for the claim of reduced human intervention; the manuscript should provide the precise prompting templates, property-categorization rules, and decision procedure used to classify a detection as automated-only versus expert-augmented so that the 8/15 figure can be reproduced.
[§3] Property-extraction pipeline (described in §3): while the multi-model study correctly identifies property quality as the binding constraint, the central claim that derived properties are “complete” and “accurate” for downstream detection would be strengthened by a quantitative completeness metric (e.g., fraction of spec-stated invariants successfully extracted) or a small-scale human validation of the extracted property sets on a held-out specification; without this, the 100% recall on RepoAudit remains tied to the specific benchmarks rather than shown to generalize.

minor comments (2)

[Abstract] Abstract and §5: the reported ~$30 / ~42 min figures should state the exact LLM models, temperature settings, and parallelization hardware used, as these directly affect reproducibility of the cost claim.
[Evaluation] Table or figure summarizing false-positive root causes: adding a compact table that cross-tabulates the three pipeline-pinned causes across both benchmarks would improve readability and allow readers to assess the relative contribution of each phase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment point by point below.

read point-by-point responses

Referee: §4 (Sherlock Evaluation): the distinction between the 8 fully automated recoveries and the 7 expert-augmented recoveries is load-bearing for the claim of reduced human intervention; the manuscript should provide the precise prompting templates, property-categorization rules, and decision procedure used to classify a detection as automated-only versus expert-augmented so that the 8/15 figure can be reproduced.

Authors: We agree that reproducibility of the 8/15 split is essential to substantiate the reduced-human-intervention claim. In the revised manuscript we will add the exact prompting templates used for property extraction and detection, the categorization rules for invariants/pre/postconditions/trust assumptions, and the explicit decision procedure for labeling a recovery as fully automated versus expert-augmented. These materials will appear in Section 4 together with a new appendix containing the full templates and classification logic. revision: yes
Referee: Property-extraction pipeline (described in §3): while the multi-model study correctly identifies property quality as the binding constraint, the central claim that derived properties are “complete” and “accurate” for downstream detection would be strengthened by a quantitative completeness metric (e.g., fraction of spec-stated invariants successfully extracted) or a small-scale human validation of the extracted property sets on a held-out specification; without this, the 100% recall on RepoAudit remains tied to the specific benchmarks rather than shown to generalize.

Authors: We recognize that an explicit completeness metric or held-out human validation would further support generalization claims. Because natural-language specifications do not supply an enumerated ground-truth set of invariants, a direct quantitative completeness fraction would require additional formalization and could introduce subjectivity. We will therefore add a small-scale human validation study of extracted property sets on a held-out specification in the revised version and will expand the limitations discussion in Section 3 to clarify the scope of the 100% recall result. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical framework (SPECA) evaluated on external public benchmarks (Sherlock Ethereum Fusaka contest with 366 submissions and RepoAudit C/C++ benchmark). All reported metrics (vulnerability recovery, precision/recall, extra bugs) are grounded in traceable external findings and author/external validation rather than internal fits or self-referential definitions. No load-bearing self-citations, ansatzes, or predictions that reduce to fitted inputs are present; property generation quality is explicitly identified as the limiting factor with multi-model controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unverified capability of LLMs to extract complete security properties from natural language; no explicit free parameters or invented entities are stated, but implicit choices in model selection and property categorization are required for the pipeline to function.

axioms (1)

domain assumption Large language models can accurately and completely extract categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications.
This is the binding step that enables all downstream cross-implementation detections and is invoked in the property-generation phase.

pith-pipeline@v0.9.0 · 5606 in / 1346 out tokens · 87606 ms · 2026-05-07T13:13:23.583344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Md Sakib Anwar, Carter Yagemann, and Zhiqiang Lin. 2025. GoSonar: Detecting Logical Vulnerabilities in Memory Safe Language Using Inductive Constraint Reasoning . In2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 758–773. doi:10.1109/SP61157.2025.00043

work page doi:10.1109/sp61157.2025.00043 2025
[2]

Myrto Arapinis, Véronique Cortier, Hubert de Groote, Charlie Jacomme, and Steve Kremer. 2026. Are ideal functionalities really ideal? https://inria.hal.science/hal- 05422941

work page 2026
[3]

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. 2022. Dos and Don’ts of Machine Learning in Computer Security. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 3971–3988. https://www.usenix.org/conference/usenixsecurity22/pr...

work page 2022
[4]

Kellas, Vasileios P

Yaniv David, Neophytos Christou, Andreas D. Kellas, Vasileios P. Kemerlis, and Junfeng Yang. 2024. QUACK: Hindering Deserialization Attacks via Static Duck Typing. InProceedings of the 31st Network and Distributed System Security Sym- posium (NDSS)

work page 2024
[5]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2025. Vulnerabil- ity Detection with Code Language Models: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1729–1741. ...

work page doi:10.1109/icse55347.2025.00038 2025
[6]

Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. In Proceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan ...

work page 2025
[7]

Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3819–3823. doi:10.1109/ASE63991.2025.00323

work page doi:10.1109/ase63991.2025.00323 2025
[8]

Thanh Le-Cong, Bach Le, and Toby Murray. 2025. Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specifica- tion Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)...

work page doi:10.18653/v1/2025.acl- 2025
[9]

Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. InProceedings of the 34th USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity25/presentation/ lekssays

work page 2025
[10]

Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach.Proc. ACM Program. Lang.8, OOPSLA1, Article 111 (April 2024), 26 pages. doi:10.1145/ 3649828

work page 2024
[12]

Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=9LdJDU7E91

work page 2025
[13]

Ye Liu, Yue Xue, Daoyuan Wu, Yuqiang Sun, Yi Li, Miaolei Shi, and Yang Liu

work page
[14]

InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS)

PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation. InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS). Distinguished Paper Award

work page
[15]

Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2025. Combining Fine-Tuning and LLM-Based Agents for Intuitive Smart Contract Auditing with Justifications. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1742–1754. doi:10.11...

work page doi:10.1109/icse55347.2025.00027 2025
[16]

Kemerlis

Marius Momeu, Fabian Kilger, Christopher Roemheld, Simon Schnückel, Sergej Proskurin, Michalis Polychronakis, and Vasileios P. Kemerlis. 2024. ISLAB: Im- mutable Memory Management Metadata for Commodity Operating System Ker- nels. InProceedings of the 19th ACM Asia Conference on Computer and Communica- tions Security(Singapore, Singapore)(ASIA CCS ’24). A...

work page doi:10.1145/3634737.3644994 2024
[17]

1992.Semantics with Applications: A Formal Introduction

Hanne Riis Nielson and Flemming Nielson. 1992.Semantics with Applications: A Formal Introduction. John Wiley & Sons, Chichester, England

work page 1992
[18]

Lianglu Pan, Shaanan Cohney, Toby Murray, and Van-Thuan Pham. 2024. EDE- Fuzz: A Web API Fuzzer for Excessive Data Exposures. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Por- tugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 45, 12 pages. doi:10.1145/3597503.3608133

work page doi:10.1145/3597503.3608133 2024
[19]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 2048–2060. doi:10.1145/3597503.3639117

work page doi:10.1145/3597503.3639117 2024
[20]

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks . In2024 IEEE Symposium on Security and Privacy (SP). IEEE Com- puter Society, Los Alamitos, CA, USA, 862–880. doi:10.1109/S...

work page doi:10.1109/sp54263.2024.00210 2024
[21]

Sarisht Wadhwa, Luca Zanolini, Aditya Asgaonkar, Francesco D’Amato, Chengrui Fang, Fan Zhang, and Kartik Nayak. 2024. Data Independent Order Policy En- forcement: Limitations and Solutions. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security(Salt Lake City, UT, USA) (CCS ’24). Association for Computing Machinery, New...

work page doi:10.1145/3658644.3670367 2024
[22]

Nodes MUST maintain custody of at least CUSTODY_REQUIREMENT columns

Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, and Xiangyu Zhang. 2025. RFCAudit: AI Agent for Auditing Protocol Implementations Against RFC Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1221–1233. doi:10.1109/ASE63991.2025. 00105 Open Science In accordance with the ACM CCS 2026 ...

work page doi:10.1109/ase63991.2025 2025

[1] [1]

Md Sakib Anwar, Carter Yagemann, and Zhiqiang Lin. 2025. GoSonar: Detecting Logical Vulnerabilities in Memory Safe Language Using Inductive Constraint Reasoning . In2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 758–773. doi:10.1109/SP61157.2025.00043

work page doi:10.1109/sp61157.2025.00043 2025

[2] [2]

Myrto Arapinis, Véronique Cortier, Hubert de Groote, Charlie Jacomme, and Steve Kremer. 2026. Are ideal functionalities really ideal? https://inria.hal.science/hal- 05422941

work page 2026

[3] [3]

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. 2022. Dos and Don’ts of Machine Learning in Computer Security. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 3971–3988. https://www.usenix.org/conference/usenixsecurity22/pr...

work page 2022

[4] [4]

Kellas, Vasileios P

Yaniv David, Neophytos Christou, Andreas D. Kellas, Vasileios P. Kemerlis, and Junfeng Yang. 2024. QUACK: Hindering Deserialization Attacks via Static Duck Typing. InProceedings of the 31st Network and Distributed System Security Sym- posium (NDSS)

work page 2024

[5] [5]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2025. Vulnerabil- ity Detection with Code Language Models: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1729–1741. ...

work page doi:10.1109/icse55347.2025.00038 2025

[6] [6]

Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. In Proceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan ...

work page 2025

[7] [7]

Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3819–3823. doi:10.1109/ASE63991.2025.00323

work page doi:10.1109/ase63991.2025.00323 2025

[8] [8]

Thanh Le-Cong, Bach Le, and Toby Murray. 2025. Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specifica- tion Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)...

work page doi:10.18653/v1/2025.acl- 2025

[9] [9]

Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. InProceedings of the 34th USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity25/presentation/ lekssays

work page 2025

[10] [10]

Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2024. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach.Proc. ACM Program. Lang.8, OOPSLA1, Article 111 (April 2024), 26 pages. doi:10.1145/ 3649828

work page 2024

[11] [12]

Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=9LdJDU7E91

work page 2025

[12] [13]

Ye Liu, Yue Xue, Daoyuan Wu, Yuqiang Sun, Yi Li, Miaolei Shi, and Yang Liu

work page

[13] [14]

InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS)

PropertyGPT: LLM-driven Formal Verification of Smart Contracts through Retrieval-Augmented Property Generation. InProceedings of the 32nd Network and Distributed System Security Symposium (NDSS). Distinguished Paper Award

work page

[14] [15]

Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2025. Combining Fine-Tuning and LLM-Based Agents for Intuitive Smart Contract Auditing with Justifications. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1742–1754. doi:10.11...

work page doi:10.1109/icse55347.2025.00027 2025

[15] [16]

Kemerlis

Marius Momeu, Fabian Kilger, Christopher Roemheld, Simon Schnückel, Sergej Proskurin, Michalis Polychronakis, and Vasileios P. Kemerlis. 2024. ISLAB: Im- mutable Memory Management Metadata for Commodity Operating System Ker- nels. InProceedings of the 19th ACM Asia Conference on Computer and Communica- tions Security(Singapore, Singapore)(ASIA CCS ’24). A...

work page doi:10.1145/3634737.3644994 2024

[16] [17]

1992.Semantics with Applications: A Formal Introduction

Hanne Riis Nielson and Flemming Nielson. 1992.Semantics with Applications: A Formal Introduction. John Wiley & Sons, Chichester, England

work page 1992

[17] [18]

Lianglu Pan, Shaanan Cohney, Toby Murray, and Van-Thuan Pham. 2024. EDE- Fuzz: A Web API Fuzzer for Excessive Data Exposures. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Por- tugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 45, 12 pages. doi:10.1145/3597503.3608133

work page doi:10.1145/3597503.3608133 2024

[18] [19]

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 2048–2060. doi:10.1145/3597503.3639117

work page doi:10.1145/3597503.3639117 2024

[19] [20]

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks . In2024 IEEE Symposium on Security and Privacy (SP). IEEE Com- puter Society, Los Alamitos, CA, USA, 862–880. doi:10.1109/S...

work page doi:10.1109/sp54263.2024.00210 2024

[20] [21]

Sarisht Wadhwa, Luca Zanolini, Aditya Asgaonkar, Francesco D’Amato, Chengrui Fang, Fan Zhang, and Kartik Nayak. 2024. Data Independent Order Policy En- forcement: Limitations and Solutions. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security(Salt Lake City, UT, USA) (CCS ’24). Association for Computing Machinery, New...

work page doi:10.1145/3658644.3670367 2024

[21] [22]

Nodes MUST maintain custody of at least CUSTODY_REQUIREMENT columns

Mingwei Zheng, Chengpeng Wang, Xuwei Liu, Jinyao Guo, Shiwei Feng, and Xiangyu Zhang. 2025. RFCAudit: AI Agent for Auditing Protocol Implementations Against RFC Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1221–1233. doi:10.1109/ASE63991.2025. 00105 Open Science In accordance with the ACM CCS 2026 ...

work page doi:10.1109/ase63991.2025 2025