Recognition: no theorem link
s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs
Pith reviewed 2026-05-15 10:55 UTC · model grok-4.3
The pith
s2n-bignum-bench supplies formal specifications from verified AWS cryptographic assembly and requires LLMs to output HOL Light proof scripts that pass machine checking within a fixed timeout.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
s2n-bignum-bench is constructed by extracting the already-proven HOL Light specifications for the s2n-bignum assembly routines and turning each one into a task that asks an LLM to generate a proof script accepted by the HOL Light kernel within a fixed timeout. The library itself was developed and verified by human experts at AWS; the benchmark therefore isolates the second half of the verification process—proof construction—while supplying the first half (the precise specification) as input.
What carries the argument
The s2n-bignum-bench task harness, which feeds a formal specification to the LLM, collects the generated proof script, and runs it through HOL Light with a timeout to decide success or failure.
If this is right
- LLMs that pass the benchmark can be used to accelerate the construction of machine-checked proofs for other verified cryptographic libraries.
- Failure on the benchmark indicates that current models still lack the ability to connect low-level assembly details with their abstract specifications inside a proof assistant.
- The benchmark can serve as a training signal for fine-tuning models on the specific style of proof scripts required by HOL Light for integer and bit-vector reasoning.
- Success rates on s2n-bignum-bench provide a quantitative measure that can be tracked over time as models improve at code-related formal reasoning.
Where Pith is reading between the lines
- The benchmark could be extended with partial-proof hints or proof-state feedback to study whether interactive assistance improves LLM performance on low-level code.
- Because the routines are already machine-checked, the same test suite could be reused to compare proof scripts produced by LLMs against the original human-written scripts for length, readability, or reuse of lemmas.
- Similar extraction pipelines could be applied to other formally verified open-source libraries to create additional domain-specific benchmarks without requiring new human verification effort.
Load-bearing premise
That an LLM's ability to produce a HOL Light proof script accepted within the timeout for these routines is a good test of its capacity to construct proofs about real-world low-level implementations.
What would settle it
Run a representative set of current LLMs on the benchmark routines and record whether any model produces a proof script that HOL Light accepts for even the smallest routine; zero success rate on the simplest cases would falsify the claim that the benchmark meaningfully exercises real-world proof synthesis.
read the original abstract
Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces s2n-bignum-bench, a benchmark derived from the already-verified s2n-bignum cryptographic library. It supplies human-authored formal specifications in HOL Light for assembly routines and requires LLMs to generate proof scripts that are accepted by the HOL Light checker within a fixed timeout. The work positions the benchmark as the first public resource focused on machine-checkable proof synthesis for industrial low-level cryptographic code, aiming to evaluate LLM reasoning beyond competition-style mathematics.
Significance. If the evaluation protocol is sound, the benchmark supplies a practically grounded testbed that connects LLM theorem-proving research to real industrial verification artifacts. It reuses existing machine-checked proofs, offers reproducible setup via the linked repository, and targets a domain (cryptographic assembly) where formal methods have already succeeded, thereby providing falsifiable, externally validated problems rather than synthetic ones.
major comments (1)
- [Abstract] Abstract: the central claim that the benchmark 'addresses the gap' between competition mathematics and 'proofs about real-world implementations' rests on the assumption that supplying a pre-existing formal specification and requiring only a proof script adequately exercises low-level code reasoning. The manuscript does not provide evidence or discussion of the relative difficulty of specification derivation versus proof discharge in the original s2n-bignum verification effort; if specification construction is the dominant industrial cost, the benchmark tests a narrower capability than the title and abstract suggest.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the benchmark's significance and for the constructive feedback. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the benchmark 'addresses the gap' between competition mathematics and 'proofs about real-world implementations' rests on the assumption that supplying a pre-existing formal specification and requiring only a proof script adequately exercises low-level code reasoning. The manuscript does not provide evidence or discussion of the relative difficulty of specification derivation versus proof discharge in the original s2n-bignum verification effort; if specification construction is the dominant industrial cost, the benchmark tests a narrower capability than the title and abstract suggest.
Authors: We appreciate this observation. The manuscript already notes that both specification and proof were performed by human experts in the original s2n-bignum verification. We agree that we provide no quantitative comparison of their relative difficulties. The benchmark is deliberately scoped to the proof-synthesis task because that is the capability current LLMs have begun to demonstrate on synthetic problems but have not yet been evaluated on for verified industrial artifacts. Generating an accepted HOL Light proof script still requires the model to reason about the concrete assembly semantics, loop invariants, and bit-vector arithmetic that appear in the supplied specification. We will revise the abstract to make the benchmark's focus explicit and will add a short paragraph in the introduction acknowledging that specification construction is a major industrial cost while clarifying that the benchmark isolates the proof-discharge component to enable direct comparison with existing theorem-proving benchmarks. revision: yes
Circularity Check
No circularity: benchmark proposal is self-contained design choice
full rationale
The paper introduces s2n-bignum-bench by selecting an existing industrially verified library, supplying its human-authored HOL Light specifications, and defining the LLM task as emitting a proof script accepted within timeout. No equations, parameter fits, uniqueness theorems, or self-citations are used to derive the benchmark's validity or relevance; the claim that this tests proof synthesis for real-world code is presented explicitly as a design decision rather than a result obtained from prior steps in the paper. The work contains no load-bearing reductions to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The HOL Light theorem prover provides a sound proof checker for the given specifications.
Reference graph
Works this paper leans on
-
[1]
Achim, T. and V . Tenev (2023). Harmonic: Building mathematical superintelligence
work page 2023
- [2]
-
[3]
Chen, L., J. Gu, L. Huang, W. Huang, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, K. Ma, C. Ren, J. Shen, W. Shi, T. Sun, H. Sun, J. Wang, S. Wang, Z. Wang, C. Wei, S. Wei, Y . Wu, Y . Wu, Y . Xia, H. Xin, F. Yang, H. Ying, H. Yuan, Z. Yuan, T. Zhan, C. Zhang, Y . Zhang, G. Zhang, T. Zhao, J. Zhao, Y . Zhou, and T. H. Zhu (2025). Seed-prover: Deep and broad r...
work page 2025
-
[4]
Training Verifiers to Solve Math Word Problems
Cobbe, K., V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021). Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Gu, A., B. Rozi `ere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang (2024). Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Harrison, J. (2009). Hol light: An overview. InInternational Conference on Theorem Proving in Higher Order Logics, pp. 60–66. Springer
work page 2009
-
[7]
Harrison, J. (2026). Soundness of s2n-bignum formal verification
work page 2026
-
[8]
Harrison, J., J. Urban, and F. Wiedijk (2014). History of interactive theorem proving. InHand- book of the History of Logic, V olume 9, pp. 135–214. Elsevier
work page 2014
-
[9]
Hu, J., T. Zhu, and S. Welleck (2024). minictx: Neural theorem proving with (long-) contexts. arXiv preprint arXiv:2408.03350
-
[10]
arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668
Letson, A., L. Sarra, A. Poiroux, O. Dressler, P. Lezeau, D. Aranha, F. Pu, A. Hill, M. C. Hidalgo, J. Berman, et al. (2026). Sorrydb: Can ai provers complete real-world lean theorems? arXiv preprint arXiv:2603.02668
- [11]
-
[12]
Mazzucato, D., A. Mohamed, J. Lee, C. Barrett, J. Grundy, J. Harrison, and C. S. P ˘as˘areanu (2025). Relational hoare logic for realistically modelled machine code. InInternational Confer- ence on Computer Aided Verification, pp. 389–413. Springer
work page 2025
- [13]
- [14]
- [15]
-
[16]
Tsoukalas, G., J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri (2024). Putnambench: Evaluating neural theorem-provers on the putnam mathematical competi- tion
work page 2024
-
[17]
Vlad Isenbaev, B. H. (2026). Logical intelligence’s aleph solves putnambench
work page 2026
-
[18]
Wiedijk, F. (2012). Pollack-inconsistency.Electronic Notes in Theoretical Computer Sci- ence 285, 85–100
work page 2012
- [19]
-
[20]
Xu, Q., X. Luan, R. Wang, J. O. J. Leang, P. Wang, H. Li, W. Li, and C. Watt (2026). Neural theorem proving for verification conditions: A real-world benchmark. InThe Fourteenth Inter- national Conference on Learning Representations. 5 Accepted as a Workshop paper at AIPV 2026
work page 2026
- [21]
-
[22]
Zheng, K., J. M. Han, and S. Polu (2022). Minif2f: a cross-system benchmark for formal olympiad-level mathematics. 6 Accepted as a Workshop paper at AIPV 2026 A AGUIDE FOR CHALLENGERS The official repository walks a user through setting up the benchmark, and the experimental protocol overview described here is a good place to start to begin exploring this...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.