arxiv: 2603.14628 · v2 · submitted 2026-03-15 · 💻 cs.PL · cs.AI· cs.CR· cs.LO

Recognition: no theorem link

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Balaji Rao , John Harrison , Soonho Kong , Juneyoung Lee , Carlo Lipizzi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:55 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.CRcs.LO

keywords benchmarkLLM theorem provingHOL Lightformal verificationcryptographic assemblyproof synthesislow-level code

0 comments

The pith

s2n-bignum-bench supplies formal specifications from verified AWS cryptographic assembly and requires LLMs to output HOL Light proof scripts that pass machine checking within a fixed timeout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a benchmark drawn from the s2n-bignum library, whose assembly routines for fast cryptography have already been formally verified in HOL Light. Instead of asking an LLM to state a mathematical theorem, the benchmark gives the correct formal specification of each routine and asks the model to produce a complete proof script that HOL Light accepts. Success requires the model to bridge precise low-level implementation details with the abstract specification in a way that survives automated checking. A sympathetic reader would care because current LLM theorem-proving successes occur mostly on competition-style mathematics, leaving open whether the same models can handle the concrete proof obligations that arise in industrial code. The benchmark therefore supplies a public, reproducible test that measures progress toward practical, machine-checked verification of real implementations.

Core claim

s2n-bignum-bench is constructed by extracting the already-proven HOL Light specifications for the s2n-bignum assembly routines and turning each one into a task that asks an LLM to generate a proof script accepted by the HOL Light kernel within a fixed timeout. The library itself was developed and verified by human experts at AWS; the benchmark therefore isolates the second half of the verification process—proof construction—while supplying the first half (the precise specification) as input.

What carries the argument

The s2n-bignum-bench task harness, which feeds a formal specification to the LLM, collects the generated proof script, and runs it through HOL Light with a timeout to decide success or failure.

If this is right

LLMs that pass the benchmark can be used to accelerate the construction of machine-checked proofs for other verified cryptographic libraries.
Failure on the benchmark indicates that current models still lack the ability to connect low-level assembly details with their abstract specifications inside a proof assistant.
The benchmark can serve as a training signal for fine-tuning models on the specific style of proof scripts required by HOL Light for integer and bit-vector reasoning.
Success rates on s2n-bignum-bench provide a quantitative measure that can be tracked over time as models improve at code-related formal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended with partial-proof hints or proof-state feedback to study whether interactive assistance improves LLM performance on low-level code.
Because the routines are already machine-checked, the same test suite could be reused to compare proof scripts produced by LLMs against the original human-written scripts for length, readability, or reuse of lemmas.
Similar extraction pipelines could be applied to other formally verified open-source libraries to create additional domain-specific benchmarks without requiring new human verification effort.

Load-bearing premise

That an LLM's ability to produce a HOL Light proof script accepted within the timeout for these routines is a good test of its capacity to construct proofs about real-world low-level implementations.

What would settle it

Run a representative set of current LLMs on the benchmark routines and record whether any model produces a proof script that HOL Light accepts for even the smallest routine; zero success rate on the simplest cases would falsify the claim that the benchmark meaningfully exercises real-world proof synthesis.

read the original abstract

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmark hands LLMs the human-written specs for verified AWS crypto assembly and asks them to emit HOL Light proofs, which is a practical new resource but tests only the discharge step rather than full low-level reasoning.

read the letter

This benchmark takes the already-verified s2n-bignum assembly routines and turns them into tasks where an LLM receives the formal spec and must produce a HOL Light script that the checker accepts within the timeout. The authors built it from the existing library used at AWS for fast crypto primitives, where humans had already done both the specification and the proof work. The GitHub release makes the tasks runnable right away, which is the main practical contribution here. It moves the evaluation target away from competition math problems toward something closer to industrial code, and the machine-checkable output avoids fuzzy grading. That part is straightforward and useful for groups working on LLM-assisted theorem proving. The soft spot is exactly the one the stress-test note flags. Because the specs come pre-written, the benchmark does not exercise the step of turning raw assembly into the right mathematical proposition. In real verification projects that step often carries the heaviest load, so the claim of testing low-level code reasoning ends up narrower than the title implies. The paper is clear about the setup, so this is not a hidden flaw, just a limit on scope. Researchers who want concrete examples beyond miniF2F or similar suites will find it worth trying. It is not a deep theoretical advance, but the resource itself is new enough that a serious editor should send it to referees rather than desk-reject it. The evaluation protocol is simple to check and the data is public.

Referee Report

1 major / 0 minor

Summary. The paper introduces s2n-bignum-bench, a benchmark derived from the already-verified s2n-bignum cryptographic library. It supplies human-authored formal specifications in HOL Light for assembly routines and requires LLMs to generate proof scripts that are accepted by the HOL Light checker within a fixed timeout. The work positions the benchmark as the first public resource focused on machine-checkable proof synthesis for industrial low-level cryptographic code, aiming to evaluate LLM reasoning beyond competition-style mathematics.

Significance. If the evaluation protocol is sound, the benchmark supplies a practically grounded testbed that connects LLM theorem-proving research to real industrial verification artifacts. It reuses existing machine-checked proofs, offers reproducible setup via the linked repository, and targets a domain (cryptographic assembly) where formal methods have already succeeded, thereby providing falsifiable, externally validated problems rather than synthetic ones.

major comments (1)

[Abstract] Abstract: the central claim that the benchmark 'addresses the gap' between competition mathematics and 'proofs about real-world implementations' rests on the assumption that supplying a pre-existing formal specification and requiring only a proof script adequately exercises low-level code reasoning. The manuscript does not provide evidence or discussion of the relative difficulty of specification derivation versus proof discharge in the original s2n-bignum verification effort; if specification construction is the dominant industrial cost, the benchmark tests a narrower capability than the title and abstract suggest.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the benchmark's significance and for the constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the benchmark 'addresses the gap' between competition mathematics and 'proofs about real-world implementations' rests on the assumption that supplying a pre-existing formal specification and requiring only a proof script adequately exercises low-level code reasoning. The manuscript does not provide evidence or discussion of the relative difficulty of specification derivation versus proof discharge in the original s2n-bignum verification effort; if specification construction is the dominant industrial cost, the benchmark tests a narrower capability than the title and abstract suggest.

Authors: We appreciate this observation. The manuscript already notes that both specification and proof were performed by human experts in the original s2n-bignum verification. We agree that we provide no quantitative comparison of their relative difficulties. The benchmark is deliberately scoped to the proof-synthesis task because that is the capability current LLMs have begun to demonstrate on synthetic problems but have not yet been evaluated on for verified industrial artifacts. Generating an accepted HOL Light proof script still requires the model to reason about the concrete assembly semantics, loop invariants, and bit-vector arithmetic that appear in the supplied specification. We will revise the abstract to make the benchmark's focus explicit and will add a short paragraph in the introduction acknowledging that specification construction is a major industrial cost while clarifying that the benchmark isolates the proof-discharge component to enable direct comparison with existing theorem-proving benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal is self-contained design choice

full rationale

The paper introduces s2n-bignum-bench by selecting an existing industrially verified library, supplying its human-authored HOL Light specifications, and defining the LLM task as emitting a proof script accepted within timeout. No equations, parameter fits, uniqueness theorems, or self-citations are used to derive the benchmark's validity or relevance; the claim that this tests proof synthesis for real-world code is presented explicitly as a design decision rather than a result obtained from prior steps in the paper. The work contains no load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existing formal verification of s2n-bignum by human experts and the soundness of HOL Light, with no new free parameters or invented entities introduced.

axioms (1)

standard math The HOL Light theorem prover provides a sound proof checker for the given specifications.
This is a background assumption for any work using HOL Light.

pith-pipeline@v0.9.0 · 5599 in / 1083 out tokens · 66066 ms · 2026-05-15T10:55:35.299277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

Achim, T. and V . Tenev (2023). Harmonic: Building mathematical superintelligence

work page 2023
[2]

Daruru, B

Barkallah, S., S. Daruru, B. Miranda, L. Aniva, A. Nie, and S. Koyejo (2025). Veribench-ftp: A formal theorem proving benchmark in lean 4 for code verification. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025

work page 2025
[3]

Chen, L., J. Gu, L. Huang, W. Huang, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, K. Ma, C. Ren, J. Shen, W. Shi, T. Sun, H. Sun, J. Wang, S. Wang, Z. Wang, C. Wei, S. Wei, Y . Wu, Y . Wu, Y . Xia, H. Xin, F. Yang, H. Ying, H. Yuan, Z. Yuan, T. Zhan, C. Zhang, Y . Zhang, G. Zhang, T. Zhao, J. Zhao, Y . Zhou, and T. H. Zhu (2025). Seed-prover: Deep and broad r...

work page 2025
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021). Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Gu, A., B. Rozi `ere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang (2024). Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Harrison, J. (2009). Hol light: An overview. InInternational Conference on Theorem Proving in Higher Order Logics, pp. 60–66. Springer

work page 2009
[7]

Harrison, J. (2026). Soundness of s2n-bignum formal verification

work page 2026
[8]

Urban, and F

Harrison, J., J. Urban, and F. Wiedijk (2014). History of interactive theorem proving. InHand- book of the History of Logic, V olume 9, pp. 135–214. Elsevier

work page 2014
[9]

Zhu, and S

Hu, J., T. Zhu, and S. Welleck (2024). minictx: Neural theorem proving with (long-) contexts. arXiv preprint arXiv:2408.03350

work page arXiv 2024
[10]

arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668

Letson, A., L. Sarra, A. Poiroux, O. Dressler, P. Lezeau, D. Aranha, F. Pu, A. Hill, M. C. Hidalgo, J. Berman, et al. (2026). Sorrydb: Can ai provers complete real-world lean theorems? arXiv preprint arXiv:2603.02668

work page arXiv 2026
[11]

Lohn, E. and S. Welleck (2024). minicodeprops: a minimal benchmark for proving code properties.arXiv preprint arXiv:2406.11915

work page arXiv 2024
[12]

Mohamed, J

Mazzucato, D., A. Mohamed, J. Lee, C. Barrett, J. Grundy, J. Harrison, and C. S. P ˘as˘areanu (2025). Relational hoare logic for realistically modelled machine code. InInternational Confer- ence on Computer Aided Verification, pp. 389–413. Springer

work page 2025
[13]

Hol light extension for vs code

monadius (2026). Hol light extension for vs code

work page 2026
[14]

Gpt-5.3-codex system card

OpenAI (2026). Gpt-5.3-codex system card

work page 2026
[15]

Thakur, A., J. Lee, G. Tsoukalas, M. Sistla, M. Zhao, S. Zetzsche, G. Durrett, Y . Yue, and S. Chaudhuri (2025). Clever: A curated benchmark for formally verified code generation.arXiv preprint arXiv:2505.13938

work page arXiv 2025
[16]

Tsoukalas, G., J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri (2024). Putnambench: Evaluating neural theorem-provers on the putnam mathematical competi- tion

work page 2024
[17]

Vlad Isenbaev, B. H. (2026). Logical intelligence’s aleph solves putnambench

work page 2026
[18]

Wiedijk, F. (2012). Pollack-inconsistency.Electronic Notes in Theoretical Computer Sci- ence 285, 85–100

work page 2012
[19]

Xin, Y ., Q. Chen, G. Durrett, and I. Dillig (2026). Verisoftbench: Repository-scale formal verification benchmarks for lean.arXiv preprint arXiv:2602.18307

work page arXiv 2026
[20]

Xu, Q., X. Luan, R. Wang, J. O. J. Leang, P. Wang, H. Li, W. Li, and C. Watt (2026). Neural theorem proving for verification conditions: A real-world benchmark. InThe Fourteenth Inter- national Conference on Learning Representations. 5 Accepted as a Workshop paper at AIPV 2026

work page 2026
[21]

Ye, Z., Z. Yan, J. He, T. Kasriel, K. Yang, and D. Song (2025). Verina: Benchmarking verifiable code generation.arXiv preprint arXiv:2505.23135

work page arXiv 2025
[22]

Zheng, K., J. M. Han, and S. Polu (2022). Minif2f: a cross-system benchmark for formal olympiad-level mathematics. 6 Accepted as a Workshop paper at AIPV 2026 A AGUIDE FOR CHALLENGERS The official repository walks a user through setting up the benchmark, and the experimental protocol overview described here is a good place to start to begin exploring this...

work page 2022