pith. machine review for the scientific record. sign in

arxiv: 2603.14628 · v2 · submitted 2026-03-15 · 💻 cs.PL · cs.AI· cs.CR· cs.LO

Recognition: no theorem link

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:55 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.CRcs.LO
keywords benchmarkLLM theorem provingHOL Lightformal verificationcryptographic assemblyproof synthesislow-level code
0
0 comments X

The pith

s2n-bignum-bench supplies formal specifications from verified AWS cryptographic assembly and requires LLMs to output HOL Light proof scripts that pass machine checking within a fixed timeout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a benchmark drawn from the s2n-bignum library, whose assembly routines for fast cryptography have already been formally verified in HOL Light. Instead of asking an LLM to state a mathematical theorem, the benchmark gives the correct formal specification of each routine and asks the model to produce a complete proof script that HOL Light accepts. Success requires the model to bridge precise low-level implementation details with the abstract specification in a way that survives automated checking. A sympathetic reader would care because current LLM theorem-proving successes occur mostly on competition-style mathematics, leaving open whether the same models can handle the concrete proof obligations that arise in industrial code. The benchmark therefore supplies a public, reproducible test that measures progress toward practical, machine-checked verification of real implementations.

Core claim

s2n-bignum-bench is constructed by extracting the already-proven HOL Light specifications for the s2n-bignum assembly routines and turning each one into a task that asks an LLM to generate a proof script accepted by the HOL Light kernel within a fixed timeout. The library itself was developed and verified by human experts at AWS; the benchmark therefore isolates the second half of the verification process—proof construction—while supplying the first half (the precise specification) as input.

What carries the argument

The s2n-bignum-bench task harness, which feeds a formal specification to the LLM, collects the generated proof script, and runs it through HOL Light with a timeout to decide success or failure.

If this is right

  • LLMs that pass the benchmark can be used to accelerate the construction of machine-checked proofs for other verified cryptographic libraries.
  • Failure on the benchmark indicates that current models still lack the ability to connect low-level assembly details with their abstract specifications inside a proof assistant.
  • The benchmark can serve as a training signal for fine-tuning models on the specific style of proof scripts required by HOL Light for integer and bit-vector reasoning.
  • Success rates on s2n-bignum-bench provide a quantitative measure that can be tracked over time as models improve at code-related formal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended with partial-proof hints or proof-state feedback to study whether interactive assistance improves LLM performance on low-level code.
  • Because the routines are already machine-checked, the same test suite could be reused to compare proof scripts produced by LLMs against the original human-written scripts for length, readability, or reuse of lemmas.
  • Similar extraction pipelines could be applied to other formally verified open-source libraries to create additional domain-specific benchmarks without requiring new human verification effort.

Load-bearing premise

That an LLM's ability to produce a HOL Light proof script accepted within the timeout for these routines is a good test of its capacity to construct proofs about real-world low-level implementations.

What would settle it

Run a representative set of current LLMs on the benchmark routines and record whether any model produces a proof script that HOL Light accepts for even the smallest routine; zero success rate on the simplest cases would falsify the claim that the benchmark meaningfully exercises real-world proof synthesis.

read the original abstract

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces s2n-bignum-bench, a benchmark derived from the already-verified s2n-bignum cryptographic library. It supplies human-authored formal specifications in HOL Light for assembly routines and requires LLMs to generate proof scripts that are accepted by the HOL Light checker within a fixed timeout. The work positions the benchmark as the first public resource focused on machine-checkable proof synthesis for industrial low-level cryptographic code, aiming to evaluate LLM reasoning beyond competition-style mathematics.

Significance. If the evaluation protocol is sound, the benchmark supplies a practically grounded testbed that connects LLM theorem-proving research to real industrial verification artifacts. It reuses existing machine-checked proofs, offers reproducible setup via the linked repository, and targets a domain (cryptographic assembly) where formal methods have already succeeded, thereby providing falsifiable, externally validated problems rather than synthetic ones.

major comments (1)
  1. [Abstract] Abstract: the central claim that the benchmark 'addresses the gap' between competition mathematics and 'proofs about real-world implementations' rests on the assumption that supplying a pre-existing formal specification and requiring only a proof script adequately exercises low-level code reasoning. The manuscript does not provide evidence or discussion of the relative difficulty of specification derivation versus proof discharge in the original s2n-bignum verification effort; if specification construction is the dominant industrial cost, the benchmark tests a narrower capability than the title and abstract suggest.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the benchmark's significance and for the constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the benchmark 'addresses the gap' between competition mathematics and 'proofs about real-world implementations' rests on the assumption that supplying a pre-existing formal specification and requiring only a proof script adequately exercises low-level code reasoning. The manuscript does not provide evidence or discussion of the relative difficulty of specification derivation versus proof discharge in the original s2n-bignum verification effort; if specification construction is the dominant industrial cost, the benchmark tests a narrower capability than the title and abstract suggest.

    Authors: We appreciate this observation. The manuscript already notes that both specification and proof were performed by human experts in the original s2n-bignum verification. We agree that we provide no quantitative comparison of their relative difficulties. The benchmark is deliberately scoped to the proof-synthesis task because that is the capability current LLMs have begun to demonstrate on synthetic problems but have not yet been evaluated on for verified industrial artifacts. Generating an accepted HOL Light proof script still requires the model to reason about the concrete assembly semantics, loop invariants, and bit-vector arithmetic that appear in the supplied specification. We will revise the abstract to make the benchmark's focus explicit and will add a short paragraph in the introduction acknowledging that specification construction is a major industrial cost while clarifying that the benchmark isolates the proof-discharge component to enable direct comparison with existing theorem-proving benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal is self-contained design choice

full rationale

The paper introduces s2n-bignum-bench by selecting an existing industrially verified library, supplying its human-authored HOL Light specifications, and defining the LLM task as emitting a proof script accepted within timeout. No equations, parameter fits, uniqueness theorems, or self-citations are used to derive the benchmark's validity or relevance; the claim that this tests proof synthesis for real-world code is presented explicitly as a design decision rather than a result obtained from prior steps in the paper. The work contains no load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existing formal verification of s2n-bignum by human experts and the soundness of HOL Light, with no new free parameters or invented entities introduced.

axioms (1)
  • standard math The HOL Light theorem prover provides a sound proof checker for the given specifications.
    This is a background assumption for any work using HOL Light.

pith-pipeline@v0.9.0 · 5599 in / 1083 out tokens · 66066 ms · 2026-05-15T10:55:35.299277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    Achim, T. and V . Tenev (2023). Harmonic: Building mathematical superintelligence

  2. [2]

    Daruru, B

    Barkallah, S., S. Daruru, B. Miranda, L. Aniva, A. Nie, and S. Koyejo (2025). Veribench-ftp: A formal theorem proving benchmark in lean 4 for code verification. InThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025

  3. [3]

    Chen, L., J. Gu, L. Huang, W. Huang, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, K. Ma, C. Ren, J. Shen, W. Shi, T. Sun, H. Sun, J. Wang, S. Wang, Z. Wang, C. Wei, S. Wei, Y . Wu, Y . Wu, Y . Xia, H. Xin, F. Yang, H. Ying, H. Yuan, Z. Yuan, T. Zhan, C. Zhang, Y . Zhang, G. Zhang, T. Zhao, J. Zhao, Y . Zhou, and T. H. Zhu (2025). Seed-prover: Deep and broad r...

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021). Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168

  5. [5]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Gu, A., B. Rozi `ere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang (2024). Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065

  6. [6]

    Harrison, J. (2009). Hol light: An overview. InInternational Conference on Theorem Proving in Higher Order Logics, pp. 60–66. Springer

  7. [7]

    Harrison, J. (2026). Soundness of s2n-bignum formal verification

  8. [8]

    Urban, and F

    Harrison, J., J. Urban, and F. Wiedijk (2014). History of interactive theorem proving. InHand- book of the History of Logic, V olume 9, pp. 135–214. Elsevier

  9. [9]

    Zhu, and S

    Hu, J., T. Zhu, and S. Welleck (2024). minictx: Neural theorem proving with (long-) contexts. arXiv preprint arXiv:2408.03350

  10. [10]

    arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668

    Letson, A., L. Sarra, A. Poiroux, O. Dressler, P. Lezeau, D. Aranha, F. Pu, A. Hill, M. C. Hidalgo, J. Berman, et al. (2026). Sorrydb: Can ai provers complete real-world lean theorems? arXiv preprint arXiv:2603.02668

  11. [11]

    Lohn, E. and S. Welleck (2024). minicodeprops: a minimal benchmark for proving code properties.arXiv preprint arXiv:2406.11915

  12. [12]

    Mohamed, J

    Mazzucato, D., A. Mohamed, J. Lee, C. Barrett, J. Grundy, J. Harrison, and C. S. P ˘as˘areanu (2025). Relational hoare logic for realistically modelled machine code. InInternational Confer- ence on Computer Aided Verification, pp. 389–413. Springer

  13. [13]

    Hol light extension for vs code

    monadius (2026). Hol light extension for vs code

  14. [14]

    Gpt-5.3-codex system card

    OpenAI (2026). Gpt-5.3-codex system card

  15. [15]

    Thakur, A., J. Lee, G. Tsoukalas, M. Sistla, M. Zhao, S. Zetzsche, G. Durrett, Y . Yue, and S. Chaudhuri (2025). Clever: A curated benchmark for formally verified code generation.arXiv preprint arXiv:2505.13938

  16. [16]

    Tsoukalas, G., J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri (2024). Putnambench: Evaluating neural theorem-provers on the putnam mathematical competi- tion

  17. [17]

    Vlad Isenbaev, B. H. (2026). Logical intelligence’s aleph solves putnambench

  18. [18]

    Wiedijk, F. (2012). Pollack-inconsistency.Electronic Notes in Theoretical Computer Sci- ence 285, 85–100

  19. [19]

    Xin, Y ., Q. Chen, G. Durrett, and I. Dillig (2026). Verisoftbench: Repository-scale formal verification benchmarks for lean.arXiv preprint arXiv:2602.18307

  20. [20]

    Xu, Q., X. Luan, R. Wang, J. O. J. Leang, P. Wang, H. Li, W. Li, and C. Watt (2026). Neural theorem proving for verification conditions: A real-world benchmark. InThe Fourteenth Inter- national Conference on Learning Representations. 5 Accepted as a Workshop paper at AIPV 2026

  21. [21]

    Ye, Z., Z. Yan, J. He, T. Kasriel, K. Yang, and D. Song (2025). Verina: Benchmarking verifiable code generation.arXiv preprint arXiv:2505.23135

  22. [22]

    Zheng, K., J. M. Han, and S. Polu (2022). Minif2f: a cross-system benchmark for formal olympiad-level mathematics. 6 Accepted as a Workshop paper at AIPV 2026 A AGUIDE FOR CHALLENGERS The official repository walks a user through setting up the benchmark, and the experimental protocol overview described here is a good place to start to begin exploring this...