pith. sign in

arxiv: 2605.26457 · v1 · pith:HKUY4TLOnew · submitted 2026-05-26 · 💻 cs.SE · cs.AI· cs.CL· cs.PL

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Pith reviewed 2026-06-29 16:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.PL
keywords specification autoformalizationLLM agentsformal verificationVerusCodeforces benchmarkagentic environmentssoftware verificationtest-driven evaluation
0
0 comments X

The pith

LLM agents translate informal Codeforces problems into Verus formal specifications at rates from 21 percent for open models to 77.8 percent for the strongest frontier model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Verus-SpecBench, a set of 581 tasks drawn from Codeforces programming problems, and Verus-SpecGym, an environment in which agents can call Verus, run bash commands, and edit files while building formal specifications. Evaluation works by extending Verus so that specs execute as ordinary Rust code and then running them on the official test suites plus adversarial cases extracted from Codeforces hacks. Frontier models reach 51 to 78 percent success while open-source models stay at 21 to 25 percent. Failures often consist of omitted input constraints, acceptance of wrong outputs, or rejection of valid ones, and an LLM judge misses roughly a quarter of the cases the execution method catches. The work therefore supplies both a concrete test bed and evidence that specification autoformalization is feasible yet still error-prone even when the same models can already write passing code.

Core claim

The authors introduce Verus-SpecBench and Verus-SpecGym to measure whether LLM agents can convert informal problem statements into faithful formal specifications for the Verus verifier. They extend Verus with an exec_spec feature so generated specifications can be compiled and executed directly against Codeforces tests and extracted hacks. On the resulting benchmark the strongest model, Gemini 3.1 Pro, solves 77.8 percent of tasks, other frontier models solve 51.1 to 57.8 percent, and open-source models solve 21.5 to 25.5 percent. Analysis of errors shows that generated specifications commonly omit input assumptions, accept incorrect outputs, or reject valid ones, and that LLM-based judging

What carries the argument

Verus-SpecGym, an agentic loop in which models interact with the Verus verifier, the filesystem, and bash to iteratively produce and test formal specifications that are then checked by execution against official tests plus Codeforces hacks.

If this is right

  • Frontier agents can already produce usable formal specifications for most Codeforces-style problems when given interactive feedback.
  • Execution against tests and hacks supplies a scalable way to judge specification quality without expert-written reference specifications.
  • The dominant failure modes are omission of input constraints and incorrect handling of output validity.
  • LLM judges alone are not reliable for this task because they miss a substantial fraction of execution-detectable errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing this benchmark with existing code-generation agents could surface whether correct code plus correct spec together produce verified programs that match user intent.
  • The gap between frontier and open-source performance suggests that scaling data or training specifically on formal-spec writing tasks would be a direct next experiment.
  • Extending the same execution-based evaluator to larger, multi-file problems would test whether the current success rates hold when specifications grow more complex.

Load-bearing premise

That a specification which passes the official tests and the extracted hacks necessarily captures the original informal problem intent.

What would settle it

A generated specification that passes every official test and every extracted hack yet accepts a program whose behavior violates an untested aspect of the original problem statement.

Figures

Figures reproduced from arXiv: 2605.26457 by Anmol Agarwal, Bryan Parno, Cedric Flamant, Jannis Limperg, Kanna Shimizu, Natalie Neamtu, Pranjal Aggarwal, Sean Welleck, Seungone Kim.

Figure 1
Figure 1. Figure 1: Overview of VERUS-SPECGYM. Left: Each task gives the agent an informal Code￾forces problem description, a formal Verus skeleton, and the function signatures for pre_spec and post_spec. The pre_spec function encodes input properties, while post_spec encodes output properties. The task also includes sample test cases, worked examples, Verus documentation, and evaluator source code. Middle: The agent reads th… view at source ↗
Figure 2
Figure 2. Figure 2: Binary search example. The top-left shows the informal problem description sI for a binary-search task. The top-right shows the data structures our pipeline produces to represent its inputs and outputs. The middle row shows example testcases available from Codeforces, one per bucket, alongside their converted counterparts as typed Verus values. The bottom row shows four candidate specifications sF that do … view at source ↗
Figure 4
Figure 4. Figure 4: summarizes how a hack is routed into buckets. If Codeforces rejects the hack input as invalid, we add it to τpre-sound. If Codeforces accepts the input as valid, we add it to τpre-comp. For valid inputs, we also observe the output produced by the program being hacked when run on that input. If the Codeforces checker accepts this output, we add the input-output pair to τpost-comp; otherwise, we add the pair… view at source ↗
Figure 3
Figure 3. Figure 3: Lossless conversion for Codeforces 1027C. For a raw Codeforces testcase t, the parser R produces executable Rust values E = R(t). The printer P then maps those values back to text, yielding Treproduced = P(E) = P(R(t)). The conversion is lossless only if Treproduced == t byte-for-byte; otherwise the benchmark might evaluate the wrong concrete testcase. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of the number of test cases per problem in each evaluation bucket. Median [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decision tree for resolving one testcase. For each testcase, the evaluator checks whether the submitted specification accepts or rejects the concrete input-output example. It first tries to prove acceptance or rejection symbolically in Verus. If both symbolic checks are inconclusive, it compiles the specification to executable Rust via exec_spec and runs it on the concrete testcase. The observation is assi… view at source ↗
Figure 7
Figure 7. Figure 7: Example of formally verified code: binary search for first occurrence. Blue highlights the formal specification sF (the pre_spec and post_spec predicates), written over the data model in yellow (the In1/Out types, declared with exec_spec_unverified!). Purple highlights the executable code, and green highlights the proof annotations that help Verus verify the code against sF , including loop invariants, dec… view at source ↗
Figure 8
Figure 8. Figure 8: gives a compact view of why these four buckets are needed: a candidate specification can disagree with the informal intent either on the valid-input domain or on the input-output relation, and each asymmetric difference corresponds to one completeness or soundness bucket. test case Codeforces hack dom(RsI ) valid inputs dom(RsF ) pre_spec accepts pre-spec completeness pre-spec soundness correct (a) Pre-spe… view at source ↗
Figure 9
Figure 9. Figure 9: Representative invalid test cases for Codeforces 1027C. Syntactically incorrect cases violate [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Codeforces rating distribution of VERUS-SPECBENCH problems. Ratings range from 800 to 2700, with a median of 1200 and a mean of 1289. The majority of problems fall in the 900–1499 range, with diminishing but non-trivial representation at higher difficulty levels [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of Codeforces topic tags across [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Informal specification sI , i.e., the natural-language description, for Codeforces 1027C (https://codeforces.com/problemset/problem/1027/C). per problem. All experiments run on a combination of Modal cloud and Docker containers, with 4 CPUs and 8 GB RAM per task. Budgeted agent evaluations also depend on practical API-level details. First, API latency affects how many interaction rounds an agent can compl… view at source ↗
Figure 13
Figure 13. Figure 13: Summary of Pass@1 and testcase-bucket performance across evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pass@1 by Codeforces problem difficulty rating (top) and number of problems per rating [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of testcase resolution outcomes across evaluated runs, broken down by evalu [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: UpSet plot of solved-problem overlaps among the three closed-source models. Each [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Estimated probability of catching at least one specification failure as a function of the [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Expected Pass@1 as a function of the maximum test-case budget [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗
read the original abstract

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Verus-SpecBench, a benchmark of 581 specification autoformalization tasks derived from Codeforces problems targeting the Verus verifier for Rust, together with Verus-SpecGym, an agentic environment allowing LLMs to interact with Verus, bash, and the filesystem to develop specs. Correctness is determined by whether generated specs type-check and produce identical input/output behavior on official Codeforces tests plus extracted hacks; the strongest model (Gemini 3.1 Pro) solves 77.8% of tasks, other frontier models 51.1–57.8%, and OSS models 21.5–25.5%. The work also reports that an LLM judge misses 26% of the failures caught by the executable evaluator and catalogs failure modes such as omitted input assumptions.

Significance. If the test-and-hack proxy reliably measures fidelity to informal intent, the results would establish that frontier agents can already autoformalize specifications for a majority of realistic programming problems while underscoring remaining brittleness and the value of executable over LLM-based evaluation. The open release of code, data, and logs is a clear strength for reproducibility.

major comments (2)
  1. [Abstract / Evaluation Methodology] Abstract and evaluation methodology: the headline solve rates (Gemini 3.1 Pro at 77.8 %, frontier models 51–58 %) are defined by the claim that a spec is correct if it type-checks and matches the behavior of the official tests plus extracted hacks. Because the benchmark supplies no gold reference specifications and the abstract itself lists failure modes (omitted assumptions, acceptance of incorrect outputs, rejection of valid ones) that can still pass the provided test distribution, the proxy may systematically overstate correctness on untested cases; this directly undermines the central performance claims.
  2. [Task Construction and Results] Task selection and analysis: the manuscript reports aggregate numbers over 581 tasks but provides no breakdown by problem difficulty, no confidence intervals, and no description of how the Codeforces problems were filtered or sampled, making it impossible to assess whether the reported gaps between model classes are robust or driven by a non-representative subset.
minor comments (1)
  1. [Methodology] The description of the extended exec_spec mechanism would benefit from a short concrete example showing how an exec_spec is turned into executable Rust for testing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below, proposing revisions to strengthen the paper where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract / Evaluation Methodology] Abstract and evaluation methodology: the headline solve rates (Gemini 3.1 Pro at 77.8 %, frontier models 51–58 %) are defined by the claim that a spec is correct if it type-checks and matches the behavior of the official tests plus extracted hacks. Because the benchmark supplies no gold reference specifications and the abstract itself lists failure modes (omitted assumptions, acceptance of incorrect outputs, rejection of valid ones) that can still pass the provided test distribution, the proxy may systematically overstate correctness on untested cases; this directly undermines the central performance claims.

    Authors: We agree that our evaluation relies on a proxy rather than gold-standard reference specifications, which are costly to produce. The paper explicitly discusses failure modes that could evade the tests and demonstrates that the executable evaluator catches 26% more failures than an LLM judge. However, to better contextualize the results, we will revise the abstract and methodology sections to more prominently state that the reported performance is measured against this test-and-hack proxy and to elaborate on its limitations, including the possibility of overstatement on untested inputs. We will also clarify the rationale for this approach in the absence of reference specs. revision: yes

  2. Referee: [Task Construction and Results] Task selection and analysis: the manuscript reports aggregate numbers over 581 tasks but provides no breakdown by problem difficulty, no confidence intervals, and no description of how the Codeforces problems were filtered or sampled, making it impossible to assess whether the reported gaps between model classes are robust or driven by a non-representative subset.

    Authors: We will incorporate a description of the sampling and filtering process used to select the 581 Codeforces problems for the benchmark. Additionally, we will add breakdowns of solve rates by problem difficulty (based on Codeforces ratings) and include confidence intervals for the key performance metrics to allow better assessment of the robustness of the observed differences between model classes. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or fitted parameters

full rationale

The paper introduces Verus-SpecBench and Verus-SpecGym as an empirical evaluation framework for LLM agents on specification autoformalization tasks derived from Codeforces problems. Performance is measured by executing generated specs against official tests and extracted hacks, with results reported as solve rates (e.g., Gemini 3.1 Pro at 77.8%). No mathematical derivations, equations, fitted parameters, or self-citation chains are present that reduce claims to inputs by construction. The work is self-contained against external benchmarks and test cases, with acknowledged limitations on the proxy metric treated as a correctness concern rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on existing Codeforces problems, the Verus verifier, and standard test/hack data.

pith-pipeline@v0.9.1-grok · 5933 in / 1219 out tokens · 36392 ms · 2026-06-29T16:29:31.401028+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Analyzing the Narration Gap in LLM-Solver Loops

    cs.AI 2026-06 unverdicted novelty 8.0

    The narration step in LLM-solver loops is vulnerable to prompt injection that inverts verified solver conclusions, and hardened prompts reduce but do not eliminate the risk under adaptive attacks.

Reference graph

Works this paper leans on

17 extracted references · cited by 1 Pith paper

  1. [1]

    Hack availability.We remove problems from early contests that predate the Codeforces hack system, since hacks are essential for our soundness test cases

  2. [2]

    Stage 3: Hack collection and categorization.For each remaining problem, we collect official Codeforces tests and all user-submitted hacks

    No floating point.We exclude problems that involve floating-point arithmetic, as Verus does not currently support floating-point reasoning. Stage 3: Hack collection and categorization.For each remaining problem, we collect official Codeforces tests and all user-submitted hacks. As described in §3.1 and summarized in Figure 4, a hack produces a test case w...

  3. [3]

    Valid input, incorrect output(the hacked solution produces output rejected by the Code- forces checker): the input contributes to τpre-comp, and the input-output pair is added to τpost-sound

  4. [4]

    correct” categories mean the resolved accept/reject verdict matches the testcase bucket; the “incorrect

    Valid input, correct output(the hacked solution happens to produce accepted output): the input contributes toτ pre-comp, and the pair is added toτ post-comp. Several cleaning steps follow bucket assignment. Different attackers often propose identical hacks targeting different submissions; we de-duplicate test cases by their raw input-output content. Some ...

  5. [5]

    The exact union contains 13 integer points, yet gemini-3.1pro’s executable postcondition rejects

  6. [6]

    ‘out.len() == n‘

    Similarly, hack_1118906__participant contains five identical radius-2 circles centered at 0, so the union is just one radius-2circle and again contains13points. 49 Listing 6: Snippet from postcondition generated by gemini-3.1pro for Codeforces 2074D that fails post-completeness hacks. exec_spec_unverified! { pub struct YInterval { pub y: i64, pub start: i...

  7. [7]

    **Starting point**: 317 318- It uses the ‘/home/dev/solve.rs‘ file you edited, with the provided fixed 319types and printers. 55 320- It also uses four **derived files** that extract only: 321- Your ‘pre_spec‘ / ‘post_spec‘, 322- Your four ‘*_proof‘ helpers, 323- The relevant ‘check_*‘ wrapper, 324- And any helper functions you introduced. 325

  8. [8]

    **Snippet injection**: 327 328- For each testcase and category, it reads: 329- ‘out.input_defn‘ and possibly ‘out.gt_output_defn‘. 330- It then substitutes these into the appropriate ‘check_*‘ function bodies: 331 332‘‘‘rust 333// __PASTE_out.input_defn__ 334// __PASTE_out.gt_output_defn__ 335‘‘‘ 336 337- The assertions inside the ‘check_*‘ functions are ...

  9. [9]

    352- ‘pre_sound‘ uses ‘check_pre_spec_soundness‘

    **Verus execution**: 348 349- For each testcase, the evaluator runs Verus on the derived check wrapper 350for that testcase’s bucket: 351- ‘pre_complete‘ uses ‘check_pre_spec_completeness‘. 352- ‘pre_sound‘ uses ‘check_pre_spec_soundness‘. 353- ‘post_complete‘ uses ‘check_post_spec_completeness‘. 354- ‘post_sound‘ uses ‘check_post_spec_soundness‘. 355- It...

  10. [10]

    364- The test passes if Verus can prove ‘assert(pre_spec(...))‘

    **Pass/fail logic**: 361 362- For **pre-completeness** tests (‘pre_complete‘): 363- The inputs are known valid. 364- The test passes if Verus can prove ‘assert(pre_spec(...))‘. 365- For **pre-soundness** tests (‘pre_sound‘): 366- The inputs are invalid. 367- The test passes if Verus can prove ‘assert(!pre_spec(...))‘. 368- For **post-completeness** tests ...

  11. [11]

    379- Final scoring uses the full evaluator suite, including hidden tests not 380present under ‘/home/symbolic_tests‘

    **Visible vs hidden tests**: 376 377- Running ‘verus_gym_specgen_check‘ in this environment may give local/debug 378feedback on only the visible sample tests. 379- Final scoring uses the full evaluator suite, including hidden tests not 380present under ‘/home/symbolic_tests‘. 381- Use visible failures as diagnostics, but write specs from the problem 382st...

  12. [12]

    436- Inputs are represented by a single logical input struct ‘In1‘

    **Understand the problem:** 435- Read ‘/home/problem_artifacts/problem_statement.md‘ carefully. 436- Inputs are represented by a single logical input struct ‘In1‘. 437- Outputs are represented by ‘Out‘. 438- Check ‘/home/examples/solve.rs‘ for a worked example of a complete Verus specification file. 439

  13. [13]

    442- Look at a few ‘out.input_defn‘ / ‘out.gt_output_defn‘ files under ‘/home/symbolic_tests/*/*/‘ 443to see concrete examples of how ‘ExecIn1‘ / ‘ExecOut‘ are constructed

    **Inspect the existing types and snippets:** 441- Look at ‘/home/dev/solve.rs‘ and confirm the definitions of ‘In1‘ and ‘Out‘. 442- Look at a few ‘out.input_defn‘ / ‘out.gt_output_defn‘ files under ‘/home/symbolic_tests/*/*/‘ 443to see concrete examples of how ‘ExecIn1‘ / ‘ExecOut‘ are constructed. 444

  14. [14]

    **Sketch ‘pre_spec‘:** 446- Encode all input constraints from the problem statement. 447

  15. [15]

    **Sketch ‘post_spec‘:** 449- Express the output correctness condition from the problem statement. 450

  16. [16]

    453- Fill in the four ‘*_proof‘ functions enough to help Verus verify the 454‘check_*‘ assertions

    **Add or refine helpers and proofs:** 452- Introduce helper spec functions to keep ‘pre_spec‘/‘post_spec‘ readable. 453- Fill in the four ‘*_proof‘ functions enough to help Verus verify the 454‘check_*‘ assertions. 455

  17. [17]

    459- For failing tests, inspect: 460- ‘/home/attempts/<run_id>/snippets/<category>/<test_id>/*.rs‘ 461- Corresponding ‘.stdout‘ / ‘.stderr‘ paths mentioned in the error messages

    **Use the feedback loop:** 457- After running ‘verus_gym_specgen_check‘, read 458‘/home/attempts/<run_id>/natural_language_feedback.txt‘. 459- For failing tests, inspect: 460- ‘/home/attempts/<run_id>/snippets/<category>/<test_id>/*.rs‘ 461- Corresponding ‘.stdout‘ / ‘.stderr‘ paths mentioned in the error messages. 462- Adjust specs and proofs to fix the ...