VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

Binhang Yuan; Fei Ye; Fengdi Che; Jie Fu; Lingfei Zeng; Xuhan Huang; Xu Xu

arxiv: 2510.06296 · v3 · submitted 2025-10-07 · 💻 cs.PL · cs.AI

VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

Lingfei Zeng , Fengdi Che , Xuhan Huang , Fei Ye , Xu Xu , Binhang Yuan , Jie Fu This is my paper

Pith reviewed 2026-05-18 09:20 UTC · model grok-4.3

classification 💻 cs.PL cs.AI

keywords formal verificationLLM code generationbenchmarkequivalence scoreground-truth-free evaluationDafnyalgorithmic problemsspecification quality

0 comments

The pith

VeriEquivBench replaces ground-truth matching with an equivalence score to evaluate LLM-generated formally verifiable code and specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VeriEquivBench, a benchmark of 2,389 complex algorithmic problems, to test large language models on co-generating code and formal specifications in languages such as Dafny. It replaces the need for ground-truth references with a formally grounded equivalence score that verifies whether generated code and specifications align with user intent through direct verification. This removes the manual annotation bottleneck that previously restricted datasets to a few hundred simple problems. Evaluations on the benchmark show that state-of-the-art models still struggle to produce code that passes formal verification. A sympathetic reader would care because scalable metrics are required to measure and improve progress toward AI systems whose outputs can be proven correct.

Core claim

The authors establish VeriEquivBench as a ground-truth-free evaluation framework that uses an equivalence score to assess the quality of LLM-generated code and formal specifications. The score operates by formally verifying alignment with user intent rather than comparing against reference solutions. Application of this framework to current models demonstrates that generating formally verifiable code remains a profound challenge.

What carries the argument

The equivalence score, a formally grounded metric that checks alignment of generated specifications and code with user intent through verification instead of reference matching.

If this is right

Benchmarks for formal verification tasks can now include thousands of complex problems without requiring manual creation of ground-truth specifications.
LLM performance on generating verifiable code can be assessed more reliably and at larger scale than before.
The observed difficulty indicates that co-generation of code and proofs needs substantial further advances.
Such benchmarks can guide development of coding agents that produce outputs suitable for formal verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The equivalence score approach could extend to other formal languages or proof assistants beyond the ones used here.
Repeated application of the benchmark over time would allow tracking whether model capabilities improve on verifiable code generation.
The metric might be incorporated into training objectives to reward models that produce verifiable outputs directly.

Load-bearing premise

The equivalence score accurately measures the quality and alignment of generated specifications and code with user intent in the absence of ground-truth references.

What would settle it

Human experts judging a sample of generated code-specification pairs and finding that the equivalence score assigns high values to misaligned pairs or low values to correctly aligned ones would undermine the metric.

read the original abstract

Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scales evaluation of LLM-generated verifiable code to 2389 problems via an equivalence score that avoids ground-truth specs, but the score's link to actual user intent looks under-validated.

read the letter

The main thing here is a larger benchmark for testing whether LLMs can produce both code and formal specs that hold up under verification. Previous sets topped out at a few hundred problems because building reference specs by hand is slow and expert-heavy, so moving to 2389 algorithmic tasks is a direct response to that limit. The results line up with what most people observe: even strong models still struggle to generate code that is formally verifiable from the start.

Referee Report

2 major / 1 minor

Summary. The paper introduces VeriEquivBench, a benchmark of 2,389 complex algorithmic problems for evaluating LLMs on co-generating code and formal specifications (e.g., in Dafny). It replaces ground-truth matching with a formally grounded equivalence score to assess quality and alignment with user intent, and concludes that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs.

Significance. If the equivalence score can be shown to reliably track alignment with natural-language intent rather than merely internal consistency or satisfiability, the benchmark's scale would address a key limitation of prior small datasets and provide a valuable tool for measuring progress in LLM-based formal verification. The work highlights persistent difficulties in the task.

major comments (2)

[Abstract] Abstract: The equivalence score is described as a 'formally grounded metric' that 'rigorously verifies the quality of generated specifications and code' and replaces ground-truth matching, yet the manuscript provides no explicit definition, computation procedure, or derivation (e.g., whether it relies on mutual implication between spec and implementation or equivalence to a canonical form derived from the problem). This absence makes it impossible to determine whether the metric introduces circularity by depending on quantities defined within the verification process itself.
[Evaluation framework] Evaluation framework: The headline result that SOTA LLMs find formally verifiable code generation profoundly difficult rests on the equivalence score serving as a proxy for user intent. No independent validation is described, such as correlation with human judgments, agreement with test suites on a held-out ground-truth subset, or cross-checks against existing Dafny benchmarks, leaving open the possibility that high scores reflect only internal consistency rather than semantic alignment with the original algorithmic description.

minor comments (1)

[Abstract] The abstract mentions Dafny but does not clarify the exact formal language or verifier used across the 2,389 problems; a brief statement in §2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification in our manuscript. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The equivalence score is described as a 'formally grounded metric' that 'rigorously verifies the quality of generated specifications and code' and replaces ground-truth matching, yet the manuscript provides no explicit definition, computation procedure, or derivation (e.g., whether it relies on mutual implication between spec and implementation or equivalence to a canonical form derived from the problem). This absence makes it impossible to determine whether the metric introduces circularity by depending on quantities defined within the verification process itself.

Authors: We agree that the abstract would benefit from a concise definition of the equivalence score to avoid ambiguity regarding its computation and any potential circularity. In the revised manuscript, we will expand the abstract to state that the equivalence score is computed by using the Dafny verifier to establish mutual implication: the generated specification must entail the observable behavior of the implementation, and the implementation must satisfy the specification. This verification is performed by the independent formal tool and does not depend on quantities defined during LLM generation, thereby avoiding circularity. We will also add a dedicated subsection in the evaluation framework section that provides the full computation procedure and derivation. revision: yes
Referee: [Evaluation framework] Evaluation framework: The headline result that SOTA LLMs find formally verifiable code generation profoundly difficult rests on the equivalence score serving as a proxy for user intent. No independent validation is described, such as correlation with human judgments, agreement with test suites on a held-out ground-truth subset, or cross-checks against existing Dafny benchmarks, leaving open the possibility that high scores reflect only internal consistency rather than semantic alignment with the original algorithmic description.

Authors: We acknowledge the value of additional validation to confirm that the equivalence score tracks alignment with natural-language intent. Our current framework emphasizes formal consistency between generated specifications and implementations as a necessary condition for intent alignment, given that the benchmark problems originate from explicit algorithmic descriptions. To strengthen this, we will add a partial validation study in the revision: on a held-out subset of 100 problems, we will report agreement between equivalence scores and expert human judgments of semantic fidelity to the original problem statements. We maintain that the formal nature of the score provides a more reliable signal than purely syntactic or test-based checks, though we recognize that broader cross-benchmark comparisons would be a useful future extension. revision: partial

Circularity Check

0 steps flagged

No significant circularity; equivalence score derivation is self-contained

full rationale

The paper introduces VeriEquivBench and replaces ground-truth matching with a formally grounded equivalence score based on formal verification (e.g., in Dafny). No equations, definitions, or self-citations in the abstract or described framework reduce the score or central claims to inputs by construction. The metric is defined independently via formal properties to enable ground-truth-free evaluation, and the LLM difficulty result follows from applying this metric to the new dataset of 2389 problems. The derivation chain remains self-contained without self-definitional loops or fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that formal verification can serve as a reliable proxy for intent alignment without ground truth, and introduces the equivalence score as a new evaluation entity whose independent evidence is limited to the benchmark construction itself.

axioms (1)

domain assumption Formal verification tools can rigorously establish equivalence between generated code, specifications, and user intent
Invoked as the basis for replacing ground-truth matching with the equivalence score.

invented entities (1)

Equivalence score no independent evidence
purpose: Ground-truth-free metric to evaluate quality of generated specifications and code
New metric introduced to address limitations of existing benchmarks; no independent falsifiable evidence provided beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5733 in / 1338 out tokens · 38102 ms · 2026-05-18T09:20:20.890153+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our equivalence score accomplishes the task by proving the bidirectional implication relationship: whether the code falls into the lattices described by the specifications, and whether the method output is the unique value satisfying specifications for any inputs. Both proofs can be automatically completed by the Dafny verifier.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the equivalence score, as a formally grounded metric, accurately measures the quality and alignment of generated specifications and code with user intent in the absence of ground-truth references

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.