Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

Antonio Valerio Miceli Barone; Poon Tsz Nok

arxiv: 2604.17010 · v2 · submitted 2026-04-18 · 💻 cs.CL · cs.AI· cs.LG· cs.PL

Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

Antonio Valerio Miceli Barone , Poon Tsz Nok This is my paper

Pith reviewed 2026-05-10 06:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.PL

keywords self-playsemantic equivalenceformal verificationcode reasoningHaskellLLM evaluatorEquiBenchPySecDB

0 comments

The pith

Self-play training on formally verified semantic equivalence in Haskell improves LLM accuracy on code reasoning benchmarks by up to 13 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training loop in which one model generates Haskell program variants while a second model judges whether the variants compute the same result. Liquid Haskell supplies machine-checked proofs that establish true equivalence, and execution differences supply counterexamples for inequivalent cases. These labeled pairs are presented through a difficulty-ordered curriculum so the evaluator learns to recognize semantic sameness. The resulting evaluator, when tested on separate benchmarks, shows clear accuracy gains that transfer to code-understanding and security-related tasks. Ablation runs that keep the counterexamples but remove the proof-based equivalence labels demonstrate that the formal proofs, not the data volume, drive the reasoning improvement.

Core claim

The central claim is that an adversarial self-play procedure organized around verified semantic equivalence produces an evaluator whose code-reasoning ability exceeds models trained on execution counterexamples alone. On the EquiBench benchmark the evaluator reaches an accuracy gain of 13.3 percentage points, with additional consistent gains on the PySecDB security dataset. The authors further show through controlled ablations that equivalence proofs are uniquely responsible for the acquired reasoning capabilities while inequivalence data mainly supplies volume.

What carries the argument

The semantic-equivalence self-play framework that alternates generator and evaluator roles, using Liquid Haskell proofs for positive equivalence labels and execution counterexamples for negative labels, organized by a difficulty-aware curriculum.

If this is right

The evaluator model can be applied to new code-equivalence questions without task-specific fine-tuning.
Equivalence proofs contribute more to reasoning capability than additional volumes of inequivalence counterexamples.
The difficulty-aware curriculum allows the models to scale from simple to complex program pairs during training.
Both the full training pipeline and the synthetic dataset of roughly 28,000 validated programs support direct reproduction and extension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generator-evaluator loop could be adapted to other languages that offer comparable formal verification support.
Grounding training signals in symbolic proofs rather than execution traces alone may offer a general path toward stronger reasoning in code-focused language models.
An evaluator that has internalized verified equivalence relations could serve as a reliable component inside automated testing or program repair systems.

Load-bearing premise

Liquid Haskell proofs correctly and exhaustively identify semantically equivalent program pairs without systematic false positives or negatives that the training loop could exploit.

What would settle it

Evaluating the trained model on a new collection of Haskell program pairs whose equivalence status has been confirmed by an independent method such as manual review or a different verification tool, then checking whether accuracy remains near the reported 13-point gain.

Figures

Figures reproduced from arXiv: 2604.17010 by Antonio Valerio Miceli Barone, Poon Tsz Nok.

**Figure 2.** Figure 2: shows the entire multi-stage filtering mechanism. We have contributed OpInstruct-HSx, a clean and executable Haskell dataset for both SEQ and SINQ games, which consists of approximately 28,000 validated Haskell functions derived from real-world problems. This dataset is made publicly available2 , serving as a high-quality synthetic Haskell resource for training LLMs in semantic reasoning tasks. The code… view at source ↗

**Figure 3.** Figure 3: Mean and standard deviation for the difficulty [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy comparison between base and fine [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: F1 score comparison between base and fine [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Counts and proportions of validated Alice [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Alice’s HumanEval results. Averages over 16 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Alice’s MBPP results. Averages over 16 trials. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 7.** Figure 7: Mean and box-whiskey plots of the difficulty [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 14.** Figure 14: Alice’s MBPP results. Averages over 16 trials. H.3.3 Alice’s SEQ vs SINQ Validated Generation Counts 1 2 3 4 5 6 7 Round 0 2 4 6 8 10 12 14 16 18 Count 3 (1.2%) 11 (4.4%) 6 (2.4%) 4 (1.6%) 10 (4.0%) 12 (4.8%) 16 (6.4%) 12 (4.8%) 9 (3.6%) 11 (4.4%) 9 (3.6%) 11 (4.4%) 11 (4.4%) 15 (6.0%) Alice SEQ vs SINQ: Counts by Round SEQ SINQ [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: SEQ vs SINQ Validated Generation Counts. [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Alice’s HumanEval results. Averages over [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Alice’s MBPP results. Averages over 16 trials. H.4.3 Alice’s SEQ vs SINQ Validated Generation Counts 1 2 3 4 5 6 7 Round 0 5 10 15 20 25 30 Count 26 (10.4%) 24 (9.6%) 17 (6.8%) 25 (10.0%) 27 (10.8%) 22 (8.8%) 15 (6.0%) Alice SEQ vs SINQ: Counts by Round SEQ SINQ [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: SEQ vs SINQ Validated Generation Counts. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Validated Generation Counts of both exper [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

read the original abstract

We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They built a self-play loop using Liquid Haskell proofs for equivalence labels on Haskell code and report benchmark gains, but the ablation claim that proofs uniquely drive reasoning gains looks shaky without proof completeness checks.

read the letter

The core idea here is a self-play setup where one model generates Haskell program pairs, Liquid Haskell tries to prove they are equivalent, and execution finds counterexamples when they are not. These labels feed a curriculum-trained evaluator that then gets tested on external tasks. They also ship the OpInstruct-HSx dataset of roughly 28k validated programs plus the full training code.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a self-play framework (SEQ-SINQ) for training LLMs on semantic equivalence and inequivalence of Haskell programs. It combines Liquid Haskell proofs for equivalence validation with execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum on the released OpInstruct-HSx synthetic dataset (~28k programs). The evaluator is shown to transfer to downstream tasks with up to 13.3pp accuracy gains on EquiBench and consistent gains on PySecDB; ablations on SEQ-SINQ regimes claim that equivalence proofs (rather than inequivalence data volume) are uniquely responsible for the reasoning improvements.

Significance. If the results hold, the work offers a concrete method for using formal verification to generate high-quality, proof-based supervision signals in self-play, which could improve generalization in code reasoning models beyond purely execution-based or synthetic data. The public release of the full training pipeline on GitHub and the OpInstruct-HSx dataset on Hugging Face is a clear strength that enables reproducibility and follow-on work.

major comments (2)

[Ablation studies on the SEQ-SINQ regimes] Ablation studies on the SEQ-SINQ regimes: the central claim that 'equivalence proofs are uniquely responsible for the model's reasoning capabilities' (while inequivalence supervision only provides data volume) is load-bearing for the paper's contribution. This attribution assumes Liquid Haskell supplies accurate and unbiased positive labels. However, Liquid Haskell is sound but incomplete; many semantically equivalent pairs may fail to type-check or yield proofs due to refinement-type limitations, solver timeouts, or unprovable properties. Without reported proof success rates, false-negative rates on equivalent pairs, or validation that the curriculum does not overfit to provable fragments of OpInstruct-HSx, the ablation's causal separation is not fully supported.
[§4] §4 (or equivalent section describing verification and curriculum): the manuscript does not report verification error rates, proof completion statistics, or an analysis of how many candidate pairs are discarded due to unprovability. This information is necessary to evaluate whether the 13.3pp EquiBench gain reflects genuine semantic reasoning or selection bias toward easier-to-prove programs.

minor comments (2)

The abstract states 'up to 13.3pp accuracy gain' without specifying the exact baseline model, split, or statistical significance; this detail should appear in the main results table or text for clarity.
Notation for the SEQ-SINQ regimes and the difficulty metric used in the curriculum could be defined more explicitly in the methods section to aid replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the ablation studies and verification details. These points help clarify the strength of our claims regarding the role of equivalence proofs. We respond to each major comment below and will incorporate revisions as indicated.

read point-by-point responses

Referee: [Ablation studies on the SEQ-SINQ regimes] Ablation studies on the SEQ-SINQ regimes: the central claim that 'equivalence proofs are uniquely responsible for the model's reasoning capabilities' (while inequivalence supervision only provides data volume) is load-bearing for the paper's contribution. This attribution assumes Liquid Haskell supplies accurate and unbiased positive labels. However, Liquid Haskell is sound but incomplete; many semantically equivalent pairs may fail to type-check or yield proofs due to refinement-type limitations, solver timeouts, or unprovable properties. Without reported proof success rates, false-negative rates on equivalent pairs, or validation that the curriculum does not overfit to provable fragments of OpInstruct-HSx, the ablation's causal separation is not fully supported.

Authors: We agree that Liquid Haskell's incompleteness requires careful interpretation of the ablation results. The positive labels in our training are guaranteed correct by successful proofs (soundness ensures no false positives), while incompleteness means some equivalent pairs are not labeled positive and thus treated as unknown. This could increase task difficulty rather than bias toward easier fragments. To strengthen the causal claim and address potential selection effects in the difficulty-aware curriculum, we will add to the revised manuscript a dedicated analysis in §4 reporting proof success rates on OpInstruct-HSx, the fraction of equivalent pairs that successfully produce proofs, and statistics on discarded pairs. This will better support the attribution that equivalence proofs, rather than data volume alone, drive the reasoning gains. revision: yes
Referee: [§4] §4 (or equivalent section describing verification and curriculum): the manuscript does not report verification error rates, proof completion statistics, or an analysis of how many candidate pairs are discarded due to unprovability. This information is necessary to evaluate whether the 13.3pp EquiBench gain reflects genuine semantic reasoning or selection bias toward easier-to-prove programs.

Authors: We acknowledge that the current version of the manuscript does not include these verification statistics. In the revision, we will expand §4 to report proof completion rates, verification error rates (noting that Liquid Haskell produces no false positives due to soundness, with errors stemming from incompleteness or timeouts), the number of candidate pairs discarded due to unprovability, and details on how the curriculum filters and orders pairs. This addition will allow direct assessment of selection bias. While the observed transfer gains on EquiBench and PySecDB are consistent with improved semantic reasoning, the new statistics will provide the necessary transparency to evaluate this. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results on external benchmarks are independent of training inputs.

full rationale

The paper's claims center on empirical accuracy gains (up to 13.3pp on EquiBench) and ablation findings from SEQ-SINQ regimes, evaluated on held-out external benchmarks (EquiBench, PySecDB) and a released synthetic dataset (OpInstruct-HSx). These quantities are not defined from or reduced to the self-play training loop, Liquid Haskell proofs, or curriculum inputs by construction. The framework generates data but measures transfer to independent downstream tasks, with ablations explicitly separating data volume from proof-based signals. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the described pipeline; results remain falsifiable against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of Liquid Haskell for semantic equivalence validation and on the assumption that self-play with curriculum produces transferable reasoning rather than dataset-specific patterns.

axioms (1)

domain assumption Liquid Haskell proofs correctly and completely capture semantic equivalence of Haskell programs
The framework uses these proofs as the positive signal for equivalence during training.

pith-pipeline@v0.9.0 · 5444 in / 1254 out tokens · 44843 ms · 2026-05-10T06:43:48.466950+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

Ardiff: scaling program equivalence checking via iterative abstraction and refinement of common code. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineer- ing, pages 13–24. Federico Cassano, John Gouwar, Daniel Nguyen, Syd- ney Nguyen, Luna Phipps-Costin, Donald Pinc...

work page arXiv 2022
[2]

Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025

Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations. Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. Enhancing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085. Lucas...

work page arXiv 2025
[3]

This purity ensures that two functions are semantically equivalent if they produce the same outputs for all inputs, regardless of how those outputs are computed

Pure Semantics Every function is a pure mapping from inputs to outputs, there is no hidden state or mutation. This purity ensures that two functions are semantically equivalent if they produce the same outputs for all inputs, regardless of how those outputs are computed. 1-- revRec :: [a] -> [a] 2revRec :: [a] -> [a] 3revRec [] = [] 4revRec (x:xs) = revRe...

work page
[4]

negative

Static Typing and GHC Compile Haskell’s static type system provides strong guarantees at compile time. Once a program is accepted by the compiler, most classes of errors such as type mismatches and null de-referencing are eliminated. This makes the type checker an effective pre-filter for program validity in the self-play loop. 1-- add :: Int -> Int -> In...

work page 2025
[5]

@-}}`annotation , with the exact naming pattern lemma_<P>_equiv

Use the`{{-@ lemma_... @-}}`annotation , with the exact naming pattern lemma_<P>_equiv

work page
[6]

The Haskell type signature

work page
[7]

The function definition with`===`steps

work page
[8]

--reflection

Please put your proof between```haskell and ``` No extra text, no additional comments. Your answer must match the example format exactly, without trailing whitespace or newlines outside the code block. D.4 Lemma SEQ Proof User Prompt {error_msg_section} {equiv_code} ------------------------------------------------------------ Your task: Produce the proof ...

work page 2025

[1] [1]

Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

Ardiff: scaling program equivalence checking via iterative abstraction and refinement of common code. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineer- ing, pages 13–24. Federico Cassano, John Gouwar, Daniel Nguyen, Syd- ney Nguyen, Luna Phipps-Costin, Donald Pinc...

work page arXiv 2022

[2] [2]

Enhanc- ing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085, 2025

Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations. Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. Enhancing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085. Lucas...

work page arXiv 2025

[3] [3]

This purity ensures that two functions are semantically equivalent if they produce the same outputs for all inputs, regardless of how those outputs are computed

Pure Semantics Every function is a pure mapping from inputs to outputs, there is no hidden state or mutation. This purity ensures that two functions are semantically equivalent if they produce the same outputs for all inputs, regardless of how those outputs are computed. 1-- revRec :: [a] -> [a] 2revRec :: [a] -> [a] 3revRec [] = [] 4revRec (x:xs) = revRe...

work page

[4] [4]

negative

Static Typing and GHC Compile Haskell’s static type system provides strong guarantees at compile time. Once a program is accepted by the compiler, most classes of errors such as type mismatches and null de-referencing are eliminated. This makes the type checker an effective pre-filter for program validity in the self-play loop. 1-- add :: Int -> Int -> In...

work page 2025

[5] [5]

@-}}`annotation , with the exact naming pattern lemma_<P>_equiv

Use the`{{-@ lemma_... @-}}`annotation , with the exact naming pattern lemma_<P>_equiv

work page

[6] [6]

The Haskell type signature

work page

[7] [7]

The function definition with`===`steps

work page

[8] [8]

--reflection

Please put your proof between```haskell and ``` No extra text, no additional comments. Your answer must match the example format exactly, without trailing whitespace or newlines outside the code block. D.4 Lemma SEQ Proof User Prompt {error_msg_section} {equiv_code} ------------------------------------------------------------ Your task: Produce the proof ...

work page 2025