Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
Pith reviewed 2026-05-10 06:43 UTC · model grok-4.3
The pith
Self-play training on formally verified semantic equivalence in Haskell improves LLM accuracy on code reasoning benchmarks by up to 13 points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an adversarial self-play procedure organized around verified semantic equivalence produces an evaluator whose code-reasoning ability exceeds models trained on execution counterexamples alone. On the EquiBench benchmark the evaluator reaches an accuracy gain of 13.3 percentage points, with additional consistent gains on the PySecDB security dataset. The authors further show through controlled ablations that equivalence proofs are uniquely responsible for the acquired reasoning capabilities while inequivalence data mainly supplies volume.
What carries the argument
The semantic-equivalence self-play framework that alternates generator and evaluator roles, using Liquid Haskell proofs for positive equivalence labels and execution counterexamples for negative labels, organized by a difficulty-aware curriculum.
If this is right
- The evaluator model can be applied to new code-equivalence questions without task-specific fine-tuning.
- Equivalence proofs contribute more to reasoning capability than additional volumes of inequivalence counterexamples.
- The difficulty-aware curriculum allows the models to scale from simple to complex program pairs during training.
- Both the full training pipeline and the synthetic dataset of roughly 28,000 validated programs support direct reproduction and extension.
Where Pith is reading between the lines
- The same generator-evaluator loop could be adapted to other languages that offer comparable formal verification support.
- Grounding training signals in symbolic proofs rather than execution traces alone may offer a general path toward stronger reasoning in code-focused language models.
- An evaluator that has internalized verified equivalence relations could serve as a reliable component inside automated testing or program repair systems.
Load-bearing premise
Liquid Haskell proofs correctly and exhaustively identify semantically equivalent program pairs without systematic false positives or negatives that the training loop could exploit.
What would settle it
Evaluating the trained model on a new collection of Haskell program pairs whose equivalence status has been confirmed by an independent method such as manual review or a different verification tool, then checking whether accuracy remains near the reported 13-point gain.
Figures
read the original abstract
We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a self-play framework (SEQ-SINQ) for training LLMs on semantic equivalence and inequivalence of Haskell programs. It combines Liquid Haskell proofs for equivalence validation with execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum on the released OpInstruct-HSx synthetic dataset (~28k programs). The evaluator is shown to transfer to downstream tasks with up to 13.3pp accuracy gains on EquiBench and consistent gains on PySecDB; ablations on SEQ-SINQ regimes claim that equivalence proofs (rather than inequivalence data volume) are uniquely responsible for the reasoning improvements.
Significance. If the results hold, the work offers a concrete method for using formal verification to generate high-quality, proof-based supervision signals in self-play, which could improve generalization in code reasoning models beyond purely execution-based or synthetic data. The public release of the full training pipeline on GitHub and the OpInstruct-HSx dataset on Hugging Face is a clear strength that enables reproducibility and follow-on work.
major comments (2)
- [Ablation studies on the SEQ-SINQ regimes] Ablation studies on the SEQ-SINQ regimes: the central claim that 'equivalence proofs are uniquely responsible for the model's reasoning capabilities' (while inequivalence supervision only provides data volume) is load-bearing for the paper's contribution. This attribution assumes Liquid Haskell supplies accurate and unbiased positive labels. However, Liquid Haskell is sound but incomplete; many semantically equivalent pairs may fail to type-check or yield proofs due to refinement-type limitations, solver timeouts, or unprovable properties. Without reported proof success rates, false-negative rates on equivalent pairs, or validation that the curriculum does not overfit to provable fragments of OpInstruct-HSx, the ablation's causal separation is not fully supported.
- [§4] §4 (or equivalent section describing verification and curriculum): the manuscript does not report verification error rates, proof completion statistics, or an analysis of how many candidate pairs are discarded due to unprovability. This information is necessary to evaluate whether the 13.3pp EquiBench gain reflects genuine semantic reasoning or selection bias toward easier-to-prove programs.
minor comments (2)
- The abstract states 'up to 13.3pp accuracy gain' without specifying the exact baseline model, split, or statistical significance; this detail should appear in the main results table or text for clarity.
- Notation for the SEQ-SINQ regimes and the difficulty metric used in the curriculum could be defined more explicitly in the methods section to aid replication.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the ablation studies and verification details. These points help clarify the strength of our claims regarding the role of equivalence proofs. We respond to each major comment below and will incorporate revisions as indicated.
read point-by-point responses
-
Referee: [Ablation studies on the SEQ-SINQ regimes] Ablation studies on the SEQ-SINQ regimes: the central claim that 'equivalence proofs are uniquely responsible for the model's reasoning capabilities' (while inequivalence supervision only provides data volume) is load-bearing for the paper's contribution. This attribution assumes Liquid Haskell supplies accurate and unbiased positive labels. However, Liquid Haskell is sound but incomplete; many semantically equivalent pairs may fail to type-check or yield proofs due to refinement-type limitations, solver timeouts, or unprovable properties. Without reported proof success rates, false-negative rates on equivalent pairs, or validation that the curriculum does not overfit to provable fragments of OpInstruct-HSx, the ablation's causal separation is not fully supported.
Authors: We agree that Liquid Haskell's incompleteness requires careful interpretation of the ablation results. The positive labels in our training are guaranteed correct by successful proofs (soundness ensures no false positives), while incompleteness means some equivalent pairs are not labeled positive and thus treated as unknown. This could increase task difficulty rather than bias toward easier fragments. To strengthen the causal claim and address potential selection effects in the difficulty-aware curriculum, we will add to the revised manuscript a dedicated analysis in §4 reporting proof success rates on OpInstruct-HSx, the fraction of equivalent pairs that successfully produce proofs, and statistics on discarded pairs. This will better support the attribution that equivalence proofs, rather than data volume alone, drive the reasoning gains. revision: yes
-
Referee: [§4] §4 (or equivalent section describing verification and curriculum): the manuscript does not report verification error rates, proof completion statistics, or an analysis of how many candidate pairs are discarded due to unprovability. This information is necessary to evaluate whether the 13.3pp EquiBench gain reflects genuine semantic reasoning or selection bias toward easier-to-prove programs.
Authors: We acknowledge that the current version of the manuscript does not include these verification statistics. In the revision, we will expand §4 to report proof completion rates, verification error rates (noting that Liquid Haskell produces no false positives due to soundness, with errors stemming from incompleteness or timeouts), the number of candidate pairs discarded due to unprovability, and details on how the curriculum filters and orders pairs. This addition will allow direct assessment of selection bias. While the observed transfer gains on EquiBench and PySecDB are consistent with improved semantic reasoning, the new statistics will provide the necessary transparency to evaluate this. revision: yes
Circularity Check
No significant circularity; results on external benchmarks are independent of training inputs.
full rationale
The paper's claims center on empirical accuracy gains (up to 13.3pp on EquiBench) and ablation findings from SEQ-SINQ regimes, evaluated on held-out external benchmarks (EquiBench, PySecDB) and a released synthetic dataset (OpInstruct-HSx). These quantities are not defined from or reduced to the self-play training loop, Liquid Haskell proofs, or curriculum inputs by construction. The framework generates data but measures transfer to independent downstream tasks, with ablations explicitly separating data volume from proof-based signals. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the described pipeline; results remain falsifiable against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Liquid Haskell proofs correctly and completely capture semantic equivalence of Haskell programs
Reference graph
Works this paper leans on
-
[1]
Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022
Ardiff: scaling program equivalence checking via iterative abstraction and refinement of common code. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineer- ing, pages 13–24. Federico Cassano, John Gouwar, Daniel Nguyen, Syd- ney Nguyen, Luna Phipps-Costin, Donald Pinc...
-
[2]
Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations. Alessandro Giagnorio, Alberto Martin-Lopez, and Gabriele Bavota. 2025. Enhancing code generation for low-resource languages: No silver bullet.arXiv preprint arXiv:2501.19085. Lucas...
-
[3]
Pure Semantics Every function is a pure mapping from inputs to outputs, there is no hidden state or mutation. This purity ensures that two functions are semantically equivalent if they produce the same outputs for all inputs, regardless of how those outputs are computed. 1-- revRec :: [a] -> [a] 2revRec :: [a] -> [a] 3revRec [] = [] 4revRec (x:xs) = revRe...
-
[4]
Static Typing and GHC Compile Haskell’s static type system provides strong guarantees at compile time. Once a program is accepted by the compiler, most classes of errors such as type mismatches and null de-referencing are eliminated. This makes the type checker an effective pre-filter for program validity in the self-play loop. 1-- add :: Int -> Int -> In...
work page 2025
-
[5]
@-}}`annotation , with the exact naming pattern lemma_<P>_equiv
Use the`{{-@ lemma_... @-}}`annotation , with the exact naming pattern lemma_<P>_equiv
-
[6]
The Haskell type signature
-
[7]
The function definition with`===`steps
-
[8]
Please put your proof between```haskell and ``` No extra text, no additional comments. Your answer must match the example format exactly, without trailing whitespace or newlines outside the code block. D.4 Lemma SEQ Proof User Prompt {error_msg_section} {equiv_code} ------------------------------------------------------------ Your task: Produce the proof ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.