Inference-Time Code Selection via Symbolic Equivalence Partitioning

Ananth Grama; David Cho; Fanping Sui; Yifan Wang

arxiv: 2604.06485 · v2 · submitted 2026-04-07 · 💻 cs.LG · cs.AI

Inference-Time Code Selection via Symbolic Equivalence Partitioning

David Cho , Yifan Wang , Fanping Sui , Ananth Grama This is my paper

Pith reviewed 2026-05-13 01:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords inference-time selectioncode generationsymbolic executionequivalence classesLLM codeprogram correctnessHumanEval+LiveCodeBench

0 comments

The pith

Symbolic Equivalence Partitioning selects the correct LLM-generated code by grouping programs that produce identical behavior on public test inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that correct code solutions generated by large language models tend to converge on the same functional behavior across valid inputs, even when their syntax and algorithms differ. It introduces Symbolic Equivalence Partitioning to first treat public examples as validity signals, then apply symbolic execution to divide the remaining candidates into equivalence classes, and finally pick a program from the largest class. This selection method raises accuracy on code generation benchmarks while avoiding the need for extra test cases, trained verifiers, or additional model runs. Readers should care because multi-candidate sampling is already a standard way to boost LLM code quality, yet reliable choice among those candidates has remained a bottleneck; a lightweight semantic signal based on actual behavior addresses that gap directly.

Core claim

Symbolic Equivalence Partitioning first uses problem-provided public examples as lightweight validity signals, then employs symbolic execution to partition the remaining candidate programs into bounded functional equivalence classes, and selects from the dominant equivalence class. This process identifies correct solutions more reliably than consensus alone because correlated incorrect solutions rarely form the largest class under symbolic comparison. The approach operates entirely at inference time on the initial sample pool.

What carries the argument

Symbolic Equivalence Partitioning (SEP), which partitions candidate programs into functional equivalence classes using symbolic execution on public examples and selects a program from the largest class.

If this is right

At N=10 samples, average accuracy increases from 0.754 to 0.826 on HumanEval+.
At N=10 samples, average accuracy increases from 0.565 to 0.647 on LiveCodeBench.
The gains occur without auxiliary test generation, learned verifiers, or additional LLM calls.
Symbolic agreement on functional behavior supplies a usable inference-time signal for code selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be inserted into existing multi-sample code pipelines with only symbolic execution overhead.
It implies that functional-space correlation among LLM errors is weaker than syntactic or semantic-space correlation, which may guide future sampling methods.
Similar partitioning might apply to other structured generation tasks where symbolic or executable checks can define equivalence.

Load-bearing premise

The dominant equivalence class found by symbolic execution on the public examples will contain the correct functional behavior rather than a cluster of correlated mistakes.

What would settle it

A test set in which symbolic execution on the public examples places more candidates into one incorrect functional class than into the correct class, causing the method to select from the wrong group.

Figures

Figures reproduced from arXiv: 2604.06485 by Ananth Grama, David Cho, Fanping Sui, Yifan Wang.

**Figure 2.** Figure 2: Scaling with candidate pool size. Average accuracy across 8 models as the number of sampled candidates increases from N = 5 to N = 20 on HumanEval+ and LiveCodeBench. We compare Pass@1, four baselines, and our method. 4.2 Main Results We evaluate Symbolic Equivalence Partitioning across eight models (4B to 20B parameters) at N ∈ {5, 10, 20}. Tables 1 and 2 present the primary N = 10 setting, and additional… view at source ↗

**Figure 3.** Figure 3: Aggregate runtime comparison for Ours and CodeT. SEP runtime grows with candidate pool size N, while CodeT incurs a higher fixed total overhead due to auxiliary test generation. Values are total wall-clock seconds over full benchmark evaluations; y-axes use log scale. 4.4 Runtime / Cost Analysis Accuracy alone does not capture the practical tradeoffs of reranking methods, so we next compare post-generation… view at source ↗

read the original abstract

Sampling multiple candidate programs at inference time is an effective way to improve LLM code generation. However, its benefit depends on reliably selecting a correct solution from the generated pool. We observe that this selection problem has a distinctive semantic structure: correct solutions, despite differences in syntax, implementation, or algorithmic strategy, often converge to the same functional behavior over valid inputs. At the same time, consensus alone is not sufficient for correctness, because models can also produce correlated wrong solutions that implement the same mistaken behavior. We propose Symbolic Equivalence Partitioning (SEP), an inference-time selection framework that first uses problem-provided public examples as lightweight validity signals. SEP then uses symbolic execution to partition the remaining candidate programs into bounded functional equivalence classes and selects from the dominant equivalence class. Across HumanEval+ and LiveCodeBench, SEP consistently improves selection accuracy without auxiliary test generation, learned verifiers, or additional LLM inference. At $N=10$, SEP improves average accuracy from 0.754 to 0.826 on HumanEval+ and from 0.565 to 0.647 on LiveCodeBench, showing that symbolic functional agreement is an effective signal for inference-time code selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEP offers a practical symbolic partitioning method for selecting LLM code candidates with reported benchmark gains, but leaves the dominant-class assumption untested against correlated errors.

read the letter

SEP gives an inference-time selection method called Symbolic Equivalence Partitioning. It first checks candidates against public examples, then applies symbolic execution to group the rest into equivalence classes based on functional behavior, and finally chooses from the largest class. The accuracy gains on the benchmarks are what stand out. This is new in the way it combines public example validation with symbolic partitioning to avoid the need for additional test cases or verifier models. Earlier approaches often required generating more tests or training separate selectors, but SEP stays within the given problem setup and uses symbolic tools for bounded equivalence. The paper does well in presenting a practical technique that improves selection without extra costs. On HumanEval+ it lifts average accuracy from 0.754 to 0.826, and on LiveCodeBench from 0.565 to 0.647 when sampling 10 candidates. These are concrete numbers that show the value of using symbolic agreement as a signal. The soft spots center on the core assumption that the correct behavior will be the dominant equivalence class. If several incorrect programs implement the same mistake and agree on the public tests, they could form a larger group and get selected instead. The description does not include any study of how frequently correlated errors occur or tests for when this assumption breaks, so the stress-test point about wrong clusters is a legitimate concern that remains unaddressed in the provided details. This kind of work is for the community focused on making LLM code generation more reliable through post-sampling selection. Readers looking for lightweight methods that do not require new data or models will find it relevant. It deserves a serious referee because the idea is well-motivated, the results are reported, and the method appears implementable, even though additional experiments on edge cases would make the claims stronger. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that Symbolic Equivalence Partitioning (SEP) improves inference-time selection of correct code from LLM-generated candidates. It filters the pool using problem-provided public examples, applies symbolic execution to partition remaining candidates into bounded functional equivalence classes, and selects a program from the dominant (largest) class. Empirical results on HumanEval+ and LiveCodeBench show accuracy gains at N=10 (0.754 to 0.826 and 0.565 to 0.647 respectively) without auxiliary test generation, learned verifiers, or extra LLM calls.

Significance. If the results hold under further scrutiny, SEP demonstrates that symbolic functional agreement on public inputs can provide an effective, parameter-free signal for code selection that outperforms naive consensus while remaining lightweight. This strengthens the case for hybrid symbolic-LLM approaches in code generation and could extend to other domains where functional equivalence is checkable.

major comments (2)

[§3] §3 (SEP framework): The selection of the dominant equivalence class after symbolic partitioning on public examples rests on the unexamined assumption that correct solutions reliably form the largest class and that correlated incorrect behaviors do not. No ablation, counterexample analysis, or frequency statistics are provided for cases in which multiple off-by-one or wrong-algorithm variants agree symbolically on the given tests, which would cause systematic selection of incorrect code and undermine the reported gains.
[§4] §4 (Experiments): The accuracy improvements on HumanEval+ and LiveCodeBench are presented as averages without implementation details of the symbolic executor, variance across random seeds, statistical significance tests, or failure-case breakdowns where the dominant class is incorrect. These omissions make it impossible to verify that the central claim is supported by the data rather than by favorable benchmark conditions.

minor comments (2)

The abstract and method description would be clearer if they explicitly stated the bounded nature of the symbolic execution (e.g., input domain restrictions) and any assumptions about the public examples being sufficient validity filters.
Notation for equivalence classes and the dominance criterion could be formalized with a short definition or pseudocode to avoid ambiguity when readers attempt to re-implement the partitioning step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comments identify valid gaps in our analysis and experimental reporting. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (SEP framework): The selection of the dominant equivalence class after symbolic partitioning on public examples rests on the unexamined assumption that correct solutions reliably form the largest class and that correlated incorrect behaviors do not. No ablation, counterexample analysis, or frequency statistics are provided for cases in which multiple off-by-one or wrong-algorithm variants agree symbolically on the given tests, which would cause systematic selection of incorrect code and undermine the reported gains.

Authors: We agree that the original manuscript did not sufficiently examine this assumption through dedicated ablations or counterexample analysis. The design of SEP is motivated by the observation that correct programs tend to converge on identical functional behavior over the public tests, while many incorrect programs produce distinct behaviors or smaller clusters; however, we did not quantify the frequency of cases where correlated errors (e.g., off-by-one or alternative algorithms) form the dominant class. In the revised version we will add an ablation subsection that reports (i) the distribution of equivalence-class sizes for correct versus incorrect selections, (ii) the rate at which the dominant class is incorrect, and (iii) concrete counterexamples of off-by-one and wrong-algorithm variants that agree on the public tests. These additions will make the reliability of the dominant-class heuristic explicit. revision: yes
Referee: [§4] §4 (Experiments): The accuracy improvements on HumanEval+ and LiveCodeBench are presented as averages without implementation details of the symbolic executor, variance across random seeds, statistical significance tests, or failure-case breakdowns where the dominant class is incorrect. These omissions make it impossible to verify that the central claim is supported by the data rather than by favorable benchmark conditions.

Authors: We acknowledge that the experimental section lacks several elements required for full reproducibility and statistical verification. The reported numbers are simple averages over the benchmarks, without variance, significance tests, or failure analysis. In the revised manuscript we will expand §4 to include: (1) a precise description of the symbolic executor, including how bounded execution and public-example filtering are implemented; (2) accuracy results with standard deviations across multiple random seeds for candidate generation; (3) statistical significance tests (e.g., McNemar’s test) comparing SEP against the baselines; and (4) a failure-case breakdown that enumerates instances where the dominant class is incorrect, together with qualitative analysis of those cases. These changes will allow readers to assess the robustness of the gains beyond the reported averages. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines SEP as applying symbolic execution directly to public examples to partition candidates into bounded equivalence classes and selecting the dominant class. This mechanism relies on an external symbolic executor rather than any internal fitted parameters, self-referential definitions, or load-bearing self-citations. No equations reduce a prediction to its own inputs by construction, and the reported accuracy gains on HumanEval+ and LiveCodeBench are presented as empirical outcomes independent of the method's definition. The central claim remains self-contained with no reduction to renamed inputs or smuggled ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about functional convergence of correct programs and the reliability of symbolic execution for partitioning; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption Correct solutions converge to the same functional behavior over valid inputs despite syntactic or algorithmic differences.
Explicitly stated as the distinctive semantic structure enabling partitioning.
domain assumption Symbolic execution on public examples can partition candidate programs into bounded functional equivalence classes.
Core mechanism of the SEP selection process.

pith-pipeline@v0.9.0 · 5506 in / 1367 out tokens · 54054 ms · 2026-05-13T01:09:04.872670+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define two programs pi and pj as functionally equivalent with respect to domain C... partition P into equivalence classes... select a representative p∗ from the largest partition
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SMT-Constrained Pruning... injects domain-specific constraints as Satisfiability Modulo Theories (SMT) assumptions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

URLhttps://openreview.net/forum?id=4FWAwZtd2n. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw. Jason...

work page 2023
[2]

Test Case Generation.CodeT generates synthetic test cases by prompting the same language model used for solution generation

and use the official implementation provided athttps://github.com/microsoft/CodeT. Test Case Generation.CodeT generates synthetic test cases by prompting the same language model used for solution generation. Following the original paper, we construct prompts by appending a test-generation instruction p to the provided starter code. Concretely, the prompt ...

work page
[3]

For HumanEval+, we follow the original CodeT setting and use 100 raw test-generation samples per problem, yielding at most 500 generated test cases

Test Generation:For each problem, we sample independent test-generation outputs and retain up to five valid assert statements from each output that invoke the target entry point. For HumanEval+, we follow the original CodeT setting and use 100 raw test-generation samples per problem, yielding at most 500 generated test cases. For LiveCodeBench, the same s...

work page
[4]

Execution:All N candidate solutions are executed against the resulting set of generated tests, producing pass/fail outcomes for each test

work page
[5]

D Runtime Tables for SEP and CodeT D.1 SEP Runtime (Total Wall-Clock Seconds) Table 5: SEP runtime on HumanEval+ aggregated across the full benchmark evaluation

Grouping and Ranking:Solutions are grouped based on identical execution fingerprints, groups are ranked by a dual-agreement score (generated-test pass strength + agreement count), and the final answer is selected from the top-ranked group. D Runtime Tables for SEP and CodeT D.1 SEP Runtime (Total Wall-Clock Seconds) Table 5: SEP runtime on HumanEval+ aggr...

work page arXiv

[1] [1]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

URLhttps://openreview.net/forum?id=4FWAwZtd2n. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw. Jason...

work page 2023

[2] [2]

Test Case Generation.CodeT generates synthetic test cases by prompting the same language model used for solution generation

and use the official implementation provided athttps://github.com/microsoft/CodeT. Test Case Generation.CodeT generates synthetic test cases by prompting the same language model used for solution generation. Following the original paper, we construct prompts by appending a test-generation instruction p to the provided starter code. Concretely, the prompt ...

work page

[3] [3]

For HumanEval+, we follow the original CodeT setting and use 100 raw test-generation samples per problem, yielding at most 500 generated test cases

Test Generation:For each problem, we sample independent test-generation outputs and retain up to five valid assert statements from each output that invoke the target entry point. For HumanEval+, we follow the original CodeT setting and use 100 raw test-generation samples per problem, yielding at most 500 generated test cases. For LiveCodeBench, the same s...

work page

[4] [4]

Execution:All N candidate solutions are executed against the resulting set of generated tests, producing pass/fail outcomes for each test

work page

[5] [5]

D Runtime Tables for SEP and CodeT D.1 SEP Runtime (Total Wall-Clock Seconds) Table 5: SEP runtime on HumanEval+ aggregated across the full benchmark evaluation

Grouping and Ranking:Solutions are grouped based on identical execution fingerprints, groups are ranked by a dual-agreement score (generated-test pass strength + agreement count), and the final answer is selected from the top-ranked group. D Runtime Tables for SEP and CodeT D.1 SEP Runtime (Total Wall-Clock Seconds) Table 5: SEP runtime on HumanEval+ aggr...

work page arXiv