pith. sign in

arxiv: 2604.02450 · v1 · submitted 2026-04-02 · 💻 cs.LG · cs.AI· cs.CL

Do We Need Frontier Models to Verify Mathematical Proofs?

Pith reviewed 2026-05-13 21:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM proof verificationprompt optimizationmathematical reasoningself-consistencyopen-source modelsnatural language proofs
0
0 comments X

The pith

Smaller open-source models can verify competition math proofs as reliably as frontier models once prompts target their specific failure modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether frontier-scale language models are required to check natural-language proofs of hard competition problems. It shows that smaller models trail mainly in consistency, not in raw accuracy, when using ordinary judging prompts. An automated search over prompts produces an ensemble of specialized ones that raise accuracy by up to 9.1 percent and self-consistency by 15.9 percent. These gains hold across models and datasets, bringing a 35-billion-parameter open model to parity with Gemini-class verifiers on human-graded proofs.

Core claim

Smaller open-source models already possess the mathematical capabilities needed to verify proofs at the level of frontier models, but general prompts fail to draw those capabilities out reliably. LLM-guided search yields an ensemble of prompts that address the distinct error patterns of smaller models, closing the accuracy and consistency gaps without changing the underlying model weights.

What carries the argument

LLM-guided prompt search that synthesizes an ensemble of specialized prompts targeting the failure modes of smaller models.

If this is right

  • Proof verification can be performed by open models at far lower inference cost than frontier systems.
  • Prompt ensembles become a standard tool for improving consistency in LLM-based judges.
  • Reliable automated checking of competition-level proofs becomes practical for grading or theorem-proving pipelines.
  • The same search method can be applied to other verification tasks where smaller models underperform on general prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the ensembles generalize, they could be reused across entirely new proof datasets without repeating the search step.
  • Capability differences between model sizes for verification appear more prompt-dependent than fundamental.
  • The approach may extend to improving proof generation, not only verification, once similar failure modes are identified.

Load-bearing premise

The discovered prompt ensembles will continue to work on new problems, new models, and new human-graded proof collections without further search or tuning.

What would settle it

Apply the reported prompt ensemble to a fresh set of human-graded proofs from a different math competition and check whether accuracy and self-consistency remain within a few percentage points of the original results.

Figures

Figures reproduced from arXiv: 2604.02450 by Aaditya Naik, Guruprerana Shabadi, Mayur Naik, Rajeev Alur.

Figure 1
Figure 1. Figure 1: (Left) Mean balanced accuracy and self-consistency rates of frontier and open-source LLM judges across all datasets and prompts. Low and high-reasoning settings are denoted with ‘•’ and ‘♦’, respectively, while ‘⋆’ denotes performance after prompt ensembling. Smaller open-source models are only ∼10% behind frontier models in balanced accuracy, but are substantially less self-consistent. Prompt ensembling c… view at source ↗
Figure 2
Figure 2. Figure 2: Comparing mean balanced accuracy of all the models on the combined dataset using the three [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: False positive rates (left) and false negative rates (right) of few models on the combined dataset, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two failure modes of GPT-OSS-120B with IMOBench prompt on IMO-GradingBench and how targeted prompts address them. Each column shows the proof excerpt (top), the IMOBench prompt’s reasoning (middle), and a targeted prompt’s response (bottom). Excerpts are marked by a thick left bar; red highlights mark errors. Yellow highlights mark fixed judgements. Case 1: the IMOBench prompt accepts the proof’s invalid i… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of the prompt ensemble on the three datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Self-consistency rate as a function of the number of independent samples for Qwen3.5-35B and [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Balanced accuracy vs self-consistency on IMO-GradingBench under two grading thresholds: only [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mean balanced accuracy vs self-consistency across all datasets, including high reasoning effort [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of ensemble strategies averaged across all datasets: the full diverse prompt ensemble, [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: False positive rate (FPR) and false negative rate (FNR) of GPT-OSS 120B for each of the 12 error [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FPR and FNR comparison of open-source models (with and without the prompt ensemble) and [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates four open-source and two frontier LLMs on human-graded natural language proofs of competition-level math problems, measuring verifier accuracy and self-consistency. It finds smaller models trail frontier models by up to ~10% in accuracy and ~25% in consistency, but claims that an LLM-guided prompt search can synthesize specialized prompt ensembles that overcome smaller-model failure modes, yielding gains of up to 9.1% accuracy and 15.9% self-consistency and allowing models such as Qwen3.5-35B to match Gemini 3.1 Pro across datasets.

Significance. If the prompt ensembles generalize, the result shows that reliable proof verification does not require frontier-scale models and can be elicited from smaller open-source models via targeted prompting. This would lower the computational barrier for scalable verification of LLM-generated proofs and provide a practical alternative to relying exclusively on the largest models.

major comments (2)
  1. [Abstract] Abstract and Methods: The LLM-guided prompt search procedure is described as producing gains 'across models and datasets,' yet no train/test split, cross-validation protocol, or held-out problem set is reported for the search itself. Without this separation, the 9.1% accuracy and 15.9% self-consistency improvements risk being artifacts of optimization on the same human-graded collections used for final evaluation.
  2. [Experiments] Experiments: The claim that the synthesized ensembles overcome 'specific failure modes' of smaller models is load-bearing for the central thesis, but the manuscript provides no quantitative breakdown of which failure modes were targeted or how the ensemble was validated to avoid dataset-specific tuning.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it listed the exact four open-source and two frontier models and the names of the human-graded datasets.
  2. Notation for self-consistency (rate of agreement across repeated judgments) should be defined explicitly on first use rather than left implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important methodological clarifications needed for the prompt-search procedure and failure-mode analysis. We address each point below and will revise the manuscript accordingly to strengthen transparency without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: The LLM-guided prompt search procedure is described as producing gains 'across models and datasets,' yet no train/test split, cross-validation protocol, or held-out problem set is reported for the search itself. Without this separation, the 9.1% accuracy and 15.9% self-consistency improvements risk being artifacts of optimization on the same human-graded collections used for final evaluation.

    Authors: We acknowledge the validity of this concern. The current manuscript does not explicitly document a held-out split or cross-validation protocol for the LLM-guided prompt search, which was performed by querying the LLM on a small set of examples drawn from the evaluation collections. In the revision we will expand the Methods section to describe the exact search procedure (including the number of examples used, the LLM employed for search, and the stopping criteria), report results when the search is restricted to one dataset and evaluated on the others, and add a note on the risk of dataset-specific tuning. We maintain that the observed gains across four models and multiple independent datasets provide evidence against pure overfitting, but we agree that explicit separation details are required for full reproducibility. revision: yes

  2. Referee: [Experiments] Experiments: The claim that the synthesized ensembles overcome 'specific failure modes' of smaller models is load-bearing for the central thesis, but the manuscript provides no quantitative breakdown of which failure modes were targeted or how the ensemble was validated to avoid dataset-specific tuning.

    Authors: We agree that a quantitative breakdown is necessary to support the central claim. The manuscript currently states that the ensembles overcome specific failure modes but does not enumerate them or provide ablation results. In the revised version we will add a new subsection under Experiments that (1) categorizes the dominant failure modes observed in smaller models (e.g., inconsistent handling of proof edge cases, misreading of quantifiers, and over-acceptance of incomplete steps), (2) shows per-mode accuracy before and after ensemble application, and (3) reports ablation studies that isolate each prompt in the ensemble on held-out problem subsets. This will also include validation across datasets to address tuning concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation on external human-graded datasets

full rationale

The paper reports empirical accuracy and self-consistency measurements for LLM proof verification on human-graded competition-level proof collections. The LLM-guided prompt search produces an ensemble of prompts whose performance is then measured on those same external datasets. No equations, definitions, or self-citations are present that reduce the reported 9.1% accuracy or 15.9% self-consistency gains to quantities defined by construction from fitted parameters inside the paper. The central claim rests on observable performance deltas across models and datasets rather than any tautological renaming or internal fit. This is a standard empirical study whose results are independently falsifiable against the cited human-graded collections.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM evaluation metrics and the representativeness of competition-level proof datasets; no new entities or fitted parameters are introduced.

axioms (2)
  • domain assumption Accuracy and self-consistency are appropriate metrics for assessing LLM proof verification quality
    Standard practice in LLM reasoning evaluation literature.
  • domain assumption Human-graded proof datasets provide reliable ground truth for model judgments
    Invoked when reporting accuracy numbers.

pith-pipeline@v0.9.0 · 5588 in / 1411 out tokens · 38591 ms · 2026-05-13T21:24:13.659800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    it asserts a non-trivial mathematical fact without proof or citation of a well-known result

  2. [2]

    it applies a theorem or lemma without verifying that the required hypotheses are satisfied

  3. [3]

    it states a bound, estimate, or inequality without derivation; or

  4. [4]

    Is this justified within the proof?

    it introduces a construction and claims properties without demonstration. Go through the solution line by line and ask: “Is this justified within the proof?” Do not accept vague justifications such as “it is easy to see”, “clearly”, or “by a standard argument”. Verdict: \boxed{CORRECT} if all claims are adequately justified; \boxed{INCORRECT} if any unjus...

  5. [5]

    intuitively

    uses phrases like “intuitively” or “one can see” without formal justification

  6. [6]

    appeals to geometric intuition without algebraic verification

  7. [7]

    uses asymptotic notation without justifying error terms

  8. [8]

    inverts a function or relation without proving the inversion is valid

  9. [9]

    for large enoughN

    claims an estimate holds “for large enoughN” without specifying what “large enough” means

  10. [10]

    hand-waves through a key technical step; or

  11. [11]

    For each such instance, explain why the argument is insufficient

    conflates approximate and exact equalities. For each such instance, explain why the argument is insufficient. Verdict: \boxed{INCORRECT} if informal arguments invalidate key steps; \boxed{CORRECT} if the proof is sufficiently rigorous throughout. First Error Finder You are a strict math grader performing an error identification task. Read the proposed sol...

  12. [12]

    SinceAholds for each primepdividingn, it holds modulon

    “SinceAholds for each primepdividingn, it holds modulon” — wrong whennhas prime power factors

  13. [13]

    Since X happens infinitely often and Y happens infinitely often, there exist adjacent occurrences

    “Since X happens infinitely often and Y happens infinitely often, there exist adjacent occurrences” — wrong 12 without additional argument

  14. [14]

    Sincegcd(a, b) = 1andpdividesa+b, thenpdividesb

    “Sincegcd(a, b) = 1andpdividesa+b, thenpdividesb” — wrong

  15. [15]

    This triple is primitive

    “This triple is primitive” — asserted without verifying the gcd conditions

  16. [16]

    The intersection is countable/finite

    “The intersection is countable/finite” — asserted without proof

  17. [17]

    Applying this iteratively gives

    “Applying this iteratively gives. . . ” — hand-waving over a non-trivial iteration argument

  18. [18]

    Dividing both sides by X

    “Dividing both sides by X” or “the congruence lifts” — without checking that X is coprime to the modulus. For each “therefore/thus/hence/so” step, explicitly verify whether the conclusion actually follows from the premises. If you find a non-sequitur, quote it and explain what is missing. Verdict:\boxed{INCORRECT}if a non-sequitur is found;\boxed{CORRECT}...

  19. [19]

    every step follows logically and rigorously from prior steps

  20. [20]

    no claim is left unjustified, however obvious it seems

  21. [21]

    the proof addresses exactly the right problem

  22. [22]

    all edge cases and boundary conditions are handled

  23. [23]

    the final answer is demonstrably correct

  24. [24]

    no external results are used without explicit verification of their applicability; and

  25. [25]

    If this step were wrong, what would need to change to fix it?

    all estimates, bounds, and asymptotic claims are properly justified. Challenge every claim. Accept nothing on faith. If you haveanydoubt aboutanystep, the proof is \boxed{INCORRECT}. Quote each problematic passage and explain your concern in detail. Proof Repair You are a mathematical proof editor. Determine whether the proposed solution is correct by att...

  26. [26]

    it is clear that

    it claims a non-trivial result with “it is clear that”, “one can verify”, or similar

  27. [27]

    applying this iteratively

    it says “applying this iteratively” or “by induction” without performing the induction

  28. [28]

    it claims a relationship (equality, inequality, divisibility) without intermediate steps; 13

  29. [29]

    the argument extends

    it uses phrases like “the argument extends” without proof

  30. [30]

    it splits a congruence or divisibility argument without verifying compatibility conditions; or

  31. [31]

    Would this need more justification in a competition submission?

    it performs an algebraic manipulation without verifying validity conditions. For each step, ask: “Would this need more justification in a competition submission?” The proof is \boxed{INCORRECT}if it contains essential justification gaps. False Lemma Verifier You are checking whether each explicitly stated lemma in this proof is actually true. Your method:

  32. [32]

    Lemma”, “Claim

    find every “Lemma”, “Claim”, or “Fact” that the proof states

  33. [33]

    for each one, try to disprove it using a simple counterexample (e.g., paths, stars, small integers); and

  34. [34]

    if you find the lemma is false, quote it exactly, give your counterexample, and mark the proof \boxed{INCORRECT}. Pay special attention to lemmas about: splitting trees by removing an edge and claimed bounds on component sizes (e.g., removing any edge from a star K1,M−1 gives components of sizes 1 and M−1 ); balanced partitions or centroid decompositions;...

  35. [35]

    the mathematical argument is complete — all key steps are present

  36. [36]

    every step is mathematically correct — no false claims; and

  37. [37]

    the answer is correct — the final answer matches what was asked. Common reasons for deductions(less than 7 points⇒\boxed{INCORRECT}): • missing a key case or sub-case; • an unjustified leap or gap in the argument; • a computational or algebraic error that affects the result; or • a wrong final answer. Grading task.First, estimate the score (0–7) and expla...

  38. [38]

    Read each step sequentially

  39. [39]

    For each step, check: is this logically valid given the previous steps?

  40. [40]

    If false or unjustified in a critical way: stop and report\boxed{INCORRECT}

  41. [41]

    Note: minor notational issues are acceptable

    If you reach the end with all steps valid: report\boxed{CORRECT}. Note: minor notational issues are acceptable. Only flag genuine mathematical errors. Skeptical Grading You are an extremely skeptical mathematics professor. Your job is to find flaws in proofs — butonlyreal flaws, not nitpicks. Try hard to find a reason this proof is incorrect. Consider: is...

  42. [42]

    Is the answer right? Check the final answer matches the problem

  43. [43]

    Is the math correct? Verify any formula or computation stated

  44. [44]

    Is the logic complete? Are all cases covered?

  45. [45]

    After your analysis: \boxed{INCORRECT} if you found a specific flaw; \boxed{CORRECT} if your best skeptical effort found none

    Are theorems applied correctly? Check all conditions are met. After your analysis: \boxed{INCORRECT} if you found a specific flaw; \boxed{CORRECT} if your best skeptical effort found none. Entailment Analysis Your task: determine if this proof provesexactlywhat was asked, with all steps valid. Step 1.State precisely what the problem requires to be proven....

  46. [46]

    List every mathematical claim or step in the proof

  47. [47]

    For each claim, mark it as: VALID/ NEEDS_VERIFICATION/ SUSPICIOUS

  48. [48]

    15 Verdict: \boxed{INCORRECT} if at least one concern is FATAL; \boxed{CORRECT} if all concerns are MINORor stylistic (or there are none)

    For SUSPICIOUSitems, determine if the flaw is: • FATAL: the proof is wrong without this step; or • MINOR: the proof still works or the gap is trivially fillable. 15 Verdict: \boxed{INCORRECT} if at least one concern is FATAL; \boxed{CORRECT} if all concerns are MINORor stylistic (or there are none). Theorem Usage Analysis You are verifying whether every t...

  49. [49]

    identify the theorem being used

  50. [50]

    list its required conditions or hypotheses; and

  51. [51]

    find allX

    check whether each condition is actually verified in the proof. Also check: if the proof uses induction, verify that the base case is proven, the inductive step is valid, and the induction variable and range are correct. Verdict: \boxed{INCORRECT} if any condition of a lemma or theorem is not verified for a specific application; \boxed{CORRECT}if all cond...