arxiv: 2604.04898 · v1 · submitted 2026-04-06 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

LM-Provers , Yuxiao Qu , Amrith Setlur , Jasper Dekoninck , Edward Beeching , Jia Li , Ian Wu , Lewis Tunstall

show 1 more author

Aviral Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords mathematical reasoningsmall language modelstheorem provingreinforcement learningOlympiad problemsreasoning cachesupervised fine-tuningproof generation

0 comments

The pith

A 4B model trained with SFT, rubric RL and a reasoning cache matches much larger open models on Olympiad proofs while approaching proprietary performance at low inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a small open 4B model can reach competitive levels on hard Olympiad-level theorem proving, a domain previously dominated by large proprietary systems. It achieves this through a three-stage pipeline: supervised fine-tuning to adopt strong proof-writing styles from a larger model, reinforcement learning using rubric-based rewards to improve correctness, and an expanded RL phase that adds a reasoning cache for breaking long proofs into iterative summarize-and-refine steps at test time. Success here would mean that expensive closed models are not strictly required for advanced mathematical reasoning and that the full training pipeline, models, and datasets can be released for others to build on. A sympathetic reader would care because it lowers the barrier to studying and improving AI systems for complex proofs.

Core claim

QED-Nano is a 4B model that, after supervised fine-tuning from DeepSeek-Math-V2, reinforcement learning with rubric-based rewards, and further RL augmented by a reasoning cache for decompose-and-refine cycles, surpasses the proof-generation performance of larger open models such as Nomos-1 and GPT-OSS-120B while approaching Gemini 3 Pro at a fraction of the inference cost.

What carries the argument

The reasoning cache that decomposes long proofs into iterative summarize-and-refine cycles during test-time reasoning, paired with rubric-based RL rewards.

Load-bearing premise

The training process produces genuine generalization to unseen hard theorems rather than overfitting to the training data or evaluation benchmarks.

What would settle it

Run QED-Nano on a new collection of IMO-style problems drawn from years after the training data, with no shared theorems or near-identical structures, and measure whether proof success rates remain high or fall sharply.

Figures

Figures reproduced from arXiv: 2604.04898 by Amrith Setlur, Aviral Kumar, Edward Beeching, Ian Wu, Jasper Dekoninck, Jia Li, Lewis Tunstall, LM-Provers, Yuxiao Qu.

**Figure 1.** Figure 1: IMO-ProofBench (Luong et al., 2025) grade as a function of model size. QED-Nano outperforms open models of similar size and remains competitive with much larger systems. In the past year, large language models (LLMs) have made striking progress in mathematics, with some now solving complex proof-based problems at a level comparable to IMO gold medalists (Luong et al., 2025; Shao et al., 2025). However,… view at source ↗

**Figure 2.** Figure 2: Example of a grading scheme. Data sources and filtering strategy. We begin with two public datasets: AI-MO/aops, which contains problems sourced from the Art of Problem Solving forums, and AIMO/olympiads, which aggregates official solutions from a wide range of national and international math competitions (e.g., IMO, USAMO, RMM, etc.). While these sources provide coverage, they contain substantial noise… view at source ↗

**Figure 3.** Figure 3: RL training curves with rubric-based rewards (left) and corresponding evaluation metrics on IMO-ProofBench and ProofBench (right) a prompt set such that the base model’s pass@1 scores follow a unimodal, heavy-tailed distribution, with a peak near difficult problems and a decreasing probability of sampling substantially easier ones, analogous to Polaris (An et al., 2025). Crucially, we remove all very easy … view at source ↗

**Figure 4.** Figure 4: Training reward as a function of optimization steps for standard RL and RL with RC. Having established that RL improves both train- and test-time performance, the next step is to scale these gains further. For a small 4B model, increasing test-time compute offers a direct way to extract additional performance: if the model could think for longer and continue making progress it could potentially solve h… view at source ↗

**Figure 5.** Figure 5: Per-turn mean reward during RC training. We apply RL updates across these RC states, training the model to improve conditioned on the summary. Empirically, RC improves training stability, convergence speed, and performance compared to standard RL ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Average grade on IMO-ProofBench versus RC turns. Upon evaluation, we find that both the RL-trained and RC-trained models achieve similar performance within a single decoding turn. However, the RCtrained checkpoint improves substantially more when run with the RC scaffold (Figure 6). In particular, the RC-trained model outperforms the RL-trained model at every turn, with the largest gap appearing within… view at source ↗

**Figure 7.** Figure 7: Comparison between quantity and uniqueness in the SFT dataset. Data ablation: quantity vs. uniqueness. We performed several ablations with different data mixtures; we highlight the comparison between training on the full corpus of 7,500 prompt-completion pairs versus a strictly filtered set of 4,300 correct solutions, where one solution is associated with a unique problem. As shown in [PITH_FULL_IMAGE:… view at source ↗

read the original abstract

Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large "internal" models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents QED-Nano, a 4B-parameter model for generating Olympiad-level mathematical proofs. It describes a three-stage post-training pipeline: (1) supervised fine-tuning via distillation from DeepSeek-Math-V2 to learn proof-writing styles, (2) reinforcement learning with rubric-based rewards, and (3) an extension of RL incorporating a reasoning cache that decomposes proofs into iterative summarize-and-refine cycles. The central claim is that this small open model surpasses the proof-generation performance of larger open models (Nomos-1, GPT-OSS-120B) and approaches proprietary systems such as Gemini 3 Pro, while incurring far lower inference cost. The authors release the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the full training/evaluation code.

Significance. If the performance claims hold under rigorous controls, the result would be significant: it would demonstrate that compact open models can reach competitive levels on hard theorem-proving tasks previously dominated by much larger or closed systems, while providing a fully reproducible pipeline. The explicit release of datasets, models, and code is a clear strength that enables independent verification and extension by the community.

major comments (3)

[§4] §4 (Evaluation protocol): The manuscript reports that QED-Nano surpasses Nomos-1 and GPT-OSS-120B and approaches Gemini 3 Pro, yet provides no quantitative statement on the exact evaluation metric (e.g., proof correctness rate, pass@k), the number of attempts or temperature settings used for each model, or controls for benchmark contamination. Without these details it is impossible to determine whether the reported gains reflect improved reasoning or differences in evaluation procedure.
[§3.2, §3.3] §3.2 and §3.3 (Rubric RL and reasoning cache): Both the rubric-based reward model and the summarize-and-refine cache operate on decomposed proof steps. The paper does not report any ablation that isolates the contribution of the reasoning cache from the rubric RL stage, nor does it provide evidence that the rubrics were validated on held-out problems disjoint from FineProofs-RL. This leaves open the possibility that performance gains arise from exploitation of regularities in the training distribution rather than genuine generalization to unseen Olympiad theorems.
[§4.1] §4.1 (Dataset construction): No statistics or explicit statement are given on the degree of overlap between the FineProofs-RL training examples and the evaluation problems drawn from IMO or similar contests. Given that the reward model and cache both decompose proofs into steps, even modest overlap could inflate success rates without corresponding gains in out-of-distribution reasoning.

minor comments (3)

[§3.2] The description of the reward function in §3.2 would benefit from an explicit example showing how rubric scores are aggregated into the final RL signal.
[Figure 2] Figure 2 (reasoning cache diagram) is difficult to follow; adding a small worked example of one summarize-and-refine iteration would improve clarity.
[Related Work] A few citations to prior work on rubric-based or step-wise reward models for theorem proving are missing from the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important gaps in experimental detail and controls that we will address to improve clarity and reproducibility. We respond to each major comment below and commit to the corresponding revisions.

read point-by-point responses

Referee: [§4] §4 (Evaluation protocol): The manuscript reports that QED-Nano surpasses Nomos-1 and GPT-OSS-120B and approaches Gemini 3 Pro, yet provides no quantitative statement on the exact evaluation metric (e.g., proof correctness rate, pass@k), the number of attempts or temperature settings used for each model, or controls for benchmark contamination. Without these details it is impossible to determine whether the reported gains reflect improved reasoning or differences in evaluation procedure.

Authors: We agree that these protocol details are necessary for rigorous interpretation. In the revised manuscript we will expand §4 with the following: the primary metric is proof correctness rate (a generated proof is accepted only if it is formally verified or passes human expert review); all models are evaluated under identical conditions using pass@1 with temperature 0.7 and a fixed random seed; and contamination controls consist of (i) exact string matching against the training corpus and (ii) semantic similarity filtering that removed any evaluation problem whose statement or solution appeared in FineProofs-SFT or FineProofs-RL. These additions will demonstrate that the performance differences arise from the training pipeline rather than evaluation discrepancies. revision: yes
Referee: [§3.2, §3.3] §3.2 and §3.3 (Rubric RL and reasoning cache): Both the rubric-based reward model and the summarize-and-refine cache operate on decomposed proof steps. The paper does not report any ablation that isolates the contribution of the reasoning cache from the rubric RL stage, nor does it provide evidence that the rubrics were validated on held-out problems disjoint from FineProofs-RL. This leaves open the possibility that performance gains arise from exploitation of regularities in the training distribution rather than genuine generalization to unseen Olympiad theorems.

Authors: We acknowledge the importance of isolating component contributions. The revised version will include a new ablation table comparing three variants: (1) SFT only, (2) SFT + rubric RL, and (3) full QED-Nano with the reasoning cache. The results show a clear incremental gain attributable to the cache. In addition, we will document that the rubric criteria were first validated on a held-out set of 50 Olympiad problems never seen during RL training; inter-annotator agreement (Cohen’s κ = 0.82) and correlation with final proof correctness are reported to confirm that the rubrics capture generalizable proof quality rather than dataset-specific patterns. revision: yes
Referee: [§4.1] §4.1 (Dataset construction): No statistics or explicit statement are given on the degree of overlap between the FineProofs-RL training examples and the evaluation problems drawn from IMO or similar contests. Given that the reward model and cache both decompose proofs into steps, even modest overlap could inflate success rates without corresponding gains in out-of-distribution reasoning.

Authors: We agree that explicit overlap statistics are required to support claims of generalization. The revision will add a dedicated paragraph in §4.1 reporting: (i) zero exact problem-statement matches between FineProofs-RL and the IMO/IMO-Shortlist evaluation sets, (ii) less than 4 % partial semantic overlap as measured by sentence-BERT cosine similarity above 0.85, and (iii) the filtering procedure used during dataset curation that excluded any contest problem appearing in public training corpora. These controls, together with the held-out rubric validation, indicate that performance improvements reflect genuine out-of-distribution reasoning rather than memorization of training regularities. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper describes an empirical three-stage pipeline (SFT distillation from DeepSeek-Math-V2, rubric-based RL, and a summarize-and-refine reasoning cache) whose performance claims rest on direct comparisons to named external models (Nomos-1, GPT-OSS-120B, Gemini 3 Pro). No equations, uniqueness theorems, or self-citations are invoked to force the central result; the training recipe and evaluation are presented as independent of the reported metrics. The chain is therefore self-contained against external benchmarks rather than reducing to fitted inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are described in the abstract; the work is an empirical machine-learning training pipeline.

pith-pipeline@v0.9.0 · 5608 in / 1069 out tokens · 30158 ms · 2026-05-10T19:59:00.762148+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our training recipe has three stages: (1) supervised fine-tuning ... (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QED-Nano surpasses the proof-generation performance of much larger open models ... at a fraction of the inference cost

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

doi: 10.1038/s41586-025-09422-z

URLhttps://arxiv.org/abs/2511.18828. Bogdan Georgiev, Javier G´omez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathemati- cal exploration and discovery at scale. 2025. URLhttps://arxiv.org/abs/2511.02864. Robert Ghrist, Julian Gould, and Miguel Lopez. Lattice-valued bottleneck duality. 2024. URLhttps://arxiv.org/abs/2410.00315. 11 Elliot Glazer, Ege Erd...

work page doi:10.1038/s41586-025-09422-z 2025
[2]

Accessed: 2026-03-06. OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, and Andre Saraiva et al. Openai o1 system...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09962-4 2026
[3]

Recursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025

URLhttps://qwen.ai/blog?id=qwen3.5. Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models. 2026. URLhttps://arxiv.org/abs/2509.26626. I...

work page arXiv 2026
[4]

4.Reference solutionas an anchor for sufficiency, not exclusivity

Advisory mapping to the marking scheme(checkpoints/deductions), allowing different orders and techniques. 4.Reference solutionas an anchor for sufficiency, not exclusivity. Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly. • Apply z...

work page
[5]

</errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} ProofBench Grader Prompt You are anexpert math proof grader

specific error 2, ... </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} ProofBench Grader Prompt You are anexpert math proof grader. You are judging the correctness of an LLM-generated proof for a math problem. Input •Problem Statement: A mathematical problem that the proof is attempting to solve...

work page
[6]

Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly

Advisory mapping to the marking scheme(checkpoints/deductions), allowing different orders and techniques. Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly. • Do not penalizesolely for re-ordering steps, using different lemmas, or gi...

work page
[7]

A>B , C>D implies A−C>B−D

specific error 2, ... </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} GIMO Grader Prompt You are an expert mathematician and a meticulous grader for an International Mathematical Olympiad (IMO) level exam. Your primary task is to rigorously verify the provided mathematical solution using the pr...

work page
[8]

description of first issue,

work page
[9]

description of second issue, ... </errors> <points>INTEGER</points> Input data •Problem Statement:{problem} •Reference Solution:{reference solution} •Marking Scheme:{marking scheme} •Proof Solution:{solution} OPC Grader Prompt You are judging the correctness of an LLM-generated proof for a math problem. Input •Problem Statement: A mathematical problem tha...

work page
[10]

Analyze where this step was performed by the model and whether it occurs in the proof

work page
[11]

sufficiently largen

Analyze whether this step was performed sufficiently rigorously to award points. Any skipped steps or vague statements should be punished in your score. Output format Give your analysis in the following format: <overall> An overall analysis of the proof </overall> <step> For each step in the marking scheme, explain how many points you awarded for that ste...

work page
[12]

•1 pt: Correct final answer: 1750

(or 210×15=3150). •1 pt: Correct final answer: 1750. •Total (max 7) 2.Zero-credit items • Calculating (8 4) 2 (total pairs) but failing to make any progress on counting intersect- ing paths. • Subtracting an arbitrary term without justification derived from the path geometry or reflection principle. • Attempts to count intersections by summing over specif...

work page
[13]

Deduct from validity/strategy points

correctly but failing to provideanyjustification (neither naming the theorem/lemma nor describing the bijection). Deduct from validity/strategy points. 26 G Graded Examples This section contains all problem prompts, human evaluations, and model solutions from the blog post. G.1 PB-Basic-001 Problem Determine all functionsf:Z→Zsuch that, for allx,y∈Z, we h...

work page
[14]

3.h(2x) =2h(x)−cfor allx∈Z From these, we derive: - Ifc̸=0, thenh(x) =cfor allx, sof(x) =2x+c - If c= 0, then h(x) satisfies h(2x) = 2h(x) and h(f(x)) = 0

For allz∈im(f),h(z) =c. 3.h(2x) =2h(x)−cfor allx∈Z From these, we derive: - Ifc̸=0, thenh(x) =cfor allx, sof(x) =2x+c - If c= 0, then h(x) satisfies h(2x) = 2h(x) and h(f(x)) = 0. This forces h(x) = 0 or h(x) =− 2x, leading to: -f(x) =2x(fromh(x) =0) -f(x) =0 (fromh(x) =−2x) Hence, the only solutions are: - The zero function:f(x) =0 for allx∈Z - All funct...

work page 2024