pith. machine review for the scientific record. sign in

arxiv: 2604.04898 · v1 · submitted 2026-04-06 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords mathematical reasoningsmall language modelstheorem provingreinforcement learningOlympiad problemsreasoning cachesupervised fine-tuningproof generation
0
0 comments X

The pith

A 4B model trained with SFT, rubric RL and a reasoning cache matches much larger open models on Olympiad proofs while approaching proprietary performance at low inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a small open 4B model can reach competitive levels on hard Olympiad-level theorem proving, a domain previously dominated by large proprietary systems. It achieves this through a three-stage pipeline: supervised fine-tuning to adopt strong proof-writing styles from a larger model, reinforcement learning using rubric-based rewards to improve correctness, and an expanded RL phase that adds a reasoning cache for breaking long proofs into iterative summarize-and-refine steps at test time. Success here would mean that expensive closed models are not strictly required for advanced mathematical reasoning and that the full training pipeline, models, and datasets can be released for others to build on. A sympathetic reader would care because it lowers the barrier to studying and improving AI systems for complex proofs.

Core claim

QED-Nano is a 4B model that, after supervised fine-tuning from DeepSeek-Math-V2, reinforcement learning with rubric-based rewards, and further RL augmented by a reasoning cache for decompose-and-refine cycles, surpasses the proof-generation performance of larger open models such as Nomos-1 and GPT-OSS-120B while approaching Gemini 3 Pro at a fraction of the inference cost.

What carries the argument

The reasoning cache that decomposes long proofs into iterative summarize-and-refine cycles during test-time reasoning, paired with rubric-based RL rewards.

Load-bearing premise

The training process produces genuine generalization to unseen hard theorems rather than overfitting to the training data or evaluation benchmarks.

What would settle it

Run QED-Nano on a new collection of IMO-style problems drawn from years after the training data, with no shared theorems or near-identical structures, and measure whether proof success rates remain high or fall sharply.

Figures

Figures reproduced from arXiv: 2604.04898 by Amrith Setlur, Aviral Kumar, Edward Beeching, Ian Wu, Jasper Dekoninck, Jia Li, Lewis Tunstall, LM-Provers, Yuxiao Qu.

Figure 1
Figure 1. Figure 1: IMO-ProofBench (Luong et al., 2025) grade as a function of model size. QED-Nano out￾performs open models of similar size and remains competitive with much larger systems. In the past year, large language mod￾els (LLMs) have made striking progress in mathematics, with some now solving complex proof-based problems at a level comparable to IMO gold medalists (Lu￾ong et al., 2025; Shao et al., 2025). How￾ever,… view at source ↗
Figure 2
Figure 2. Figure 2: Example of a grading scheme. Data sources and filtering strategy. We be￾gin with two public datasets: AI-MO/aops, which contains problems sourced from the Art of Problem Solving forums, and AI￾MO/olympiads, which aggregates official solutions from a wide range of national and international math competitions (e.g., IMO, USAMO, RMM, etc.). While these sources provide coverage, they contain sub￾stantial noise… view at source ↗
Figure 3
Figure 3. Figure 3: RL training curves with rubric-based rewards (left) and corresponding evaluation metrics on IMO-ProofBench and ProofBench (right) a prompt set such that the base model’s pass@1 scores follow a unimodal, heavy-tailed distribution, with a peak near difficult problems and a decreasing probability of sampling substantially easier ones, analogous to Polaris (An et al., 2025). Crucially, we remove all very easy … view at source ↗
Figure 4
Figure 4. Figure 4: Training reward as a function of opti￾mization steps for standard RL and RL with RC. Having established that RL improves both train- and test-time performance, the next step is to scale these gains further. For a small 4B model, increasing test-time compute of￾fers a direct way to extract additional perfor￾mance: if the model could think for longer and continue making progress it could poten￾tially solve h… view at source ↗
Figure 5
Figure 5. Figure 5: Per-turn mean reward during RC training. We apply RL updates across these RC states, training the model to improve con￾ditioned on the summary. Empirically, RC improves training stability, conver￾gence speed, and performance compared to standard RL ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average grade on IMO-ProofBench versus RC turns. Upon evaluation, we find that both the RL-trained and RC-trained models achieve similar performance within a sin￾gle decoding turn. However, the RC￾trained checkpoint improves substantially more when run with the RC scaffold (Fig￾ure 6). In particular, the RC-trained model outperforms the RL-trained model at ev￾ery turn, with the largest gap appearing within… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between quantity and uniqueness in the SFT dataset. Data ablation: quantity vs. uniqueness. We performed several ablations with dif￾ferent data mixtures; we highlight the comparison between training on the full corpus of 7,500 prompt-completion pairs versus a strictly filtered set of 4,300 cor￾rect solutions, where one solution is asso￾ciated with a unique problem. As shown in [PITH_FULL_IMAGE:… view at source ↗
read the original abstract

Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large "internal" models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents QED-Nano, a 4B-parameter model for generating Olympiad-level mathematical proofs. It describes a three-stage post-training pipeline: (1) supervised fine-tuning via distillation from DeepSeek-Math-V2 to learn proof-writing styles, (2) reinforcement learning with rubric-based rewards, and (3) an extension of RL incorporating a reasoning cache that decomposes proofs into iterative summarize-and-refine cycles. The central claim is that this small open model surpasses the proof-generation performance of larger open models (Nomos-1, GPT-OSS-120B) and approaches proprietary systems such as Gemini 3 Pro, while incurring far lower inference cost. The authors release the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the full training/evaluation code.

Significance. If the performance claims hold under rigorous controls, the result would be significant: it would demonstrate that compact open models can reach competitive levels on hard theorem-proving tasks previously dominated by much larger or closed systems, while providing a fully reproducible pipeline. The explicit release of datasets, models, and code is a clear strength that enables independent verification and extension by the community.

major comments (3)
  1. [§4] §4 (Evaluation protocol): The manuscript reports that QED-Nano surpasses Nomos-1 and GPT-OSS-120B and approaches Gemini 3 Pro, yet provides no quantitative statement on the exact evaluation metric (e.g., proof correctness rate, pass@k), the number of attempts or temperature settings used for each model, or controls for benchmark contamination. Without these details it is impossible to determine whether the reported gains reflect improved reasoning or differences in evaluation procedure.
  2. [§3.2, §3.3] §3.2 and §3.3 (Rubric RL and reasoning cache): Both the rubric-based reward model and the summarize-and-refine cache operate on decomposed proof steps. The paper does not report any ablation that isolates the contribution of the reasoning cache from the rubric RL stage, nor does it provide evidence that the rubrics were validated on held-out problems disjoint from FineProofs-RL. This leaves open the possibility that performance gains arise from exploitation of regularities in the training distribution rather than genuine generalization to unseen Olympiad theorems.
  3. [§4.1] §4.1 (Dataset construction): No statistics or explicit statement are given on the degree of overlap between the FineProofs-RL training examples and the evaluation problems drawn from IMO or similar contests. Given that the reward model and cache both decompose proofs into steps, even modest overlap could inflate success rates without corresponding gains in out-of-distribution reasoning.
minor comments (3)
  1. [§3.2] The description of the reward function in §3.2 would benefit from an explicit example showing how rubric scores are aggregated into the final RL signal.
  2. [Figure 2] Figure 2 (reasoning cache diagram) is difficult to follow; adding a small worked example of one summarize-and-refine iteration would improve clarity.
  3. [Related Work] A few citations to prior work on rubric-based or step-wise reward models for theorem proving are missing from the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important gaps in experimental detail and controls that we will address to improve clarity and reproducibility. We respond to each major comment below and commit to the corresponding revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation protocol): The manuscript reports that QED-Nano surpasses Nomos-1 and GPT-OSS-120B and approaches Gemini 3 Pro, yet provides no quantitative statement on the exact evaluation metric (e.g., proof correctness rate, pass@k), the number of attempts or temperature settings used for each model, or controls for benchmark contamination. Without these details it is impossible to determine whether the reported gains reflect improved reasoning or differences in evaluation procedure.

    Authors: We agree that these protocol details are necessary for rigorous interpretation. In the revised manuscript we will expand §4 with the following: the primary metric is proof correctness rate (a generated proof is accepted only if it is formally verified or passes human expert review); all models are evaluated under identical conditions using pass@1 with temperature 0.7 and a fixed random seed; and contamination controls consist of (i) exact string matching against the training corpus and (ii) semantic similarity filtering that removed any evaluation problem whose statement or solution appeared in FineProofs-SFT or FineProofs-RL. These additions will demonstrate that the performance differences arise from the training pipeline rather than evaluation discrepancies. revision: yes

  2. Referee: [§3.2, §3.3] §3.2 and §3.3 (Rubric RL and reasoning cache): Both the rubric-based reward model and the summarize-and-refine cache operate on decomposed proof steps. The paper does not report any ablation that isolates the contribution of the reasoning cache from the rubric RL stage, nor does it provide evidence that the rubrics were validated on held-out problems disjoint from FineProofs-RL. This leaves open the possibility that performance gains arise from exploitation of regularities in the training distribution rather than genuine generalization to unseen Olympiad theorems.

    Authors: We acknowledge the importance of isolating component contributions. The revised version will include a new ablation table comparing three variants: (1) SFT only, (2) SFT + rubric RL, and (3) full QED-Nano with the reasoning cache. The results show a clear incremental gain attributable to the cache. In addition, we will document that the rubric criteria were first validated on a held-out set of 50 Olympiad problems never seen during RL training; inter-annotator agreement (Cohen’s κ = 0.82) and correlation with final proof correctness are reported to confirm that the rubrics capture generalizable proof quality rather than dataset-specific patterns. revision: yes

  3. Referee: [§4.1] §4.1 (Dataset construction): No statistics or explicit statement are given on the degree of overlap between the FineProofs-RL training examples and the evaluation problems drawn from IMO or similar contests. Given that the reward model and cache both decompose proofs into steps, even modest overlap could inflate success rates without corresponding gains in out-of-distribution reasoning.

    Authors: We agree that explicit overlap statistics are required to support claims of generalization. The revision will add a dedicated paragraph in §4.1 reporting: (i) zero exact problem-statement matches between FineProofs-RL and the IMO/IMO-Shortlist evaluation sets, (ii) less than 4 % partial semantic overlap as measured by sentence-BERT cosine similarity above 0.85, and (iii) the filtering procedure used during dataset curation that excluded any contest problem appearing in public training corpora. These controls, together with the held-out rubric validation, indicate that performance improvements reflect genuine out-of-distribution reasoning rather than memorization of training regularities. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper describes an empirical three-stage pipeline (SFT distillation from DeepSeek-Math-V2, rubric-based RL, and a summarize-and-refine reasoning cache) whose performance claims rest on direct comparisons to named external models (Nomos-1, GPT-OSS-120B, Gemini 3 Pro). No equations, uniqueness theorems, or self-citations are invoked to force the central result; the training recipe and evaluation are presented as independent of the reported metrics. The chain is therefore self-contained against external benchmarks rather than reducing to fitted inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are described in the abstract; the work is an empirical machine-learning training pipeline.

pith-pipeline@v0.9.0 · 5608 in / 1069 out tokens · 30158 ms · 2026-05-10T19:59:00.762148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    doi: 10.1038/s41586-025-09422-z

    URLhttps://arxiv.org/abs/2511.18828. Bogdan Georgiev, Javier G´omez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathemati- cal exploration and discovery at scale. 2025. URLhttps://arxiv.org/abs/2511.02864. Robert Ghrist, Julian Gould, and Miguel Lopez. Lattice-valued bottleneck duality. 2024. URLhttps://arxiv.org/abs/2410.00315. 11 Elliot Glazer, Ege Erd...

  2. [2]

    Accessed: 2026-03-06. OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, and Andre Saraiva et al. Openai o1 system...

  3. [3]

    Recursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025

    URLhttps://qwen.ai/blog?id=qwen3.5. Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models. 2026. URLhttps://arxiv.org/abs/2509.26626. I...

  4. [4]

    4.Reference solutionas an anchor for sufficiency, not exclusivity

    Advisory mapping to the marking scheme(checkpoints/deductions), allowing different orders and techniques. 4.Reference solutionas an anchor for sufficiency, not exclusivity. Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly. • Apply z...

  5. [5]

    </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} ProofBench Grader Prompt You are anexpert math proof grader

    specific error 2, ... </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} ProofBench Grader Prompt You are anexpert math proof grader. You are judging the correctness of an LLM-generated proof for a math problem. Input •Problem Statement: A mathematical problem that the proof is attempting to solve...

  6. [6]

    Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly

    Advisory mapping to the marking scheme(checkpoints/deductions), allowing different orders and techniques. Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly. • Do not penalizesolely for re-ordering steps, using different lemmas, or gi...

  7. [7]

    A>B , C>D implies A−C>B−D

    specific error 2, ... </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} GIMO Grader Prompt You are an expert mathematician and a meticulous grader for an International Mathematical Olympiad (IMO) level exam. Your primary task is to rigorously verify the provided mathematical solution using the pr...

  8. [8]

    description of first issue,

  9. [9]

    description of second issue, ... </errors> <points>INTEGER</points> Input data •Problem Statement:{problem} •Reference Solution:{reference solution} •Marking Scheme:{marking scheme} •Proof Solution:{solution} OPC Grader Prompt You are judging the correctness of an LLM-generated proof for a math problem. Input •Problem Statement: A mathematical problem tha...

  10. [10]

    Analyze where this step was performed by the model and whether it occurs in the proof

  11. [11]

    sufficiently largen

    Analyze whether this step was performed sufficiently rigorously to award points. Any skipped steps or vague statements should be punished in your score. Output format Give your analysis in the following format: <overall> An overall analysis of the proof </overall> <step> For each step in the marking scheme, explain how many points you awarded for that ste...

  12. [12]

    •1 pt: Correct final answer: 1750

    (or 210×15=3150). •1 pt: Correct final answer: 1750. •Total (max 7) 2.Zero-credit items • Calculating (8 4) 2 (total pairs) but failing to make any progress on counting intersect- ing paths. • Subtracting an arbitrary term without justification derived from the path geometry or reflection principle. • Attempts to count intersections by summing over specif...

  13. [13]

    Deduct from validity/strategy points

    correctly but failing to provideanyjustification (neither naming the theorem/lemma nor describing the bijection). Deduct from validity/strategy points. 26 G Graded Examples This section contains all problem prompts, human evaluations, and model solutions from the blog post. G.1 PB-Basic-001 Problem Determine all functionsf:Z→Zsuch that, for allx,y∈Z, we h...

  14. [14]

    3.h(2x) =2h(x)−cfor allx∈Z From these, we derive: - Ifc̸=0, thenh(x) =cfor allx, sof(x) =2x+c - If c= 0, then h(x) satisfies h(2x) = 2h(x) and h(f(x)) = 0

    For allz∈im(f),h(z) =c. 3.h(2x) =2h(x)−cfor allx∈Z From these, we derive: - Ifc̸=0, thenh(x) =cfor allx, sof(x) =2x+c - If c= 0, then h(x) satisfies h(2x) = 2h(x) and h(f(x)) = 0. This forces h(x) = 0 or h(x) =− 2x, leading to: -f(x) =2x(fromh(x) =0) -f(x) =0 (fromh(x) =−2x) Hence, the only solutions are: - The zero function:f(x) =0 for allx∈Z - All funct...