Recognition: 2 theorem links
· Lean TheoremQED-Nano: Teaching a Tiny Model to Prove Hard Theorems
Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3
The pith
A 4B model trained with SFT, rubric RL and a reasoning cache matches much larger open models on Olympiad proofs while approaching proprietary performance at low inference cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QED-Nano is a 4B model that, after supervised fine-tuning from DeepSeek-Math-V2, reinforcement learning with rubric-based rewards, and further RL augmented by a reasoning cache for decompose-and-refine cycles, surpasses the proof-generation performance of larger open models such as Nomos-1 and GPT-OSS-120B while approaching Gemini 3 Pro at a fraction of the inference cost.
What carries the argument
The reasoning cache that decomposes long proofs into iterative summarize-and-refine cycles during test-time reasoning, paired with rubric-based RL rewards.
Load-bearing premise
The training process produces genuine generalization to unseen hard theorems rather than overfitting to the training data or evaluation benchmarks.
What would settle it
Run QED-Nano on a new collection of IMO-style problems drawn from years after the training data, with no shared theorems or near-identical structures, and measure whether proof success rates remain high or fall sharply.
Figures
read the original abstract
Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large "internal" models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents QED-Nano, a 4B-parameter model for generating Olympiad-level mathematical proofs. It describes a three-stage post-training pipeline: (1) supervised fine-tuning via distillation from DeepSeek-Math-V2 to learn proof-writing styles, (2) reinforcement learning with rubric-based rewards, and (3) an extension of RL incorporating a reasoning cache that decomposes proofs into iterative summarize-and-refine cycles. The central claim is that this small open model surpasses the proof-generation performance of larger open models (Nomos-1, GPT-OSS-120B) and approaches proprietary systems such as Gemini 3 Pro, while incurring far lower inference cost. The authors release the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the full training/evaluation code.
Significance. If the performance claims hold under rigorous controls, the result would be significant: it would demonstrate that compact open models can reach competitive levels on hard theorem-proving tasks previously dominated by much larger or closed systems, while providing a fully reproducible pipeline. The explicit release of datasets, models, and code is a clear strength that enables independent verification and extension by the community.
major comments (3)
- [§4] §4 (Evaluation protocol): The manuscript reports that QED-Nano surpasses Nomos-1 and GPT-OSS-120B and approaches Gemini 3 Pro, yet provides no quantitative statement on the exact evaluation metric (e.g., proof correctness rate, pass@k), the number of attempts or temperature settings used for each model, or controls for benchmark contamination. Without these details it is impossible to determine whether the reported gains reflect improved reasoning or differences in evaluation procedure.
- [§3.2, §3.3] §3.2 and §3.3 (Rubric RL and reasoning cache): Both the rubric-based reward model and the summarize-and-refine cache operate on decomposed proof steps. The paper does not report any ablation that isolates the contribution of the reasoning cache from the rubric RL stage, nor does it provide evidence that the rubrics were validated on held-out problems disjoint from FineProofs-RL. This leaves open the possibility that performance gains arise from exploitation of regularities in the training distribution rather than genuine generalization to unseen Olympiad theorems.
- [§4.1] §4.1 (Dataset construction): No statistics or explicit statement are given on the degree of overlap between the FineProofs-RL training examples and the evaluation problems drawn from IMO or similar contests. Given that the reward model and cache both decompose proofs into steps, even modest overlap could inflate success rates without corresponding gains in out-of-distribution reasoning.
minor comments (3)
- [§3.2] The description of the reward function in §3.2 would benefit from an explicit example showing how rubric scores are aggregated into the final RL signal.
- [Figure 2] Figure 2 (reasoning cache diagram) is difficult to follow; adding a small worked example of one summarize-and-refine iteration would improve clarity.
- [Related Work] A few citations to prior work on rubric-based or step-wise reward models for theorem proving are missing from the related-work section.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify important gaps in experimental detail and controls that we will address to improve clarity and reproducibility. We respond to each major comment below and commit to the corresponding revisions.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation protocol): The manuscript reports that QED-Nano surpasses Nomos-1 and GPT-OSS-120B and approaches Gemini 3 Pro, yet provides no quantitative statement on the exact evaluation metric (e.g., proof correctness rate, pass@k), the number of attempts or temperature settings used for each model, or controls for benchmark contamination. Without these details it is impossible to determine whether the reported gains reflect improved reasoning or differences in evaluation procedure.
Authors: We agree that these protocol details are necessary for rigorous interpretation. In the revised manuscript we will expand §4 with the following: the primary metric is proof correctness rate (a generated proof is accepted only if it is formally verified or passes human expert review); all models are evaluated under identical conditions using pass@1 with temperature 0.7 and a fixed random seed; and contamination controls consist of (i) exact string matching against the training corpus and (ii) semantic similarity filtering that removed any evaluation problem whose statement or solution appeared in FineProofs-SFT or FineProofs-RL. These additions will demonstrate that the performance differences arise from the training pipeline rather than evaluation discrepancies. revision: yes
-
Referee: [§3.2, §3.3] §3.2 and §3.3 (Rubric RL and reasoning cache): Both the rubric-based reward model and the summarize-and-refine cache operate on decomposed proof steps. The paper does not report any ablation that isolates the contribution of the reasoning cache from the rubric RL stage, nor does it provide evidence that the rubrics were validated on held-out problems disjoint from FineProofs-RL. This leaves open the possibility that performance gains arise from exploitation of regularities in the training distribution rather than genuine generalization to unseen Olympiad theorems.
Authors: We acknowledge the importance of isolating component contributions. The revised version will include a new ablation table comparing three variants: (1) SFT only, (2) SFT + rubric RL, and (3) full QED-Nano with the reasoning cache. The results show a clear incremental gain attributable to the cache. In addition, we will document that the rubric criteria were first validated on a held-out set of 50 Olympiad problems never seen during RL training; inter-annotator agreement (Cohen’s κ = 0.82) and correlation with final proof correctness are reported to confirm that the rubrics capture generalizable proof quality rather than dataset-specific patterns. revision: yes
-
Referee: [§4.1] §4.1 (Dataset construction): No statistics or explicit statement are given on the degree of overlap between the FineProofs-RL training examples and the evaluation problems drawn from IMO or similar contests. Given that the reward model and cache both decompose proofs into steps, even modest overlap could inflate success rates without corresponding gains in out-of-distribution reasoning.
Authors: We agree that explicit overlap statistics are required to support claims of generalization. The revision will add a dedicated paragraph in §4.1 reporting: (i) zero exact problem-statement matches between FineProofs-RL and the IMO/IMO-Shortlist evaluation sets, (ii) less than 4 % partial semantic overlap as measured by sentence-BERT cosine similarity above 0.85, and (iii) the filtering procedure used during dataset curation that excluded any contest problem appearing in public training corpora. These controls, together with the held-out rubric validation, indicate that performance improvements reflect genuine out-of-distribution reasoning rather than memorization of training regularities. revision: yes
Circularity Check
No significant circularity detected in the derivation chain
full rationale
The paper describes an empirical three-stage pipeline (SFT distillation from DeepSeek-Math-V2, rubric-based RL, and a summarize-and-refine reasoning cache) whose performance claims rest on direct comparisons to named external models (Nomos-1, GPT-OSS-120B, Gemini 3 Pro). No equations, uniqueness theorems, or self-citations are invoked to force the central result; the training recipe and evaluation are presented as independent of the reported metrics. The chain is therefore self-contained against external benchmarks rather than reducing to fitted inputs or self-referential definitions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our training recipe has three stages: (1) supervised fine-tuning ... (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
QED-Nano surpasses the proof-generation performance of much larger open models ... at a fraction of the inference cost
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s41586-025-09422-z
URLhttps://arxiv.org/abs/2511.18828. Bogdan Georgiev, Javier G´omez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathemati- cal exploration and discovery at scale. 2025. URLhttps://arxiv.org/abs/2511.02864. Robert Ghrist, Julian Gould, and Miguel Lopez. Lattice-valued bottleneck duality. 2024. URLhttps://arxiv.org/abs/2410.00315. 11 Elliot Glazer, Ege Erd...
-
[2]
Accessed: 2026-03-06. OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, and Andre Saraiva et al. Openai o1 system...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09962-4 2026
-
[3]
URLhttps://qwen.ai/blog?id=qwen3.5. Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, and Moksh Jain. Recursive self-aggregation unlocks deep thinking in large language models. 2026. URLhttps://arxiv.org/abs/2509.26626. I...
-
[4]
4.Reference solutionas an anchor for sufficiency, not exclusivity
Advisory mapping to the marking scheme(checkpoints/deductions), allowing different orders and techniques. 4.Reference solutionas an anchor for sufficiency, not exclusivity. Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly. • Apply z...
-
[5]
specific error 2, ... </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} ProofBench Grader Prompt You are anexpert math proof grader. You are judging the correctness of an LLM-generated proof for a math problem. Input •Problem Statement: A mathematical problem that the proof is attempting to solve...
-
[6]
Advisory mapping to the marking scheme(checkpoints/deductions), allowing different orders and techniques. Alternative-approach policy • If the proof uses a different but valid method,map its steps to equivalent rubric check- points(same logical role) and award points accordingly. • Do not penalizesolely for re-ordering steps, using different lemmas, or gi...
-
[7]
specific error 2, ... </errors> Input data •Problem Statement:{problem} •Marking Scheme:{marking scheme} •Proof Solution:{solution} GIMO Grader Prompt You are an expert mathematician and a meticulous grader for an International Mathematical Olympiad (IMO) level exam. Your primary task is to rigorously verify the provided mathematical solution using the pr...
-
[8]
description of first issue,
-
[9]
description of second issue, ... </errors> <points>INTEGER</points> Input data •Problem Statement:{problem} •Reference Solution:{reference solution} •Marking Scheme:{marking scheme} •Proof Solution:{solution} OPC Grader Prompt You are judging the correctness of an LLM-generated proof for a math problem. Input •Problem Statement: A mathematical problem tha...
-
[10]
Analyze where this step was performed by the model and whether it occurs in the proof
-
[11]
Analyze whether this step was performed sufficiently rigorously to award points. Any skipped steps or vague statements should be punished in your score. Output format Give your analysis in the following format: <overall> An overall analysis of the proof </overall> <step> For each step in the marking scheme, explain how many points you awarded for that ste...
-
[12]
•1 pt: Correct final answer: 1750
(or 210×15=3150). •1 pt: Correct final answer: 1750. •Total (max 7) 2.Zero-credit items • Calculating (8 4) 2 (total pairs) but failing to make any progress on counting intersect- ing paths. • Subtracting an arbitrary term without justification derived from the path geometry or reflection principle. • Attempts to count intersections by summing over specif...
-
[13]
Deduct from validity/strategy points
correctly but failing to provideanyjustification (neither naming the theorem/lemma nor describing the bijection). Deduct from validity/strategy points. 26 G Graded Examples This section contains all problem prompts, human evaluations, and model solutions from the blog post. G.1 PB-Basic-001 Problem Determine all functionsf:Z→Zsuch that, for allx,y∈Z, we h...
-
[14]
For allz∈im(f),h(z) =c. 3.h(2x) =2h(x)−cfor allx∈Z From these, we derive: - Ifc̸=0, thenh(x) =cfor allx, sof(x) =2x+c - If c= 0, then h(x) satisfies h(2x) = 2h(x) and h(f(x)) = 0. This forces h(x) = 0 or h(x) =− 2x, leading to: -f(x) =2x(fromh(x) =0) -f(x) =0 (fromh(x) =−2x) Hence, the only solutions are: - The zero function:f(x) =0 for allx∈Z - All funct...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.