VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

Kevin Leach; Manish Acharya; Yifan Zhang; Yueke Zhang; Yu Huang; Zhenyu Liao

arxiv: 2606.19399 · v1 · pith:P22VHLU7new · submitted 2026-06-17 · 💻 cs.LG · cs.AI· cs.LO· cs.PL

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

Manish Acharya , Zhenyu Liao , Yueke Zhang , Kevin Leach , Yu Huang , Yifan Zhang This is my paper

Pith reviewed 2026-06-26 21:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LOcs.PL

keywords formal theorem provingLLM proversverifier-guided searchMCTSminiF2Fzero-shotproof searchcombinatorics benchmark

0 comments

The pith

VERITAS reaches 40.6% on miniF2F by routing verifier signals into critic-guided MCTS after Best-of-N sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VERITAS shows that LLM formal provers improve when they treat verifier outputs beyond a binary pass/fail by running Best-of-N sampling first then feeding its failures as negative examples into a second-phase MCTS. The protocol guarantees that every theorem solved in the initial sweep remains solved, so any extra successes come from the feedback-driven exploration. On the standard miniF2F benchmark this yields higher success than independent Best-of-5 or Portfolio baselines. The paper also releases a 55-theorem combinatorics set where unguided sampling drops below Portfolio, indicating that iterative lemma recovery from verifier messages requires guided search.

Core claim

The central claim is that a two-phase zero-shot protocol—Best-of-N sampling followed by a critic that converts Phase-1 verifier failures (syntax errors, type mismatches, partial progress) into explicit negative examples for MCTS—produces additional proofs while preserving all Phase-1 solves, reaching 40.6% on miniF2F and 7.3% on the new VERITAS-CombiBench where Best-of-5 falls to 1.8%.

What carries the argument

The two-phase protocol that ingests every verifier signal as negative examples for critic-guided MCTS after an initial Best-of-N sweep.

If this is right

Theorems solved by the initial Best-of-N sweep are guaranteed to stay solved after the MCTS pass.
Unguided sampling underperforms when the correct lemma names must be recovered iteratively from successive verifier messages.
The released 55-theorem combinatorics benchmark exposes the gap between Best-of-5 and Portfolio methods under feedback-driven conditions.
Rich verifier signals (syntax, types, partial goals) can be used directly as training signals for the critic without additional human annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-phase routing of verifier signals could be tested on other interactive theorem provers that already expose detailed error messages.
Releasing the combinatorics benchmark creates a controlled setting for measuring how well different search methods recover lemma sequences over multiple turns.
If the critic can be replaced by a lighter rule-based extractor of failure patterns, the method might run with lower overhead while retaining the reported gains.

Load-bearing premise

A critic model can reliably turn Phase-1 verifier failure signals into useful negative examples that improve MCTS exploration without introducing bias or inefficiency that erases the gains.

What would settle it

Ablating the negative examples derived from Phase-1 failures and measuring whether the additional solves on miniF2F and VERITAS-CombiBench disappear while Phase-1 solves remain unchanged.

Figures

Figures reproduced from arXiv: 2606.19399 by Kevin Leach, Manish Acharya, Yifan Zhang, Yueke Zhang, Yu Huang, Zhenyu Liao.

**Figure 1.** Figure 1: VERITAS two-phase protocol. Phase 1: Best-of-N dispatch; failures feed corpus F1. Phase 2: Critic-guided MCTS on U1 with Lean signals σA–σD; final S = S1 ∪ S2 satisfies S ⊇ S1. Portfolio by +14.4pp. Because the Best-of-5 comparison is an independent stochastic run, we treat this gap as descriptive and report paired attribution separately. The two single-component ablations close the 2×2 grid. MCTS without… view at source ↗

**Figure 2.** Figure 2: Main results on miniF2F. (a) Solve rates with 95% Wilson CIs; dashed line: ReProver (26.5%). (b) Per-category Bestof-5→VERITAS deltas (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Phase attribution and difficulty split. (a) Solved theorems on 201 hard problems (47 shared; orange: 11 VERITASonly; blue: 5 Best-of-5-only). (b) Easy (sorry, n=43) vs. hard (non-sorry, n=201) solve rates by method. Difficulty stratification. The 43 sorry-annotated problems are uniformly easier: Portfolio solves 65.1%, Bestof-5 solves 88.4%, VERITAS solves 95.3%. On the 201 non-sorry problems the rates … view at source ↗

**Figure 4.** Figure 4: Compute efficiency and failure modes. (a) Solve rate vs. total Lean calls (log); VERITAS on the inference-only Pareto frontier. (b) Deepest verifier signal at termination; 24 σC nearmisses (10%). 3.7. Failure-mode Analysis Locates the Next Frontier Stratifying the 244 theorems by deepest verifier signal (Figure 4(b)): 99 proofs (σD=1); 24 (10%) reach σC without closing—the most promising near-misses for… view at source ↗

read the original abstract

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VERITAS adds a critic-guided MCTS phase on verifier negatives after Best-of-N sampling, claims 40.6% on miniF2F and 7.3% on a new 55-theorem combinatorics benchmark, but the gains rest on an untested assumption about the critic.

read the letter

The main thing here is a two-phase protocol that runs Best-of-N first, then feeds verifier failure signals (syntax errors, type mismatches, partial goals) as explicit negatives into a critic that guides MCTS, while keeping every Phase-1 solve. That setup is new in the cited baselines and produces the reported deltas: 40.6% versus 36.9% Best-of-5 on miniF2F and 7.3% versus 1.8%/3.6% on the released VERITAS-CombiBench. The new benchmark is useful on its own because it shows unguided sampling can hurt when lemma names must be recovered iteratively.

The paper does a clean job of preserving Phase-1 solves so any extra solves can be attributed to the second phase. Releasing the benchmark and GitHub artifacts is also straightforward and helpful.

The soft spot is exactly the one the stress-test flags. The whole improvement is credited to the critic turning raw verifier output into useful negative examples without adding bias or wasting search budget. The abstract gives no prompt, no training details, and no ablation that removes the critic while matching compute. Until those are shown, it is hard to know whether the 3.7-point lift on miniF2F or the larger lift on CombiBench actually comes from feedback-driven exploration or from something else in the search setup. The numbers are concrete, but the causal claim is still thin.

This is for groups working on LLM-guided formal search who already run verifiers and want to test whether richer signals help. It is worth a serious referee because the benchmark is new, the protocol is described clearly enough to replicate, and the empirical comparison is direct. The paper should go to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper presents VERITAS, a zero-shot LLM-based formal theorem proving framework using a two-phase protocol: Best-of-N sampling followed by critic-guided MCTS that ingests Phase-1 verifier failure signals (syntax errors, type mismatches, partial goals) as explicit negative examples. The protocol is designed to preserve all Phase-1 solves so that any additional solves in Phase 2 are attributable to feedback-driven exploration. It reports 40.6% on miniF2F (vs. independently run Best-of-5 at 36.9% and Portfolio at 26.2%) and 7.3% on the released 55-theorem VERITAS-CombiBench (vs. Best-of-5 at 1.8% and Portfolio at 3.6%), with GitHub artifacts provided.

Significance. If the central performance claims hold, the work demonstrates the value of routing rich verifier signals into search rather than collapsing them to binary outcomes, particularly in combinatorics where iterative lemma recovery benefits from guided exploration. The release of VERITAS-CombiBench and the preservation of Phase-1 solves are concrete strengths that enable direct attribution and community reuse.

major comments (2)

[Methods (Phase-2 protocol)] The description of the critic (how verifier outputs are mapped to negative examples for MCTS) lacks an ablation that removes the critic while exactly matching Phase-2 search budget; without this, the attribution of the 3.7-point miniF2F gain and the CombiBench delta specifically to feedback-driven exploration cannot be isolated from additional search effort.
[Experimental setup and critic implementation] No details are provided on the critic's prompt template, whether it is zero-shot or fine-tuned, or how negative-example construction avoids systematic bias across proof states (e.g., over-penalizing viable branches with certain lemma styles); this directly affects the weakest assumption that the critic reliably improves exploration without erasing net gains.

minor comments (2)

[Abstract] The abstract states that artifacts are available on GitHub; the manuscript should include a direct link and a brief description of what is released (models, prompts, benchmark theorems).
[Results] Table or figure reporting per-theorem results on CombiBench would help readers assess whether the 7.3% aggregate reflects broad improvement or concentration on a few theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the strengths of our work, including the preservation of Phase-1 solves and the release of VERITAS-CombiBench. We respond point-by-point to the major comments below and commit to revisions that directly address the concerns raised.

read point-by-point responses

Referee: [Methods (Phase-2 protocol)] The description of the critic (how verifier outputs are mapped to negative examples for MCTS) lacks an ablation that removes the critic while exactly matching Phase-2 search budget; without this, the attribution of the 3.7-point miniF2F gain and the CombiBench delta specifically to feedback-driven exploration cannot be isolated from additional search effort.

Authors: We agree that an ablation isolating the critic's contribution while exactly matching the Phase-2 search budget would strengthen attribution of the observed gains. In the revised manuscript we will add this ablation: Phase-2 will be re-run with the identical number of MCTS simulations and node expansions but without the critic (using uniform selection or a non-feedback baseline), and we will report the resulting solve rates on both miniF2F and VERITAS-CombiBench. This will allow direct comparison and clarify how much of the 3.7-point and CombiBench deltas are due to feedback-driven exploration versus additional search effort. revision: yes
Referee: [Experimental setup and critic implementation] No details are provided on the critic's prompt template, whether it is zero-shot or fine-tuned, or how negative-example construction avoids systematic bias across proof states (e.g., over-penalizing viable branches with certain lemma styles); this directly affects the weakest assumption that the critic reliably improves exploration without erasing net gains.

Authors: The critic is implemented as a zero-shot prompt to the same base LLM used elsewhere in the framework; no fine-tuning occurs. We will include the complete prompt template in the appendix of the revised manuscript. Negative examples are constructed by extracting only the specific failed tactic or lemma name from the verifier output and marking solely that branch as negative for the MCTS value function. The prompt is localized to the current proof state and error type, avoiding global penalties. We will add a brief discussion of potential biases (including lemma-style effects) and the design choices intended to limit them. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons only

full rationale

The paper reports direct empirical results on miniF2F and a new benchmark using a two-phase protocol versus independently run baselines (Best-of-5, Portfolio). No equations, fitted parameters, or derivations are present that could reduce to inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing. The central claims rest on observable performance deltas, which are externally falsifiable and self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all claims rest on standard LLM and search assumptions not detailed here.

pith-pipeline@v0.9.1-grok · 5742 in / 1202 out tokens · 26894 ms · 2026-06-26T21:36:02.305028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 7 linked inside Pith

[1]

Zheng, Kunhao and Han, Jesse Michael and Polu, Stanislas , booktitle =
[2]

arXiv preprint arXiv:2009.03393 , year =

Generative Language Modeling for Automated Theorem Proving , author =. arXiv preprint arXiv:2009.03393 , year =

Pith/arXiv arXiv 2009
[3]

Yang, Kaiyu and Swope, Aidan M and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima , booktitle =
[4]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Lample, Guillaume and Lachaux, Marie-Anne and Lavril, Thibaut and Martinet, Xavier and Hayat, Amaury and Ebner, Gabriel and Rodriguez, Aur. Advances in Neural Information Processing Systems (NeurIPS) , year =
[5]

and Li, Wenda and Tworkowski, Szymon and Czechowski, Konrad and Odrzyg

Jiang, Albert Q. and Li, Wenda and Tworkowski, Szymon and Czechowski, Konrad and Odrzyg. Advances in Neural Information Processing Systems (NeurIPS) , year =
[6]

Wang, Haiming and Xin, Huajian and Zheng, Chuanyang and Li, Lin and Liu, Zhengying and Cao, Qingxing and Huang, Yinya and Xiong, Jing and Shi, Han and Xie, Enze and Yin, Jian and Li, Zhenguo and Liao, Heng and Liang, Xiaodan , booktitle =
[7]

Wu, Zijian and Huang, Suozhi and Zhou, Zhejian and Ying, Huaiyuan and Wang, Jiayu and Lin, Dahua and Chen, Kai , journal =
[8]

Ren, Z. Z. and Shao, Zhihong and Song, Junxiao and Xin, Huajian and Wang, Haocheng and Zhao, Wanjia and Zhang, Liyue and Fu, Zhe and Zhu, Qihao and Yang, Dejian and Wu, Z. F. and Gou, Zhibin and Ma, Shirong and Tang, Hongxuan and Liu, Yuxuan and Gao, Wenjun and Guo, Daya and Ruan, Chong , journal =
[9]

2024 , howpublished =

2024
[10]

First Conference on Language Modeling (COLM) , year =

An In-Context Learning Agent for Formal Theorem-Proving , author =. First Conference on Language Modeling (COLM) , year =
[11]

Kumarappan, Adarsh and Tiwari, Mo and Song, Peiyang and George, Robert Joseph and Xiao, Chaowei and Anandkumar, Anima , booktitle =
[12]

and Deng, Jia and Biderman, Stella and Welleck, Sean , booktitle =

Azerbayev, Zhangir and Schoelkopf, Hailey and Paster, Keiran and Dos Santos, Marco and McAleer, Stephen and Jiang, Albert Q. and Deng, Jia and Biderman, Stella and Welleck, Sean , booktitle =
[13]

and Radev, Dragomir and Avigad, Jeremy , journal =

Azerbayev, Zhangir and Piotrowski, Bartosz and Schoelkopf, Hailey and Ayers, Edward W. and Radev, Dragomir and Avigad, Jeremy , journal =
[14]

Xin, Huajian and Guo, Daya and Shao, Zhihong and Ren, Z. Z. and Zhu, Qihao and Liu, Bo and Ruan, Chong and Li, Wenda and Liang, Xiaodan , journal =
[15]

Xin, Huajian and Ren, Z. Z. and Song, Junxiao and Shao, Zhihong and Zhao, Wanjia and Wang, Haocheng and Liu, Bo and Zhang, Liyue and Lu, Xuan and Du, Qiushi and Gao, Wenjun and Zhang, Haowei and Zhu, Qihao and Yang, Dejian and Gou, Zhibin and Wu, Z. F. and Luo, Fuli and Ruan, Chong , booktitle =
[16]

Lin, Yong and Tang, Shange and Lyu, Bohan and Wu, Jiayun and Lin, Hongzhou and Yang, Kaiyu and Li, Jia and Xia, Mengzhou and Chen, Danqi and Arora, Sanjeev and Jin, Chi , journal =
[17]

International Conference on Learning Representations (ICLR) , year =

Proof Artifact Co-Training for Theorem Proving with Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[18]

International Conference on Machine Learning (ICML) , year =

Formal Mathematics Statement Curriculum Learning , author =. International Conference on Machine Learning (ICML) , year =
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Autoformalization with Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[20]

International Conference on Learning Representations (ICLR) , year =

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , author =. International Conference on Learning Representations (ICLR) , year =
[21]

Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =

Limperg, Jannis and From, Asta Halkj. Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =
[22]

de Moura, Leonardo and Ullrich, Sebastian , booktitle =. The. 2021 , publisher =

2021
[23]

Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =

The. Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =
[24]

International Conference on Machine Learning (ICML) , pages =

Learning to Prove Theorems via Interacting with Proof Assistants , author =. International Conference on Machine Learning (ICML) , pages =
[25]

Huang, Daniel and Dhariwal, Prafulla and Song, Dawn and Sutskever, Ilya , booktitle =
[26]

and Rabe, Markus N

Bansal, Kshitij and Loos, Sarah M. and Rabe, Markus N. and Szegedy, Christian and Wilcox, Stewart , booktitle =
[27]

Bandit Based

Kocsis, Levente and Szepesv. Bandit Based. 17th European Conference on Machine Learning (ECML) , pages =
[28]

Machine Learning , volume =

Finite-Time Analysis of the Multiarmed Bandit Problem , author =. Machine Learning , volume =
[29]

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and

Silver, David and Hubert, Thomas and Schrittwieser, Julian and Antonoglou, Ioannis and Lai, Matthew and Guez, Arthur and Lanctot, Marc and Sifre, Laurent and Kumaran, Dharshan and Graepel, Thore and Lillicrap, Timothy and Simonyan, Karen and Hassabis, Demis , journal =. A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and
[30]

Nature , volume =

Solving Olympiad Geometry Without Human Demonstrations , author =. Nature , volume =
[31]

Chervonyi, Yuri and Trinh, Trieu H. and Ol. Gold-medalist Performance in Solving Olympiad Geometry with. arXiv preprint arXiv:2502.03544 , year =

arXiv
[32]

International Conference on Learning Representations (ICLR) , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations (ICLR) , year =
[33]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2211.14275 , year =

Solving Math Word Problems with Process- and Outcome-Based Feedback , author =. arXiv preprint arXiv:2211.14275 , year =

Pith/arXiv arXiv
[35]

Fang, Zihan and Zhang, Yifan and Zhang, Yueke and Leach, Kevin and Huang, Yu , journal =
[36]

Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , booktitle =
[37]

Zhang, Yifan and Li, Jieyu and Pei, Kexin and Huang, Yu and Leach, Kevin , journal =
[38]

Zhang, Yueke and Zhang, Yifan and Leach, Kevin and Huang, Yu , journal =
[39]

International Conference on Learning Representations (ICLR) , year =

Generating Sequences by Learning to Self-Correct , author =. International Conference on Learning Representations (ICLR) , year =
[40]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[41]

International Conference on Learning Representations (ICLR) , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[42]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[43]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[45]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the
[46]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Solving Quantitative Reasoning Problems with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[47]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal =
[48]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =
[49]

International Conference on Learning Representations (ICLR) , year =

Hong, Sirui and Zhuge, Mingchen and Chen, Jiaqi and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, J. International Conference on Learning Representations (ICLR) , year =
[50]

and Burger, Doug and Wang, Chi , booktitle =

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , booktitle =
[51]

Zhang, Yifan and Ganapavarapu, Giridhar and Jayaraman, Srideepika and Agrawal, Bhavna and Patel, Dhaval and Fokoue, Achille , booktitle =
[52]

Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , booktitle =
[53]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
[54]

arXiv preprint arXiv:2107.03374 , year =

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

Pith/arXiv arXiv
[55]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =
[56]

Rozi. Code. arXiv preprint arXiv:2308.12950 , year =

Pith/arXiv arXiv
[57]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[58]

arXiv preprint arXiv:2303.08774 , year =

Pith/arXiv arXiv
[59]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal =
[60]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems (NeurIPS) , year =
[61]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
[62]

Irving, Geoffrey and Szegedy, Christian and Alemi, Alexander A. and E. Advances in Neural Information Processing Systems (NeurIPS) , year =
[63]

and Blanchette, Jasmin C

Paulson, Lawrence C. and Blanchette, Jasmin C. , booktitle =. Three Years of Experience with
[64]

Gauthier, Thibault and Kaliszyk, Cezary and Urban, Josef and Kumar, Ramana and Norrish, Michael , journal =
[65]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Reinforcement Learning of Theorem Proving , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[66]

Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL) , pages =

Generating Correctness Proofs with Neural Networks , author =. Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL) , pages =
[67]

and Ringer, Talia and Brun, Yuriy , booktitle =

First, Emily and Rabe, Markus N. and Ringer, Talia and Brun, Yuriy , booktitle =
[68]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Thinking Fast and Slow with Deep Learning and Tree Search , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[69]

, booktitle =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =
[70]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[71]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[72]

arXiv preprint arXiv:2601.20802 , year =

Reinforcement Learning via Self-Distillation , author =. arXiv preprint arXiv:2601.20802 , year =

Pith/arXiv arXiv

[1] [1]

Zheng, Kunhao and Han, Jesse Michael and Polu, Stanislas , booktitle =

[2] [2]

arXiv preprint arXiv:2009.03393 , year =

Generative Language Modeling for Automated Theorem Proving , author =. arXiv preprint arXiv:2009.03393 , year =

Pith/arXiv arXiv 2009

[3] [3]

Yang, Kaiyu and Swope, Aidan M and Gu, Alex and Chalamala, Rahul and Song, Peiyang and Yu, Shixing and Godil, Saad and Prenger, Ryan and Anandkumar, Anima , booktitle =

[4] [4]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Lample, Guillaume and Lachaux, Marie-Anne and Lavril, Thibaut and Martinet, Xavier and Hayat, Amaury and Ebner, Gabriel and Rodriguez, Aur. Advances in Neural Information Processing Systems (NeurIPS) , year =

[5] [5]

and Li, Wenda and Tworkowski, Szymon and Czechowski, Konrad and Odrzyg

Jiang, Albert Q. and Li, Wenda and Tworkowski, Szymon and Czechowski, Konrad and Odrzyg. Advances in Neural Information Processing Systems (NeurIPS) , year =

[6] [6]

Wang, Haiming and Xin, Huajian and Zheng, Chuanyang and Li, Lin and Liu, Zhengying and Cao, Qingxing and Huang, Yinya and Xiong, Jing and Shi, Han and Xie, Enze and Yin, Jian and Li, Zhenguo and Liao, Heng and Liang, Xiaodan , booktitle =

[7] [7]

Wu, Zijian and Huang, Suozhi and Zhou, Zhejian and Ying, Huaiyuan and Wang, Jiayu and Lin, Dahua and Chen, Kai , journal =

[8] [8]

Ren, Z. Z. and Shao, Zhihong and Song, Junxiao and Xin, Huajian and Wang, Haocheng and Zhao, Wanjia and Zhang, Liyue and Fu, Zhe and Zhu, Qihao and Yang, Dejian and Wu, Z. F. and Gou, Zhibin and Ma, Shirong and Tang, Hongxuan and Liu, Yuxuan and Gao, Wenjun and Guo, Daya and Ruan, Chong , journal =

[9] [9]

2024 , howpublished =

2024

[10] [10]

First Conference on Language Modeling (COLM) , year =

An In-Context Learning Agent for Formal Theorem-Proving , author =. First Conference on Language Modeling (COLM) , year =

[11] [11]

Kumarappan, Adarsh and Tiwari, Mo and Song, Peiyang and George, Robert Joseph and Xiao, Chaowei and Anandkumar, Anima , booktitle =

[12] [12]

and Deng, Jia and Biderman, Stella and Welleck, Sean , booktitle =

Azerbayev, Zhangir and Schoelkopf, Hailey and Paster, Keiran and Dos Santos, Marco and McAleer, Stephen and Jiang, Albert Q. and Deng, Jia and Biderman, Stella and Welleck, Sean , booktitle =

[13] [13]

and Radev, Dragomir and Avigad, Jeremy , journal =

Azerbayev, Zhangir and Piotrowski, Bartosz and Schoelkopf, Hailey and Ayers, Edward W. and Radev, Dragomir and Avigad, Jeremy , journal =

[14] [14]

Xin, Huajian and Guo, Daya and Shao, Zhihong and Ren, Z. Z. and Zhu, Qihao and Liu, Bo and Ruan, Chong and Li, Wenda and Liang, Xiaodan , journal =

[15] [15]

Xin, Huajian and Ren, Z. Z. and Song, Junxiao and Shao, Zhihong and Zhao, Wanjia and Wang, Haocheng and Liu, Bo and Zhang, Liyue and Lu, Xuan and Du, Qiushi and Gao, Wenjun and Zhang, Haowei and Zhu, Qihao and Yang, Dejian and Gou, Zhibin and Wu, Z. F. and Luo, Fuli and Ruan, Chong , booktitle =

[16] [16]

Lin, Yong and Tang, Shange and Lyu, Bohan and Wu, Jiayun and Lin, Hongzhou and Yang, Kaiyu and Li, Jia and Xia, Mengzhou and Chen, Danqi and Arora, Sanjeev and Jin, Chi , journal =

[17] [17]

International Conference on Learning Representations (ICLR) , year =

Proof Artifact Co-Training for Theorem Proving with Language Models , author =. International Conference on Learning Representations (ICLR) , year =

[18] [18]

International Conference on Machine Learning (ICML) , year =

Formal Mathematics Statement Curriculum Learning , author =. International Conference on Machine Learning (ICML) , year =

[19] [19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Autoformalization with Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[20] [20]

International Conference on Learning Representations (ICLR) , year =

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , author =. International Conference on Learning Representations (ICLR) , year =

[21] [21]

Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =

Limperg, Jannis and From, Asta Halkj. Proceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =

[22] [22]

de Moura, Leonardo and Ullrich, Sebastian , booktitle =. The. 2021 , publisher =

2021

[23] [23]

Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =

The. Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP) , pages =

[24] [24]

International Conference on Machine Learning (ICML) , pages =

Learning to Prove Theorems via Interacting with Proof Assistants , author =. International Conference on Machine Learning (ICML) , pages =

[25] [25]

Huang, Daniel and Dhariwal, Prafulla and Song, Dawn and Sutskever, Ilya , booktitle =

[26] [26]

and Rabe, Markus N

Bansal, Kshitij and Loos, Sarah M. and Rabe, Markus N. and Szegedy, Christian and Wilcox, Stewart , booktitle =

[27] [27]

Bandit Based

Kocsis, Levente and Szepesv. Bandit Based. 17th European Conference on Machine Learning (ECML) , pages =

[28] [28]

Machine Learning , volume =

Finite-Time Analysis of the Multiarmed Bandit Problem , author =. Machine Learning , volume =

[29] [29]

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and

Silver, David and Hubert, Thomas and Schrittwieser, Julian and Antonoglou, Ioannis and Lai, Matthew and Guez, Arthur and Lanctot, Marc and Sifre, Laurent and Kumaran, Dharshan and Graepel, Thore and Lillicrap, Timothy and Simonyan, Karen and Hassabis, Demis , journal =. A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and

[30] [30]

Nature , volume =

Solving Olympiad Geometry Without Human Demonstrations , author =. Nature , volume =

[31] [31]

Chervonyi, Yuri and Trinh, Trieu H. and Ol. Gold-medalist Performance in Solving Olympiad Geometry with. arXiv preprint arXiv:2502.03544 , year =

arXiv

[32] [32]

International Conference on Learning Representations (ICLR) , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations (ICLR) , year =

[33] [33]

arXiv preprint arXiv:2110.14168 , year =

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2211.14275 , year =

Solving Math Word Problems with Process- and Outcome-Based Feedback , author =. arXiv preprint arXiv:2211.14275 , year =

Pith/arXiv arXiv

[35] [35]

Fang, Zihan and Zhang, Yifan and Zhang, Yueke and Leach, Kevin and Huang, Yu , journal =

[36] [36]

Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , booktitle =

[37] [37]

Zhang, Yifan and Li, Jieyu and Pei, Kexin and Huang, Yu and Leach, Kevin , journal =

[38] [38]

Zhang, Yueke and Zhang, Yifan and Leach, Kevin and Huang, Yu , journal =

[39] [39]

International Conference on Learning Representations (ICLR) , year =

Generating Sequences by Learning to Self-Correct , author =. International Conference on Learning Representations (ICLR) , year =

[40] [40]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[41] [41]

International Conference on Learning Representations (ICLR) , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

[42] [42]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[43] [43]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =

[44] [44]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[45] [45]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the

[46] [46]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Solving Quantitative Reasoning Problems with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[47] [47]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal =

[48] [48]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =

[49] [49]

International Conference on Learning Representations (ICLR) , year =

Hong, Sirui and Zhuge, Mingchen and Chen, Jiaqi and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, J. International Conference on Learning Representations (ICLR) , year =

[50] [50]

and Burger, Doug and Wang, Chi , booktitle =

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , booktitle =

[51] [51]

Zhang, Yifan and Ganapavarapu, Giridhar and Jayaraman, Srideepika and Agrawal, Bhavna and Patel, Dhaval and Fokoue, Achille , booktitle =

[52] [52]

Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , booktitle =

[53] [53]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

[54] [54]

arXiv preprint arXiv:2107.03374 , year =

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

Pith/arXiv arXiv

[55] [55]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =

[56] [56]

Rozi. Code. arXiv preprint arXiv:2308.12950 , year =

Pith/arXiv arXiv

[57] [57]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[58] [58]

arXiv preprint arXiv:2303.08774 , year =

Pith/arXiv arXiv

[59] [59]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal =

[60] [60]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems (NeurIPS) , year =

[61] [61]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

[62] [62]

Irving, Geoffrey and Szegedy, Christian and Alemi, Alexander A. and E. Advances in Neural Information Processing Systems (NeurIPS) , year =

[63] [63]

and Blanchette, Jasmin C

Paulson, Lawrence C. and Blanchette, Jasmin C. , booktitle =. Three Years of Experience with

[64] [64]

Gauthier, Thibault and Kaliszyk, Cezary and Urban, Josef and Kumar, Ramana and Norrish, Michael , journal =

[65] [65]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Reinforcement Learning of Theorem Proving , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[66] [66]

Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL) , pages =

Generating Correctness Proofs with Neural Networks , author =. Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL) , pages =

[67] [67]

and Ringer, Talia and Brun, Yuriy , booktitle =

First, Emily and Rabe, Markus N. and Ringer, Talia and Brun, Yuriy , booktitle =

[68] [68]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Thinking Fast and Slow with Deep Learning and Tree Search , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[69] [69]

, booktitle =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =

[70] [70]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[71] [71]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[72] [72]

arXiv preprint arXiv:2601.20802 , year =

Reinforcement Learning via Self-Distillation , author =. arXiv preprint arXiv:2601.20802 , year =

Pith/arXiv arXiv