arxiv: 2605.13137 · v2 · submitted 2026-05-13 · 💻 cs.IR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

Guoxiong Gao , Zeming Sun , Jiedong Jiang , Yutong Wang , Jingda Xu , Peihao Wu , Bryan Dai , Bin Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:59 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords Lean 4theorem provingpremise retrievalinformation retrievalMathlibautomated reasoningembedding modelsretrieval systems

0 comments

The pith

LeanSearch v2 recovers full premise sets for Lean 4 theorems at 46.1 percent accuracy on research benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses global premise retrieval, the task of finding the complete scattered set of library lemmas required for an entire Lean 4 theorem proof. Standard semantic search and per-step premise selection handle only fragments of this problem. LeanSearch v2 introduces a two-mode system: a standard embedding-reranker pipeline on an informalized Mathlib corpus and a reasoning mode that adds iterative sketch-retrieve-reflect cycles. On 69 research-level theorems, the reasoning mode places ground-truth premise groups inside the top 10 candidates 46.1 percent of the time, beating prior reasoning retrievers at 38 percent and premise baselines at 9.3 percent. When the same retriever is dropped into a fixed prover loop, proof success rises to 20 percent versus 16 percent for the next system and 4 percent with no retrieval.

Core claim

LeanSearch v2's reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates on a 69-query benchmark of research-level Mathlib theorems, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%). In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation.

What carries the argument

The reasoning mode's iterative sketch-retrieve-reflect cycles, which operate on top of a standard embedding-reranker pipeline applied to a hierarchy-informalized Mathlib corpus.

Load-bearing premise

The 69-query benchmark and the fixed prover loop are representative of real-world Lean 4 theorem proving tasks.

What would settle it

A larger or differently distributed set of theorems, or the same theorems run inside a changed prover loop, that shows no gain in proof success over the next-best retriever would falsify the claim that the measured retrieval improvement drives the reported proof gains.

Figures

Figures reproduced from arXiv: 2605.13137 by Bin Dong, Bryan Dai, Guoxiong Gao, Jiedong Jiang, Jingda Xu, Peihao Wu, Yutong Wang, Zeming Sun.

**Figure 2.** Figure 2: Standard mode pipeline. (a) Jixia extracts each declaration and its dependencies from Mathlib. (b) Declarations are informalized bottom-up so that each description can draw on alreadyinformalized dependencies. (c) Queries and corpus items are embedded for cosine-similarity retrieval; a reranker refines the top candidates. Standard mode serves as both a standalone search engine and the retrieval substrate… view at source ↗

**Figure 3.** Figure 3: Hidden auto-generated declarations in Mathlib. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Reasoning mode pipeline. The sketch generator decomposes a target theorem into sub [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch-retrieve-reflect cycles. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open-sourced all code, data, and benchmarks. Code and data: https://github.com/frenzymath/LeanSearch-v2 . The standard mode is publicly available with API access at https://leansearch.net/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LeanSearch v2 adds iterative sketch-retrieve-reflect cycles that raise premise-group recovery to 46% and proof success to 20% on their 69-query set, but the small fixed benchmark leaves the margins open to questions about stability.

read the letter

LeanSearch v2 adds an iterative reasoning mode on top of a standard embedding-reranker pipeline, and this combination recovers more complete premise groups for Lean 4 theorems while also lifting proof rates in their controlled tests. The standard mode reaches nDCG@10 of 0.62 without domain fine-tuning, beating the next reported system at 0.53. The reasoning mode then targets global retrieval through sketch-retrieve-reflect cycles and hits 46.1% recovery of ground-truth premise groups within the top 10 candidates on 69 research-level Mathlib theorems, compared with 38% for other reasoning systems and 9.3% for premise-selection baselines. In the downstream test with a fixed prover loop, swapping in LeanSearch v2 produces 20% proof success versus 16% for the next best retriever and 4% with no retrieval at all. The full code, data, and benchmarks are open-sourced, which is the most useful part for anyone who wants to check or extend the work.

Referee Report

2 major / 2 minor

Summary. The paper introduces LeanSearch v2, a two-mode retrieval system for global premise retrieval in Lean 4. Its standard mode uses an embedding-reranker pipeline on a hierarchy-informalized Mathlib corpus to achieve nDCG@10 of 0.62. Its reasoning mode applies iterative sketch-retrieve-reflect cycles on top of the standard mode, recovering 46.1% of ground-truth premise groups within 10 candidates on a 69-query benchmark of research-level theorems (outperforming 38.0% for strong reasoning retrieval systems and 9.3% for premise-selection baselines). In a controlled downstream evaluation with a fixed prover loop, LeanSearch v2 yields 20% proof success versus 16% for the next-best retriever and 4% without retrieval. All code, data, and benchmarks are open-sourced, with a public API available.

Significance. If the empirical results hold, the work makes a clear contribution to automated theorem proving by directly targeting global premise sets rather than isolated lemmas or per-tactic predictions. The reported lift from retrieval quality to proof success (20% vs. 16%) demonstrates practical impact, and the open-sourced code plus public API lower barriers for adoption and extension in the Lean community. The two-mode design offers a reusable substrate for both single-query and iterative reasoning retrieval.

major comments (2)

The 69-query benchmark evaluation reports 46.1% premise recovery and 20% proof success without variance estimates, statistical significance tests, or sensitivity analysis to query selection or prover-loop parameters. This is load-bearing for the superiority claim over the 38.0% and 16% baselines, as the 8.1-point margins could reflect benchmark construction rather than method robustness.
The downstream evaluation uses a single fixed prover loop; it is unclear how the 20% success rate changes under alternative loop configurations or theorem distributions, which directly affects the claim that retrieval quality propagates to proof generation.

minor comments (2)

The abstract introduces the term 'hierarchy-informalized Mathlib corpus' without a brief gloss or pointer to the section that defines the informalization process.
In the evaluation sections, ensure every baseline system is accompanied by a concise description of its original implementation and any adaptations made for the Lean 4 setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's contribution to global premise retrieval. We address each major comment below and commit to revisions that strengthen the empirical evaluation.

read point-by-point responses

Referee: The 69-query benchmark evaluation reports 46.1% premise recovery and 20% proof success without variance estimates, statistical significance tests, or sensitivity analysis to query selection or prover-loop parameters. This is load-bearing for the superiority claim over the 38.0% and 16% baselines, as the 8.1-point margins could reflect benchmark construction rather than method robustness.

Authors: We agree that the absence of variance estimates and statistical tests weakens the robustness claims. In the revised manuscript we will add bootstrap-derived 95% confidence intervals for both the 46.1% premise-recovery and 20% proof-success figures, together with paired significance tests against the reported baselines. We will also include a sensitivity analysis that subsamples the 69-query set and varies prover-loop hyperparameters to show that the observed margins are not artifacts of the particular benchmark construction. revision: yes
Referee: The downstream evaluation uses a single fixed prover loop; it is unclear how the 20% success rate changes under alternative loop configurations or theorem distributions, which directly affects the claim that retrieval quality propagates to proof generation.

Authors: The fixed prover loop was selected to isolate the contribution of retrieval quality under identical search and tactic-generation conditions. We acknowledge that broader exploration of loop variants and theorem distributions would further support generalizability. In the revision we will add an explicit discussion of this design choice and, where compute permits, report results on one additional loop configuration to illustrate that the relative ordering among retrievers remains stable. revision: partial

Circularity Check

0 steps flagged

Empirical retrieval metrics and downstream prover success are direct measurements, not reductions to fitted inputs or self-citations

full rationale

The paper reports nDCG@10, premise-group recovery rates (46.1%), and proof-success percentages (20%) from a fixed 69-query benchmark and a controlled prover loop. These are experimental outcomes obtained by running the described retrieval pipeline and prover on held-out theorems; they are not obtained by solving equations whose right-hand sides are defined from the left-hand sides or from parameters fitted inside the same derivation. The standard mode uses an embedding-reranker pipeline on a hierarchy-informalized corpus, and the reasoning mode iterates sketch-retrieve-reflect on top of it; neither step is shown to presuppose the reported numbers. Self-citations to prior LeanSearch work are present but not load-bearing for the central empirical claims, which remain independently verifiable via the open-sourced code and benchmarks. No uniqueness theorems, ansatzes, or renamings of known results are invoked to force the outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The system relies on pre-trained embedding models and rerankers whose internal parameters are inherited from upstream training; no new free parameters are introduced in the reported pipeline itself.

pith-pipeline@v0.9.0 · 5625 in / 1155 out tokens · 34656 ms · 2026-05-15T02:59:50.734388+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/* (AbsoluteFloorClosure, Cost, DimensionForcing, AlexanderDuality) reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global premise retrieval... embedding-reranker pipeline... iterative sketch-retrieve-reflect cycles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 8 internal anchors

[1]

Mm-bright: A multi-task multimodal benchmark for reasoning-intensive retrieval

Abdelrahman Abdallah, Mohamed Darwish Mounis, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mostafa Farouk Senussi, Mohamed Mahmoud, Mohammed Ali, Adam Jatowt, and Hyun- Soo Kang. Mm-bright: A multi-task multimodal benchmark for reasoning-intensive retrieval. arXiv preprint arXiv:2601.09562,

work page arXiv
[2]

Aristotle: IMO-level Automated Theorem Proving

Tudor Achim, Alex Best, Alberto Bietti, Kevin Der, Mathïs Fédérico, Sergei Gukov, Daniel Halpern- Leistner, Kirsten Henningsgard, Yury Kudryashov, Alexander Meiburg, et al. Aristotle: Imo-level automated theorem proving.arXiv preprint arXiv:2510.01346,

work page internal anchor Pith review arXiv
[3]

Leanexplore: A search engine for lean 4 declarations.arXiv preprint arXiv:2506.11085,

Justin Asher. Leanexplore: A search engine for lean 4 declarations.arXiv preprint arXiv:2506.11085,

work page arXiv
[4]

Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260,

Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, et al. Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience.arXiv preprint arXiv:2512.17260,

work page arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Rader: Reasoning-aware dense retrieval models

Debrup Das, Sam O’Nuallain, and Razieh Rahimi. Rader: Reasoning-aware dense retrieval models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19981–20008,

work page 2025
[7]

A semantic search engine for mathlib4

Guoxiong Gao, Haocheng Ju, Jiedong Jiang, Zihan Qin, and Bin Dong. A semantic search engine for mathlib4. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8001–8013, 2024a. Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, and Bin Dong. Herald: A natural language annotated lean 4 dataset.arXiv preprint a...

work page arXiv 2024
[8]

Jiang, W

doi: 10.1038/ s41586-025-09833-y. Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, et al. Fate: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint arXiv:2511.02872,

work page arXiv
[9]

Mirb: Mathematical information retrieval benchmark.arXiv preprint arXiv:2505.15585,

Haocheng Ju and Bin Dong. Mirb: Mathematical information retrieval benchmark.arXiv preprint arXiv:2505.15585,

work page arXiv
[10]

Automated Conjecture Resolution with Formal Verification

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, et al. Automated conjecture resolution with formal verification.arXiv preprint arXiv:2604.03789,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Benchmarking information retrieval models on complex retrieval tasks.arXiv preprint arXiv:2509.07253,

Julian Killingback and Hamed Zamani. Benchmarking information retrieval models on complex retrieval tasks.arXiv preprint arXiv:2509.07253,

work page arXiv
[12]

Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027,

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yun- zhou Xie, Junqiao Zhao, Qiufeng Wang, et al. Numina-lean-agent: An open and general agentic reasoning system for formal mathematics.arXiv preprint arXiv:2601.14027,

work page arXiv
[13]

GroupRank: A Groupwise Paradigm for Effective and Efficient Passage Reranking with LLMs

Meixiu Long, Duolin Sun, Dan Yang, Yihan Jiao, Lei Liu, Jiahai Wang, BinBin Hu, Yue Shen, Jie Feng, Zhehao Tan, et al. Grouprank: A groupwise paradigm for effective and efficient passage reranking with llms.arXiv:2511.11653, 2025a. Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yecheng Luo, Yue Shen, Jian Wang, Hualei Zhou, Chunxiao Guo, Peng Wei, et al....

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mteb: Massive text em- bedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text em- bedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037,

work page 2014
[15]

Combining textual and struc- tural information for premise selection in lean.arXiv preprint arXiv:2510.23637,

Job Petrovˇciˇc, David Eliecer Narvaez Denis, and Ljup ˇco Todorovski. Combining textual and struc- tural information for premise selection in lean.arXiv preprint arXiv:2510.23637,

work page arXiv
[16]

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, et al. Deepseek-prover-v2: Advancing formal mathematical rea- soning via reinforcement learning for subgoal decomposition.arXiv preprint arXiv:2504.21801,

work page internal anchor Pith review arXiv
[17]

Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, et al. Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

work page arXiv
[18]

REAL‑Prover: Retrieval Aug- mented Lean Prover for Mathematical Reasoning, 2025

Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, et al. Real-prover: Retrieval augmented lean prover for mathematical reasoning.arXiv preprint arXiv:2505.20613,

work page arXiv
[19]

Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883,

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S Siegel, Michael Tang, et al. Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883,

work page arXiv
[20]

Learning an effective premise re- trieval model for efficient mathematical formalization.arXiv preprint arXiv:2501.13959,

11 Yicheng Tao, Haotian Liu, Shanwen Wang, and Hongteng Xu. Learning an effective premise re- trieval model for efficient mathematical formalization.arXiv preprint arXiv:2501.13959,

work page arXiv
[21]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

The Lean mathematical library

The mathlib Community. The Lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP 2020), pages 367–381. ACM,

work page 2020
[23]

In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs

doi: 10.1145/3372885.3373824. Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. Naturalproofs: Mathematical theorem proving in natural language.arXiv preprint arXiv:2104.01112,

work page doi:10.1145/3372885.3373824
[24]

Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418,

Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Benjamin Van Durme. Rank1: Test-time compute for reranking in information retrieval.arXiv preprint arXiv:2502.18418,

work page arXiv
[25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

URLhttps://github.com/yaoyichen/ INF-X-Retriever. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Premise Selection for a Lean Hammer

Thomas Zhu, Joshua Clune, Jeremy Avigad, Albert Qiaochu Jiang, and Sean Welleck. Premise selection for a lean hammer.arXiv preprint arXiv:2506.07477,

work page arXiv
[28]

CondensedMod satisfies AB4

12 A Benchmark Curation In this section, we present the curation methodology, key characteristics, and illustrative examples of our benchmark to provide readers with a clearer understanding of both the benchmark and the experimental results. We introduce two benchmarks:MathlibQRfor the classical formal theorem searching task, andMathlibMPRfor global premi...

work page 2024
[29]

Table 5: Mean rank conditioned on the system’s randomized label position

— but the permutation procedure assigns each system to each of the four label positions approximately equally often (574–640assignments per cell on2430judg- ments; uniform expectation is607.5), so the bias averages out at the system level and the reported overall mean ranks are debiased. Table 5: Mean rank conditioned on the system’s randomized label posi...

work page 2025
[30]

21 Table 10:MathlibQR by difficulty.Half the queries are Easy, half are Hard

Throughout the following tables we abbreviate column headers asLE= LeanEx- plore,LF= LeanFinder,LSv2(r)= LeanSearch v2 retriever-only, andLSv2(rr)= LeanSearch v2 with reranking. 21 Table 10:MathlibQR by difficulty.Half the queries are Easy, half are Hard. nDCG@10Recall@10 Difficulty LE LF LSv2(r) LSv2(rr) LE LF LSv2(r) LSv2(rr) Easy 0.410 0.568 0.5330.666...

work page arXiv
[31]

Best per column inbold. Covered (main only) Covered (main∨alt-all) System 5 10 20 30 50 5 10 20 30 50 Reasoning retrieval (informal+formal query) INF-X-Retriever 14.5 17.4 18.8 21.7 21.7 15.9 18.8 21.7 24.6 24.6 +query rewriter 10.1 10.1 14.5 15.9 23.2 10.1 10.1 17.4 17.4 24.6 ReasonIR 15.9 17.4 18.8 18.8 26.1 17.4 18.8 20.3 21.7 30.4 DIVER (full pipeline...

work page arXiv
[32]

The Sonnet column reproduces the L0 / L1 / reasoning-mode rows of Table 3; the Kimi column re-runs the same loops with Kimi K2 Instruct in the prover role only

Table 16:Prover-backbone ablation on FATE-H.Each cell is the count of solved problems out of 100(percentages in parentheses). The Sonnet column reproduces the L0 / L1 / reasoning-mode rows of Table 3; the Kimi column re-runs the same loops with Kimi K2 Instruct in the prover role only. Retrieval mode (prover input) Sonnet prover Kimi prover No retrieval 4...

work page 2020