{"paper":{"title":"LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LeanSearch v2 recovers full premise sets for Lean 4 theorems at 46.1 percent accuracy on research benchmarks.","cross_cats":["cs.AI"],"primary_cat":"cs.IR","authors_text":"Bin Dong, Bryan Dai, Guoxiong Gao, Jiedong Jiang, Jingda Xu, Peihao Wu, Yutong Wang, Zeming Sun","submitted_at":"2026-05-13T08:04:57Z","abstract_excerpt":"Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeli"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%). In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval).","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 69-query benchmark and the fixed prover loop are assumed to be representative of real-world Lean 4 usage; the paper does not report how performance changes when the prover loop or theorem distribution is altered.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LeanSearch v2 recovers full premise sets for Lean 4 theorems at 46.1 percent accuracy on research benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7ccbd682e8d25fc634a54ba7ad8902fa6d39de1eaec45654288aa7295e8b7634"},"source":{"id":"2605.13137","kind":"arxiv","version":2},"verdict":{"id":"503ffee2-451f-4ab5-be83-aeeaeee84261","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:59:50.734388Z","strongest_claim":"On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%). In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval).","one_line_summary":"LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 69-query benchmark and the fixed prover loop are assumed to be representative of real-world Lean 4 usage; the paper does not report how performance changes when the prover loop or theorem distribution is altered.","pith_extraction_headline":"LeanSearch v2 recovers full premise sets for Lean 4 theorems at 46.1 percent accuracy on research benchmarks."},"references":{"count":32,"sample":[{"doi":"","year":null,"title":"Mm-bright: A multi-task multimodal benchmark for reasoning-intensive retrieval","work_id":"ec60218c-2ad5-4557-9963-cd5455c05e8b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Aristotle: IMO-level Automated Theorem Proving","work_id":"6c61af2f-a34a-4647-9111-6ba5a60f6bc2","ref_index":2,"cited_arxiv_id":"2510.01346","is_internal_anchor":true},{"doi":"","year":null,"title":"Leanexplore: A search engine for lean 4 declarations.CoRR, abs/2506.11085","work_id":"f87043d0-5303-46ff-8552-c8881d23811b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience","work_id":"06a2ed7e-52e1-4927-ba09-545d2843a7f4","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","ref_index":5,"cited_arxiv_id":"2507.06261","is_internal_anchor":true}],"resolved_work":32,"snapshot_sha256":"ff0f1440a49abe711c80c42684d41c3b4d06f63c0ae7254f63908f9230e23b89","internal_anchors":8},"formal_canon":{"evidence_count":1,"snapshot_sha256":"34589dd26d001efef791c96258dd64b9bd3978ce8b0289d68f12335269f7f27d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}