{"total":26,"items":[{"citing_arxiv_id":"2605.23772","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Proving for Program Verification","primary_cat":"cs.AI","submitted_at":"2026-05-22T15:41:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agentic Claude reaches 98.8% valid specs, 87.5% implementation certification, and 98.1% end-to-end success on CLEVER, revealing a mismatch between benchmark difficulty and current prover performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22763","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Mathematics Research with AI-Driven Formal Proof Search","primary_cat":"cs.AI","submitted_at":"2026-05-21T17:24:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22885","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-21T02:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20531","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pseudo-Formalization for Automatic Proof Verification","primary_cat":"cs.LO","submitted_at":"2026-05-19T22:08:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"logical foundations, enabling each derivation step to be checked by a verifier. Early LLM-based theorem- provingsystems, includingReProver[ 84], DeepSeek-Prover[85], andTheoremLlama[86], establishpractical recipes for combining language models with proof-assistant feedback in mathematical reasoning. More recent systems, such as DeepSeek-Prover-V2 [87], Kimina-Prover [88], MA-LoT [76], and Goedel-Prover-V2 [89], improve this process through deliberative proof search, self-correction, and repeated proof generation and verification. Formal verification interfaces are also expanding beyond theorem proving in mathematics. HybridReasoning [90] applies formal provers to support natural-language reasoning; Lean4Physics [91] and PhysLib [92] extend Lean-based verification to physics; and VERINA [93] and Goedel-Code-Prover [94]"},{"citing_arxiv_id":"2605.17778","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models","primary_cat":"math.ST","submitted_at":"2026-05-18T02:56:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17283","ref_index":172,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","primary_cat":"cs.CL","submitted_at":"2026-05-17T06:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17255","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean","primary_cat":"cs.AI","submitted_at":"2026-05-17T04:53:47+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14061","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MathAtlas: A Benchmark for Autoformalization in the Wild","primary_cat":"cs.AI","submitted_at":"2026-05-13T19:35:46+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11905","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving","primary_cat":"cs.AI","submitted_at":"2026-05-12T10:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10379","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness","primary_cat":"cs.CL","submitted_at":"2026-05-11T11:23:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[32] Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction, 2025. URL https://arxiv.org/abs/2508.03613. [33] George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jen- nings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem- provers on the putnam mathematical competition. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, ed- itors,Advances in Neural Information Processing Systems 38: Annual Conference on Neu-"},{"citing_arxiv_id":"2605.08678","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI","primary_cat":"cs.LG","submitted_at":"2026-05-09T04:29:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5 ProClaude Opus 4.7Claude Opus 4.6 GLM 5.1 DeepSeek-V4 ProDeepSeek-V4 Flash GPT-5.5 Gemini 3.1 Pro GPT-5.4 DeepSeek-V3.2Gemini 3.1 FlashQwen 3.6 MaxClaude Sonnet 4.6 Kimi K2.6Qwen 3.6 Plus 36.1 36.1 35.9 29.8 29.0 28.5 27.1 26.6 25.8 25.7 23.7 22.6 21.2 21.2 19.5 MLS-Bench-Lite score (%) Figure 2: MLS-Bench-Lite Performance across 15 models. chitectures [53, 118], training procedures [1, 25, 36], and data and loss design [22, 27]. More recently, LLMs have accelerated automated discovery across a broad spectrum: serving as collaborative scien- tific partners [28, 82], optimizing specific algorithms and computational components [67, 70, 79], and driving fully autonomous research [56, 91, 110]. A growing set of benchmarks evaluates these"},{"citing_arxiv_id":"2605.06651","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AI co-mathematician: Accelerating mathematicians with agentic AI","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06110","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:24:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00677","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game","primary_cat":"cs.LG","submitted_at":"2026-05-01T14:03:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Obfuscated Natural Number Game shows reasoning LLMs keep proof accuracy without semantic cues while general models degrade, establishing a metric for architectural reasoning in alien math domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23712","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving","primary_cat":"cs.LG","submitted_at":"2026-04-26T13:54:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22519","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Ablation and the Meno: Tools for Empirical Metamathematics","primary_cat":"cs.LO","submitted_at":"2026-04-24T13:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Meno and tactic ablation on Tao's Analysis I generate proof populations that embed on low one- or two-dimensional submanifolds far from human constructions in Goedel Prover space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20209","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scaling Self-Play with Self-Guidance","primary_cat":"cs.LG","submitted_at":"2026-04-22T05:50:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19558","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On Reasoning-Centric LLM-based Automated Theorem Proving","primary_cat":"cs.SE","submitted_at":"2026-04-21T15:11:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18050","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data","primary_cat":"cs.AI","submitted_at":"2026-04-20T10:18:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02909","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR","primary_cat":"cs.LG","submitted_at":"2026-04-06T15:02:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"00 Global FPR (a) OLMo DAPO Clean 284-316 268-332 384-416 368-432 484-516 468-532 101 102 Step 0.00 0.25 0.50 0.75 1.00 Oracle Reward 101 102 Step 0.00 0.25 0.50 0.75 1.00 Global FPR (b) Qwen Figure 8: Results on length-based FP . The legend indicates the interval of the output completion within which the verifier gives a FP . DAPO Clean DAPO with \"\\[\" DAPO without \"\\[\" DAPO Random (FNR=20%) DAPO Random (FNR=50%) 101 102 Step 0.00 0.25 0.50 0.75 1.00 Oracle Reward 101 102 Step 0.00 0.25 0.50 0.75 1.00 Global FNR (a) FNR Clean Random (FPR=20%) Random (FPR=50%) Relative Error (<1) Relative Error (<0.01) Relative Error (<0.0001) Token \"Certainly\" Token \"python\" 101 102 Step 0.00 0.25 0.50 0.75 1.00 Oracle Reward 101 102 Step 0.00 0.25 0.50 0.75 1.00 Global FPR (b) FPR Figure 9: Results on Qwen2.5-1.5B-Instruct. B More Results B.1 Length-based False Positives Some systematic FPs are inherently more difficult to learn to exploit than others. In such cases, the FPR can remain relatively low throughout training, and the effect of the noise can be more similar to a delaying one than a plateauing one. Length-based FP , where the verifier introduces FPs if the output falls within a certain length interval, is one such example. In Figure 8, we present the results with length-based FP . Most settings show a delayed training dynamic, with the FPR staying relatively low, below 0.3, throughout training. Some settings, however, exhibit a more plateau-like behavior, with the FPR rising to around 0.5. This tends to occur for larger intervals, which naturally makes the hack easier for the model to exploit. Additionally, some lengths are easier to exploit than others. An FP interval of around 300 consistently produces higher FPR and, therefore, more severe plateauing than the other intervals. This is likely related to the fact that the average output length under clean-verifier training converges to around 300, making it easier for the model to hack the verifier when the FP i"},{"citing_arxiv_id":"2604.03071","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Automatic Textbook Formalization","primary_cat":"cs.AI","submitted_at":"2026-04-03T14:51:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-agent scaffolds require a solution to thecoordination problem: How can a large swarm of agents be organized to make consistent progress on a shared project? Initial explorations in the realm of software engineering have produced remarkably large code bases, but also shown the coherence issues that arise when agents are insufficiently orchestrated (Lin, 2026; Carlini, 2026). In this study, we propose a simple multi-agent scaffold that largely resolves these limitations by relying on battle-tested standard practices inherited from human collaborative software engineering: A) Large-scale parallelization via sub-agents with well-defined task assignments. B) Version control usinggitwith trunk-based development on short-lived feature branches."},{"citing_arxiv_id":"2602.24273","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Minimal Agent for Automated Theorem Proving","primary_cat":"cs.AI","submitted_at":"2026-02-27T18:43:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03715","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification","primary_cat":"cs.LG","submitted_at":"2026-01-07T09:04:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R³L combines reflect-then-retry exploration, pivotal credit assignment, and positive amplification in RL for LLMs, reporting 5-52% relative gains on agentic and reasoning tasks with stable training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.12787","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics","primary_cat":"cs.AI","submitted_at":"2025-10-14T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14274","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Discovering New Theorems via LLMs with In-Context Proof Learning in Lean","primary_cat":"cs.LG","submitted_at":"2025-09-16T06:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs in a conjecturing-proving loop that conditions on their own prior verified Lean proofs discover more hard-to-prove theorems than baselines that generate statements and proofs together.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}