{"total":15,"items":[{"citing_arxiv_id":"2605.30914","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automating Formal Verification with Reinforcement Learning and Recursive Inference","primary_cat":"cs.LG","submitted_at":"2026-05-29T06:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23643","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin","primary_cat":"cs.CR","submitted_at":"2026-05-22T13:55:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20244","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search","primary_cat":"cs.LO","submitted_at":"2026-05-18T04:19:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lean Refactor uses retrieval from a curated multi-objective strategy database to guide frozen LLMs in refactoring Lean proofs, reporting over 70% token compression on benchmarks and improved version transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17283","ref_index":151,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","primary_cat":"cs.CL","submitted_at":"2026-05-17T06:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17255","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean","primary_cat":"cs.AI","submitted_at":"2026-05-17T04:53:47+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11905","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving","primary_cat":"cs.AI","submitted_at":"2026-05-12T10:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09079","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators","primary_cat":"cs.AI","submitted_at":"2026-05-09T17:39:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InAdvances in Neural Information Processing Systems, 2022. 13 [38] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023. doi: 10.48550/arXiv.2308.01825. URLhttps://arxiv.org/abs/2308.01825. [39] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research, 2023. doi: 10.48550/arXiv.2304.06767. URLhttps://arxiv.org/abs/2304.06767. [40] Marco Scutari. Learning bayesian networks with the bnlearn r package."},{"citing_arxiv_id":"2605.09012","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics","primary_cat":"cs.AI","submitted_at":"2026-05-09T15:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Planning Citation-side Grounding End-to-end ModelAnchorAcc↑CiteRecall@20↑GroundRate↑ToolAcc↑ Claude Opus 4.5 24.0% (48)10.5%(21) 24.0% (48)7.0%(14) Grok 4 26.5% (53) 4.5% (9)41.5%(83) 3.5% (7) Kimi K2 Thinking43.5%(87) 6.0% (12) 24.0% (48) 3.5% (7) GPT-5.2 5.5% (11) 8.5% (17) 25.0% (50) 3.0% (6) DeepSeek V3.2 8.0% (16) 5.5% (11) 29.0% (58) 2.5% (5) Gemini 3.1 Pro 2.0% (4) 5.5% (11) 12.0% (24) 2.0% (4) Qwen3-235B Thinking 40.5% (81) 6.0% (12) 17.0% (34) 1.0% (2) Shared retrieval artifact.All models use the same release-frozen retrieval artifact, instantiated with Google Scholar. We use Google Scholar as a broad scholarly discovery interface because the needed tool may appear across heterogeneous records-articles, proceedings papers, preprints,"},{"citing_arxiv_id":"2604.05868","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-04-07T13:28:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03071","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automatic Textbook Formalization","primary_cat":"cs.AI","submitted_at":"2026-04-03T14:51:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.24273","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Minimal Agent for Automated Theorem Proving","primary_cat":"cs.AI","submitted_at":"2026-02-27T18:43:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.12253","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Search for Constrained Random Generators","primary_cat":"cs.PL","submitted_at":"2025-11-15T15:18:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A Lean library called Palamedes uses synthesis rules from generator semantics and catamorphism-anamorphism rewriting to automatically produce correct constrained random generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.12787","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics","primary_cat":"cs.AI","submitted_at":"2025-10-14T17:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.04697","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-03-06T18:43:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02871","ref_index":214,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T04:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}