IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
Title resolution pending
16 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
Fidelity probes from code raise specification fidelity from 0.63 to 0.94 on a 12k-line COBOL benchmark over eight iterations, with convergence predicted by a two-state Markov fixed point from four iterations of rate data.
VERIMED translates natural-language requirements to formal logic via LLMs, detects ambiguity from stochastic formalization differences, and audits for inconsistency and safety violations using SMT queries.
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Viverra generates C code from text descriptions together with assertions that are verified by model checkers, and a user study with over 400 participants shows the verified assertions improve code comprehension.
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
VERGE decomposes LLM outputs into atomic claims, autoformalizes them to first-order logic, verifies with SMT solvers and consensus, and refines via minimal correction subsets, yielding 18.7% average uplift on reasoning benchmarks.
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
citing papers explorer
-
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
-
Advancing Mathematics Research with AI-Driven Formal Proof Search
LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
-
Fidelity Probes for Specification--Code Alignment
Fidelity probes from code raise specification fidelity from 0.63 to 0.94 on a 12k-line COBOL benchmark over eight iterations, with convergence predicted by a two-state Markov fixed point from four iterations of rate data.
-
Neurosymbolic Auditing of Natural-Language Software Requirements
VERIMED translates natural-language requirements to formal logic via LLMs, detects ambiguity from stochastic formalization differences, and audits for inconsistency and safety violations using SMT queries.
-
Automatic Textbook Formalization
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
-
ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
-
Viverra: Text-to-Code with Guarantees
Viverra generates C code from text descriptions together with assertions that are verified by model checkers, and a user study with over 400 participants shows the verified assertions improve code comprehension.
-
Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Segment-level supervision extracts coherent proof segments to train policy models that achieve 61-66% success on miniF2F, outperforming step-level and whole-proof methods while also improving existing provers.
-
A Minimal Agent for Automated Theorem Proving
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
-
VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning
VERGE decomposes LLM outputs into atomic claims, autoformalizes them to first-order logic, verifies with SMT solvers and consensus, and refines via minimal correction subsets, yielding 18.7% average uplift on reasoning benchmarks.
-
Aristotle: IMO-level Automated Theorem Proving
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
-
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
-
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
-
On Reasoning-Centric LLM-based Automated Theorem Proving
ReCent-Prover achieves a 22.58% relative improvement over prior state-of-the-art in proved theorems on the CoqStoq benchmark by using reasoning-centric techniques under a fixed LLM invocation budget.
-
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.