QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

Chenyang An; Jiayaun Zhang; Minghao Pan; Qihao Ye

arxiv: 2604.24021 · v3 · pith:24QDCHBNnew · submitted 2026-04-27 · 💻 cs.AI · math.AP

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

Chenyang An , Qihao Ye , Minghao Pan , Jiayaun Zhang This is my paper

Pith reviewed 2026-05-25 06:50 UTC · model grok-4.3

classification 💻 cs.AI math.AP

keywords multi-agent systemsmathematical proof generationautomated theorem provingAI for mathematicsopen problemsalgebraic geometryfluid PDEs

0 comments

The pith

QED is a multi-agent system that converts open research questions into complete, verified mathematical proofs without further human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents QED as an open-source system that decomposes proof generation into separate agents for planning, argument construction, and correctness checking. This structure addresses repeated failures of single-query AI attempts on research-level mathematics. On 18 evaluated projects the system produced five original contributions in algebraic geometry, fluid PDEs, probability, and inverse problems. Domain experts judged three of these at the difficulty and scope typical of specialist mathematics journals. The result matters because it shows a concrete route by which AI can generate new, publishable mathematics from unanswered questions.

Core claim

QED shows that a pipeline of decomposition, prover, and verifier agents can take human-formulated research questions and output original, expert-validated mathematical proofs across several domains, yielding five new works three of which match the standard of established specialist venues.

What carries the argument

The multi-agent pipeline that assigns distinct roles to decomposition, proving, and verification agents.

If this is right

Specialist mathematics venues could receive and review AI-generated proofs as regular submissions.
Researchers could use the system to explore multiple proof strategies for a single conjecture in parallel.
The open-source code allows direct testing on additional open problems in the listed domains.
The separation of planning from proving and verification becomes a reusable template for other automated reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid workflows may emerge in which humans supply questions and the system supplies initial proof drafts for refinement.
Success on fluid PDEs and algebraic geometry suggests the same pipeline could be tested on number theory or combinatorics open problems.
If verifier accuracy scales with problem size, the approach could reduce the fraction of human time spent on routine proof checking.
Publication of the generated proofs would create a public record that future AI systems can train against.

Load-bearing premise

The verifier agents must reliably catch all errors in the generated proofs so that the five outputs are both original and correct without hidden human fixes.

What would settle it

Discovery of an undetected logical error in one of the five generated proofs by independent expert review, or zero new proofs produced when the same system is run on a fresh set of comparable open problems.

Figures

Figures reproduced from arXiv: 2604.24021 by Chenyang An, Jiayaun Zhang, Minghao Pan, Qihao Ye.

**Figure 1.** Figure 1: QED system architecture. Stage 0 surveys the literature. Stage 1 iterates a proof loop: multiple models generate proofs in parallel; the structural verifier (SV) and detailed verifier (DV) each independently produce n ∗ m verification reports m per proof, where n is the number of provers and m is the number of verification agent for each proof. At present, both n and m are at most three, corresponding to C… view at source ↗

**Figure 2.** Figure 2: Decomposition mode flow. The regulator analyzes verification failures and selects one of three corrective actions: REVISE PROOF make revision to the proof without changing the plan, REVISE PLAN modifies the existing decomposition plan, and REWRITE works on a new decomposition entirely. Decomposer have access to previous plans to avoid repetition/make revision. • Phase 3: Citation verification (addresses Fa… view at source ↗

read the original abstract

We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single-query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research-level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QED produced five expert-reviewed original proofs on open math problems with a three-agent pipeline, but the verifier step has no error-rate data or clear audit of human involvement.

read the letter

QED produced five original mathematical works on open problems that domain experts viewed as solid contributions, with three at the level of typical specialist journal papers. That's the central result to take away. The paper introduces a multi-agent system with separate agents for decomposing the research question, generating proof steps, and verifying the output. This addresses some issues with single large language model queries on complex proofs. They tested it on 18 projects spanning algebraic geometry, fluid PDEs, probability, and inverse problems. The code is open-sourced, which allows others to inspect and build on it. This setup shows some promise for turning research questions into proofs with limited human input after the initial question. The breadth across fields is a strength. The main concern is the verification step. The abstract does not provide quantitative measures of the verifier agents' accuracy on these novel proofs, such as false negative rates or comparisons to expert performance. The claim of operating without further human guidance sits alongside the mention of evaluation in collaboration with domain experts, which could imply some involvement in assessing or refining the outputs. Without an audit trail like agent traces or detailed rubrics, it's difficult to gauge the true level of autonomy. This paper is aimed at people working on AI for mathematical research. A reader in automated reasoning or multi-agent AI would get value from the architecture and the case studies. It deserves serious peer review because it reports concrete applications to open problems and provides the code, even though the evaluation could be more rigorous.

Referee Report

3 major / 1 minor

Summary. The manuscript presents QED, an open-source multi-agent system that decomposes mathematical proof generation into planning, proving, and verification stages to address open research questions without further human guidance. On 18 research-level projects spanning algebraic geometry, fluid PDEs, probability, and inverse problems, the system is reported to have produced five original works; domain experts assessed these as solid specialized contributions, with three comparable in difficulty and scope to papers commonly published in established specialist mathematics venues. The code is released at https://github.com/proofQED/QED.

Significance. If the central claims are substantiated with adequate verification evidence, the work would constitute a meaningful contribution to AI-assisted mathematical research by demonstrating a multi-agent pipeline capable of producing publishable-level outputs on open problems. The explicit open-source release and separation of verification from generation are concrete strengths that enable reproducibility and community scrutiny.

major comments (3)

[Abstract] Abstract: The assertion that the pipeline operates 'without further human guidance' is load-bearing for the headline claim yet is immediately qualified by the statement that evaluation occurred 'in collaboration with domain experts.' No section clarifies the precise scope of this collaboration (e.g., whether experts performed post-generation selection, refinement, or only final assessment), which directly affects whether the five outputs can be attributed solely to the agent system.
[Evaluation section] Evaluation section (and abstract): The correctness of the five produced works rests on verifier agents, yet no quantitative performance data are supplied—such as false-negative rates on error detection for novel research-level proofs, comparison against human experts on held-out examples, or formal verification of the verifiers themselves. Without these metrics the claim that the outputs are mathematically correct cannot be independently assessed.
[Results section] Results section: The manuscript provides no audit trail (raw agent traces, expert review rubrics, or side-by-side comparison with published analogues) that would allow readers to verify originality and correctness of the five works. This absence is material because the central claim is that QED generated publishable research contributions.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief statement of the total number of agent interactions or token budgets used across the 18 projects to give readers a sense of computational cost.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The assertion that the pipeline operates 'without further human guidance' is load-bearing for the headline claim yet is immediately qualified by the statement that evaluation occurred 'in collaboration with domain experts.' No section clarifies the precise scope of this collaboration (e.g., whether experts performed post-generation selection, refinement, or only final assessment), which directly affects whether the five outputs can be attributed solely to the agent system.

Authors: The QED pipeline generates proofs with no human input after the initial research question is supplied; planning, proving, and verification stages run autonomously. Collaboration with domain experts occurred exclusively after generation, solely for independent assessment of quality, originality, and publishability; experts performed no selection, refinement, or iterative guidance. We will revise the abstract for precision and insert a new subsection in the Evaluation section that explicitly bounds the scope of human involvement. revision: yes
Referee: [Evaluation section] The correctness of the five produced works rests on verifier agents, yet no quantitative performance data are supplied—such as false-negative rates on error detection for novel research-level proofs, comparison against human experts on held-out examples, or formal verification of the verifiers themselves. Without these metrics the claim that the outputs are mathematically correct cannot be independently assessed.

Authors: We agree that quantitative metrics on verifier performance for novel research-level proofs are absent. The verifier agents supply an automated first-pass check, while final correctness and significance judgments rest on the domain-expert assessments described in the paper. Generating false-negative rates or formal verifier verification would require an additional benchmark study outside the present scope. We will revise the Evaluation section to clarify the complementary roles of automated verifiers and expert review, add an explicit limitations paragraph on the lack of such metrics, and temper any implication that correctness rests solely on the agents. revision: partial
Referee: [Results section] The manuscript provides no audit trail (raw agent traces, expert review rubrics, or side-by-side comparison with published analogues) that would allow readers to verify originality and correctness of the five works. This absence is material because the central claim is that QED generated publishable research contributions.

Authors: All raw agent traces, conversation logs, and available expert review notes for the five works are already deposited in the public repository (https://github.com/proofQED/QED). We will expand the Results section with explicit hyperlinks to the trace directories for each proof, summarize the expert review criteria employed, and include brief qualitative discussion of how the generated results relate in scope to representative published papers in the respective subfields. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external expert judgment, not self-referential derivation

full rationale

The paper describes an empirical system and its outputs on open problems, with correctness and originality assessed by external domain experts. No equations, fitted parameters, predictions, or derivation chains appear. The abstract explicitly ties evaluation to collaboration with domain experts and external assessments, providing an independent benchmark rather than reducing to internal definitions or self-citations. No load-bearing step reduces by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the system is described at the architectural level only.

pith-pipeline@v0.9.0 · 5666 in / 1049 out tokens · 26929 ms · 2026-05-25T06:50:17.023346+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Return Probability for the Switch--Walk--Switch Lamplighter Walk on a Regular Tree
math.PR 2026-05 unverdicted novelty 6.0

Sharp return probability asymptotic p_{2n}(e,e) = ρ_d^{2n} exp[−(π²(log(d−1))² + o(1)) n / log² n] for the switch-walk-switch lamplighter walk with Z_2 lamps on the infinite d-regular tree, with proofs generated by th...

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

That is, Cλ(R) and Cl,λ(R) are nonzero forR >0

Verify that Fλ,l(R) is well-defined for λ >0 , l≥2 , and R >0 . That is, Cλ(R) and Cl,λ(R) are nonzero forR >0

work page
[2]

If 0< λ≤1 and l≥2 , l∈Z , then Fλ,l(R) is negative and monotonically increasing forR >0

work page
[3]

Result.QED did not produce a proof for this problem

If λ >1 and l≥2 , l∈Z , then there exists R1 > 0 such that Fλ,l(R) is increasing for R∈(0, R 1), decreasing forR > R 1, andF λ,l(R1)>0. Result.QED did not produce a proof for this problem. Across all 9 rounds, QED’s own verification pipeline identi- fied errors in every attempted proof and rejected them—the system self-verified to reject its own outputs w...

work page

[1] [1]

That is, Cλ(R) and Cl,λ(R) are nonzero forR >0

Verify that Fλ,l(R) is well-defined for λ >0 , l≥2 , and R >0 . That is, Cλ(R) and Cl,λ(R) are nonzero forR >0

work page

[2] [2]

If 0< λ≤1 and l≥2 , l∈Z , then Fλ,l(R) is negative and monotonically increasing forR >0

work page

[3] [3]

Result.QED did not produce a proof for this problem

If λ >1 and l≥2 , l∈Z , then there exists R1 > 0 such that Fλ,l(R) is increasing for R∈(0, R 1), decreasing forR > R 1, andF λ,l(R1)>0. Result.QED did not produce a proof for this problem. Across all 9 rounds, QED’s own verification pipeline identi- fied errors in every attempted proof and rejected them—the system self-verified to reject its own outputs w...

work page