arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668

Austin Letson et al · 2026 · arXiv 2603.02668

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

cs.PL · 2026-03-15 · unverdicted · novelty 7.0

s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.

A Minimal Agent for Automated Theorem Proving

cs.AI · 2026-02-27 · unverdicted · novelty 6.0

A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.

citing papers explorer

Showing 4 of 4 citing papers.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics cs.AI · 2026-05-13 · unverdicted · none · ref 15
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness cs.CL · 2026-05-11 · unverdicted · none · ref 40
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs cs.PL · 2026-03-15 · unverdicted · none · ref 10
s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.
A Minimal Agent for Automated Theorem Proving cs.AI · 2026-02-27 · unverdicted · none · ref 21
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.

arXiv:2603.02668 [cs.AI].url:https://arxiv.org/abs/2603.02668

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer