Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
Mixed citations
First Proof.arXiv preprint
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
years
2026 9representative citing papers
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
Computational discovery via FlowBoost supports conjectures on the singular values of the coupling matrix E_n being 2^{-k/2} independent of n, a sharp p=2 critical exponent for p-Stam inequalities, and bifurcation of extremals for p<2.
An AI framework combining informal reasoning and formal verification resolves an open commutative algebra problem and produces a Lean 4-checked proof with minimal human input.
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
citing papers explorer
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
-
RMA: an Agentic System for Research-Level Mathematical Problems
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
-
Spectral Structure in Finite Free Information Inequalities and $p$-Stam Phase Transitions
Computational discovery via FlowBoost supports conjectures on the singular values of the coupling matrix E_n being 2^{-k/2} independent of n, a sharp p=2 critical exponent for p-Stam inequalities, and bifurcation of extremals for p<2.
-
Automated Conjecture Resolution with Formal Verification
An AI framework combining informal reasoning and formal verification resolves an open commutative algebra problem and produces a Lean 4-checked proof with minimal human input.
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
-
Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations
Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
-
Artificial Intelligence and the Structure of Mathematics
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.