Hard2verify: A step-level verification benchmark for open-ended frontier math

URL https://arxiv · 2025 · arXiv 2510.13744

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

cs.LG · 2026-05-20 · conditional · novelty 7.0

VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Pseudo-Formalization for Automatic Proof Verification

cs.LO · 2026-05-19 · unverdicted · novelty 5.0

Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering cs.LG · 2026-05-20 · conditional · none · ref 15
VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness cs.CL · 2026-05-11 · unverdicted · none · ref 19
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 54 · 2 links
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Pseudo-Formalization for Automatic Proof Verification cs.LO · 2026-05-19 · unverdicted · none · ref 25
Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.

Hard2verify: A step-level verification benchmark for open-ended frontier math

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer