A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
Proof2hybrid: Automatic mathematical benchmark synthesis for proof-centric problems
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.