arXiv preprint arXiv:2510.05962 , year =

O'Brien, Dayy · 2025 · arXiv 2510.05962

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

cs.AI · 2026-04-06 · unverdicted · novelty 5.0

A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.

LLM Benchmark Datasets Should Be Contamination-Resistant

cs.LG · 2026-05-19 · unverdicted · novelty 4.0

Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

citing papers explorer

Showing 2 of 2 citing papers.

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis cs.AI · 2026-04-06 · unverdicted · none · ref 8
A hypothesis-driven pipeline generates targeted hard math problems that drop Llama-3.3-70B-Instruct accuracy from 77% on MATH to as low as 45%.
LLM Benchmark Datasets Should Be Contamination-Resistant cs.LG · 2026-05-19 · unverdicted · none · ref 17
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

arXiv preprint arXiv:2510.05962 , year =

fields

years

verdicts

representative citing papers

citing papers explorer