Title resolution pending

A second multimodal verification pass usingproblem + student response + original answer-sheet imageto remove samples whose OCR artifacts may materially affect evaluation · arXiv 1100.3270

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

read on arXiv browse 1 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

citing papers explorer

Showing 1 of 1 citing paper after filters.

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning cs.AI · 2026-06-08 · unverdicted · none · ref 35
RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer