Generation accuracy exceeds self-evaluation on three of four in-context QA benchmarks, with attention analysis showing evaluators attend 3-5x less to context and barely read the candidate answer.
arXiv preprint arXiv:2303.17557 , year =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA
Generation accuracy exceeds self-evaluation on three of four in-context QA benchmarks, with attention analysis showing evaluators attend 3-5x less to context and barely read the candidate answer.