GPQA is a new graduate-level benchmark where PhD experts score 65% (74% after corrections), skilled non-experts score 34% with web access, and GPT-4 scores 39%, intended to enable realistic tests of human supervision over superhuman AI.
We additionally gave detailed comments on several example questions for each part (not shown here), highlighting positive and negative aspects of the examples
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2023 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GPQA is a new graduate-level benchmark where PhD experts score 65% (74% after corrections), skilled non-experts score 34% with web access, and GPT-4 scores 39%, intended to enable realistic tests of human supervision over superhuman AI.