pith. the verified trust layer for science. sign in
Pith Number

pith:YHPZV5IX

pith:2024:YHPZV5IXRF3GIHEWDFTPQWKAAB
not attested not anchored not stored refs resolved

Measuring short-form factuality in large language models

Amelia Glaese, Hyung Won Chung, Jason Wei, John Schulman, Nguyen Karina, Spencer Papay, William Fedus, Yunxin Joy Jiao

SimpleQA benchmark measures if language models know what they know on short facts.

arxiv:2411.04368 v1 · 2024-11-07 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YHPZV5IXRF3GIHEWDFTPQWKAAB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

SimpleQA is a simple, targeted evaluation for whether models 'know what they know,' and our hope is that this benchmark will remain relevant for the next few generations of frontier models.

C2weakest assumption

Questions can be created such that there exists only a single, indisputable answer and that adversarial collection against GPT-4 responses produces questions that remain challenging for future models.

C3one line summary

SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.

References

19 extracted · 19 resolved · 4 Pith anchors

[1] org/abs/2305.18248 2024
[2] P. Anthropic. Claude 3 model card, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf 2024
[3] In Proceedings of the 22nd international conference on Machine learning, pages 89–96 2023
[4] TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension 2017 · arXiv:1705.03551
[5] Language Models (Mostly) Know What They Know 2022 · arXiv:2207.05221

Formal links

3 machine-checked theorem links

Cited by

33 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:53.221012Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c1df9af5178976641c961966f8594000496af92ff9b0066c7067ad8d484a8e51

Aliases

arxiv: 2411.04368 · arxiv_version: 2411.04368v1 · doi: 10.48550/arxiv.2411.04368 · pith_short_12: YHPZV5IXRF3G · pith_short_16: YHPZV5IXRF3GIHEW · pith_short_8: YHPZV5IX
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YHPZV5IXRF3GIHEWDFTPQWKAAB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c1df9af5178976641c961966f8594000496af92ff9b0066c7067ad8d484a8e51
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "bc6590af27d293a121a6fce13fd12d46a8a316c7aaf3e54b2a59f21071aca0f6",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-11-07T01:58:42Z",
    "title_canon_sha256": "fd6601c4ac8b2d44a8b49a1794e90a34cc658b1e7eb5e0afc3ec76cf8436e8e7"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2411.04368",
    "kind": "arxiv",
    "version": 1
  }
}