Pith Number

pith:YHPZV5IX

pith:2024:YHPZV5IXRF3GIHEWDFTPQWKAAB

not attested not anchored not stored refs resolved

Measuring short-form factuality in large language models

Amelia Glaese, Hyung Won Chung, Jason Wei, John Schulman, Nguyen Karina, Spencer Papay, William Fedus, Yunxin Joy Jiao

SimpleQA benchmark measures if language models know what they know on short facts.

arxiv:2411.04368 v1 · 2024-11-07 · cs.CL

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{YHPZV5IXRF3GIHEWDFTPQWKAAB}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

SimpleQA is a simple, targeted evaluation for whether models 'know what they know,' and our hope is that this benchmark will remain relevant for the next few generations of frontier models.

C2weakest assumption

Questions can be created such that there exists only a single, indisputable answer and that adversarial collection against GPT-4 responses produces questions that remain challenging for future models.

C3one line summary

SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.

References

19 extracted · 19 resolved · 4 Pith anchors

[1] org/abs/2305.18248 2024

[2] P. Anthropic. Claude 3 model card, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf 2024

[3] In Proceedings of the 22nd international conference on Machine learning, pages 89–96 2023

[4] TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension 2017 · arXiv:1705.03551

[5] Language Models (Mostly) Know What They Know 2022 · arXiv:2207.05221

Formal links

3 machine-checked theorem links

Cited by

33 papers in Pith

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Receipt and verification

First computed	2026-05-17T23:38:53.221012Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

c1df9af5178976641c961966f8594000496af92ff9b0066c7067ad8d484a8e51

Aliases

arxiv: 2411.04368 · arxiv_version: 2411.04368v1 · doi: 10.48550/arxiv.2411.04368 · pith_short_12: YHPZV5IXRF3G · pith_short_16: YHPZV5IXRF3GIHEW · pith_short_8: YHPZV5IX

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/YHPZV5IXRF3GIHEWDFTPQWKAAB \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c1df9af5178976641c961966f8594000496af92ff9b0066c7067ad8d484a8e51

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "bc6590af27d293a121a6fce13fd12d46a8a316c7aaf3e54b2a59f21071aca0f6",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-11-07T01:58:42Z",
    "title_canon_sha256": "fd6601c4ac8b2d44a8b49a1794e90a34cc658b1e7eb5e0afc3ec76cf8436e8e7"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2411.04368",
    "kind": "arxiv",
    "version": 1
  }
}