Pith Number

pith:TOWXDDJW

pith:2020:TOWXDDJWQSC2EPTIXWHQYSZV5P

not attested not anchored not stored refs resolved

REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Ming-Wei Chang, Panupong Pasupat, Zora Tung

Language models pre-trained with an integrated retriever over a document corpus outperform prior methods on open-domain question answering by 4 to 16 percent.

arxiv:2002.08909 v1 · 2020-02-10 · cs.CL · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{TOWXDDJWQSC2EPTIXWHQYSZV5P}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy).

C2weakest assumption

That back-propagation through a retrieval step over millions of documents is numerically stable and provides a useful unsupervised learning signal for the retriever parameters.

C3one line summary

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

References

20 extracted · 20 resolved · 11 Pith anchors

[1] arXiv preprint arXiv:1911.10470 , year= 1911

[2] Neural Machine Translation by Jointly Learning to Align and Translate · arXiv:1409.0473

[3] Semantic parsing on freebase from question-answer pairs 2013

[4] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · arXiv:1810.04805

[5] Neural Turing Machines · arXiv:1410.5401

Formal links

3 machine-checked theorem links

Cited by

29 papers in Pith

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support

LIMO: Less is More for Reasoning

Receipt and verification

First computed	2026-05-17T23:38:52.829438Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

9bad718d368485a23e68bd8f0c4b35ebf7fe612a69c5579e3ba4838f495dc45e

Aliases

arxiv: 2002.08909 · arxiv_version: 2002.08909v1 · doi: 10.48550/arxiv.2002.08909 · pith_short_12: TOWXDDJWQSC2 · pith_short_16: TOWXDDJWQSC2EPTI · pith_short_8: TOWXDDJW

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/TOWXDDJWQSC2EPTIXWHQYSZV5P \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9bad718d368485a23e68bd8f0c4b35ebf7fe612a69c5579e3ba4838f495dc45e

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "a232dfc972dce5d964abce327735d56ef564c47b6287fe4dc0ed536c127173cb",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2020-02-10T18:40:59Z",
    "title_canon_sha256": "4cc1d8fa32eb3843fc491ea769947edd01955ffa807b5f78126a614095724e51"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2002.08909",
    "kind": "arxiv",
    "version": 1
  }
}