pith. sign in
Pith Number

pith:5VIJKTYG

pith:2022:5VIJKTYGFJN4FRCBTFGSRG5VDP
not attested not anchored not stored refs resolved

Text and Code Embeddings by Contrastive Pre-Training

Alec Radford, Arvind Neelakantan, Boris Power, Chris Hallacy, David Schnurr, Felipe Petroski Such, Girish Sastry, Gretchen Krueger, Jerry Tworek, Jesse Michael Han, Joanne Jang, Johannes Heidecke, Jong Wook Kim, Kenny Hsu, Lilian Weng, Madeleine Thompson, Nikolas Tezak, Peter Welinder, Pranav Shyam, Qiming Yuan, Raul Puri, Tabarak Khan, Tao Xu, Toki Sherbakov, Tyna Eloundou Nekoul

Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search.

arxiv:2201.10005 v1 · 2022-01-24 · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{5VIJKTYGFJN4FRCBTFGSRG5VDP}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models.

C2weakest assumption

That the contrastive objective applied to unsupervised pairs at scale captures semantic similarity in a way that generalizes beyond the specific benchmarks used and is not primarily driven by model scale or data volume alone.

C3one line summary

Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.

References

28 extracted · 28 resolved · 15 Pith anchors

[1] Evaluating Large Language Models Trained on Code · arXiv:2107.03374
[2] SentEval: An evaluation toolkit for universal sentence representations · arXiv:1803.05449
[3] Cert: Contrastive self-supervised learning for language understanding 2005
[4] doi:10.48550/ARXIV.2109.10086
[5] REALM: Retrieval-Augmented Language Model Pre-Training 2002 · arXiv:2002.08909

Formal links

2 machine-checked theorem links

Cited by

30 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.434653Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

ed50954f062a5bc2c441994d289bb51bdf05424bce550fcdaa56415b53886219

Aliases

arxiv: 2201.10005 · arxiv_version: 2201.10005v1 · doi: 10.48550/arxiv.2201.10005 · pith_short_12: 5VIJKTYGFJN4 · pith_short_16: 5VIJKTYGFJN4FRCB · pith_short_8: 5VIJKTYG
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ed50954f062a5bc2c441994d289bb51bdf05424bce550fcdaa56415b53886219
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "761edbd688580583a4dafae8ed9a78bc70310f2b381acc1cab0219956ddc1455",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2022-01-24T23:36:20Z",
    "title_canon_sha256": "d32b7e303865ce5031b7cfc62037bf2b96e4b588a12af2e81d53405049c55bd9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2201.10005",
    "kind": "arxiv",
    "version": 1
  }
}