Pith Number

pith:5VIJKTYG

pith:2022:5VIJKTYGFJN4FRCBTFGSRG5VDP

not attested not anchored not stored refs resolved

Text and Code Embeddings by Contrastive Pre-Training

Alec Radford, Arvind Neelakantan, Boris Power, Chris Hallacy, David Schnurr, Felipe Petroski Such, Girish Sastry, Gretchen Krueger, Jerry Tworek, Jesse Michael Han, Joanne Jang, Johannes Heidecke, Jong Wook Kim, Kenny Hsu, Lilian Weng, Madeleine Thompson, Nikolas Tezak, Peter Welinder, Pranav Shyam, Qiming Yuan, Raul Puri, Tabarak Khan, Tao Xu, Toki Sherbakov, Tyna Eloundou Nekoul

Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search.

arxiv:2201.10005 v1 · 2022-01-24 · cs.CL · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{5VIJKTYGFJN4FRCBTFGSRG5VDP}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models.

C2weakest assumption

That the contrastive objective applied to unsupervised pairs at scale captures semantic similarity in a way that generalizes beyond the specific benchmarks used and is not primarily driven by model scale or data volume alone.

C3one line summary

Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.

References

28 extracted · 28 resolved · 15 Pith anchors

[1] Evaluating Large Language Models Trained on Code · arXiv:2107.03374

[2] SentEval: An evaluation toolkit for universal sentence representations · arXiv:1803.05449

[3] Cert: Contrastive self-supervised learning for language understanding 2005

[4] doi:10.48550/ARXIV.2109.10086

[5] REALM: Retrieval-Augmented Language Model Pre-Training 2002 · arXiv:2002.08909

Formal links

2 machine-checked theorem links

Cited by

30 papers in Pith

Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning

TouchAI: Exploring human-AI perceptual alignment in touch through language model representations

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Mitigating Label Bias with Interpretable Rubric Embeddings

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Embeddings, Except In Heavy Truncation Scenarios

Receipt and verification

First computed	2026-05-17T23:38:50.434653Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

ed50954f062a5bc2c441994d289bb51bdf05424bce550fcdaa56415b53886219

Aliases

arxiv: 2201.10005 · arxiv_version: 2201.10005v1 · doi: 10.48550/arxiv.2201.10005 · pith_short_12: 5VIJKTYGFJN4 · pith_short_16: 5VIJKTYGFJN4FRCB · pith_short_8: 5VIJKTYG

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ed50954f062a5bc2c441994d289bb51bdf05424bce550fcdaa56415b53886219

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "761edbd688580583a4dafae8ed9a78bc70310f2b381acc1cab0219956ddc1455",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2022-01-24T23:36:20Z",
    "title_canon_sha256": "d32b7e303865ce5031b7cfc62037bf2b96e4b588a12af2e81d53405049c55bd9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2201.10005",
    "kind": "arxiv",
    "version": 1
  }
}