{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:5VIJKTYGFJN4FRCBTFGSRG5VDP","short_pith_number":"pith:5VIJKTYG","schema_version":"1.0","canonical_sha256":"ed50954f062a5bc2c441994d289bb51bdf05424bce550fcdaa56415b53886219","source":{"kind":"arxiv","id":"2201.10005","version":1},"attestation_state":"computed","paper":{"title":"Text and Code Embeddings by Contrastive Pre-Training","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Alec Radford, Arvind Neelakantan, Boris Power, Chris Hallacy, David Schnurr, Felipe Petroski Such, Girish Sastry, Gretchen Krueger, Jerry Tworek, Jesse Michael Han, Joanne Jang, Johannes Heidecke, Jong Wook Kim, Kenny Hsu, Lilian Weng, Madeleine Thompson, Nikolas Tezak, Peter Welinder, Pranav Shyam, Qiming Yuan, Raul Puri, Tabarak Khan, Tao Xu, Toki Sherbakov, Tyna Eloundou Nekoul","submitted_at":"2022-01-24T23:36:20Z","abstract_excerpt":"Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competit"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2201.10005","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2022-01-24T23:36:20Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"d32b7e303865ce5031b7cfc62037bf2b96e4b588a12af2e81d53405049c55bd9","abstract_canon_sha256":"761edbd688580583a4dafae8ed9a78bc70310f2b381acc1cab0219956ddc1455"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.435061Z","signature_b64":"a97A+yKMTgnLv/CCihkGcI1meBhvVCCRaJwI+U2TMXjEXBfG0xPcvHqEiiv7ZCAn1PREhEaLM+RVdqHa6LpUBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"ed50954f062a5bc2c441994d289bb51bdf05424bce550fcdaa56415b53886219","last_reissued_at":"2026-05-17T23:38:50.434653Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.434653Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Text and Code Embeddings by Contrastive Pre-Training","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Alec Radford, Arvind Neelakantan, Boris Power, Chris Hallacy, David Schnurr, Felipe Petroski Such, Girish Sastry, Gretchen Krueger, Jerry Tworek, Jesse Michael Han, Joanne Jang, Johannes Heidecke, Jong Wook Kim, Kenny Hsu, Lilian Weng, Madeleine Thompson, Nikolas Tezak, Peter Welinder, Pranav Shyam, Qiming Yuan, Raul Puri, Tabarak Khan, Tao Xu, Toki Sherbakov, Tyna Eloundou Nekoul","submitted_at":"2022-01-24T23:36:20Z","abstract_excerpt":"Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competit"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the contrastive objective applied to unsupervised pairs at scale captures semantic similarity in a way that generalizes beyond the specific benchmarks used and is not primarily driven by model scale or data volume alone.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7a45acb93c6806ced9b29b0b9a3c746f6432c69f9b7f1646ace5df0a5b8b306f"},"source":{"id":"2201.10005","kind":"arxiv","version":1},"verdict":{"id":"c4457ccf-07e8-4bfc-8b81-17a68fd2ec52","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T19:20:15.627353Z","strongest_claim":"contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models.","one_line_summary":"Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the contrastive objective applied to unsupervised pairs at scale captures semantic similarity in a way that generalizes beyond the specific benchmarks used and is not primarily driven by model scale or data volume alone.","pith_extraction_headline":"Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search."},"references":{"count":28,"sample":[{"doi":"","year":null,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":1,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"","year":null,"title":"SentEval: An evaluation toolkit for universal sentence representations","work_id":"9ca81cef-98b4-4e74-ae6c-d2f5977db107","ref_index":2,"cited_arxiv_id":"1803.05449","is_internal_anchor":true},{"doi":"","year":2005,"title":"Cert: Contrastive self-supervised learning for language understanding","work_id":"c8bee491-a7cb-4564-bb24-4531a5283b58","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"doi:10.48550/ARXIV.2109.10086","work_id":"b92b66c0-5a02-4966-91fe-a2935c54d59b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2002,"title":"REALM: Retrieval-Augmented Language Model Pre-Training","work_id":"a397ddf8-b0b7-4e32-9d59-fb6ea67ac287","ref_index":5,"cited_arxiv_id":"2002.08909","is_internal_anchor":true}],"resolved_work":28,"snapshot_sha256":"d03ceb661eecf5b6a6f99f0a9a6f210b2b84550cc3bcbf2496bf55ebdcae9aa6","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c9e6d50bd2ee03b93f027ef512c23efb7197b14cea0bae96c746580f8469dbc9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2201.10005","created_at":"2026-05-17T23:38:50.434715+00:00"},{"alias_kind":"arxiv_version","alias_value":"2201.10005v1","created_at":"2026-05-17T23:38:50.434715+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2201.10005","created_at":"2026-05-17T23:38:50.434715+00:00"},{"alias_kind":"pith_short_12","alias_value":"5VIJKTYGFJN4","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"5VIJKTYGFJN4FRCB","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"5VIJKTYG","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2401.03563","citing_title":"Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2406.06587","citing_title":"TouchAI: Exploring human-AI perceptual alignment in touch through language model representations","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20268","citing_title":"Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21455","citing_title":"Mitigating Label Bias with Interpretable Rubric Embeddings","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16608","citing_title":"To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Embeddings, Except In Heavy Truncation Scenarios","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2509.12539","citing_title":"LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2408.00724","citing_title":"Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2511.01202","citing_title":"Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2504.19793","citing_title":"Prompt Injection Attack to Tool Selection in LLM Agents","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04936","citing_title":"Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2312.02724","citing_title":"RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2304.05376","citing_title":"ChemCrow: Augmenting large-language models with chemistry tools","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20354","citing_title":"EmbeddingGemma: Powerful and Lightweight Text Representations","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2405.07987","citing_title":"The Platonic Representation Hypothesis","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2405.17428","citing_title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models","ref_index":105,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13521","citing_title":"Granite Embedding Multilingual R2 Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2309.07597","citing_title":"C-Pack: Packed Resources For General Chinese Embeddings","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08809","citing_title":"SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2308.03281","citing_title":"Towards General Text Embeddings with Multi-stage Contrastive Learning","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26142","citing_title":"ImproBR: Bug Report Improver Using LLMs","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2402.03216","citing_title":"M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09087","citing_title":"DIAURec: Dual-Intent Space Representation Optimization for Recommendation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07096","citing_title":"Query-efficient model evaluation using cached responses","ref_index":104,"is_internal_anchor":true},{"citing_arxiv_id":"2212.03533","citing_title":"Text Embeddings by Weakly-Supervised Contrastive Pre-training","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07834","citing_title":"GenAI Powered Dynamic Causal Inference with Unstructured Data","ref_index":6,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP","json":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP.json","graph_json":"https://pith.science/api/pith-number/5VIJKTYGFJN4FRCBTFGSRG5VDP/graph.json","events_json":"https://pith.science/api/pith-number/5VIJKTYGFJN4FRCBTFGSRG5VDP/events.json","paper":"https://pith.science/paper/5VIJKTYG"},"agent_actions":{"view_html":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP","download_json":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP.json","view_paper":"https://pith.science/paper/5VIJKTYG","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2201.10005&json=true","fetch_graph":"https://pith.science/api/pith-number/5VIJKTYGFJN4FRCBTFGSRG5VDP/graph.json","fetch_events":"https://pith.science/api/pith-number/5VIJKTYGFJN4FRCBTFGSRG5VDP/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP/action/timestamp_anchor","attest_storage":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP/action/storage_attestation","attest_author":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP/action/author_attestation","sign_citation":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP/action/citation_signature","submit_replication":"https://pith.science/pith/5VIJKTYGFJN4FRCBTFGSRG5VDP/action/replication_record"}},"created_at":"2026-05-17T23:38:50.434715+00:00","updated_at":"2026-05-17T23:38:50.434715+00:00"}