arxiv: 2201.10005 · v1 · submitted 2022-01-24 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan , Tao Xu , Raul Puri , Alec Radford , Jesse Michael Han , Jerry Tworek , Qiming Yuan , Nikolas Tezak

show 17 more authors

Jong Wook Kim Chris Hallacy Johannes Heidecke Pranav Shyam Boris Power Tyna Eloundou Nekoul Girish Sastry Gretchen Krueger David Schnurr Felipe Petroski Such Kenny Hsu Madeleine Thompson Tabarak Khan Toki Sherbakov Joanne Jang Peter Welinder Lilian Weng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:20 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords text embeddingscontrastive pre-trainingsemantic searchcode embeddingsunsupervised learninglinear probe classificationcode search

0 comments

The pith

Contrastive pre-training on unsupervised data at scale produces high-quality embeddings for text and code that excel at classification and semantic search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training embedding models with a contrastive objective on large volumes of unlabeled text and code yields vector representations that transfer effectively to multiple tasks. These embeddings reach new highs on linear-probe classification across seven tasks while also delivering substantial gains on semantic search benchmarks including MSMARCO, Natural Questions, and TriviaQA. The same approach, applied to text-code pairs, improves code search performance over prior methods. The work indicates that a single unsupervised training run can replace much of the task-specific customization previously required for embedding models.

Core claim

Contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models, with relative improvements of 4% and 1.8% over previous best unsupervised and supervised text embedding models on classification and up to 23.4% on MSMARCO semantic search.

What carries the argument

Contrastive pre-training objective applied to large-scale unsupervised text-text and text-code pairs, which pulls matching pairs together and pushes non-matching pairs apart in embedding space.

If this is right

The embeddings support direct use in semantic search without task-specific fine-tuning.
Linear-probe classification accuracy exceeds prior best unsupervised and supervised embedding models on average across the evaluated tasks.
Code search performance improves substantially when the same contrastive method is applied to text-code pairs.
Unsupervised embeddings can reach competitive levels with fine-tuned models on multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A single set of embeddings could reduce the need for maintaining separate models for classification and retrieval in production systems.
The method may extend to other structured data such as tables or scientific abstracts if suitable unsupervised pairs can be constructed.
Future scaling experiments could isolate whether the gains come primarily from data volume, model size, or the specific contrastive loss.

Load-bearing premise

The contrastive objective on unsupervised pairs learns semantic similarity that transfers to the downstream classification and search tasks beyond the training data distribution.

What would settle it

A model trained on the same data volume and scale but with a non-contrastive objective such as standard language modeling fails to match the reported accuracy on the linear-probe tasks and the MSMARCO, Natural Questions, and code search benchmarks.

read the original abstract

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Contrastive pre-training at scale produces usable text and code embeddings with reported benchmark gains, but the gains cannot yet be cleanly attributed to the objective rather than scale and data volume.

read the letter

The main takeaway is that scaling contrastive pre-training on unsupervised pairs gives embeddings for text and code that improve linear-probe classification and semantic search over prior unsupervised baselines. The same models also close some of the gap to fine-tuned systems on certain tasks. For code, they report a 20.8% relative lift on search using (text, code) pairs. These are concrete numbers worth noting for anyone building retrieval or classification systems that prefer to avoid heavy supervision.

Referee Report

2 major / 2 minor

Summary. The paper claims that contrastive pre-training on large-scale unsupervised data produces high-quality vector embeddings for text and code. The same embeddings achieve new state-of-the-art linear-probe classification accuracy (4% relative gain over prior unsupervised baselines and 1.8% over supervised), strong semantic search results (23.4% relative gain on MSMARCO, 14.7% on Natural Questions, 10.6% on TriviaQA), and a 20.8% relative gain on code search, sometimes matching fine-tuned models.

Significance. If the central claim holds, the work shows that a single scalable unsupervised contrastive procedure can yield versatile embeddings competitive across classification and retrieval without task-specific customization, which would simplify embedding pipelines and improve semantic search performance.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and abstract: the reported relative improvements are presented without ablations that hold model size, parameter count, training tokens, and data distribution fixed while swapping only the contrastive loss for a standard language-modeling objective; this leaves open whether the gains are driven by the contrastive formulation or by scale and data volume alone, directly undermining attribution of the central claim.
[Abstract and §3] Abstract and §3 (Method): no details are given on exact model sizes, training procedures, hyper-parameters, or error bars on the reported accuracies and relative gains (4%, 23.4%, etc.), preventing verification that the results are robust or reproducible.

minor comments (2)

[Table 1] Table 1 and §4.2: clarify whether the linear-probe results use the same embedding dimension across all compared models.
[§5] §5 (Discussion): add a short paragraph on potential failure modes when the unsupervised pairs contain noisy or non-semantic alignments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript. We appreciate the feedback highlighting the need for clearer ablations and more detailed reporting. Below we respond to each major comment and indicate the changes we will make in the revised version.

read point-by-point responses

Referee: §4 (Experiments) and abstract: the reported relative improvements are presented without ablations that hold model size, parameter count, training tokens, and data distribution fixed while swapping only the contrastive loss for a standard language-modeling objective; this leaves open whether the gains are driven by the contrastive formulation or by scale and data volume alone, directly undermining attribution of the central claim.

Authors: We agree that a controlled ablation isolating the contrastive objective while holding model size, token count, and data fixed would strengthen causal attribution. Our submission demonstrates practical gains relative to prior unsupervised methods, but does not include this exact comparison. In revision we will add a discussion of this limitation in §4 and, resources permitting, include a matched-scale experiment contrasting the two objectives on a subset of the data. revision: partial
Referee: Abstract and §3 (Method): no details are given on exact model sizes, training procedures, hyper-parameters, or error bars on the reported accuracies and relative gains (4%, 23.4%, etc.), preventing verification that the results are robust or reproducible.

Authors: We will expand §3 and add an appendix with exact model sizes, layer counts, hidden dimensions, training token volumes, optimizer settings, batch sizes, learning-rate schedules, and data mixture details. Where multiple random seeds were run we will report standard deviations; otherwise we will note single-run results and the associated variance observed in pilot experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical results

full rationale

The paper reports direct empirical measurements of model performance after contrastive pre-training on unsupervised text and code data. Evaluations use held-out benchmarks (linear-probe classification, semantic search on MSMARCO/Natural Questions/TriviaQA, code search) with no mathematical derivation chain, no fitted parameters renamed as predictions, and no load-bearing self-citations that reduce claims to tautologies. Central results are measured outcomes on external test sets rather than quantities constructed from the training inputs themselves.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard contrastive learning assumptions and large-scale unsupervised training; many implementation choices such as model architecture, batch size, and data filtering are not detailed in the abstract and function as free parameters.

free parameters (2)

model architecture and scale
Specific transformer size and configuration chosen to achieve the reported results.
training data volume and filtering
Scale and selection of unsupervised text and code data directly affect embedding quality.

axioms (1)

domain assumption Contrastive loss on (text, text) or (text, code) pairs produces embeddings that reflect semantic similarity
Core assumption of the training objective invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1214 out tokens · 24157 ms · 2026-05-15T19:20:15.627353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation Jcost uniqueness and RCL family echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities
Foundation.LawOfExistence defect zero iff unity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GenAI Powered Dynamic Causal Inference with Unstructured Data
stat.ME 2026-05 unverdicted novelty 7.0

A GenAI-based method extracts representations from unstructured data and uses a neural network to fit marginal structural models that recover causal effects of treatment feature sequences including their positions.
Prompt Injection Attack to Tool Selection in LLM Agents
cs.CR 2025-04 conditional novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
ImproBR: Bug Report Improver Using LLMs
cs.SE 2026-04 unverdicted novelty 6.0

ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
cs.IR 2026-01 unverdicted novelty 6.0

W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
EmbeddingGemma: Powerful and Lightweight Text Representations
cs.CL 2025-09 unverdicted novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
cs.IR 2023-12 conditional novelty 6.0

RankZephyr is a new open-source LLM that closes the effectiveness gap with GPT-4 for zero-shot listwise reranking while showing robustness to input ordering and document count.
ChemCrow: Augmenting large-language models with chemistry tools
physics.chem-ph 2023-04 conditional novelty 6.0

ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
cs.CL 2026-05 unverdicted novelty 5.0

SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
cs.IR 2026-04 unverdicted novelty 5.0

AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.
DIAURec: Dual-Intent Space Representation Optimization for Recommendation
cs.IR 2026-04 unverdicted novelty 5.0

DIAURec unifies intent and language modeling to reconstruct and optimize representations in prototype and distribution spaces, outperforming baselines on three datasets.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.
Towards General Text Embeddings with Multi-stage Contrastive Learning
cs.CL 2023-08 unverdicted novelty 5.0

GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Text Embeddings by Weakly-Supervised Contrastive Pre-training
cs.CL 2022-12 unverdicted novelty 5.0

E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
Granite Embedding Multilingual R2 Models
cs.IR 2026-05 unverdicted novelty 4.0

Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 22 Pith papers · 15 internal anchors

[1]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M.,...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

SentEval: An Evaluation Toolkit for Universal Sentence Representations

Conneau, A. and Kiela, D. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2005.12766 , year=

Fang, H. and Xie, P. CERT: contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766,

work page arXiv 2005
[4]

arXiv preprint arXiv:2109.10086 , year=

Formal, T., Lassance, C., Piwowarski, B., and Clinchant, S. SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086,

work page arXiv
[5]

REALM: Retrieval-Augmented Language Model Pre-Training

Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. REALM: retrieval-augmented language model pre- training. arXiv preprint arXiv:2002.08909,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[6]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave

Hofst¨atter, S., Lin, S., Yang, J., Lin, J., and Hanbury, A. Efﬁciently teaching an effective dense retriever with balanced topic aware sampling. arXiv preprint arXiv:2104.06967,

work page arXiv
[7]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Brockschmidt, M. CodeSearchNet challenge: Evaluat- ing the state of semantic code search. arXiv preprint arXiv:1909.09436,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[8]

Unsupervised Dense Information Retrieval with Contrastive Learning

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bo- janowski, P., Joulin, A., and Grave, E. Towards unsuper- vised dense information retrieval with contrastive learn- ing. arXiv preprint arXiv:2112.09118,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Dense passage retrieval for open-domain question answering

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. In Con- ference on Empirical Methods in Natural Language Pro- cessing (EMNLP), 2020a. Karpukhin, V ., Oguz, B., Min, S., Wu, L., Edunov, S., Chen, D., and Yih, W. Dense passage retrieval for open- domain qu...

work page arXiv 2004
[10]

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Text and Code Embeddings by Contrastive Pre-Training Kobayashi, S. Contextual augmentation: Data augmen- tation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Lu, J., Batra, D., Parikh, D., and Lee, S. Vil- bert: Pretraining task-agnostic visiolinguistic represen- tations for vision-and-language tasks. arXiv preprint arXiv:1908.02265,

work page arXiv 1908
[12]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. Efﬁ- cient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., and Deng, L. MS MARCO: A hu- man generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

From doc2query to doctttttquery

Nogueira, R., Lin, J., and Epistemic, A. From doc2query to doctttttquery. Online preprint, 2019a. Nogueira, R., Yang, W., Cho, K., and Lin, J. Multi- stage document ranking with BERT. arXiv preprint arXiv:1910.14424, 2019b. Park, J. H., Shin, J., and Fung, P. Reducing gender bias in abusive language detection. In Conference on Empiri- cal Methods in Natur...

work page arXiv 1910
[15]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[17]

Zero-Shot Text-to-Image Generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text- to-image generation. arXiv preprint arXiv:2102.12092,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Text and Code Embeddings by Contrastive Pre-Training Rudinger, R., Naradowsky, J., Leonard, B., and Durme, B. V . Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Colbertv2: Effective and efﬁcient re- trieval via lightweight late interaction

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. Colbertv2: Effective and efﬁcient re- trieval via lightweight late interaction. arXiv preprint arXiv:2112.01488,

work page arXiv
[20]

A simple but tough-to-beat data augmentation approach for natural language understanding and generation

Shen, D., Zheng, M., Shen, Y ., Qu, Y ., and Chen, W. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818,

work page arXiv 2009
[21]

Process for adapting language models to society (PALMS) with values-targeted datasets

Solaiman, I. and Dennison, C. Process for adapting lan- guage models to society (PALMS) with values-targeted datasets. arXiv preprint arXiv:2106.10328,

work page arXiv
[22]

Whitening sentence representations for better semantics and faster retrieval

Su, J., Cao, J., Liu, W., and Ou, Y . Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316,

work page arXiv
[23]

WaveNet: A Generative Model for Raw Audio

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y ., and Vinyals, O. Representa- tion learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.arXiv preprint arXiv:2002.10957,

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. Minilm: Deep self-attention distillation for task- agnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957,

work page arXiv 2002
[26]

Wei, J. W. and Zou, K. EDA: easy data augmentation tech- niques for boosting performance on text classiﬁcation tasks. arXiv preprint arXiv:1901.11196,

work page arXiv 1901
[27]

Approximate nearest neighbor negative contrastive learning for dense text retrieval

Xiong, L., Xiong, C., Li, Y ., Tang, K., Liu, J., Bennett, P. N., Ahmed, J., and Overwijk, A. Approximate near- est neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808,

work page arXiv 2007
[28]

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

Text and Code Embeddings by Contrastive Pre-Training Zhao, J., Wang, T., Yatskar, M., Ordonez, V ., and Chang, K. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018