pith. sign in

hub Mixed citations

Text and Code Embeddings by Contrastive Pre-Training

Mixed citation behavior. Most common role is background (67%).

30 Pith papers citing it
Background 67% of classified citations
abstract

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

hub tools

citation-role summary

background 7 method 2

citation-polarity summary

representative citing papers

GenAI Powered Dynamic Causal Inference with Unstructured Data

stat.ME · 2026-05-08 · unverdicted · novelty 7.0

A GenAI-based method extracts representations from unstructured data and uses a neural network to fit marginal structural models that recover causal effects of treatment feature sequences including their positions.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

Mitigating Label Bias with Interpretable Rubric Embeddings

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Rubric embeddings from expert criteria mitigate label bias in models trained on historical evaluations, reducing group disparities while improving cohort quality on a master's program dataset.

ImproBR: Bug Report Improver Using LLMs

cs.SE · 2026-04-28 · unverdicted · novelty 6.0

ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.

LLMs Corrupt Your Documents When You Delegate

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

cs.IT · 2025-11-03 · unverdicted · novelty 5.0

Proposes a semantic information theory for LLMs that substitutes the token for the bit as the atomic carrier of meaning, recasts the Transformer as an energy-based model, and derives directed rate-distortion and rate-reward functions using Massey's directed information.

The Platonic Representation Hypothesis

cs.LG · 2024-05-13 · unverdicted · novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.

citing papers explorer

Showing 30 of 30 citing papers.