Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
super hub Mixed citations
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Mixed citation behavior. Most common role is method (39%).
abstract
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot se
authors
co-cited works
representative citing papers
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.
Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.
Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.
Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.
A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.
TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
citing papers explorer
-
Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings
Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.
-
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
-
The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.
-
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
-
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings
Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.
-
Latent Abstraction for Retrieval-Augmented Generation
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding
LSTM-MAS uses a chained multi-agent architecture modeled on LSTM input, forget, and output gates to improve long-context QA performance and reduce hallucinations compared with prior multi-agent baselines.
-
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG
TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
-
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
-
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval
Permutation-invariant fine-tuning (PI-FT) randomizes field order and applies dropout during embedding model training to eliminate sensitivity to serialization order, reducing order-change penalty from 7.4 to 0.2 nDCG@10 on a generated multilingual DevDataBench while outperforming zero-shot baselines
-
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
OGCaReBench is a new retrieval-focused benchmark for evaluating LLMs on off-guideline clinical questions from real case reports, showing retrieval augmentation raises accuracy from 56% to 82%.
-
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
-
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
-
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
-
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
-
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
-
Rag Performance Prediction for Question Answering
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
-
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
-
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
-
Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token
Causal2Vec prepends a BERT-generated contextual token to decoder-only LLMs and pools its hidden state with the EOS token to reach new SOTA on MTEB among public-data-trained embedding models.
-
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search
KbSD uses a same-size hint-augmented teacher and quadrant-adaptive KL objectives to deliver dense supervision for calibrated behavior across knowledge states in agentic search.
-
When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression
Combining dimensionality reduction and quantization compresses text embeddings to 0.1% size with minimal performance loss on MTEB tasks, outperforming either technique alone.
-
ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor
ConvMemory delivers competitive recall at far lower latency than larger rerankers for long-term conversational memory while a multi-seed ablation refutes temporal-structure exploitation as the operative mechanism.
-
Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search
QDET deploys a 7B-parameter model fine-tuned with three auxiliary tasks and RL that matches a 671B model's F1 on query-driven timeline summarization while delivering measurable gains in production search metrics.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
-
Search-R3: Unifying Reasoning and Embedding in Large Language Models
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
-
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
-
LTRR: Learning To Rank Retrievers for LLMs
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
Multilingual E5 Text Embeddings: A Technical Report
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
-
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning
Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.
-
Towards General Text Embeddings with Multi-stage Contrastive Learning
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
-
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
GE2 tops BEIR and Italian RAG benchmarks at nDCG@10 of 0.638 and 0.282 but with 231.6 ms latency; mE5-L is competitive on Italian at 31 ms while LaBSE underperforms all dedicated retrieval models.
-
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
-
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting Recall@50 from 68.89% to 77.55% and Recall@200 from 0.5969 to 0.7047.
-
Mitigating Hallucination on Hallucination in RAG via Ensemble Voting
VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks with a parallelizable design.