super hub Mixed citations

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Binxing Jiao, Daxin Jiang, Liang Wang, Linjun Yang, Nan Yang, Xiaolong Huang · 2022 · cs.CL · arXiv 2212.03533

Mixed citation behavior. Most common role is method (39%).

154 Pith papers citing it

Method 39% of classified citations

open full Pith review browse 154 citing papers more from Binxing Jiao arXiv PDF

abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 9 other 2 baseline 1 dataset 1

citation-polarity summary

use method 9 background 8 support 2 unclear 2 baseline 1 use dataset 1

claims ledger

abstract This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot se

authors

Binxing Jiao Daxin Jiang Liang Wang Linjun Yang Nan Yang Xiaolong Huang

co-cited works

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

STRABLE: Benchmarking Tabular Machine Learning with Strings

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation

cs.IR · 2026-06-29 · conditional · novelty 7.0

Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.

Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings

cs.CL · 2026-06-28 · conditional · novelty 7.0

Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

The Voronoi Bottleneck: Capacity-Aware Dense Retrieval for Product Search

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

Towards Cost-effective LLMs Routing with Batch Prompting

cs.DB · 2026-05-27 · unverdicted · novelty 7.0

RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.

Generative Conversational Recommender System

cs.IR · 2026-05-21 · unverdicted · novelty 7.0

A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.

Skill Description Deception Attack against Task Routing in Internet of Agents

cs.MA · 2026-05-11 · conditional · novelty 7.0

Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.

citing papers explorer

Showing 50 of 55 citing papers after filters.

Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings cs.CL · 2026-06-28 · conditional · none · ref 24 · internal anchor
Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning cs.CL · 2026-05-30 · unverdicted · none · ref 48 · internal anchor
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness cs.CL · 2026-05-27 · unverdicted · none · ref 60 · internal anchor
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions cs.CL · 2026-05-21 · unverdicted · none · ref 59 · internal anchor
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches cs.CL · 2026-05-15 · unverdicted · none · ref 58 · internal anchor
A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG cs.CL · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding cs.CL · 2026-05-06 · unverdicted · none · ref 32 · internal anchor
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders cs.CL · 2026-05-02 · unverdicted · none · ref 15 · internal anchor
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory cs.CL · 2026-05-01 · unverdicted · none · ref 18 · internal anchor
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings cs.CL · 2026-04-30 · unverdicted · none · ref 15 · internal anchor
Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.
Latent Abstraction for Retrieval-Augmented Generation cs.CL · 2026-04-20 · unverdicted · none · ref 38 · internal anchor
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
LMEB: Long-horizon Memory Embedding Benchmark cs.CL · 2026-03-13 · unverdicted · none · ref 33 · internal anchor
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding cs.CL · 2026-01-17 · unverdicted · none · ref 34 · internal anchor
LSTM-MAS uses a chained multi-agent architecture modeled on LSTM input, forget, and output gates to improve long-context QA performance and reduce hallucinations compared with prior multi-agent baselines.
Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG cs.CL · 2025-11-12 · conditional · none · ref 15 · internal anchor
TARG uses uncertainty scores from a short no-context draft to gate retrieval in RAG, matching Always-RAG accuracy while cutting retrievals by 70-90% on QA benchmarks.
HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation cs.CL · 2025-10-09 · unverdicted · none · ref 23 · internal anchor
HiPRAG adds hierarchical process rewards to RL training for agentic RAG, reducing over-search to 2.3% and achieving 65.4-67.2% accuracy on seven QA benchmarks across 3B and 7B models.
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs cs.CL · 2024-12-22 · unverdicted · none · ref 63 · internal anchor
GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 35 · internal anchor
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries cs.CL · 2024-01-27 · accept · none · ref 42 · internal anchor
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 58 · internal anchor
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval cs.CL · 2026-06-29 · unverdicted · none · ref 15 · internal anchor
Permutation-invariant fine-tuning (PI-FT) randomizes field order and applies dropout during embedding model training to eliminate sensitivity to serialization order, reducing order-change penalty from 7.4 to 0.2 nDCG@10 on a generated multilingual DevDataBench while outperforming zero-shot baselines
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs cs.CL · 2026-05-31 · unverdicted · none · ref 28 · internal anchor
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance cs.CL · 2026-05-21 · unverdicted · none · ref 129 · internal anchor
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering cs.CL · 2026-05-20 · unverdicted · none · ref 3 · internal anchor
OGCaReBench is a new retrieval-focused benchmark for evaluating LLMs on off-guideline clinical questions from real case reports, showing retrieval augmentation raises accuracy from 56% to 82%.
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping cs.CL · 2026-05-07 · unverdicted · none · ref 35 · internal anchor
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning cs.CL · 2026-05-02 · unverdicted · none · ref 54 · internal anchor
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus cs.CL · 2026-05-01 · unverdicted · none · ref 74 · internal anchor
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation cs.CL · 2026-04-19 · unverdicted · none · ref 40 · internal anchor
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning cs.CL · 2026-04-19 · unverdicted · none · ref 64 · internal anchor
REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
Rag Performance Prediction for Question Answering cs.CL · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs cs.CL · 2026-04-08 · unverdicted · none · ref 50 · internal anchor
LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CL · 2025-10-01 · unverdicted · none · ref 19 · internal anchor
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token cs.CL · 2025-07-31 · conditional · none · ref 23 · internal anchor
Causal2Vec prepends a BERT-generated contextual token to decoder-only LLMs and pools its hidden state with the EOS token to reach new SOTA on MTEB among public-data-trained embedding models.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents cs.CL · 2025-06-18 · unverdicted · none · ref 54 · internal anchor
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 101 · internal anchor
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search cs.CL · 2026-06-29 · unverdicted · none · ref 20 · internal anchor
KbSD uses a same-size hint-augmented teacher and quadrant-adaptive KL objectives to deliver dense supervision for calibrated behavior across knowledge states in agentic search.
When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression cs.CL · 2026-05-31 · unverdicted · none · ref 13 · internal anchor
Combining dimensionality reduction and quantization compresses text embeddings to 0.1% size with minimal performance loss on MTEB tasks, outperforming either technique alone.
ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor cs.CL · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
ConvMemory delivers competitive recall at far lower latency than larger rerankers for long-term conversational memory while a multi-seed ablation refutes temporal-structure exploitation as the operative mechanism.
Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search cs.CL · 2026-05-26 · unverdicted · none · ref 34 · internal anchor
QDET deploys a 7B-parameter model fine-tuned with three auxiliary tasks and RL that matches a 671B model's F1 on query-driven timeline summarization while delivering measurable gains in production search metrics.
Generating Place-Based Compromises Between Two Points of View cs.CL · 2026-04-27 · unverdicted · none · ref 71 · internal anchor
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
Search-R3: Unifying Reasoning and Embedding in Large Language Models cs.CL · 2025-10-08 · unverdicted · none · ref 75 · internal anchor
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL · 2025-10-01 · unverdicted · none · ref 41 · internal anchor
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
LTRR: Learning To Rank Retrievers for LLMs cs.CL · 2025-06-16 · unverdicted · none · ref 42 · internal anchor
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 195 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Multilingual E5 Text Embeddings: A Technical Report cs.CL · 2024-02-08 · unverdicted · none · ref 20 · internal anchor
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning cs.CL · 2024-01-07 · unverdicted · none · ref 47 · internal anchor
Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.
Towards General Text Embeddings with Multi-stage Contrastive Learning cs.CL · 2023-08-07 · unverdicted · none · ref 83 · internal anchor
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems cs.CL · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
GE2 tops BEIR and Italian RAG benchmarks at nDCG@10 of 0.638 and 0.282 but with 231.6 ms latency; mE5-L is competitive on Italian at 31 ms while LaBSE underperforms all dedicated retrieval models.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder cs.CL · 2026-05-19 · unverdicted · none · ref 40 · internal anchor
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data cs.CL · 2026-04-20 · conditional · none · ref 28 · internal anchor
Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting Recall@50 from 68.89% to 77.55% and Recall@200 from 0.5969 to 0.7047.
Mitigating Hallucination on Hallucination in RAG via Ensemble Voting cs.CL · 2026-03-28 · unverdicted · none · ref 48 · internal anchor
VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks with a parallelizable design.

Text Embeddings by Weakly-Supervised Contrastive Pre-training

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer