Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
super hub Mixed citations
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Mixed citation behavior. Most common role is method (39%).
abstract
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot se
authors
co-cited works
representative citing papers
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.
ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.
Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.
Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.
Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.
OpAI-Bench provides a new benchmark for evaluating AI-text detectors on progressively human-AI co-edited documents at multiple granularities, revealing non-monotonic detection patterns.
ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
MaskForge reaches 79.3% average attack success rate on five dLLMs by adaptively searching and accumulating structural attack patterns with a UCB bandit, improving 17.6% over baselines and transferring to 88.2% on AdvBench.
DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.
A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.
TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
citing papers explorer
-
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems
KidnapRAG is a sequential black-box poisoning attack on Agentic RAG systems using Bait, Chain-Link, and Mal-Ins documents to redirect retrieval and reasoning, outperforming prior baselines.
-
The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
-
Field Order Should Not Matter: Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval
Permutation-invariant fine-tuning (PI-FT) randomizes field order and applies dropout during embedding model training to eliminate sensitivity to serialization order, reducing order-change penalty from 7.4 to 0.2 nDCG@10 on a generated multilingual DevDataBench while outperforming zero-shot baselines
-
Semantic insurance pricing with large language models
LLM embeddings from policy text outperform hand-engineered features in a GLM for French motor insurance claim frequency, with larger gains at small sample sizes and further improvement from insurance-specific fine-tuning.
-
SHARD: cell-keyed residual splitting for alignment-resistant private dense retrieval
SHARD introduces cell-keyed residual splitting that turns dense retrieval embeddings into revocable, renewable, unlinkable templates resistant to alignment attacks while preserving exact utility under CKKS reranking.
-
Toward a Hybrid Digital Twin of Society: Quantifying Cognitive-Spatial Linkages Through Online-Offline Feedback Networks
A Feedback Network model is developed showing online semantic exploration is more concentrated than physical mobility, with stable retail-business linkages and greater COVID disruption to spatial than cognitive routines, as a step toward hybrid digital twins of society.
-
Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors
Supervised bi-encoder retrievers encode document-level relevance priors from annotation biases, producing a findability gap for documents lacking favored features such as comprehensiveness and mainstream topic coverage.
-
Policy and World Modeling Co-Training for Language Agents
PaW co-trains policy and world modeling on standard RL rollouts using action-entropy data selection, noise-tolerant loss, and reward-adaptive balancing, yielding consistent gains on three agent benchmarks.
-
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
-
On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets
Meta-study of MTEB rankings introduces dataset-composition and ranking-scheme robustness indicators and finds only a small subset of models stay consistently strong across tasks, languages, and evaluation variations.
-
Xetrieval: Mechanistically Explaining Dense Retrieval
Xetrieval enriches sentence embeddings with a single-pass reasoning internalizer and decomposes the result into sparse interpretable features whose overlaps explain individual dense-retrieval decisions.
-
ReasonOps: Operator Segmentation for LLM Reasoning Traces
Unsupervised clustering on sentence-initial 3-token pivots extracts 7 universal reasoning operators from 44k traces across 12 LLMs that enable model fingerprinting and answer-correctness prediction.
-
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Search-E1 uses GRPO interleaved with on-policy self-distillation to reach 0.440 average EM on seven QA benchmarks with Qwen2.5-3B, outperforming open-source baselines.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
OGCaReBench is a new retrieval-focused benchmark for evaluating LLMs on off-guideline clinical questions from real case reports, showing retrieval augmentation raises accuracy from 56% to 82%.
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
-
S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs
S2Aligner decouples semantic and structural components in LLM-as-Aligner pre-training for sparse TAGs and uses structure-oriented reconstruction plus domain risk balancing to improve transferability and reduce generalization gaps.
-
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
Test-Time Compute for Frozen Embedding Models through Agentic Program Search
Agentic program search over a frozen encoder API yields retrieval programs that improve nDCG@10 on held-out tasks and unseen encoder families with no per-domain training.
-
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides adding 2.4pp.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves LLM query planning on knowledge QA by using explicit skill selection from an evolving SkillBank and a two-stage SFT process that aligns training with inference-time skill-grounded execution.
-
Do not copy and paste! Rewriting strategies for code retrieval
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
-
RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
-
Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval
SIRA compresses multi-round exploratory retrieval into one corpus-discriminative BM25 action via LLM document enrichment, query-time term prediction, and corpus-statistic filtering, reporting top average performance on ten BEIR benchmarks and strong results on BrowseComp-Wikipedia without relevance
-
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
-
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges per turn's normalized IG.
-
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
-
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acquisition.
-
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
-
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces
KAHM yields a compute-efficient query encoder that outperforms matched learned adapters in reconstructing a frozen Mixedbread embedding space on an Austrian-law retrieval task while delivering an 8.53x CPU speedup.
-
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
-
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
-
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
-
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
-
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
-
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence analysis on LLMs up to 32B parameters.
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking
DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cross-encoders.
-
Rag Performance Prediction for Question Answering
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
-
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
-
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
-
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
-
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
-
Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead
Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and confidence scores remain poorly calibrated.
-
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
Stream2LLM introduces adaptive scheduling and preemption for append-mode and update-mode context streaming in disaggregated LLM deployments, delivering up to 11x TTFT improvements on real-world workloads while preserving throughput.