super hub Mixed citations

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Binxing Jiao, Daxin Jiang, Liang Wang, Linjun Yang, Nan Yang, Xiaolong Huang · 2022 · cs.CL · arXiv 2212.03533

Mixed citation behavior. Most common role is method (39%).

177 Pith papers citing it

Method 39% of classified citations

open full Pith review browse 177 citing papers more from Binxing Jiao arXiv PDF

abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 9 other 2 baseline 1 dataset 1

citation-polarity summary

use method 9 background 8 support 2 unclear 2 baseline 1 use dataset 1

claims ledger

abstract This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot se

authors

Binxing Jiao Daxin Jiang Liang Wang Linjun Yang Nan Yang Xiaolong Huang

co-cited works

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

STRABLE: Benchmarking Tabular Machine Learning with Strings

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.

Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation

cs.IR · 2026-06-29 · conditional · novelty 7.0

Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.

Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings

cs.CL · 2026-06-28 · conditional · novelty 7.0

Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

The Voronoi Bottleneck: Capacity-Aware Dense Retrieval for Product Search

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

cs.DB · 2026-06-06 · unverdicted · novelty 7.0

An adaptive two-phase semantic filter using clustering then a hybrid proxy trained on LLM confidence achieves 1.6-2.0x speedup over prior methods at 90% accuracy on 10K document corpora.

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

OpAI-Bench provides a new benchmark for evaluating AI-text detectors on progressively human-AI co-edited documents at multiple granularities, revealing non-monotonic detection patterns.

ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

cs.CR · 2026-06-01 · unverdicted · novelty 7.0

MaskForge reaches 79.3% average attack success rate on five dLLMs by adaptively searching and accumulating structural attack patterns with a UCB bandit, improving 17.6% over baselines and transferring to 88.2% on AdvBench.

Test-Time Training for Zero-Resource Dense Retrieval Reranking

cs.IR · 2026-05-31 · unverdicted · novelty 7.0

DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

Towards Cost-effective LLMs Routing with Batch Prompting

cs.DB · 2026-05-27 · unverdicted · novelty 7.0

RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.

Generative Conversational Recommender System

cs.IR · 2026-05-21 · unverdicted · novelty 7.0

A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.

citing papers explorer

Showing 50 of 177 citing papers.

Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs cs.CL · 2026-04-08 · unverdicted · none · ref 50 · internal anchor
LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views cs.DB · 2026-04-08 · unverdicted · none · ref 44 · internal anchor
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers cs.IR · 2026-04-07 · unverdicted · none · ref 22 · internal anchor
Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections cs.IR · 2026-04-07 · accept · none · ref 19 · internal anchor
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead cs.IR · 2026-04-04 · accept · none · ref 52 · internal anchor
Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and confidence scores remain poorly calibrated.
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT) cs.DB · 2026-03-29 · unverdicted · none · ref 1 · 2 links · internal anchor
Stream2LLM introduces adaptive scheduling and preemption for append-mode and update-mode context streaming in disaggregated LLM deployments, delivering up to 11x TTFT improvements on real-world workloads while preserving throughput.
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering cs.DL · 2026-03-28 · accept · none · ref 28 · internal anchor
ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
Learning to Configure Agentic AI Systems cs.AI · 2026-02-12 · unverdicted · none · ref 15 · 2 links · internal anchor
ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to budget-matched baselines.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CL · 2025-10-01 · unverdicted · none · ref 19 · internal anchor
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering cs.CV · 2025-08-31 · unverdicted · none · ref 43 · internal anchor
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token cs.CL · 2025-07-31 · conditional · none · ref 23 · internal anchor
Causal2Vec prepends a BERT-generated contextual token to decoder-only LLMs and pools its hidden state with the EOS token to reach new SOTA on MTEB among public-data-trained embedding models.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents cs.CL · 2025-06-18 · unverdicted · none · ref 54 · internal anchor
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and shopping tasks.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning cs.AI · 2025-03-25 · unverdicted · none · ref 36 · internal anchor
ReSearch trains LLMs via RL to integrate search operations into reasoning steps, achieving strong generalization across benchmarks and eliciting reflection and self-correction without supervised reasoning data.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 101 · internal anchor
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search cs.CL · 2026-06-29 · unverdicted · none · ref 20 · internal anchor
KbSD uses a same-size hint-augmented teacher and quadrant-adaptive KL objectives to deliver dense supervision for calibrated behavior across knowledge states in agentic search.
ARMOR: Adaptive Retriever Optimization for Low-Resource Telecom Question Answering cs.IR · 2026-06-29 · unverdicted · none · ref 38 · internal anchor
ARMOR optimizes retrievers via joint RAG-likelihood and InfoNCE training with regularization toward the base encoder, yielding improved retrieval and QA on telecom benchmarks.
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning cs.AI · 2026-06-25 · unverdicted · none · ref 54 · internal anchor
A three-stage training pipeline internalizes world-model simulation and success estimation in LLM agents for improved planning on search and math tasks.
Hybrid privacy-aware semantic search: SVD-truncated document geometry and CKKS-encrypted query reranking under a restricted threat model cs.CR · 2026-06-24 · unverdicted · none · ref 2 · internal anchor
Hybrid privacy method for semantic search truncates and rotates document vectors geometrically while encrypting queries with CKKS, preserving retrieval quality on 1M-document corpora under a restricted threat model.
ReCal: Reward Calibration for RL-based LLM Routing cs.LG · 2026-06-10 · unverdicted · none · ref 33 · internal anchor
ReCal introduces hierarchical reward decomposition and distribution-aware optimization to address ambiguous credit assignment and optimization bias in RL-based LLM routing.
PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs cs.CL · 2026-06-04 · unverdicted · none · ref 22 · internal anchor
Short LLM prompts contain distinctive lexical signals enabling user identification as a behavioral biometric, with lexical features outperforming semantic ones across a dataset of 20k+ prompts from 1k users.
When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression cs.CL · 2026-05-31 · unverdicted · none · ref 13 · internal anchor
Combining dimensionality reduction and quantization compresses text embeddings to 0.1% size with minimal performance loss on MTEB tasks, outperforming either technique alone.
SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching cs.LG · 2026-05-29 · unverdicted · none · ref 35 · internal anchor
SemStruct models tables as heterogeneous graphs with GNNs on frozen PLM embeddings to incorporate row co-occurrences for schema matching and reports SOTA results on Valentine and SOTAB-SM benchmarks.
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark cs.CV · 2026-05-28 · unverdicted · none · ref 63 · internal anchor
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor cs.CL · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
ConvMemory delivers competitive recall at far lower latency than larger rerankers for long-term conversational memory while a multi-seed ablation refutes temporal-structure exploitation as the operative mechanism.
LRanker: LLM Ranker for Massive Candidates cs.IR · 2026-05-27 · unverdicted · none · ref 23 · internal anchor
LRanker combines K-means candidate aggregation with graph-partitioned ensemble of query embeddings to improve LLM ranking accuracy and scalability on massive candidate pools, reporting 3-30% gains on RBench tasks up to 6.8M candidates.
Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search cs.CL · 2026-05-26 · unverdicted · none · ref 34 · internal anchor
QDET deploys a 7B-parameter model fine-tuned with three auxiliary tasks and RL that matches a 671B model's F1 on query-driven timeline summarization while delivering measurable gains in production search metrics.
Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering cs.IR · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
Multi-task evaluation of 22 patent embedding models finds task-specific fine-tuning benefits and significant cross-landscape retrieval degradation that cannot be fixed by hybrid fusion.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks cs.SE · 2026-05-14 · unverdicted · none · ref 46 · internal anchor
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging cs.AI · 2026-05-13 · unverdicted · none · ref 34 · internal anchor
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 42 · 3 links · internal anchor
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data cs.DC · 2026-05-01 · unverdicted · none · ref 44 · internal anchor
SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy with an analytical cost model accurate to 2%.
Generating Place-Based Compromises Between Two Points of View cs.CL · 2026-04-27 · unverdicted · none · ref 71 · internal anchor
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards cs.CV · 2026-04-22 · unverdicted · none · ref 28 · internal anchor
A sandbox-trained multimodal search agent with process-oriented rewards transfers zero-shot to real Google Search and outperforms prior methods on FVQA, InfoSeek, and MMSearch.
BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels cs.IR · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
BioHiCL applies hierarchical multi-label contrastive learning with MeSH annotations to improve biomedical retrieval, sentence similarity, and question answering using small efficient models.
Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining cs.DB · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
Lit2Vec delivers a documented, reproducible pipeline that extracts and annotates a large licensed chemistry paper corpus from S2ORC with paragraph embeddings and subfield labels.
RECIPER: A Dual-View Retrieval Pipeline for Procedure-Oriented Materials Question Answering eess.SP · 2026-04-13 · unverdicted · none · ref 16 · internal anchor
RECIPER improves procedure-oriented retrieval from materials papers by combining paragraph-level dense retrieval with LLM-extracted procedural summaries and lightweight reranking, yielding average gains of +3.73 Recall@1 and better downstream QA.
Social Life of Code: Modeling Evolution through Code Embedding and Opinion Dynamics cs.SE · 2026-02-17 · unverdicted · none · ref 22 · internal anchor
Code embeddings combined with the Expressed-Private Opinion model produce trajectories that quantify developer influence and consensus formation across three open-source repositories.
Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model cs.IR · 2026-02-03 · unverdicted · none · ref 7 · internal anchor
Qwen3-embedding models show noise sensitivity in conversational retrieval where dialogue artifacts rank highly despite lacking semantic value, a problem reduced by query prompting and more severe than in prior Qwen versions or other baselines.
Legal Retrieval for Public Defenders cs.IR · 2026-01-20 · conditional · none · ref 39 · internal anchor
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? cs.CV · 2026-01-11 · unverdicted · none · ref 38 · internal anchor
Longer textual reasoning chains degrade MLLM accuracy on fine-grained visual tasks; a new normalization and constrained-reward training framework mitigates the effect and sets new SOTA numbers.
Tokenizing Buildings: A Transformer for Layout Synthesis cs.CV · 2025-12-04 · unverdicted · none · ref 25 · internal anchor
SBM tokenizes building rooms via a sparse attribute-feature matrix and trains a Transformer for high-fidelity embeddings plus autoregressive layout generation, yielding better retrieval and fewer layout errors than baselines.
Attention Grounded Enhancement for Visual Document Retrieval cs.IR · 2025-11-17 · unverdicted · none · ref 53 · internal anchor
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
Search-R3: Unifying Reasoning and Embedding in Large Language Models cs.CL · 2025-10-08 · unverdicted · none · ref 75 · internal anchor
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL · 2025-10-01 · unverdicted · none · ref 41 · internal anchor
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
LTRR: Learning To Rank Retrievers for LLMs cs.CL · 2025-06-16 · unverdicted · none · ref 42 · internal anchor
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 195 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Multilingual E5 Text Embeddings: A Technical Report cs.CL · 2024-02-08 · unverdicted · none · ref 20 · internal anchor
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning cs.CL · 2024-01-07 · unverdicted · none · ref 47 · internal anchor
Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.
Towards General Text Embeddings with Multi-stage Contrastive Learning cs.CL · 2023-08-07 · unverdicted · none · ref 83 · internal anchor
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
R$^2$-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agentic Search cs.IR · 2026-06-26 · unverdicted · none · ref 44 · internal anchor
R²-Searcher introduces fine-grained evidence modeling, retrieval reflection, and R²PO RL to calibrate retrieval-reasoning boundaries and improve multi-hop QA performance.

Text Embeddings by Weakly-Supervised Contrastive Pre-training

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer