super hub Mixed citations

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Binxing Jiao, Daxin Jiang, Liang Wang, Linjun Yang, Nan Yang, Xiaolong Huang · 2022 · cs.CL · arXiv 2212.03533

Mixed citation behavior. Most common role is method (39%).

184 Pith papers citing it

Method 39% of classified citations

open full Pith review browse 184 citing papers more from Binxing Jiao arXiv PDF

abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 9 other 2 baseline 1 dataset 1

citation-polarity summary

use method 9 background 8 support 2 unclear 2 baseline 1 use dataset 1

claims ledger

abstract This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot se

authors

Binxing Jiao Daxin Jiang Liang Wang Linjun Yang Nan Yang Xiaolong Huang

co-cited works

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

STRABLE: Benchmarking Tabular Machine Learning with Strings

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

Embedding Inference Attack

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

Tailored queries enable identification of the embedding model used by a black-box IR system from the unordered set of retrieved documents, even when a reranker is present.

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.

Diagnosing and Mitigating Retrieval Bottlenecks in LLM-Based Cold-Start Recommendation

cs.IR · 2026-06-29 · conditional · novelty 7.0

Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.

Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings

cs.CL · 2026-06-28 · conditional · novelty 7.0

Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

DICE aggregates independently encoded document chunks into a single vector to reduce evidence dilution in long-document dense retrieval, reporting gains on LongEmbed especially beyond 4k tokens.

Non-negative Elastic Net Decoding for Information Retrieval

cs.IR · 2026-06-16 · unverdicted · novelty 7.0

NNN decoding selects documents via non-negative elastic net reconstruction of the query embedding, with a theorem showing it strictly dominates dense retrieval on correlated corpora and experiments showing gains over inner-product baselines.

The Voronoi Bottleneck: Capacity-Aware Dense Retrieval for Product Search

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

cs.DB · 2026-06-06 · unverdicted · novelty 7.0

An adaptive two-phase semantic filter using clustering then a hybrid proxy trained on LLM confidence achieves 1.6-2.0x speedup over prior methods at 90% accuracy on 10K document corpora.

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

OpAI-Bench provides a new benchmark for evaluating AI-text detectors on progressively human-AI co-edited documents at multiple granularities, revealing non-monotonic detection patterns.

ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

cs.CR · 2026-06-01 · unverdicted · novelty 7.0

MaskForge reaches 79.3% average attack success rate on five dLLMs by adaptively searching and accumulating structural attack patterns with a UCB bandit, improving 17.6% over baselines and transferring to 88.2% on AdvBench.

Test-Time Training for Zero-Resource Dense Retrieval Reranking

cs.IR · 2026-05-31 · unverdicted · novelty 7.0

DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

Towards Cost-effective LLMs Routing with Batch Prompting

cs.DB · 2026-05-27 · unverdicted · novelty 7.0

RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.

citing papers explorer

Showing 34 of 184 citing papers.

Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs cs.CL · 2025-10-01 · unverdicted · none · ref 41 · internal anchor
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
LTRR: Learning To Rank Retrievers for LLMs cs.CL · 2025-06-16 · unverdicted · none · ref 42 · internal anchor
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 195 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Multilingual E5 Text Embeddings: A Technical Report cs.CL · 2024-02-08 · unverdicted · none · ref 20 · internal anchor
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning cs.CL · 2024-01-07 · unverdicted · none · ref 47 · internal anchor
Data-CUBE applies a two-level curriculum (TSP-based task ordering via simulated annealing plus difficulty-sorted mini-batches) to multi-task instruction tuning and reports gains on MTEB sentence representation tasks.
Towards General Text Embeddings with Multi-stage Contrastive Learning cs.CL · 2023-08-07 · unverdicted · none · ref 83 · internal anchor
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
R$^2$-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agentic Search cs.IR · 2026-06-26 · unverdicted · none · ref 44 · internal anchor
R²-Searcher introduces fine-grained evidence modeling, retrieval reflection, and R²PO RL to calibrate retrieval-reasoning boundaries and improve multi-hop QA performance.
Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability cs.HC · 2026-06-10 · conditional · none · ref 25 · internal anchor
Frozen multimodal embeddings with trait-specific late fusion cut personality prediction MSE by 19% relative to baseline in the 2026 AVI challenge, while cognitive results are attributed to validation shortcuts rather than content-based inference.
UniCA: Bi-directional Cross-Attention with Positive Similarity Loss for Robust Multi-Modal Retrieval cs.IR · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
UniCA proposes bi-directional cross-attention and positive similarity loss for multi-modal retrieval and reports up to 4.09% Recall@5 gain on WebQA hybrid tasks versus baseline.
DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents cs.CV · 2026-05-27 · unverdicted · none · ref 53 · internal anchor
DocArena automates creation of multimodal document QA training data via MLLM-based structuring and cross-page reasoning pairs, yielding agents with top retrieval and QA performance in unified tests.
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini cs.CV · 2026-05-26 · unverdicted · none · ref 14 · internal anchor
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems cs.CL · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
GE2 tops BEIR and Italian RAG benchmarks at nDCG@10 of 0.638 and 0.282 but with 231.6 ms latency; mE5-L is competitive on Italian at 31 ms while LaBSE underperforms all dedicated retrieval models.
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder cs.CL · 2026-05-19 · unverdicted · none · ref 40 · internal anchor
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems cs.LG · 2026-05-19 · unverdicted · none · ref 188 · internal anchor
The work introduces WaLeF/FIDLAr for flood forecasting, CoDiCast for probabilistic weather, and Hypercube-RAG for explainable environmental QA, claiming superior accuracy, efficiency, and interpretability over baselines.
H-MAPS: Hierarchical Memory-Augmented Proactive Search Assistant for Scientific Literature cs.IR · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
H-MAPS uses a three-layered hierarchical memory to infer a reader's background and intent from implicit behaviors, generating profile-specific questions and on-device literature retrieval, as shown when NLP and HCI researchers receive different recommendations for the same paper.
Domain-Adaptive Dense Retrieval for Brazilian Legal Search cs.IR · 2026-05-05 · unverdicted · none · ref 25 · internal anchor
Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with largest gains on out-of-domain question-based search.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 185 · 2 links · internal anchor
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
Health System Scale Semantic Search Across Unstructured Clinical Notes cs.IR · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
A semantic search system was deployed at health-system scale across 166 million clinical notes, delivering sub-second latency, ~$4000 monthly cost, and 24-89% faster chart abstraction with maintained agreement.
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data cs.CL · 2026-04-20 · conditional · none · ref 28 · internal anchor
Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting Recall@50 from 68.89% to 77.55% and Recall@200 from 0.5969 to 0.7047.
Mitigating Hallucination on Hallucination in RAG via Ensemble Voting cs.CL · 2026-03-28 · unverdicted · none · ref 48 · internal anchor
VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks with a parallelizable design.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking cs.CL · 2026-01-08 · unverdicted · none · ref 22 · internal anchor
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping cs.LG · 2025-10-29 · unverdicted · none · ref 33 · internal anchor
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems cs.CL · 2025-07-10 · unverdicted · none · ref 19 · internal anchor
Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models cs.CL · 2025-06-05 · unverdicted · none · ref 14 · internal anchor
Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning, and model merging on Qwen3 backbones.
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts cs.IR · 2026-07-02 · unverdicted · none · ref 62 · internal anchor
Cluster-based semantic chunking does not outperform fixed-size or recursive chunking for RAG on academic theses, and RAGAs faithfulness shows limited reliability in this setup.
CuriosAI Submission to the CASTLE Challenge at EgoVis 2026 cs.CV · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
Reports SVA (0.50) and TMKG (0.35) accuracies on the CASTLE 2026 egocentric video QA challenge using VLM/LLM pipelines with preprocessing.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation cs.IR · 2026-04-21 · unverdicted · none · ref 49 · internal anchor
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 155 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed? cs.CL · 2026-05-31 · unverdicted · none · ref 12 · internal anchor
Applies embeddings, Cleanlab noise filtering, and MLP classification to achieve robust performance on imbalanced MultiPride data for distinguishing hate speech from reclaimed language.
Findings of the Counter Turing Test: AI-Generated Text Detection cs.CL · 2026-05-20 · unverdicted · none · ref 28 · 2 links · internal anchor
Shared task findings show near-perfect binary detection of AI-generated text but greater difficulty in attributing outputs to particular language models.
A Survey on Retrieval-Augmented Text Generation for Large Language Models cs.IR · 2024-04-17 · unverdicted · none · ref 141 · internal anchor
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval cs.SD · 2026-04-20 · unreviewed · ref 41 · internal anchor
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data cs.LG · 2026-04-15 · unreviewed · ref 34 · internal anchor
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search cs.AI · 2026-04-04 · unreviewed · ref 30 · internal anchor

Text Embeddings by Weakly-Supervised Contrastive Pre-training

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer