Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.
super hub Mixed citations
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Mixed citation behavior. Most common role is method (39%).
abstract
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot se
authors
co-cited works
representative citing papers
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.
A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
A single autoregressive model for conversational recommendation that uses semantic item IDs, predicts response intent and target first, then generates the response, reporting up to 29% Recall@1 gains.
A new linked multimodal dataset of Russian domestic and foreign policy speeches with texts, images, captions, harmonized metadata, and expert-refined topic annotations is introduced to support analyses in political communication and LLM applications.
TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% added parameters and up to 50% fewer reasoning tokens.
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.
Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
citing papers explorer
-
Towards Cost-effective LLMs Routing with Batch Prompting
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
-
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
-
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)
Stream2LLM introduces adaptive scheduling and preemption for append-mode and update-mode context streaming in disaggregated LLM deployments, delivering up to 11x TTFT improvements on real-world workloads while preserving throughput.
-
Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining
Lit2Vec delivers a documented, reproducible pipeline that extracts and annotates a large licensed chemistry paper corpus from S2ORC with paragraph embeddings and subfield labels.