Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
super hub Mixed citations
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Mixed citation behavior. Most common role is background (39%).
abstract
In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit{Multi-Linguality}, \textit{Multi-Functionality}, and \textit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit{Multi-Linguality}, \textit{Multi-Functionality}, and \textit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective
authors
co-cited works
representative citing papers
Retrieval coverage limits LLM rerankers in cold-start recommendation; a learned hybrid fusion improves pool quality but LLM reranking often degrades end-to-end performance while simpler rankers exploit the pool.
On heterogeneous document collections, only query expansion and a newly introduced per-source calibrated corrector (SSCC) deliver reliable gains beyond a strong cross-encoder reranker; other common retrieval enhancements do not.
LEDGER provides a corpus of 4,999 annual reports with 31 labeled KPIs and three benchmarks for page-level retrieval, needle-in-haystack lookup, and full KPI extraction from long documents.
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.
QuIVer performs Vamana-style graph construction entirely inside a 2-bit Sign-Magnitude BQ space, achieving >=88% Recall@10 on contrastive-learning embeddings and 2.5-5.5x higher throughput than DiskANN/HNSW at matched recall with 4.7x less hot memory.
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BEIR datasets.
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
MOSAIC learns overlap-aware shared-specific representations, fits a first-stage predictor on overlapping data, and calibrates the gap using target-pattern samples, with non-asymptotic error bounds decomposing overlap size, calibration gap, and representation error.
Audit of KB-VQA benchmarks reveals systematic violations of answer derivability, question clarity, and visual disambiguation assumptions, with new repair and multi-entity augmentation protocols producing different model performance trends.
SHARD introduces cell-keyed residual splitting that turns dense retrieval embeddings into revocable, renewable, unlinkable templates resistant to alignment attacks while preserving exact utility under CKKS reranking.
EvoEmbedding generates evolvable embeddings via a latent memory updated during sequential processing, outperforming larger models on long-context retrieval and generalizing to 10x longer contexts in downstream tasks.
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
ScholarQuest is a taxonomy-guided benchmark for agentic academic paper search that shows agentic methods beat single-shot baselines but reach only 0.314 Recall@100 and 0.355 Recall@All.
A retrieve-then-confirm framework applied to one CS program finds ~50% coverage of both CS2013 and CS2023, ~88% competency articulation, and lower cognitive depth under the newer guideline (76% vs 95%).
MonaVec provides a training-free 4-bit vector quantization and deterministic search kernel using Randomized Hadamard Transform and ChaCha20 seeding for embedded and offline use.
citing papers explorer
-
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
-
LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction
LEDGER provides a corpus of 4,999 annual reports with 31 labeled KPIs and three benchmarks for page-level retrieval, needle-in-haystack lookup, and full KPI extraction from long documents.
-
Towards Cost-effective LLMs Routing with Batch Prompting
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
-
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.
-
QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization
QuIVer performs Vamana-style graph construction entirely inside a 2-bit Sign-Magnitude BQ space, achieving >=88% Recall@10 on contrastive-learning embeddings and 2.5-5.5x higher throughput than DiskANN/HNSW at matched recall with 4.7x less hot memory.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
-
Latent Abstraction for Retrieval-Augmented Generation
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
-
Pattern-Calibrated Multimodal Prediction under Blockwise Missingness
MOSAIC learns overlap-aware shared-specific representations, fits a first-stage predictor on overlapping data, and calibrates the gap using target-pattern samples, with non-asymptotic error bounds decomposing overlap size, calibration gap, and representation error.
-
Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting
Audit of KB-VQA benchmarks reveals systematic violations of answer derivability, question clarity, and visual disambiguation assumptions, with new repair and multi-entity augmentation protocols producing different model performance trends.
-
SHARD: cell-keyed residual splitting for alignment-resistant private dense retrieval
SHARD introduces cell-keyed residual splitting that turns dense retrieval embeddings into revocable, renewable, unlinkable templates resistant to alignment attacks while preserving exact utility under CKKS reranking.
-
EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory
EvoEmbedding generates evolvable embeddings via a latent memory updated during sequential processing, outperforming larger models on long-context retrieval and generalizing to 10x longer contexts in downstream tasks.
-
End-to-End Voice Intent Recognition for Spontaneous Human-Drone Interaction with Naive Users
An end-to-end SLU architecture with frozen SSL acoustic encoder, LSTM classification head, and cross-modal distillation achieves 93% accuracy on simple commands and 82% on spontaneous speech at 7 ms latency on the new VoiceStick corpus, outperforming cascade baselines.
-
ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments
ScholarQuest is a taxonomy-guided benchmark for agentic academic paper search that shows agentic methods beat single-shot baselines but reach only 0.314 Recall@100 and 0.355 Recall@All.
-
Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023
A retrieve-then-confirm framework applied to one CS program finds ~50% coverage of both CS2013 and CS2023, ~88% competency articulation, and lower cognitive depth under the newer guideline (76% vs 95%).
-
MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems
MonaVec provides a training-free 4-bit vector quantization and deterministic search kernel using Randomized Hadamard Transform and ChaCha20 seeding for embedded and offline use.
-
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Morpheus is a morphology-aware neural tokenizer and embedder for Turkish that achieves lossless reversible segmentation, higher morphological alignment, lower bits-per-character, and competitive root-centric embeddings compared with subword baselines.
-
A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation
HyGRAG is a hierarchical graph RAG framework that constructs LLM summaries over hybrid chunk-entity graphs, retrieves via context and relation awareness across levels, and enables dynamic updates, reporting a 9.7% average accuracy gain on multi-hop reasoning tasks.
-
Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems
CAREATTACK adapts closed-form parameter editing with graph-based conflict resolution and lightweight anchor repair to promote malicious passages in RAG retrieval while limiting side effects on non-target queries.
-
Uncertainty-Aware Hybrid Retrieval for Long-Document RAG
UMG-RAG improves long-document RAG by uncertainty-aware fusion of multi-granularity retrievals from complementary dense and sparse retrievers, plus a parent-promotion variant.
-
CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency
CQC-RAG improves RAG factuality by generating diverse equivalent queries, building query-specific contexts, and selecting answers via cross-query confidence stability, with reported gains of +4.76 pp EM on TriviaQA and +9.12 pp EM on MuSiQue.
-
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.
-
REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs
REAL represents long-term LLM memory as a temporal confidence-aware directed property graph with non-destructive updates and uses evaluator-guided beam search plus counterfactual inference for retrieval, reporting 22.72% average gains over baselines.
-
A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology
PathPocket constructs a 4.55M-entity pathology hypergraph from 110k graded documents and deploys a multi-agent framework that outperforms prior systems on 200k cases while raising pathologist accuracy in user studies.
-
SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval
SkillPager retrieves typed semantic nodes from skill documents via MMR to reach 78.89% LLM-judged sufficiency with 47% fewer tokens than full documents on a 395-skill benchmark.
-
On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets
Meta-study of MTEB rankings introduces dataset-composition and ranking-scheme robustness indicators and finds only a small subset of models stay consistently strong across tasks, languages, and evaluation variations.
-
Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG
CrossAug augments GraphRAG indices with cross-chunk relations via GNN-guided subgraph scoring and selective LLM completion, yielding consistent gains on four QA benchmarks across three frameworks.
-
LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation
LATTE improves personalized LLM generation by forecasting peer-anchored relative preference trajectories and injecting the forecast via a State to Token Bridge, raising ROUGE-L from 0.219-0.245 to 0.259 on Amazon Reviews 2023 over static and compression baselines.
-
An Efficient and Privacy-Preserving Architecture for Cross-Institutional Collaborative RAG
FedRAG uses a Scrambled Distributed Attention protocol with feature scrambling and token permutation to enable high-throughput, privacy-preserving federated RAG without special hardware or retraining.
-
Iterate Until Retrieved: Factual Nugget Optimization for Discoverable Continual Corrections in Agentic RAG
INO is an index-time method that uses the production RAG agent to iteratively create, test with queries and paraphrases, reflect on failures, and revise factual nuggets until they are discoverable and used correctly.
-
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
Covariance Structure and Coordinate Heterogeneity Govern Binary Quantization of Contrastive Embeddings
Covariance structure and coordinate heterogeneity in InfoNCE embeddings control binary quantization fidelity, with off-diagonals contributing 30-50% of signal and heterogeneity determining rotation benefit and bit utility under a Gaussian model.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Spherical Mixture Integration for Latent Embedding Alignment across Multi-Source Feature Spaces
SMILE models synonymy in multi-EHR codes via spherical mixtures of von Mises-Fisher distributions and develops a composite quasi-likelihood estimator with non-asymptotic error bounds and consistent cluster recovery.
-
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
-
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
To Know is to Construct: Schema-Constrained Generation for Agent Memory
SCG-MEM reformulates agent memory access as schema-constrained generation within dynamic cognitive schemas, using assimilation and accommodation for updates plus an associative graph for reasoning, and outperforms retrieval baselines on the LoCoMo benchmark.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
-
LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
LFRAG advances multimodal RAG to block-level retrieval with layout segmentation and cross-attention fusion, reporting SOTA retrieval, 7.20% higher answer accuracy, and 73.07% lower token consumption on the new LFDocQA benchmark.
-
BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
BiCon-Gate improves dialogue fact-checking by applying staged de-colloquialisation and gating rewrites based on semantic consistency with context, yielding gains on the DialFact benchmark over baselines including LLM rewrites.
-
DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking
DualView fuses local cross-attention and global context aggregation via adaptive gating to rerank fixed candidate sets for multi-hop QA, reporting 99.4% Top-4 Recall on MuSiQue at 4 ms latency while beating larger cross-encoders.
-
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.
-
LiquiLM: Bridging the Semantic Gap in Liquidity Flaw Audit via DCN and LLMs
LiquiLM integrates LLMs and DCN to audit liquidity flaws in blockchain smart contracts, achieving over 90% F1-score and uncovering 238 high-risk contracts plus 10 CVE-certified vulnerabilities in real-world PoL and Ethereum contracts.