Arctic-embed 2.0: Multilingual retrieval without compromise

Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos · 2024 · arXiv 2412.04506

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

citation-role summary

baseline 1 method 1

citation-polarity summary

baseline 1 use method 1

representative citing papers

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

Larch: Learned Query Optimization for Semantic Predicates

cs.DB · 2026-06-06 · unverdicted · novelty 6.0

Larch uses a GNN-MDP formulation and a selectivity predictor plus dynamic programming to reorder semantic filter evaluation, cutting token usage 3x-19x versus prior systems on real and synthetic workloads.

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

cs.IR · 2026-05-08 · unverdicted · novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.

Identifier-Free Code Embedding Models for Scalable Search

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

A fine-tuned Qwen3-Embedding model with contrastive learning outperforms baselines on bidirectional source-to-decompiled code association and generalizes to constant-algorithm tasks.

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

cs.IR · 2025-09-16 · conditional · novelty 6.0

LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.

Grounding Text Embeddings in Stakeholder Associations

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

cs.CL · 2026-02-17 · unverdicted · novelty 5.0

A distillation-plus-task-contrastive training regimen yields compact embedding models that match or exceed state-of-the-art performance for their size while supporting 32k-token contexts and quantization.

Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

cs.CL · 2025-10-16 · conditional · novelty 5.0

A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.

MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration

cs.LG · 2026-05-24 · unverdicted · novelty 4.0

MimirRAG, a multi-agent RAG framework with metadata integration and table-aware chunking, reaches 89.3% accuracy on FinanceBench and outperforms prior baselines for financial document retrieval.

Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

cs.CL · 2026-04-16 · unverdicted · novelty 4.0

Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

cs.IR · 2025-07-03 · unverdicted · novelty 4.0

Lightweight federated learning with frozen embeddings and MLP heads reaches competitive micro and macro F1 scores for ICD-9 and ICD-10 coding on MIMIC-IV, nearly matching centralized training.

citing papers explorer

Showing 13 of 13 citing papers.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction cs.CR · 2026-05-28 · unverdicted · none · ref 50
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
Larch: Learned Query Optimization for Semantic Predicates cs.DB · 2026-06-06 · unverdicted · none · ref 73
Larch uses a GNN-MDP formulation and a selectivity predictor plus dynamic programming to reorder semantic filter evaluation, cutting token usage 3x-19x versus prior systems on real and synthetic workloads.
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance cs.CL · 2026-05-21 · unverdicted · none · ref 138
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 76
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal cs.IR · 2026-05-08 · unverdicted · none · ref 35
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
Identifier-Free Code Embedding Models for Scalable Search cs.CR · 2026-05-05 · unverdicted · none · ref 20
A fine-tuned Qwen3-Embedding model with contrastive learning outperforms baselines on bidirectional source-to-decompiled code association and generalizes to constant-algorithm tasks.
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations cs.IR · 2025-09-16 · conditional · none · ref 42
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
Grounding Text Embeddings in Stakeholder Associations cs.CL · 2026-05-26 · unverdicted · none · ref 70
The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.
jina-embeddings-v5-text: Task-Targeted Embedding Distillation cs.CL · 2026-02-17 · unverdicted · none · ref 20
A distillation-plus-task-contrastive training regimen yields compact embedding models that match or exceed state-of-the-art performance for their size while supporting 32k-token contexts and quantization.
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters cs.CL · 2025-10-16 · conditional · none · ref 17
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration cs.LG · 2026-05-24 · unverdicted · none · ref 36
MimirRAG, a multi-agent RAG framework with metadata integration and table-aware chunking, reaches 89.3% accuracy on FinanceBench and outperforms prior baselines for financial document retrieval.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task cs.CL · 2026-04-16 · unverdicted · none · ref 29
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings cs.IR · 2025-07-03 · unverdicted · none · ref 25
Lightweight federated learning with frozen embeddings and MLP heads reaches competitive micro and macro F1 scores for ICD-9 and ICD-10 coding on MIMIC-IV, nearly matching centralized training.

Arctic-embed 2.0: Multilingual retrieval without compromise

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer