Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
Arctic-embed 2.0: Multilingual retrieval without compromise
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
A fine-tuned Qwen3-Embedding model with contrastive learning outperforms baselines on bidirectional source-to-decompiled code association and generalizes to constant-algorithm tasks.
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
A distillation-plus-task-contrastive training regimen yields compact embedding models that match or exceed state-of-the-art performance for their size while supporting 32k-token contexts and quantization.
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
Lightweight federated learning with frozen embeddings and MLP heads reaches competitive micro and macro F1 scores for ICD-9 and ICD-10 coding on MIMIC-IV, nearly matching centralized training.
citing papers explorer
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
-
Identifier-Free Code Embedding Models for Scalable Search
A fine-tuned Qwen3-Embedding model with contrastive learning outperforms baselines on bidirectional source-to-decompiled code association and generalizes to constant-algorithm tasks.
-
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
-
jina-embeddings-v5-text: Task-Targeted Embedding Distillation
A distillation-plus-task-contrastive training regimen yields compact embedding models that match or exceed state-of-the-art performance for their size while supporting 32k-token contexts and quantization.
-
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
-
Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings
Lightweight federated learning with frozen embeddings and MLP heads reaches competitive micro and macro F1 scores for ICD-9 and ICD-10 coding on MIMIC-IV, nearly matching centralized training.