M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
arXiv e-prints , year = 2016, eid =
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4representative citing papers
Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
citing papers explorer
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
Nomic Embed: Training a Reproducible Long Context Text Embedder
Nomic AI produced and open-sourced a reproducible 8192-context English text embedder that exceeds OpenAI Ada-002 and text-embedding-3-small performance on MTEB short-context and LoCo long-context benchmarks.
-
Can AI-Generated Text be Reliably Detected?
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
-
Detecting Language Model Attacks with Perplexity
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.