pith. sign in

hub Mixed citations

Nomic Embed: Training a Reproducible Long Context Text Embedder

Mixed citation behavior. Most common role is baseline (38%).

30 Pith papers citing it
Baseline 38% of classified citations
abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

hub tools

citation-role summary

baseline 3 method 3 background 1 other 1

citation-polarity summary

clear filters

representative citing papers

The Voronoi Bottleneck: Capacity-Aware Dense Retrieval for Product Search

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.

Black-box model classification under the discriminative factorization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

cs.IR · 2026-05-08 · unverdicted · novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.

LLMs Corrupt Your Documents When You Delegate

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

MINT: Multi-Vector Search Index Tuning

cs.DB · 2025-04-28 · unverdicted · novelty 6.0

MINT defines multi-vector search index tuning and provides algorithms that achieve 2.1X to 8.3X latency speedup over baselines under storage and recall constraints.

Control Charts for Multi-agent Systems

cs.MA · 2026-05-11 · unverdicted · novelty 5.0

Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.

citing papers explorer

Showing 1 of 1 citing paper after filters.