hub Mixed citations

Nomic Embed: Training a Reproducible Long Context Text Embedder

Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar · 2024 · cs.CL · arXiv 2402.01613

Mixed citation behavior. Most common role is baseline (38%).

22 Pith papers citing it

Baseline 38% of classified citations

open full Pith review browse 22 citing papers arXiv PDF

abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 method 3 background 1 other 1

citation-polarity summary

baseline 3 use method 3 background 1 unclear 1

representative citing papers

OpenIIR: An Open Simulation Platform for Information Retrieval Research

cs.IR · 2026-05-10 · accept · novelty 7.0 · 2 refs

OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

Participatory provenance as representational auditing for AI-mediated public consultation

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.

MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

Coordinate Heterogeneity Governs Binary Quantization: From InfoNCE to Recall

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

Coordinate heterogeneity governs binary quantization performance via closed-form ranking fidelity expressions and a two-parameter scaling law, validated on 13 datasets across 6 embedding families.

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.

Black-box model classification under the discriminative factorization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

cs.IR · 2026-05-08 · unverdicted · novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

LLMs Corrupt Your Documents When You Delegate

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering

cs.DL · 2026-03-28 · accept · novelty 6.0

ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

cs.CL · 2025-08-03 · unverdicted · novelty 6.0

SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.

MINT: Multi-Vector Search Index Tuning

cs.DB · 2025-04-28 · unverdicted · novelty 6.0

MINT defines multi-vector search index tuning and provides algorithms that achieve 2.1X to 8.3X latency speedup over baselines under storage and recall constraints.

ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

cs.IR · 2025-02-14 · unverdicted · novelty 6.0

ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

GraphRAG with 7-8B local LLMs on 8GB VRAM hardware builds knowledge graphs from EHR docs and answers queries, with Llama 3.1 creating the largest graph, Qwen 2.5 scoring highest on quality, and models below ~7B failing to complete the pipeline.

Control Charts for Multi-agent Systems

cs.MA · 2026-05-11 · unverdicted · novelty 5.0

Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.

Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

Table representations must be permutation-invariant to preserve semantic structure, and a new header-aligned encoder moves toward this ideal while exposing fragility in existing LLM table embeddings.

BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

cs.IR · 2026-04-08 · unverdicted · novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retriever at 33.3.

Gaussian mixture models as a proxy for interacting language models

cs.CL · 2025-05-29 · unverdicted · novelty 5.0

Interacting Gaussian mixture models with RAG-style updates are shown to mimic aspects of interacting LLMs and are used to prove lower bounds on polarization probability in the resulting Markov chain.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

cs.CL · 2024-12-18 · unverdicted · novelty 5.0

ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

Health System Scale Semantic Search Across Unstructured Clinical Notes

cs.IR · 2026-04-28 · unverdicted · novelty 4.0

A semantic search system was deployed at health-system scale across 166 million clinical notes, delivering sub-second latency, ~$4000 monthly cost, and 24-89% faster chart abstraction with maintained agreement.

citing papers explorer

Showing 22 of 22 citing papers.

OpenIIR: An Open Simulation Platform for Information Retrieval Research cs.IR · 2026-05-10 · accept · none · ref 10 · 2 links · internal anchor
OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation cs.AI · 2026-04-27 · unverdicted · none · ref 21 · internal anchor
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
Participatory provenance as representational auditing for AI-mediated public consultation cs.AI · 2026-04-22 · unverdicted · none · ref 17 · internal anchor
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL cs.IR · 2026-04-08 · unverdicted · none · ref 25 · internal anchor
MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
Coordinate Heterogeneity Governs Binary Quantization: From InfoNCE to Recall cs.LG · 2026-05-17 · unverdicted · none · ref 28 · internal anchor
Coordinate heterogeneity governs binary quantization performance via closed-form ranking fidelity expressions and a two-parameter scaling law, validated on 13 datasets across 6 embedding families.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 50 · internal anchor
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.
Black-box model classification under the discriminative factorization cs.LG · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal cs.IR · 2026-05-08 · unverdicted · none · ref 36 · internal anchor
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus cs.CL · 2026-05-01 · unverdicted · none · ref 52 · internal anchor
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
LLMs Corrupt Your Documents When You Delegate cs.CL · 2026-04-17 · unverdicted · none · ref 66 · internal anchor
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering cs.DL · 2026-03-28 · accept · none · ref 33 · internal anchor
ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension cs.CL · 2025-08-03 · unverdicted · none · ref 10 · internal anchor
SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.
MINT: Multi-Vector Search Index Tuning cs.DB · 2025-04-28 · unverdicted · none · ref 53 · internal anchor
MINT defines multi-vector search index tuning and provides algorithms that achieve 2.1X to 8.3X latency speedup over baselines under storage and recall constraints.
ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation cs.IR · 2025-02-14 · unverdicted · none · ref 43 · internal anchor
ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.
GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval cs.CL · 2026-05-20 · unverdicted · none · ref 20 · internal anchor
GraphRAG with 7-8B local LLMs on 8GB VRAM hardware builds knowledge graphs from EHR docs and answers queries, with Llama 3.1 creating the largest graph, Qwen 2.5 scoring highest on quality, and models below ~7B failing to complete the pipeline.
Control Charts for Multi-agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval cs.AI · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
Table representations must be permutation-invariant to preserve semantic structure, and a new header-aligned encoder moves toward this ideal while exposing fragility in existing LLM table embeddings.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment cs.IR · 2026-04-08 · unverdicted · none · ref 30 · internal anchor
BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retriever at 33.3.
Gaussian mixture models as a proxy for interacting language models cs.CL · 2025-05-29 · unverdicted · none · ref 8 · internal anchor
Interacting Gaussian mixture models with RAG-style updates are shown to mimic aspects of interacting LLMs and are used to prove lower bounds on polarization probability in the resulting Markov chain.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 170 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Health System Scale Semantic Search Across Unstructured Clinical Notes cs.IR · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
A semantic search system was deployed at health-system scale across 166 million clinical notes, delivering sub-second latency, ~$4000 monthly cost, and 24-89% faster chart abstraction with maintained agreement.

Nomic Embed: Training a Reproducible Long Context Text Embedder

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer