hub Mixed citations

Nomic Embed: Training a Reproducible Long Context Text Embedder

Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar · 2024 · cs.CL · arXiv 2402.01613

Mixed citation behavior. Most common role is baseline (38%).

30 Pith papers citing it

Baseline 38% of classified citations

open full Pith review browse 30 citing papers arXiv PDF

abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 method 3 background 1 other 1

citation-polarity summary

baseline 3 use method 3 background 1 unclear 1

representative citing papers

The Voronoi Bottleneck: Capacity-Aware Dense Retrieval for Product Search

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.

OpenIIR: An Open Simulation Platform for Information Retrieval Research

cs.IR · 2026-05-10 · accept · novelty 7.0 · 2 refs

OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

Participatory provenance as representational auditing for AI-mediated public consultation

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.

MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.

HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.

Covariance Structure and Coordinate Heterogeneity Govern Binary Quantization of Contrastive Embeddings

cs.LG · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

Covariance structure and coordinate heterogeneity in InfoNCE embeddings control binary quantization fidelity, with off-diagonals contributing 30-50% of signal and heterogeneity determining rotation benefit and bit utility under a Gaussian model.

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.

Black-box model classification under the discriminative factorization

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

cs.IR · 2026-05-08 · unverdicted · novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

LLMs Corrupt Your Documents When You Delegate

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering

cs.DL · 2026-03-28 · accept · novelty 6.0

ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

cs.CL · 2025-08-03 · unverdicted · novelty 6.0

SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.

MINT: Multi-Vector Search Index Tuning

cs.DB · 2025-04-28 · unverdicted · novelty 6.0

MINT defines multi-vector search index tuning and provides algorithms that achieve 2.1X to 8.3X latency speedup over baselines under storage and recall constraints.

ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation

cs.IR · 2025-02-14 · unverdicted · novelty 6.0

ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.

Low-cost concept-based localized explanations: How far can we get with training-free approaches?

cs.AI · 2026-06-27 · unverdicted · novelty 5.0

Mid-scale MLLMs reach 62-88% object-level exact-match accuracy in zero-shot localized concept naming via closed-set prompting and an embedding-based Open-CoNa strategy across datasets.

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

cs.CR · 2026-05-26 · unverdicted · novelty 5.0

Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.

Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering

cs.IR · 2026-05-22 · unverdicted · novelty 5.0

Multi-task evaluation of 22 patent embedding models finds task-specific fine-tuning benefits and significant cross-landscape retrieval degradation that cannot be fixed by hybrid fusion.

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

GraphRAG with 7-8B local LLMs on 8GB VRAM hardware builds knowledge graphs from EHR docs and answers queries, with Llama 3.1 creating the largest graph, Qwen 2.5 scoring highest on quality, and models below ~7B failing to complete the pipeline.

Control Charts for Multi-agent Systems

cs.MA · 2026-05-11 · unverdicted · novelty 5.0

Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.

citing papers explorer

Showing 1 of 1 citing paper after filters.

OpenIIR: An Open Simulation Platform for Information Retrieval Research cs.IR · 2026-05-10 · accept · none · ref 10 · 2 links · internal anchor
OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.

Nomic Embed: Training a Reproducible Long Context Text Embedder

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer