Proves Voronoi complexity equals sign-rank for top-1 retrieval, introduces CUS diagnostic predicting retrieval failure at AUC >0.8 without labels, and AT-DW-InfoNCE objective with derived alpha^*=2.0 that improves Recall@100 on synthetic data.
hub Mixed citations
Nomic Embed: Training a Reproducible Long Context Text Embedder
Mixed citation behavior. Most common role is baseline (38%).
abstract
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
Presents Invoice Haystack benchmark for homogeneous document retrieval and VL-RAG hybrid framework achieving 60% Recall@1 and up to 13.5 point gains over prior methods.
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
Covariance structure and coordinate heterogeneity in InfoNCE embeddings control binary quantization fidelity, with off-diagonals contributing 30-50% of signal and heterogeneity determining rotation benefit and bit utility under a Gaussian model.
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only baselines.
Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.
MINT defines multi-vector search index tuning and provides algorithms that achieve 2.1X to 8.3X latency speedup over baselines under storage and recall constraints.
ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.
Mid-scale MLLMs reach 62-88% object-level exact-match accuracy in zero-shot localized concept naming via closed-set prompting and an embedding-based Open-CoNa strategy across datasets.
S-SPPO stabilizes SPPO via semantic calibration in supervision and representation spaces, reporting 52.19% win rate on AlpacaEval 2.0 with Llama-3-8B.
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
Multi-task evaluation of 22 patent embedding models finds task-specific fine-tuning benefits and significant cross-landscape retrieval degradation that cannot be fixed by hybrid fusion.
GraphRAG with 7-8B local LLMs on 8GB VRAM hardware builds knowledge graphs from EHR docs and answers queries, with Llama 3.1 creating the largest graph, Qwen 2.5 scoring highest on quality, and models below ~7B failing to complete the pipeline.
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
citing papers explorer
-
OpenIIR: An Open Simulation Platform for Information Retrieval Research
OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.