Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
hub Mixed citations
Billion-scale similarity search with GPUs
Mixed citation behavior. Most common role is method (60%).
abstract
Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.
Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matching BERT accuracy.
HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
A 527-item GDPR-aligned privacy preference item bank was developed by extracting 669 statements from 99 GDPR articles and validating them through multi-round expert consensus and semantic clustering.
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
NTILC replaces in-context tool registry lookup with learned latent retrieval using a signature-aware composite loss, reducing context consumption by over 95% and latency by up to 74%.
CourseBlueprint builds a typed pipeline over a 23-lecture biomedical imaging corpus to generate prerequisite-aware, learner-adaptive videos with auditable engagement contracts and slide grounding.
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
UAGA aligns two graph embedding spaces via adversarial training in a fully unsupervised setting, with an incremental extension iUAGA that uses discovered pseudo-anchors to refine both embeddings and alignments.
Pyramid is a distributed similarity search framework based on HNSW that partitions datasets into similar-item sub-datasets for efficient query processing and includes failure recovery and straggler mitigation.
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
A multi-agent system for explainable fake news detection that decomposes claims, retrieves evidence, verifies with calibrated confidence, and aggregates logic verdicts, showing better interpretability than BERT/RoBERTa on the LIAR benchmark despite lower raw accuracy.
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
ESGLens applies RAG and LLM embeddings to extract GRI-aligned information from ESG reports and achieves 0.48 Pearson correlation when regressing environmental scores on 300 company reports.
Local intrinsic dimensionality enables selection of query sets with varying difficulty for nearest neighbor search benchmarking, and common real-world datasets are not diverse as performance on one predicts others well.
citing papers explorer
No citing papers match the current filters.