Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
hub Mixed citations
Billion-scale similarity search with GPUs
Mixed citation behavior. Most common role is method (60%).
abstract
Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.
Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matching BERT accuracy.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
A 527-item GDPR-aligned privacy preference item bank was developed by extracting 669 statements from 99 GDPR articles and validating them through multi-round expert consensus and semantic clustering.
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
NTILC replaces in-context tool registry lookup with learned latent retrieval using a signature-aware composite loss, reducing context consumption by over 95% and latency by up to 74%.
CourseBlueprint builds a typed pipeline over a 23-lecture biomedical imaging corpus to generate prerequisite-aware, learner-adaptive videos with auditable engagement contracts and slide grounding.
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
UAGA aligns two graph embedding spaces via adversarial training in a fully unsupervised setting, with an incremental extension iUAGA that uses discovered pseudo-anchors to refine both embeddings and alignments.
Pyramid is a distributed similarity search framework based on HNSW that partitions datasets into similar-item sub-datasets for efficient query processing and includes failure recovery and straggler mitigation.
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
A multi-agent system for explainable fake news detection that decomposes claims, retrieves evidence, verifies with calibrated confidence, and aggregates logic verdicts, showing better interpretability than BERT/RoBERTa on the LIAR benchmark despite lower raw accuracy.
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
ESGLens applies RAG and LLM embeddings to extract GRI-aligned information from ESG reports and achieves 0.48 Pearson correlation when regressing environmental scores on 300 company reports.
Local intrinsic dimensionality enables selection of query sets with varying difficulty for nearest neighbor search benchmarking, and common real-world datasets are not diverse as performance on one predicts others well.
Empirical benchmark of FAISS (main memory) versus FENSHSES (secondary memory) on Hamming-space nearest-neighbor search across indexing speed, latency, and RAM.
citing papers explorer
-
Dense Passage Retrieval for Open-Domain Question Answering
Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
-
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.
-
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matching BERT accuracy.
-
RWGBench: Evaluating Scholarly Positioning in Related Work Generation
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
-
Modernizing User Privacy Preference Measurement through GPPI: A GDPR-aligned Privacy Preference Item Bank
A 527-item GDPR-aligned privacy preference item bank was developed by extracting 669 statements from 99 GDPR articles and validating them through multi-round expert consensus and semantic clustering.
-
MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
-
ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability
ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
The Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills
A per-component SimHash fingerprint supplies structural identity for AI agent skills, recovering family membership under paraphrase and refactoring with AUC 0.974 while localizing changes.
-
When More Cores Hurts: The Vector Database Scaling Paradox in HPC
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
-
NTILC: Neural Tool Invocation via Learned Compression
NTILC replaces in-context tool registry lookup with learned latent retrieval using a signature-aware composite loss, reducing context consumption by over 95% and latency by up to 74%.
-
CourseBlueprint: A Structured Pipeline for Adaptive Pedagogical Video Generation Grounded in Course Corpora
CourseBlueprint builds a typed pipeline over a 23-lecture biomedical imaging corpus to generate prerequisite-aware, learner-adaptive videos with auditable engagement contracts and slide grounding.
-
Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation
TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
-
Unsupervised Adversarial Graph Alignment with Graph Embedding
UAGA aligns two graph embedding spaces via adversarial training in a fully unsupervised setting, with an incremental extension iUAGA that uses discovered pseudo-anchors to refine both embeddings and alignments.
-
Pyramid: A General Framework for Distributed Similarity Search
Pyramid is a distributed similarity search framework based on HNSW that partitions datasets into similar-item sub-datasets for efficient query processing and includes failure recovery and straggler mitigation.
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
TRUST Agents: A Collaborative Multi-Agent Framework for Fake News Detection, Explainable Verification, and Logic-Aware Claim Reasoning
A multi-agent system for explainable fake news detection that decomposes claims, retrieves evidence, verifies with calibrated confidence, and aggregates logic verdicts, showing better interpretability than BERT/RoBERTa on the LIAR benchmark despite lower raw accuracy.
-
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
-
Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
-
ESGLens: An LLM-Based RAG Framework for Interactive ESG Report Analysis and Score Prediction
ESGLens applies RAG and LLM embeddings to extract GRI-aligned information from ESG reports and achieves 0.48 Pearson correlation when regressing environmental scores on 300 company reports.
-
The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
Local intrinsic dimensionality enables selection of query sets with varying difficulty for nearest neighbor search benchmarking, and common real-world datasets are not diverse as performance on one predicts others well.
-
An Empirical Comparison of FAISS and FENSHSES for Nearest Neighbor Search in Hamming Space
Empirical benchmark of FAISS (main memory) versus FENSHSES (secondary memory) on Hamming-space nearest-neighbor search across indexing speed, latency, and RAM.
-
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
-
When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI
A survey providing a taxonomy of TEE platforms, an agent-centric threat model, and open challenges for applying confidential computing to secure agentic AI systems.
-
Using predefined vector systems to speed up neural network multimillion class classification
Predefined vector systems structure neural network latent spaces to allow O(1) label prediction via index searches on embedding vectors, delivering up to 11.6x speedup on multimillion-class tasks while preserving accuracy and enabling new-class detection.
-
Product Quantization for Surface Soil Similarity
A pipeline using product quantization and systematic parameter evaluation creates data-driven soil taxonomies with higher specificity than human-derived classifications.
-
Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents
Recursive character-based chunking at 300 characters outperforms Sentence-Based, Khmer-Aware, and LLM-Based methods on L2 distance, answer relevance, and Khmer IoU in a 5-fold evaluation on 18 Khmer agricultural QA pairs.
-
From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
Hackathon submissions indicate LLMs are moving from general assistants toward composable multi-agent systems for structuring scientific knowledge and automating tasks in materials science and chemistry.
- MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings