MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.
hub Canonical reference
Passage Re-ranking with BERT
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. In this paper, we describe a simple re-implementation of BERT for query-based passage re-ranking. Our system is the state of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% (relative) in MRR@10. The code to reproduce our results is available at https://github.com/nyu-dl/dl4marco-bert
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. In this paper, we describe a simple re-implementation of BERT for query-based passage re-ranking. Our system is the state of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% (relative)
co-cited works
representative citing papers
A benchmark and ontology-driven framework links 434 cardiovascular devices to patents at 91.6% recall, producing 6.8M high-confidence links for regulatory-IP integration.
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
A self-supervised transformer learns to unscramble Feynman integrals for online IBP reduction, delivering bounded memory use on complex two-loop topologies while matching Kira's speed on the hardest cases tested.
BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.
Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
ContextNest formalizes context governance for AI agents using hash-chained documents and deterministic selectors, with experiments showing higher answer quality and perfect determinism versus standard retrieval.
An adaptive two-phase semantic filter using clustering then a hybrid proxy trained on LLM confidence achieves 1.6-2.0x speedup over prior methods at 90% accuracy on 10K document corpora.
Re-ranking retrieval candidates via a cross-encoder trained on continuous perturbation-based attribution scores improves citation faithfulness and gold-answer alignment in legal QA over semantic similarity.
DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.
SilentRetrieval is a data poisoning attack achieving 84.6% HR@10 and 57.5% ASR-LLM on Natural Questions via coordinated beam search and trigger fusion while preserving document fluency.
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.
Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
BAGEL is a Bayesian active learning framework that uses Gaussian Processes to propagate LLM relevance signals across embedding space and guide global exploration, outperforming standard LLM reranking under identical budgets on four retrieval benchmarks.
KIRA is a unified architecture for visual RAG that reports 0.97 retrieval precision, 1.0 grounding, and 0.707 domain correctness across medical, circuit, satellite, and histopathology domains via hierarchical chunking, dual-path retrieval, and evidence-conditioned generation.
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Warrant adds a query-item permission gate g_ij to attention value terms, improving primary metrics in 27 of 32 comparisons across CTDG, MTPP, RAG, STPP, and TKG tasks.
AB-RAG adaptively budgets retrieval in RAG by combining three confidence signals to decide when to stop or fetch more evidence, separating correct from incorrect answers at 57.6% vs 0% exact match on a factoid dataset.
Presents a WildChat-derived benchmark for multi-agent routing as set-valued prediction and reports that supervised methods outperform nearest-neighbor and zero-shot LLM baselines in both unconstrained accuracy and constrained cost settings.
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
citing papers explorer
-
From Regulatory Approvals to Patents: Cross-Domain Linking for Cardiovascular Device Traceability
A benchmark and ontology-driven framework links 434 cardiovascular devices to patents at 91.6% recall, producing 6.8M high-confidence links for regulatory-IP integration.
-
Test-Time Training for Zero-Resource Dense Retrieval Reranking
DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
-
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
-
Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval
BAGEL is a Bayesian active learning framework that uses Gaussian Processes to propagate LLM relevance signals across embedding space and guide global exploration, outperforming standard LLM reranking under identical budgets on four retrieval benchmarks.
-
Scaling Laws for Cross-Encoder Reranking
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
-
SPIRE: Structure-Preserving Interpretable Retrieval of Evidence
SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.
-
CoDeR: Local Constraint-Compatible Retrieval Beyond Semantic Similarity
CoDeR augments standard topical dense retrieval with a bi-encoder compatibility scorer trained via contrastive lexical-polarity supervision to reduce early exposure to constraint-violating documents.
-
STORM: Stepwise Token Optimization with Reward-Guided Beam Search
STORM trains lexical query rewriters via reward-guided beam search that converts retrieval metrics into stepwise token signals, enabling 0.6B-8B models to rival dense retrievers on TREC, BEIR and MIRACL without index changes.
-
Interactive Multi-Turn Retrieval for Health Videos
DATR combines coarse CLIP-based retrieval with multi-turn query fusion and cross-encoder re-ranking to improve health video retrieval, supported by the new MHVRC corpus.
-
Beyond Single-Score Ranking: Facet-Aware Reranking for Controllable Diversity in Paper Recommendation
SciFACE improves facet-specific paper ranking NDCG scores by training separate cross-encoders for Background and Method similarity on 5,891 GPT-4o-mini labeled pairs, outperforming SPECTER by up to 31 points.
-
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
-
ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
ProRank uses RL-based prompt warmup and fine-grained scoring to train small language models that surpass LLM rerankers on BEIR.
-
Unsupervised Dense Information Retrieval with Contrastive Learning
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
-
The Crowded Embedding Space: A Mean-Field Mechanism for Emergent Marginalization in Retrieval-Augmented Agents
A mean-field analysis of embedding-space crowding shows a phase transition and Fokker-Planck dynamics that drive retrieval-augmented agents to self-organize toward exclusive service of majority interests.
-
LRanker: LLM Ranker for Massive Candidates
LRanker combines K-means candidate aggregation with graph-partitioned ensemble of query embeddings to improve LLM ranking accuracy and scalability on massive candidate pools, reporting 3-30% gains on RBench tasks up to 6.8M candidates.
-
Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG
Reproducibility study shows position and context size effects in RAG depend on topic sampling and retrieval quality, proposes calibration for stable trends, and releases code after finding discrepancies with prior industry work.
-
CALMem : Application-Layer Dual Memory for Conversational AI
CALMem delivers virtually unbounded effective context for LLM conversations via an application-layer dual memory architecture with intra-session retrieval and token-adaptive injection.
-
KG-First, LLM-Fallback: A Hybrid Microservice for Grounded Skill Search and Explanation
SkillGraph-Service builds a provenance-preserving knowledge graph from multiple competency frameworks and achieves nDCG@5 above 0.94 with sub-200 ms latency via KG-first hybrid retrieval and constrained LLM explanations.
-
Efficient Listwise Reranking with Compressed Document Representations
RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.
-
Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval
Stratified sampling preserving teacher score distribution outperforms hard-negative mining as a robust baseline for knowledge distillation in dense retrieval.
-
The Role of Vocabularies in Learning Sparse Representations for Ranking
Larger 100K vocabularies in SPLADE models, especially those initialized with ESPLADE pretraining, improve retrieval effectiveness after pruning compared to 32K baselines while keeping similar efficiency.
-
Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey
A comprehensive survey that organizes query expansion methods in the PLM/LLM era along four design dimensions, synthesizes application patterns, and outlines future directions.
-
Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
LLM-generated synthetic hard negatives for training dense retrievers consistently underperform corpus-mined negatives from BM25 and cross-encoders across 10 BEIR datasets, with non-monotonic gains from scaling the generator from 4B to 30B parameters.
-
An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs
ITEM is a new iterative utility judgment loop for RAG that maps Schutz's three levels of relevance to retrieval, utility scoring, and generation, yielding measured gains on TREC DL, WebAP, GTI-NQ, and NQ.
-
DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling
DSIRM uses query-bridged contrastive quantization and generative LLMs to create relevance-aware discrete semantic identifiers, reporting +1.54% offline AUC and online lifts on Tmall production data.
-
RAG-Match: Retrieval-Augmented Knowledge Injection and Hierarchical Reasoning for Calibrated Semantic Relevance
RAG-Match is a three-stage framework for semantic relevance modeling that integrates knowledge-augmented pretraining, hierarchical reasoning alignment, and preference-based decision calibration, outperforming LLM baselines on a search benchmark.
-
SkillSelect-Serve: Budget-Controllable and QoS-Aware Skill Service Recommendation and Composition for Small LLM Agents
SkillSelect-Serve improves same-budget bundle recall and mean utility for LLM agent skill selection over fixed top-k retrieval by using structured Skill Services, a Micro-Agent Requirement Planner, and dual-granularity utility modeling on 35,353 skills and 586 queries.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
-
FRAGATA: Semantic Retrieval of HPC Support Tickets via Hybrid RAG over 20 Years of Request Tracker History
Fragata applies hybrid RAG to enable semantic retrieval of HPC support tickets across 20 years of history, handling language differences, typos, and varied wording better than traditional keyword search.
-
RAGe: A Retrieval-Augmented Generation Evaluation Framework
RAGe is a modular evaluation framework that correlates retrieval and generation quality with hardware constraints to recommend optimal RAG components for specific datasets.
-
A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance
A case-driven multi-agent system automates the full pipeline of bad-case detection, annotation, and resolution for e-commerce search relevance using Annotator, Optimizer, and User agents plus supporting components.
-
Let's measure run time! Extending the IR replicability infrastructure to include performance aspects
Position paper proposing to extend the OSIRRC replicability infrastructure with two performance benchmark scenarios, backed by a case study on neural re-ranking model runtimes.