MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Alina Stoica; Andrew McNamara; Bhaskar Mitra; Daniel Campos; Jianfeng Gao; Li Deng; Mir Rosenberg; Nick Craswell; Payal Bajaj; Rangan Majumder

arxiv: 1611.09268 · v3 · submitted 2016-11-28 · 💻 cs.CL · cs.IR

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj , Daniel Campos , Nick Craswell , Li Deng , Jianfeng Gao , XiaoDong Liu , Rangan Majumder , Andrew McNamara

show 7 more authors

Bhaskar Mitra Tri Nguyen Mir Rosenberg Xia Song Alina Stoica Saurabh Tiwary Tong Wang

This is my paper

Pith reviewed 2026-05-12 05:49 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords machine reading comprehensionquestion answeringdatasetsearch query logshuman generated answerspassage ranking

0 comments

The pith

MS MARCO supplies over a million real search questions with human answers to train and test reading comprehension systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MS MARCO, a dataset of 1,010,916 questions sampled from Bing search logs, each paired with a human-generated answer and passages from web documents. Unlike earlier collections that relied on synthetic or curated questions, this one draws directly from actual user queries to create a more realistic test bed. The authors define three tasks of increasing realism: deciding whether passages support an answer and synthesizing it, generating a fluent answer that stands alone, and ranking the passages themselves. The scale and origin in live search traffic allow models to be trained and measured on the kinds of information needs people express every day. If the dataset holds up, progress on it should translate more directly to practical question-answering tools.

Core claim

MS MARCO consists of 1,010,916 anonymized questions taken from Bing search logs, each supplied with at least one human-generated answer and a set of passages extracted from retrieved web documents. Questions may admit multiple answers or none at all. The dataset is accompanied by three tasks: (1) predict answerability from the passages and extract or synthesize the answer, (2) produce a well-formed answer understandable from the question and passages alone, and (3) rank the passages by relevance to the question.

What carries the argument

The MS MARCO dataset of real-user questions paired with human answers and retrieved passages, which supplies both training data and evaluation targets for the three defined reading-comprehension tasks.

If this is right

Question-answering models can be trained and scored on whether their outputs match human responses to everyday search queries rather than artificial test items.
The three tasks allow separate measurement of answerability detection, answer synthesis, and passage ranking.
Systems must learn to handle queries that have no answer or admit several valid answers.
Training at this scale supports development of models whose behavior on live search traffic can be measured directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same log-sampling method could be repeated on other search engines to produce comparable datasets in additional languages or vertical domains.
The presence of both original and rewritten human answers offers a way to quantify acceptable variation in response quality.
Models that improve on MS MARCO could be tested for transfer by running them on fresh, unlabeled search logs.

Load-bearing premise

Human annotators produce answers that are accurate, complete, and representative of how ordinary people would respond to the sampled questions.

What would settle it

Independent human raters judge a random sample of the dataset answers as incomplete or incorrect on a substantial fraction of questions.

read the original abstract

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MS MARCO gives a million real Bing queries with human answers and passages at a scale that was new then, but the paper leaves the data construction process too thin to fully back the realism claim.

read the letter

The core contribution is a dataset of over a million anonymized search questions drawn from Bing logs, each with at least one human-written answer and a set of retrieved passages. They also supply 182k rewritten answers and define three tasks: deciding if a question is answerable from the passages and extracting or synthesizing the answer, generating a well-formed answer, and ranking passages. That combination of real queries, human answers, allowance for multiple or zero answers, and the overall size was not available in the smaller or synthetic collections that existed at the time. Releasing it publicly was the practical step that let people start training and comparing models on something closer to actual user questions than the prior options.

Referee Report

3 major / 2 minor

Summary. The paper introduces MS MARCO, a large-scale machine reading comprehension dataset comprising 1,010,916 questions sampled from Bing search query logs, each with a human-generated answer and associated passages from 8.8 million web documents. It defines three tasks: (i) predicting answerability and synthesizing an answer from context passages, (ii) generating a well-formed answer from passages, and (iii) ranking retrieved passages given a question. The central claim is that the dataset's scale and derivation from real user queries distinguish it from prior MRC and QA resources, making it suitable for benchmarking.

Significance. If the human annotations are shown to be high-quality and reliably grounded in the passages, the dataset would provide a valuable large-scale resource for training and evaluating models on realistic, open-domain questions that may be unanswerable or admit multiple responses, advancing MRC research beyond smaller or synthetic datasets.

major comments (3)

[Dataset description] The manuscript provides no details on the sampling procedure, anonymization steps, or filtering criteria applied to the Bing query logs when selecting the 1,010,916 questions. This information is required to evaluate whether the questions retain a natural distribution of real user intent (Abstract and dataset description section).
[Annotation and quality control] No annotation guidelines, quality control procedures, inter-annotator agreement statistics, or statistics on passage relevance/answer grounding are reported for the human-generated answers. These are load-bearing for the claim that the answers are accurate, complete, and derivable from the provided passages (Abstract).
[Abstract] The paper states that questions 'may have multiple answers or no answers at all' but supplies no empirical breakdown of answerable vs. unanswerable cases or passage sufficiency rates, leaving the asserted realism advantage over prior datasets unsubstantiated.

minor comments (2)

[Title] The title acronym expansion contains inconsistent capitalization ('MAchine Reading COmprehension').
[Abstract] The abstract uses the nonstandard phrasing 'comprises of'; standard usage is 'comprises'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas where the manuscript can be strengthened for clarity and completeness. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Dataset description] The manuscript provides no details on the sampling procedure, anonymization steps, or filtering criteria applied to the Bing query logs when selecting the 1,010,916 questions. This information is required to evaluate whether the questions retain a natural distribution of real user intent (Abstract and dataset description section).

Authors: We agree that the current description is insufficient. In the revised manuscript, we will add a dedicated subsection on data collection that details the sampling procedure from Bing search query logs, the anonymization steps taken to protect user privacy, and the filtering criteria applied to arrive at the final set of 1,010,916 questions. This will allow readers to assess how well the questions reflect natural user intent. revision: yes
Referee: [Annotation and quality control] No annotation guidelines, quality control procedures, inter-annotator agreement statistics, or statistics on passage relevance/answer grounding are reported for the human-generated answers. These are load-bearing for the claim that the answers are accurate, complete, and derivable from the provided passages (Abstract).

Authors: We acknowledge the omission. The revised version will include the annotation guidelines given to workers, the quality control procedures (including review and validation steps), and statistics on passage relevance and answer grounding. We note that formal inter-annotator agreement was not computed during the original annotation process; we will instead describe the single-annotator-per-question workflow with post-hoc quality checks and discuss this as a limitation. revision: partial
Referee: [Abstract] The paper states that questions 'may have multiple answers or no answers at all' but supplies no empirical breakdown of answerable vs. unanswerable cases or passage sufficiency rates, leaving the asserted realism advantage over prior datasets unsubstantiated.

Authors: We agree that empirical statistics are needed to support this claim. We will add to the abstract and dataset section the observed proportions of answerable questions, questions with multiple valid answers, unanswerable questions, and cases where the provided passages are insufficient. These figures are derivable from the existing annotations and will be reported to substantiate the realism advantage. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivations

full rationale

The paper introduces the MS MARCO dataset by describing its construction from Bing query logs, human-generated answers, and retrieved passages. It contains no equations, predictions, fitted parameters, or first-principles derivations that could reduce to inputs by construction. The central claim (distinguishing scale and real-world queries) is a descriptive statement about the data resource itself, not a result derived from prior fitted quantities or self-citations. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset-construction paper with no mathematical derivations, fitted parameters, or postulated entities. No free parameters, axioms, or invented entities are required to support the central claim.

pith-pipeline@v0.9.0 · 5592 in / 1097 out tokens · 59967 ms · 2026-05-12T05:49:30.065418+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A question in the MS MARCO dataset may have multiple answers or no answers at all.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Passage Re-ranking with BERT
cs.IR 2019-01 unverdicted novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
cs.CL 2017-05 accept novelty 8.0

TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40...
Layer-wise Token Compression for Efficient Document Reranking
cs.IR 2026-05 unverdicted novelty 7.0

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, ...
Layer-wise Token Compression for Efficient Document Reranking
cs.IR 2026-05 conditional novelty 7.0

Layer-wise Token Compression applies adaptive pooling at middle transformer layers to increase QPS by up to 116% on document ranking with little or no loss in quality.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
cs.IR 2026-05 unverdicted novelty 7.0

DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge
cs.IR 2026-05 conditional novelty 7.0

EnterpriseRAG-Bench supplies a synthetic corpus of 500,000 documents across Slack, Gmail, GitHub and other tools plus 500 questions that probe lookup, multi-document reasoning, conflict resolution and absence detection.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
cs.CL 2026-05 unverdicted novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings
cs.CL 2026-04 unverdicted novelty 7.0

Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.
UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.
A Parametric Memory Head for Continual Generative Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
cs.IR 2026-04 unverdicted novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulne...
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
cs.IR 2026-04 unverdicted novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects
cs.CR 2026-04 unverdicted novelty 7.0

Injecting a few malicious vectors near the centroid exploits centrality-driven hubness in high-dimensional embeddings, causing them to dominate top-k retrievals in up to 99.85% of cases.
Spectral Tempering for Embedding Compression in Dense Passage Retrieval
cs.IR 2026-03 unverdicted novelty 7.0

Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.
ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
cs.IR 2026-02 unverdicted novelty 7.0

ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
cs.IR 2026-02 unverdicted novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme ac...
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
cs.CL 2019-05 accept novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
TubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery
cs.IR 2026-05 unverdicted novelty 6.0

A Llama-based model trained on serialized user stories unifies item, carousel, and search ranking and outperforms specialist baselines offline while improving some online metrics and reducing latency.
NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing
cs.AR 2026-05 conditional novelty 6.0

NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware ne...
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
cs.CL 2026-05 unverdicted novelty 6.0

MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, out...
EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge
cs.IR 2026-05 unverdicted novelty 6.0

EnterpriseRAG-Bench supplies a synthetic corpus of 500k documents across Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira and Confluence together with 500 questions spanning single-document lookup ...
Reproducing Complex Set-Compositional Information Retrieval
cs.CL 2026-05 unverdicted novelty 6.0

Neural retrievers that double BM25 performance on QUEST collapse below 0.02 Recall@100 on the new LIMIT+ benchmark while lexical methods reach 0.96, with all methods degrading as compositional depth increases.
NuggetIndex: Governed Atomic Retrieval for Maintainable RAG
cs.IR 2026-04 unverdicted novelty 6.0

NuggetIndex manages atomic nuggets with temporal validity and lifecycle metadata to filter outdated information before ranking, yielding 42% higher nugget recall, 9pp better temporal correctness, and 55% fewer conflic...
RAQG-QPP: Query Performance Prediction with Retrieved Query Variants and Retrieval Augmented Query Generation
cs.IR 2026-04 unverdicted novelty 6.0

Retrieved query variants from logs combined with LLM-augmented generation improve unsupervised QPP accuracy by up to 30% for neural rankers on TREC DL'19 and DL'20.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
cs.LG 2026-04 unverdicted novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
From Tokens to Concepts: Leveraging SAE for SPLADE
cs.IR 2026-04 unverdicted novelty 6.0

SAE-SPLADE substitutes SPLADE's backbone vocabulary with SAE-derived semantic concepts and matches standard SPLADE performance with better efficiency on in- and out-of-domain tasks.
ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 6.0

ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
cs.LG 2026-04 unverdicted novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models
cs.IR 2026-03 unverdicted novelty 6.0

A Voronoi cell estimation framework in embedding space enables principled token pruning for late-interaction models, reducing index size while retaining retrieval quality.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
cs.CL 2026-03 unverdicted novelty 6.0

MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
Towards Efficient and Generalizable Retrieval: Adaptive Semantic Quantization and Residual Knowledge Transfer
cs.IR 2026-02 unverdicted novelty 6.0

SA²CRQ uses sequential adaptive residual quantization based on path entropy plus anchored curriculum regularization from head items to improve both efficiency and cold-start performance in generative retrieval.
When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
cs.IR 2026-02 unverdicted novelty 6.0

LLMs consistently overrate relevance of inadequate passages in IR evaluations due to biases toward length and lexical features rather than true content match.
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization
cs.CL 2025-10 unverdicted novelty 6.0

Prompt Duel Optimizer uses dueling bandits and LLM-as-judge pairwise feedback with Double Thompson Sampling and top-performer mutation to find stronger prompts than label-free baselines on BBH and MS MARCO under limit...
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
cs.IR 2025-09 conditional novelty 6.0

LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
Learning from Natural Language Feedback for Personalized Question Answering
cs.CL 2025-08 unverdicted novelty 6.0

VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
Should We Still Pretrain Encoders with Masked Language Modeling?
cs.CL 2025-07 accept novelty 6.0

Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...
RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models
cs.IR 2025-02 unverdicted novelty 6.0

RankFlow deploys four LLM roles in sequence to rewrite queries, generate pseudo-answers, summarize passages, and rerank candidates, outperforming prior methods on TREC-DL, BEIR, and NovelEval.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
cs.IR 2023-12 conditional novelty 6.0

RankZephyr is a new open-source LLM that closes the effectiveness gap with GPT-4 for zero-shot listwise reranking while showing robustness to input ordering and document count.
REPLUG: Retrieval-Augmented Black-Box Language Models
cs.CL 2023-01 conditional novelty 6.0

REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Text and Code Embeddings by Contrastive Pre-Training
cs.CL 2022-01 unverdicted novelty 6.0

Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
NAVIS: Concurrent Search and Update with Low Position-Seeking Overhead in On-SSD Graph-Based Vector Search
cs.DC 2026-05 unverdicted novelty 5.0

NAVIS improves concurrent search and update throughput in on-SSD graph vector search by up to 2.74x for insertions and 1.37x for searches through reduced position-seeking overhead.
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
cs.IR 2026-05 unverdicted novelty 5.0

SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.
Gyan: An Explainable Neuro-Symbolic Language Model
cs.CL 2026-05 unverdicted novelty 5.0

Gyan is a novel explainable neuro-symbolic language model that decouples language modeling from knowledge representation using rhetorical and semantic theories and reports superior performance on multiple datasets.
Efficient Listwise Reranking with Compressed Document Representations
cs.IR 2026-04 unverdicted novelty 5.0

RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.
RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement
cs.CR 2026-04 unverdicted novelty 5.0

RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.
GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search
cs.IR 2026-02 unverdicted novelty 5.0

GaiaFlow combines semantic-guided diffusion tuning with early-exit and quantization methods to lower carbon emissions in neural information retrieval while maintaining competitive effectiveness.
Do Activation Verbalization Methods Convey Privileged Information?
cs.CL 2025-09 unverdicted novelty 5.0

Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
cs.IR 2025-04 unverdicted novelty 5.0

LLM-generated synthetic hard negatives for training dense retrievers consistently underperform corpus-mined negatives from BM25 and cross-encoders across 10 BEIR datasets, with non-monotonic gains from scaling the gen...
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
cs.CL 2024-12 unverdicted novelty 5.0

ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Multilingual E5 Text Embeddings: A Technical Report
cs.CL 2024-02 unverdicted novelty 5.0

Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 65 Pith papers · 3 internal anchors

[1]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

URL http://arxiv.org/abs/1601.06733. C. Clark and M. Gardner. Simple and effective multi-paragraph reading comprehension. CoRR, abs/1710.10723,

work page Pith review arXiv
[4]

URL http://arxiv.org/abs/1710.10723. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge

work page arXiv
[5]

M. Dunn, L. Sagun, M. Higgins, V . U. Güney, V . Cirik, and K. Cho. Searchqa: A new q&a dataset augmented with context from a search engine. CoRR, abs/1704.05179,

work page Pith review arXiv
[6]

B. H. Frank. Google brain chief: Deep learning takes at least 100,000 examples. https://venturebeat.com/ 2017/10/23/google-brain-chief-says-100000-examples-is-enough-data-for-deep-learning/ ,

work page 2017
[7]

J. Gao, M. Galley, and L. Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267,

work page arXiv
[8]

URL https: //arxiv.org/abs/1512.03385. W. He, K. Liu, Y . Lyu, S. Zhao, X. Xiao, Y . Liu, Y . Wang, H. Wu, Q. She, X. Liu, T. Wu, and H. Wang. Dureader: a chinese machine reading comprehension dataset from real-world applications. CoRR, abs/1711.05073,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

K. M. Hermann, T. Kociský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. 2015a. URL https://arxiv.org/abs/1506.03340. K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Proces...

work page Pith review arXiv
[10]

Kadlec, M

R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547,

work page arXiv
[11]

The NarrativeQA Reading Comprehension Challenge

T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. CoRR, abs/1712.07040,

work page Pith review arXiv
[12]

URL https://arxiv.org/abs/1606.05250. P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822,

work page internal anchor Pith review arXiv
[13]

M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention ﬂow for machine comprehension. CoRR, abs/1611.01603,

work page Pith review arXiv
[14]

Shen, P.-S

Y . Shen, P.-S. Huang, J. Gao, and W. Chen. Reasonet: Learning to stop reading in machine comprehension. arXiv preprint arXiv:1609.05284,

work page arXiv
[16]

URL http://arxiv.org/abs/1409.3215. A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. Newsqa: A machine comprehension dataset. In Rep4NLP@ACL,

work page Pith review arXiv
[17]

URL https://arxiv.org/abs/ 1502.05698. A. Wissner-Gross. Datasets over algorithms. Edge. com. Retrieved, 8,

work page Pith review arXiv
[18]

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme†. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885,

work page Pith review arXiv

[1] [1]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

URL http://arxiv.org/abs/1601.06733. C. Clark and M. Gardner. Simple and effective multi-paragraph reading comprehension. CoRR, abs/1710.10723,

work page Pith review arXiv

[3] [4]

URL http://arxiv.org/abs/1710.10723. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge

work page arXiv

[4] [5]

M. Dunn, L. Sagun, M. Higgins, V . U. Güney, V . Cirik, and K. Cho. Searchqa: A new q&a dataset augmented with context from a search engine. CoRR, abs/1704.05179,

work page Pith review arXiv

[5] [6]

B. H. Frank. Google brain chief: Deep learning takes at least 100,000 examples. https://venturebeat.com/ 2017/10/23/google-brain-chief-says-100000-examples-is-enough-data-for-deep-learning/ ,

work page 2017

[6] [7]

J. Gao, M. Galley, and L. Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267,

work page arXiv

[7] [8]

URL https: //arxiv.org/abs/1512.03385. W. He, K. Liu, Y . Lyu, S. Zhao, X. Xiao, Y . Liu, Y . Wang, H. Wu, Q. She, X. Liu, T. Wu, and H. Wang. Dureader: a chinese machine reading comprehension dataset from real-world applications. CoRR, abs/1711.05073,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

K. M. Hermann, T. Kociský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. 2015a. URL https://arxiv.org/abs/1506.03340. K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Proces...

work page Pith review arXiv

[9] [10]

Kadlec, M

R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547,

work page arXiv

[10] [11]

The NarrativeQA Reading Comprehension Challenge

T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. CoRR, abs/1712.07040,

work page Pith review arXiv

[11] [12]

URL https://arxiv.org/abs/1606.05250. P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822,

work page internal anchor Pith review arXiv

[12] [13]

M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention ﬂow for machine comprehension. CoRR, abs/1611.01603,

work page Pith review arXiv

[13] [14]

Shen, P.-S

Y . Shen, P.-S. Huang, J. Gao, and W. Chen. Reasonet: Learning to stop reading in machine comprehension. arXiv preprint arXiv:1609.05284,

work page arXiv

[14] [16]

URL http://arxiv.org/abs/1409.3215. A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. Newsqa: A machine comprehension dataset. In Rep4NLP@ACL,

work page Pith review arXiv

[15] [17]

URL https://arxiv.org/abs/ 1502.05698. A. Wissner-Gross. Datasets over algorithms. Edge. com. Retrieved, 8,

work page Pith review arXiv

[16] [18]

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme†. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885,

work page Pith review arXiv