HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Christopher D. Manning; Peng Qi; Ruslan Salakhutdinov; Saizheng Zhang; William W. Cohen; Yoshua Bengio; Zhilin Yang

arxiv: 1809.09600 · v1 · submitted 2018-09-25 · 💻 cs.CL

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang , Peng Qi , Saizheng Zhang , Yoshua Bengio , William W. Cohen , Ruslan Salakhutdinov , Christopher D. Manning This is my paper

Pith reviewed 2026-05-12 04:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords HotpotQAmulti-hop question answeringexplainable QAsupporting factsWikipedia datasetcomparison questionsQA benchmark

0 comments

The pith

HotpotQA introduces 113k Wikipedia questions that require multi-hop reasoning across documents along with sentence-level supporting facts for explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing QA datasets do not train systems for complex reasoning or to explain their answers. This paper presents HotpotQA, a large collection of questions based on Wikipedia articles that necessitate retrieving and reasoning over multiple documents. The dataset includes annotations for the specific sentences used in reasoning, enabling supervised training for explainability. It also features comparison questions that test fact extraction and comparison abilities. If successful, this would allow QA systems to handle more realistic, complex queries with transparent reasoning processes.

Core claim

HotpotQA provides 113k question-answer pairs from Wikipedia that demand finding and reasoning over multiple documents, include diverse questions not tied to schemas, supply sentence-level supporting facts, and introduce factoid comparison questions to test fact extraction and comparison. The supporting facts enable models to improve performance and make explainable predictions.

What carries the argument

The HotpotQA dataset with its sentence-level supporting fact annotations that provide strong supervision for multi-hop reasoning and explainability.

Load-bearing premise

That the questions genuinely require multi-hop reasoning over multiple documents rather than being answerable from single documents or surface patterns, and that the sentence-level supporting fact annotations are accurate and complete.

What would settle it

A demonstration that current QA models can answer most HotpotQA questions correctly by processing only a single document or without using the supporting facts annotations.

read the original abstract

Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HotpotQA gives the field a useful new benchmark for multi-hop QA over Wikipedia with sentence-level facts and comparison questions, though its value rests on how strictly the construction enforced genuine multi-hop cases.

read the letter

The main takeaway is that this paper releases HotpotQA, a 113k-question dataset built on Wikipedia that targets multi-hop reasoning, supplies sentence-level supporting facts, stays free of KB constraints, and adds comparison questions. It also runs baselines showing that recent models struggle and that the facts help both accuracy and explainability. That combination is new relative to SQuAD-style single-hop sets and earlier multi-hop efforts tied to KBs. The authors did solid work scaling the collection with crowdsourcing, defining clear question types, and releasing the data with enough structure for others to use right away. The reported gains from adding the supporting-fact supervision are concrete and worth having on record. The soft spots sit mainly in the validation of the core claims. The paper needs to show stronger evidence that most questions cannot be answered from a single document or via shallow patterns, and that the supporting-fact annotations are both complete and free of systematic bias. Details on filtering steps, agreement rates, and any post-hoc checks for single-hop leakage would help. The results section could also include more targeted ablations on whether models are truly using the facts for reasoning or just as extra signals. These are standard concerns for crowdsourced QA data rather than fatal problems. Readers working on QA architectures, explainability, or evaluation benchmarks will get immediate value from the dataset and the baseline numbers. The work is coherent on its own terms and engages the existing literature directly, so it deserves a serious referee. I would send it out for review; the community can use the resource and the discussion will likely tighten the construction details.

Referee Report

2 major / 2 minor

Summary. The paper introduces HotpotQA, a dataset of 113k Wikipedia-based QA pairs designed to require multi-hop reasoning over multiple documents. It features sentence-level supporting fact annotations for explainability, diverse questions unconstrained by KBs, and a new category of comparison questions. The authors claim current QA systems find it challenging and that access to supporting facts improves performance while enabling explainable predictions.

Significance. If the construction process robustly enforces genuine multi-hop requirements and produces accurate, complete supporting-fact labels, the dataset would be a significant contribution by providing strong supervision for reasoning and explainability in QA, addressing gaps in prior single-hop or schema-constrained datasets.

major comments (2)

[§3] §3 (Data Collection): The crowdsourcing pipeline for bridge and comparison questions is described at a high level, but no quantitative validation (e.g., percentage of questions answerable from a single paragraph or document) is provided to confirm that the multi-hop requirement is enforced and that surface-pattern shortcuts are filtered; this is load-bearing for the central claim that questions require reasoning over multiple supporting documents.
[§4.3] §4.3 (Experiments with Supporting Facts): Performance gains are reported when models use the provided sentence-level facts, yet there is no analysis of annotation completeness (e.g., whether all necessary sentences are labeled or if relevant ones are missed) or inter-annotator agreement; without this, the reliability of the 'strong supervision' and the source of the observed improvements remain unclear.

minor comments (2)

[Abstract] The abstract states the four key features but omits any quantitative results (e.g., model accuracies or dataset statistics beyond the total size), which would help readers immediately assess the claims.
[Table 1] Table 1 or dataset statistics section: Clarify the exact split between bridge and comparison questions and report any filtering rates from the validation stage to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our manuscript introducing HotpotQA. We address each major comment below, indicating where we will revise the paper to strengthen the presentation of our data collection and annotation processes.

read point-by-point responses

Referee: [§3] §3 (Data Collection): The crowdsourcing pipeline for bridge and comparison questions is described at a high level, but no quantitative validation (e.g., percentage of questions answerable from a single paragraph or document) is provided to confirm that the multi-hop requirement is enforced and that surface-pattern shortcuts are filtered; this is load-bearing for the central claim that questions require reasoning over multiple supporting documents.

Authors: We agree that explicit quantitative validation would better substantiate the multi-hop nature of the questions. The manuscript describes the crowdsourcing pipeline, including the use of adversarial filtering to remove questions answerable from a single document or via surface patterns, but does not report specific percentages or validation statistics from that process. In the revision, we will add a new table and accompanying text with the number of questions at each filtering stage, along with results from a manual audit of a sample of final questions confirming that they require information from multiple documents. revision: yes
Referee: [§4.3] §4.3 (Experiments with Supporting Facts): Performance gains are reported when models use the provided sentence-level facts, yet there is no analysis of annotation completeness (e.g., whether all necessary sentences are labeled or if relevant ones are missed) or inter-annotator agreement; without this, the reliability of the 'strong supervision' and the source of the observed improvements remain unclear.

Authors: We acknowledge the absence of completeness analysis and inter-annotator agreement (IAA) metrics for the supporting-fact annotations, which limits the ability to fully assess their reliability. The manuscript provides details on how supporting facts were collected but does not include these quantitative checks. We will revise §4.3 and the data collection section to include additional discussion of the annotation guidelines and any post-hoc manual checks performed. However, because each question received supporting-fact annotations from only a single worker, we do not have the data to compute IAA; we will explicitly note this as a limitation of the current release. revision: partial

standing simulated objections not resolved

Inter-annotator agreement for supporting-fact annotations, as multiple independent annotations were not collected during the original crowdsourcing process.

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with direct benchmarking

full rationale

The paper introduces HotpotQA via crowdsourcing pipeline for multi-hop questions and supporting-fact annotations, then reports direct model evaluations on the resulting dataset. No equations, fitted parameters, or predictions are presented; there is no derivation chain that reduces to self-definition, self-citation load-bearing, or renaming of inputs. Central claims rest on the described construction process and external model benchmarks, which are independent of any internal fit or prior self-result. This is a standard empirical dataset paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper with no free parameters, axioms, or invented entities in a mathematical or theoretical sense; the contribution is the curated dataset and its properties.

pith-pipeline@v0.9.0 · 5462 in / 1231 out tokens · 85048 ms · 2026-05-12T04:59:23.206350+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
GS-QA: A Benchmark for Geospatial Question Answering
cs.DB 2026-05 unverdicted novelty 7.0

GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 7.0

Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
cs.CL 2026-04 unverdicted novelty 7.0

Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact ...
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
cs.IR 2026-04 unverdicted novelty 7.0

HeadRank improves decoding-free passage reranking by preference-aligning attention heads to increase discriminability in middle-context documents, outperforming baselines on 14 benchmarks with only 211 training queries.
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
cs.IR 2026-04 unverdicted novelty 7.0

HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
cs.CL 2025-04 conditional novelty 7.0

BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
PubMedQA: A Dataset for Biomedical Research Question Answering
cs.CL 2019-09 unverdicted novelty 7.0

PubMedQA supplies 273k+ biomedical QA instances that require reasoning over research abstracts to produce yes/no/maybe answers.
ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

ZEBRA reduces multi-phase budget allocation for LLM orchestration to a nonlinear knapsack problem solved via LLM-estimated utility curves and water-filling Lagrange search, recovering 94.4% of unconstrained quality at...
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
cs.LG 2026-05 unverdicted novelty 6.0

Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
cs.CL 2026-05 unverdicted novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
CleanBase: Detecting Malicious Documents in RAG Knowledge Databases
cs.CR 2026-05 unverdicted novelty 6.0

CleanBase identifies malicious documents in RAG databases by detecting cliques in a semantic similarity graph constructed using embedding models and a statistical threshold.
SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering
cs.CL 2026-04 unverdicted novelty 6.0

SEARCH-R improves multi-hop question answering by training a fine-tuned Llama navigator for sub-question decomposition and using dependency-tree retrieval to quantify informational contribution of documents.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
cs.NI 2026-04 unverdicted novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
cs.IR 2026-04 unverdicted novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
cs.CL 2026-04 unverdicted novelty 6.0

Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
cs.IR 2026-04 unverdicted novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to verbalize uncertainty explicitly at the end or during reasoning reduces overconfident errors and improves answer quality on factual tasks while enabling RAG triggers.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models
cs.CL 2026-03 unverdicted novelty 6.0

TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
cs.CL 2026-03 unverdicted novelty 6.0

MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
cs.CL 2025-10 unverdicted novelty 6.0

EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding be...
Question-Adaptive Graph Learning for Multi-hop Retrieval Augmented Generation
cs.LG 2025-10 unverdicted novelty 6.0

A Multi-L KG and Quest-GNN with question-adaptive intra/inter-level message passing and synthesized pre-training data improves multi-hop RAG performance up to 33.8% on high-hop questions.
WebSailor: Navigating Super-human Reasoning for Web Agent
cs.CL 2025-07 conditional novelty 6.0

WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Context Attribution with Multi-Armed Bandit Optimization
cs.AI 2025-06 unverdicted novelty 6.0

Formulates context attribution as a combinatorial multi-armed bandit problem solved via Linear Thompson Sampling to reduce LLM queries by up to 30% on QA benchmarks while matching existing attribution quality.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 conditional novelty 6.0

ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 unverdicted novelty 6.0

ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.
In-depth Analysis of Graph-based RAG in a Unified Framework
cs.IR 2025-03 unverdicted novelty 6.0

A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.
ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation
cs.IR 2025-02 unverdicted novelty 6.0

ArchRAG proposes attributed-community hierarchical indexing and LLM clustering to improve accuracy and lower token usage in graph-based retrieval-augmented generation.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
cs.CL 2024-07 accept novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
SnapKV: LLM Knows What You are Looking for Before Generation
cs.CL 2024-04 conditional novelty 6.0

SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models
cs.SE 2024-01 unverdicted novelty 6.0

HalluHunter is a knowledge-graph and rule-based NLP framework that iteratively generates single- and multi-hop questions to uncover factual errors in LLMs, triggering errors in up to 55% of cases on nine models while ...
ConfusionPrompt: Practical Private Inference for Online Large Language Models
cs.CR 2023-12 unverdicted novelty 6.0

ConfusionPrompt enables private black-box LLM inference via prompt decomposition and pseudo-prompt mixing, claiming better privacy-utility trade-off than perturbation methods and lower memory use than open-source loca...
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
cs.CL 2023-05 conditional novelty 6.0

ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
Solving math word problems with process- and outcome-based feedback
cs.LG 2022-11 unverdicted novelty 6.0

On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
CTRL: A Conditional Transformer Language Model for Controllable Generation
cs.CL 2019-09 unverdicted novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution
cs.LG 2025-12 unverdicted novelty 5.0

MaxShapley computes fair document attributions in generative QA by reducing Shapley value calculation to polynomial time via a max-sum utility, matching exact Shapley quality on HotPotQA, MuSiQUE, and MS MARCO while u...
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
cs.AI 2025-11 conditional novelty 5.0

The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency ...
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
cs.CL 2025-10 unverdicted novelty 5.0

RELOOP unifies retrieval across text, tables, and KGs via hierarchical sequences and dual-agent guided iteration, reporting EM/F1 gains over baselines on HotpotQA, HybridQA/TAT-QA, and MetaQA.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
cs.CL 2025-10 unverdicted novelty 5.0

EvolveR proposes a closed-loop self-evolution system for LLM agents that distills experiences into principles offline and applies reinforcement during online task interactions to achieve better performance on multi-ho...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 70 Pith papers

[1]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL)

work page 2017
[2]

Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 55th Annual Meeting of the Association of Computational Linguistics

work page 2017
[3]

Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA : A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179

work page Pith review arXiv 2017
[4]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

work page 2017
[5]

Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018. Stochastic answer networks for machine reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

work page 2018
[6]

Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60

work page 2014
[7]

Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. 2017. ParlAI : A dialog research software platform. arXiv preprint arXiv:1705.06476

work page Pith review arXiv 2017
[8]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO : A human generated machine reading comprehension dataset. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS)

work page 2016
[9]

Jekaterina Novikova, Ond r ej Du s ek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG . In Proceedings of the Conference on Empirical Methods in Natural Language Processing

work page 2017
[10]

Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 2017. Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098

work page Pith review arXiv 2017
[11]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

work page 2018
[12]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2016
[13]

Shimi Salant and Jonathan Berant. 2018. Contextualized word representations for reading comprehension. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics

work page 2018
[14]

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference on Learning Representations

work page 2017
[15]

Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics

work page 2018
[16]

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189--198

work page 2017
[17]

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association of Computational Linguistics

work page 2018
[18]

Caiming Xiong, Victor Zhong, and Richard Socher. 2018. DCN+ : Mixed objective and deep residual coattention for question answering. In Proceedings of the International Conference on Learning Representations

work page 2018
[19]

Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Mastering the dungeon: Grounded language learning by mechanical turker descent. In Proceedings of the International Conference on Learning Representations

work page 2018

[1] [1]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL)

work page 2017

[2] [2]

Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 55th Annual Meeting of the Association of Computational Linguistics

work page 2017

[3] [3]

Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA : A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179

work page Pith review arXiv 2017

[4] [4]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

work page 2017

[5] [5]

Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018. Stochastic answer networks for machine reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

work page 2018

[6] [6]

Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60

work page 2014

[7] [7]

Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. 2017. ParlAI : A dialog research software platform. arXiv preprint arXiv:1705.06476

work page Pith review arXiv 2017

[8] [8]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO : A human generated machine reading comprehension dataset. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS)

work page 2016

[9] [9]

Jekaterina Novikova, Ond r ej Du s ek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG . In Proceedings of the Conference on Empirical Methods in Natural Language Processing

work page 2017

[10] [10]

Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 2017. Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098

work page Pith review arXiv 2017

[11] [11]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

work page 2018

[12] [12]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2016

[13] [13]

Shimi Salant and Jonathan Berant. 2018. Contextualized word representations for reading comprehension. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics

work page 2018

[14] [14]

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference on Learning Representations

work page 2017

[15] [15]

Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics

work page 2018

[16] [16]

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189--198

work page 2017

[17] [17]

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association of Computational Linguistics

work page 2018

[18] [18]

Caiming Xiong, Victor Zhong, and Richard Socher. 2018. DCN+ : Mixed objective and deep residual coattention for question answering. In Proceedings of the International Conference on Learning Representations

work page 2018

[19] [19]

Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Mastering the dungeon: Grounded language learning by mechanical turker descent. In Proceedings of the International Conference on Learning Representations

work page 2018