CoQA: A Conversational Question Answering Challenge

Christopher D. Manning; Danqi Chen; Siva Reddy

arxiv: 1808.07042 · v2 · pith:DDPH67ZLnew · submitted 2018-08-21 · 💻 cs.CL · cs.AI· cs.LG

CoQA: A Conversational Question Answering Challenge

Siva Reddy , Danqi Chen , Christopher D. Manning This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords conversationalcoqaquestionsanswersansweringchallengecomprehensionconversations

0 comments

read the original abstract

Humans gather information by engaging in conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong conversational and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We launch CoQA as a challenge to the community at http://stanfordnlp.github.io/coqa/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PRIMETIME : Limits of LLMs in Temporal Primitives
cs.NE 2025-04 unverdicted novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Language Models as Knowledge Bases?
cs.CL 2019-09 accept novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
cs.CL 2019-07 unverdicted novelty 7.0

The authors introduce the task of asking clarifying questions for open-domain information-seeking conversations, collect the Qulac dataset from TREC topics, and propose a retrieval framework that outperforms baselines...
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, out...
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
cs.CL 2026-05 unverdicted novelty 6.0

Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
To Tune or Not To Tune? How About the Best of Both Worlds?
cs.CL 2019-07 unverdicted novelty 3.0

A sequential fine-tuning strategy for pre-trained language models reports modest accuracy gains of 4.7%, 0.99%, and 0.72% on semantic similarity, sequence labeling, and text classification tasks.
Machine Reading Comprehension: a Literature Review
cs.CL 2019-06 unverdicted novelty 1.0

A 2019 survey of machine reading comprehension corpora and methods.