QuAC : Question Answering in Context

· 2018 · cs.CL · arXiv 1808.07036

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-of-the-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GS-QA: A Benchmark for Geospatial Question Answering

cs.DB · 2026-05-21 · unverdicted · novelty 7.0

GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

cs.CL · 2024-06-19 · unverdicted · novelty 6.0

LaMI augments LLMs with visual commonsense via late fusion of predictions from multiple text-generated images, outperforming prior augmented LLMs on visual tasks while matching VLMs and preserving or improving NLP performance.

Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.

LLMs Get Lost In Multi-Turn Conversation

cs.CL · 2025-05-09 · unverdicted · novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

Mixtral of Experts

cs.LG · 2024-01-08 · unverdicted · novelty 5.0

Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.

Mistral 7B

cs.CL · 2023-10-10 · accept · novelty 5.0

Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.

citing papers explorer

Showing 9 of 9 citing papers.

GS-QA: A Benchmark for Geospatial Question Answering cs.DB · 2026-05-21 · unverdicted · none · ref 13 · internal anchor
GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 66 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents cs.CL · 2026-05-13 · unverdicted · none · ref 45 · 2 links · internal anchor
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
LaMI: Augmenting Large Language Models via Late Multi-Image Fusion cs.CL · 2024-06-19 · unverdicted · none · ref 16 · internal anchor
LaMI augments LLMs with visual commonsense via late fusion of predictions from multiple text-generated images, outperforming prior augmented LLMs on visual tasks while matching VLMs and preserving or improving NLP performance.
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs cs.CL · 2026-04-21 · unverdicted · none · ref 67
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
LLMs Get Lost In Multi-Turn Conversation cs.CL · 2025-05-09 · unverdicted · none · ref 13
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 28
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Mixtral of Experts cs.LG · 2024-01-08 · unverdicted · none · ref 5 · internal anchor
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
Mistral 7B cs.CL · 2023-10-10 · accept · none · ref 7 · internal anchor
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.

QuAC : Question Answering in Context

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer