arxiv: 1809.02789 · v1 · submitted 2018-09-08 · 💻 cs.CL

Recognition: 1 theorem link

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov , Peter Clark , Tushar Khot , Ashish Sabharwal

Authors on Pith no claims yet

Pith reviewed 2026-05-13 08:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords question answeringopen book QAcommon knowledgescience factsdatasetmulti-hop reasoningpre-trained modelsbaselines

0 comments

The pith

Many state-of-the-art pre-trained QA methods perform worse than simple neural baselines on questions that combine science facts with common knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenBookQA, a dataset modeled on open-book exams that supplies 1329 elementary science facts and pairs them with roughly 6000 questions. Each question requires retrieving one fact and applying it to a new situation using everyday knowledge that is not stated in the fact itself. Humans reach near 92 percent accuracy, yet many advanced pre-trained QA systems score lower than basic neural models built for the task. Oracle tests that supply the correct fact show that both the provided knowledge and additional common-sense facts matter. The work frames retrieval across this multi-hop setting as the central unsolved problem.

Core claim

OpenBookQA requires a model to select the right fact from a small open book and combine it with external common knowledge to answer questions about novel situations. Human solvers achieve close to 92 percent accuracy, but many state-of-the-art pre-trained QA methods perform surprisingly poorly and fall below several simple neural baselines developed in the paper. Oracle experiments that remove the retrieval step demonstrate the value of both the open-book facts and the additional common-knowledge facts.

What carries the argument

The OpenBookQA dataset, which supplies a compact set of science facts and forces models to retrieve one fact and integrate it with unstated common knowledge.

If this is right

Pre-trained QA systems have a measurable deficit when forced to integrate retrieved facts with external common knowledge.
Simple neural baselines remain competitive and sometimes superior on this style of question.
Supplying the correct fact in an oracle setting lifts performance, confirming that both the fact and the additional knowledge are load-bearing.
Solving multi-hop retrieval over a small knowledge base plus outside facts is the main remaining obstacle to human-level results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap may persist even with larger pre-training unless models gain explicit mechanisms for pulling in unstated facts.
The same open-book-plus-common-knowledge format could be applied to other subjects to test whether current methods generalize beyond pattern matching.
Small, curated fact sets paired with targeted questions may expose reasoning limits that large unstructured corpora obscure.

Load-bearing premise

The questions cannot be solved by linguistic patterns or surface cues alone and genuinely require combining the stated fact with outside common knowledge.

What would settle it

A model that reaches near-human accuracy while denied access to the open-book facts or while relying only on question wording would show the dataset does not test the intended integration.

read the original abstract

We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic---in the context of common knowledge---and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces OpenBookQA, a new QA dataset modeled on open-book exams, consisting of 1329 elementary science facts and approximately 6000 multiple-choice questions. The questions are designed to require combining a provided fact with external common knowledge. The paper reports that state-of-the-art pre-trained QA models perform poorly on this dataset, underperforming several simple neural baselines developed by the authors, while humans reach ~92% accuracy. Oracle experiments that supply the relevant facts demonstrate their value and highlight the retrieval challenge.

Significance. If the questions genuinely require multi-hop integration of the open-book facts with common knowledge, this dataset provides a valuable benchmark for advancing QA systems beyond pattern matching toward deeper reasoning. The release of the facts, questions, and baselines is a concrete contribution that can be used immediately by the community.

major comments (1)

[Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.

minor comments (2)

[Dataset] In the dataset construction section, the process for ensuring that each question requires the specific open-book fact (rather than being answerable from the question text alone) could be described more explicitly, including any filtering steps applied after crowdsourcing.
[Introduction] Figure 1 (example question) would benefit from an additional row showing the model predictions of the simple baselines versus the SOTA systems to illustrate the performance gap visually.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The suggestion to include an explicit cue-only baseline is valuable for strengthening the interpretation of our results, and we will revise the manuscript to address this.

read point-by-point responses

Referee: [Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.

Authors: We agree that this control experiment is important for isolating the contribution of knowledge integration. In the revised manuscript, we will add results for all models (including the SOTA pre-trained QA systems and our simple neural baselines) when trained and evaluated on question text plus answer choices only, with no facts from the open book provided. This will allow us to quantify the performance attributable to surface patterns or annotation artifacts versus the need to combine the open-book facts with common knowledge. We will also update the discussion and oracle analysis sections to reference these new numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset introduction and benchmarking

full rationale

The paper introduces the OpenBookQA dataset and reports direct empirical evaluations of QA methods against it, human performance, and simple baselines. No equations, parameter fittings, derivations, or self-citations form any load-bearing chain that reduces results to inputs by construction. Claims rest on new data collection and standard accuracy measurements, which are externally verifiable and independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the facts are elementary and accurate, with no free parameters or new entities introduced.

axioms (1)

domain assumption The provided elementary science facts are accurate and sufficient when combined with common knowledge.
The dataset is built on these facts being correct.

pith-pipeline@v0.9.0 · 5517 in / 1137 out tokens · 52225 ms · 2026-05-13T08:10:08.311490+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
cs.PF 2026-04 unverdicted novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Path-Constrained Mixture-of-Experts
cs.LG 2026-03 unverdicted novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
cs.CL 2026-05 unverdicted novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
cs.CL 2026-04 unverdicted novelty 6.0

SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
cs.NE 2026-04 unverdicted novelty 6.0

BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
cs.AI 2026-05 unverdicted novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
cs.SE 2026-04 unverdicted novelty 5.0

The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
Adaptive Spiking Neurons for Vision and Language Modeling
cs.NE 2026-04 unverdicted novelty 5.0

ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 31 Pith papers · 1 internal anchor

[1]

Banko, M

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the web. In IJCAI

work page 2007
[2]

D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358--2367

work page 2016
[3]

D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017 a . Reading wikipedia to answer open-domain questions. In ACL

work page 2017
[4]

Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017 b . Enhanced lstm for natural language inference. In ACL, pages 1657--1668

work page 2017
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. CoRR, abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Clark, O

P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pages 2580--2586

work page 2016
[7]

Conneau, D

A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670--680

work page 2017
[8]

Gardner, J

M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP : A deep semantic natural language processing platform. CoRR, abs/1803.07640

work page arXiv 2017
[9]

Gururangan, S

S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL

work page 2018
[10]

K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693--1701

work page 2015
[11]

F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children's books with explicit memory representations. In ICLR

work page 2016
[12]

Hoeffding

W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13--30

work page 1963
[13]

Jansen, N

P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark. 2016. What's in an explanation? characterizing knowledge and inference requirements for elementary science exams. In COLING

work page 2016
[14]

P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree : A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC

work page 2018
[15]

T. Jenkins. 1995. Open book assessment in computing degree programmes 1. Technical Report 95.28, University of Leeds

work page 1995
[16]

Joshi, E

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pages 1601--1611

work page 2017
[17]

Kembhavi, M

A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376--5384

work page 2017
[18]

Khashabi, S

D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL

work page 2018
[19]

Khashabi, T

D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI

work page 2016
[20]

T. Khot, A. Sabharwal, and P. Clark. 2017. Answering complex questions using open information extraction. In ACL

work page 2017
[21]

T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail : A textual entailment dataset from science question answering. In AAAI

work page 2018
[22]

D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization . International Conference on Learning Representations 2015, pages 1--15

work page 2015
[23]

Kocisk \' y , J

T. Kocisk \' y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040

work page arXiv 2017
[24]

Landsberger

J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm

work page 1996
[25]

Mihaylov and A

T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL-16 shared task, pages 100--107

work page 2016
[26]

Mihaylov and A

T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination . In LSDSem – Shared Task

work page 2017
[27]

Mihaylov and A

T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge . In ACL, pages 821--832

work page 2018
[28]

Mihaylov and P

T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3 : Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval '16

work page 2016
[29]

G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41

work page 1995
[30]

G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet : A n on-line lexical database. International Journal of Lexicography, 3(4):235--244

work page 1990
[31]

B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL

work page 2018
[32]

Mostafazadeh, N

N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories . In NAACL

work page 2016
[33]

Nakov, L

P. Nakov, L. M \`a rquez, A. Moschitti, W. Magdy, H. Mubarak, a. A. Freihat, J. Glass, and B. Randeree. 2016. Semeval-2016 task 3: Community question answering. In SemEval '16, pages 525--545

work page 2016
[34]

Onishi, H

T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In EMNLP, pages 2230--2235, Austin, Texas

work page 2016
[35]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W

work page 2017
[36]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn : M achine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

work page 2011
[37]

Pennington, R

J. Pennington, R. Socher, and C. Manning. 2014. GloVe : G lobal vectors for word representation. In EMNLP, pages 1532--1543

work page 2014
[38]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL

work page 2018
[39]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383--2392

work page 2016
[40]

Richardson, C

M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest : A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193--203

work page 2013
[41]

Singh, T

P. Singh, T. Lin, E. Mueller, G. Lim, T. Perkins, and W. Zhu. 2002. Open mind common sense: Knowledge acquisition from the general public. In Lecture Notes in Computer Science, volume 2519, pages 1223--1237

work page 2002
[42]

Speer, J

R. Speer, J. Chin, and C. Havasi. 2017. ConceptNet 5.5 : A n open multilingual graph of general knowledge. In AAAI

work page 2017
[43]

Stasaski and M

K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications

work page 2017
[44]

Sugawara, H

S. Sugawara, H. Yokono, and A. Aizawa. 2017. Prerequisite skills for reading comprehension: Multi-perspective analysis of mctest datasets and systems. In AAAI, pages 3089--3096

work page 2017
[45]

Trischler, T

A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA : A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191--200

work page 2017
[46]

P. D. Turney. 2017. Leveraging term banks for answering complex questions: A case for sparse vectors. CoRR, abs/1704.03543

work page arXiv 2017
[47]

Weissenborn, G

D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL, pages 271--280

work page 2017
[48]

Welbl, P

J. Welbl, P. Stenetorp, and S. Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. TACL

work page 2018
[49]

Zhang, H

Y. Zhang, H. Dai, K. Toraman, and L. Song. 2018. KG \^ 2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings . In arXiv

work page 2018