Recognition: 1 theorem link
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Pith reviewed 2026-05-13 08:10 UTC · model grok-4.3
The pith
Many state-of-the-art pre-trained QA methods perform worse than simple neural baselines on questions that combine science facts with common knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenBookQA requires a model to select the right fact from a small open book and combine it with external common knowledge to answer questions about novel situations. Human solvers achieve close to 92 percent accuracy, but many state-of-the-art pre-trained QA methods perform surprisingly poorly and fall below several simple neural baselines developed in the paper. Oracle experiments that remove the retrieval step demonstrate the value of both the open-book facts and the additional common-knowledge facts.
What carries the argument
The OpenBookQA dataset, which supplies a compact set of science facts and forces models to retrieve one fact and integrate it with unstated common knowledge.
If this is right
- Pre-trained QA systems have a measurable deficit when forced to integrate retrieved facts with external common knowledge.
- Simple neural baselines remain competitive and sometimes superior on this style of question.
- Supplying the correct fact in an oracle setting lifts performance, confirming that both the fact and the additional knowledge are load-bearing.
- Solving multi-hop retrieval over a small knowledge base plus outside facts is the main remaining obstacle to human-level results.
Where Pith is reading between the lines
- The gap may persist even with larger pre-training unless models gain explicit mechanisms for pulling in unstated facts.
- The same open-book-plus-common-knowledge format could be applied to other subjects to test whether current methods generalize beyond pattern matching.
- Small, curated fact sets paired with targeted questions may expose reasoning limits that large unstructured corpora obscure.
Load-bearing premise
The questions cannot be solved by linguistic patterns or surface cues alone and genuinely require combining the stated fact with outside common knowledge.
What would settle it
A model that reaches near-human accuracy while denied access to the open-book facts or while relying only on question wording would show the dataset does not test the intended integration.
read the original abstract
We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic---in the context of common knowledge---and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OpenBookQA, a new QA dataset modeled on open-book exams, consisting of 1329 elementary science facts and approximately 6000 multiple-choice questions. The questions are designed to require combining a provided fact with external common knowledge. The paper reports that state-of-the-art pre-trained QA models perform poorly on this dataset, underperforming several simple neural baselines developed by the authors, while humans reach ~92% accuracy. Oracle experiments that supply the relevant facts demonstrate their value and highlight the retrieval challenge.
Significance. If the questions genuinely require multi-hop integration of the open-book facts with common knowledge, this dataset provides a valuable benchmark for advancing QA systems beyond pattern matching toward deeper reasoning. The release of the facts, questions, and baselines is a concrete contribution that can be used immediately by the community.
major comments (1)
- [Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.
minor comments (2)
- [Dataset] In the dataset construction section, the process for ensuring that each question requires the specific open-book fact (rather than being answerable from the question text alone) could be described more explicitly, including any filtering steps applied after crowdsourcing.
- [Introduction] Figure 1 (example question) would benefit from an additional row showing the model predictions of the simple baselines versus the SOTA systems to illustrate the performance gap visually.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The suggestion to include an explicit cue-only baseline is valuable for strengthening the interpretation of our results, and we will revise the manuscript to address this.
read point-by-point responses
-
Referee: [Experiments] The central interpretation that SOTA models fail due to inability to combine facts with common knowledge rests on the assumption that questions cannot be solved via linguistic cues alone. The oracle experiments (described in the results section) show gains when facts are supplied, but the manuscript does not report an explicit cue-only baseline (model performance on question + choices with no facts provided). This control is needed to quantify how much of the reported gap is attributable to knowledge integration versus annotation artifacts or surface patterns.
Authors: We agree that this control experiment is important for isolating the contribution of knowledge integration. In the revised manuscript, we will add results for all models (including the SOTA pre-trained QA systems and our simple neural baselines) when trained and evaluated on question text plus answer choices only, with no facts from the open book provided. This will allow us to quantify the performance attributable to surface patterns or annotation artifacts versus the need to combine the open-book facts with common knowledge. We will also update the discussion and oracle analysis sections to reference these new numbers. revision: yes
Circularity Check
No circularity: empirical dataset introduction and benchmarking
full rationale
The paper introduces the OpenBookQA dataset and reports direct empirical evaluations of QA methods against it, human performance, and simple baselines. No equations, parameter fittings, derivations, or self-citations form any load-bearing chain that reduces results to inputs by construction. Claims rest on new data collection and standard accuracy measurements, which are externally verifiable and independent of the paper's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The provided elementary science facts are accurate and sufficient when combined with common knowledge.
Forward citations
Cited by 32 Pith papers
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
-
Winner-Take-All Spiking Transformer for Language Modeling
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
Path-Constrained Mixture-of-Experts
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
-
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
-
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
Textbooks Are All You Need II: phi-1.5 technical report
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
-
Adaptive Spiking Neurons for Vision and Language Modeling
ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
- [1]
-
[2]
D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358--2367
work page 2016
-
[3]
D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017 a . Reading wikipedia to answer open-domain questions. In ACL
work page 2017
-
[4]
Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017 b . Enhanced lstm for natural language inference. In ACL, pages 1657--1668
work page 2017
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. CoRR, abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [6]
-
[7]
A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670--680
work page 2017
-
[8]
M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP : A deep semantic natural language processing platform. CoRR, abs/1803.07640
-
[9]
S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL
work page 2018
-
[10]
K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693--1701
work page 2015
-
[11]
F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children's books with explicit memory representations. In ICLR
work page 2016
- [12]
- [13]
-
[14]
P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree : A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC
work page 2018
-
[15]
T. Jenkins. 1995. Open book assessment in computing degree programmes 1. Technical Report 95.28, University of Leeds
work page 1995
- [16]
-
[17]
A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376--5384
work page 2017
-
[18]
D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL
work page 2018
-
[19]
D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Etzioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI
work page 2016
-
[20]
T. Khot, A. Sabharwal, and P. Clark. 2017. Answering complex questions using open information extraction. In ACL
work page 2017
-
[21]
T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail : A textual entailment dataset from science question answering. In AAAI
work page 2018
-
[22]
D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization . International Conference on Learning Representations 2015, pages 1--15
work page 2015
-
[23]
T. Kocisk \' y , J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040
-
[24]
J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm
work page 1996
-
[25]
T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL-16 shared task, pages 100--107
work page 2016
-
[26]
T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination . In LSDSem – Shared Task
work page 2017
-
[27]
T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge . In ACL, pages 821--832
work page 2018
-
[28]
T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3 : Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval '16
work page 2016
-
[29]
G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41
work page 1995
-
[30]
G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet : A n on-line lexical database. International Journal of Lexicography, 3(4):235--244
work page 1990
-
[31]
B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL
work page 2018
-
[32]
N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories . In NAACL
work page 2016
- [33]
- [34]
- [35]
-
[36]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn : M achine learning in P ython. Journal of Machine Learning Research, 12:2825--2830
work page 2011
-
[37]
J. Pennington, R. Socher, and C. Manning. 2014. GloVe : G lobal vectors for word representation. In EMNLP, pages 1532--1543
work page 2014
-
[38]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL
work page 2018
-
[39]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383--2392
work page 2016
-
[40]
M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest : A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193--203
work page 2013
- [41]
- [42]
-
[43]
K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications
work page 2017
-
[44]
S. Sugawara, H. Yokono, and A. Aizawa. 2017. Prerequisite skills for reading comprehension: Multi-perspective analysis of mctest datasets and systems. In AAAI, pages 3089--3096
work page 2017
-
[45]
A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA : A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191--200
work page 2017
- [46]
-
[47]
D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL, pages 271--280
work page 2017
- [48]
- [49]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.