Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA
Pith reviewed 2026-05-25 08:23 UTC · model grok-4.3
The pith
PTLMs show only minor gains when pre-trained on textbook content before answering true/false questions from it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have 'understood' the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).
What carries the argument
A benchmark with true/false statements drawn from textbook review questions across two introductory college texts, using separate validation and blind test splits to compare closed-book and open-book QA performance.
If this is right
- PTLMs do not effectively internalize factual content from textbooks during pre-training or general fine-tuning.
- Retrieval of relevant paragraphs outperforms pure closed-book recall on this material.
- Fine-tuning on existing QA datasets like BoolQ does not transfer to textbook-derived questions.
- New approaches may be needed to make models retain and apply knowledge from long instructional documents.
Where Pith is reading between the lines
- The task could diagnose whether future models retain document-level facts across domains beyond the two textbooks tested.
- Extending the setup to open-ended questions or other subjects would test the generality of the observed limitation.
- Models might benefit from training objectives that explicitly require reconstructing or reasoning over full documents rather than next-token prediction alone.
Load-bearing premise
That performance on true/false statements derived from the textbook authors' review questions is a valid measure of whether the PTLM has understood the textbook content rather than surface patterns or question phrasing.
What would settle it
A model achieving substantially higher accuracy than 60 percent on the blind test chapters in a closed-book setting after pre-training on the full textbooks would challenge the claim.
read the original abstract
Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a new task and leaderboard for assessing whether PTLMs can understand significant instructional documents such as college textbooks. It constructs hundreds of balanced true/false statements from author-written review questions in two textbooks (American Government 2e and U.S. History), with development tests on the first eight chapters and blind tests on the remainder. Baselines show T5 fine-tuned on BoolQ at ~50% (random), T5 with the textbook added to pre-training at 56% closed-book, and ~60% when allowing automatic paragraph retrieval for open-book answering; the authors interpret the modest closed-book gain as evidence that the PTLM may not have understood the textbook.
Significance. If the evaluation items validly proxy comprehension, the task would usefully highlight limitations of current PTLMs in integrating new document-level knowledge and stimulate work on open-book QA and pre-training methods. The creation of an author-derived, balanced, chapter-split benchmark from real textbooks is a concrete positive contribution that could support reproducible follow-up experiments.
major comments (3)
- [Abstract] Abstract: the central interpretive claim that 56% closed-book accuracy after adding the textbook to pre-training shows the PTLM 'may not have understood' the textbook rests on the untested assumption that these T/F items measure integrated content understanding rather than surface patterns, lexical overlap, or question phrasing; no ablations (paraphrased statements, negation controls, or adversarial rewordings) are described to isolate this risk, directly undermining the suggested conclusion.
- [Abstract] Abstract and experimental reporting: headline results (50%, 56%, 60%) are given without error bars, statistical tests, details on how the textbook was incorporated into T5 pre-training, the exact retrieval method for open-book, or confirmation of question balance per split, rendering it impossible to assess whether the minor lift is reliable or meaningful.
- [Task Description] Task construction: because the T/F statements are derived directly from the textbook authors' review questions (typically short, declarative, and stylistically uniform), performance may reflect sensitivity to author phrasing rather than textbook content; this is load-bearing for any claim about 'understanding' and requires explicit controls or justification.
minor comments (2)
- Providing one or two concrete examples of the true/false statements in the main text would improve clarity on item style and difficulty.
- The description of the blind test split would benefit from explicit chapter counts and total question numbers per partition to allow readers to verify the development/test separation.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We find the comments constructive and will revise the manuscript to address the concerns raised regarding the abstract's claims, experimental reporting, and task construction.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central interpretive claim that 56% closed-book accuracy after adding the textbook to pre-training shows the PTLM 'may not have understood' the textbook rests on the untested assumption that these T/F items measure integrated content understanding rather than surface patterns, lexical overlap, or question phrasing; no ablations (paraphrased statements, negation controls, or adversarial rewordings) are described to isolate this risk, directly undermining the suggested conclusion.
Authors: We agree that the interpretation would be strengthened by ablations to rule out surface-level cues. The modest improvement from 50% to 56% is presented cautiously as suggesting possible lack of understanding, but we acknowledge the need for controls. In revision, we will add a discussion of potential confounds and plan to include paraphrased test items in an updated version of the benchmark. We will also revise the abstract to present the result more as an observation rather than a strong claim about understanding. revision: partial
-
Referee: [Abstract] Abstract and experimental reporting: headline results (50%, 56%, 60%) are given without error bars, statistical tests, details on how the textbook was incorporated into T5 pre-training, the exact retrieval method for open-book, or confirmation of question balance per split, rendering it impossible to assess whether the minor lift is reliable or meaningful.
Authors: We agree that these details are essential for reproducibility and assessment. We will update the abstract and add a dedicated experimental details section providing error bars (e.g., from multiple runs), statistical significance tests, specifics on continued pre-training (number of epochs, learning rate), the retrieval method (e.g., dense or sparse retrieval over paragraphs), and confirmation that each split maintains balance (approximately 50% true/false). revision: yes
-
Referee: [Task Description] Task construction: because the T/F statements are derived directly from the textbook authors' review questions (typically short, declarative, and stylistically uniform), performance may reflect sensitivity to author phrasing rather than textbook content; this is load-bearing for any claim about 'understanding' and requires explicit controls or justification.
Authors: The review questions are written by the textbook authors specifically to test comprehension of the material, making them a natural choice for this benchmark. However, we recognize the validity of the concern about stylistic uniformity. We will expand the task description to provide justification for this approach and include it as a noted limitation. We will also consider adding a small set of paraphrased statements for comparison in future iterations. revision: partial
Circularity Check
Purely empirical task definition and baseline reporting; no derivations or self-referential reductions
full rationale
The paper introduces a new QA task using true/false statements derived from textbook review questions and reports direct empirical accuracies for PTLMs under closed-book, textbook-augmented pre-training, and open-book conditions. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations are used to derive the central claims. All results are measurements against external benchmarks (random 50%, BoolQ fine-tuning, retrieval baselines), making the work self-contained with no load-bearing steps that reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption True/false questions derived from textbook review questions serve as a valid proxy for measuring comprehension of the textbook content.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss , Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[4]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 BoolQ : Exploring the Surprising Difficulty of Natural Yes / No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language...
-
[5]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have Solved Question Answering ? Try ARC , the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
P Corbett. 2014. https://openstax.org/details/books/us-history U.S. History . OpenStax College, Houston, Texas
work page 2014
-
[7]
Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 2017. https://arxiv.org/abs/1707.03904 Quasar: Datasets for question answering by search and reading . arXiv preprint arXiv:1707.03904
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. https://arxiv.org/abs/1704.05179 SearchQA: a new Q&A dataset augmented with context from a search engine . arXiv preprint arXiv:1704.05179
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
William Falcon and The PyTorch Lightning team . 2019. https://doi.org/10.5281/zenodo.3828935 PyTorch Lightning . Available at https://github.com/PyTorchLightning/pytorch-lightning
-
[10]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. https://arxiv.org/abs/2101.00027 The Pile : An 800GB Dataset of Diverse Text for Language Modeling . arXiv preprint arXiv:2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[11]
Jose Manuel Gomez-Perez and Ra \'u l Ortega. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.441 ISAAQ - mastering textbook questions with pre-trained transformers and bottom-up and top-down attention . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5469--5479, Online. Association for Computation...
-
[12]
Zhaochen Guo and Denilson Barbosa. 2018. https://doi.org/10.3233/SW-170273 Robust named entity disambiguation with random walks . Semantic Web, 9(4):459--479
-
[13]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. https://arxiv.org/abs/2002.08909 REALM: retrieval-augmented language model pre-training . arXiv preprint arXiv:2002.08909
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[14]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf Teaching machines to read and comprehend . In Advances in neural information processing systems, volume 28. Curran Associates, Inc
work page 2015
-
[15]
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen F \"u rstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. https://www.aclweb.org/anthology/D11-1072 Robust disambiguation of named entities in text . In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 782--7...
work page 2011
-
[16]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...
-
[17]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...
-
[18]
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. https://openaccess.thecvf.com/content_cvpr_2017/html/Kembhavi_Are_You_Smarter_CVPR_2017_paper.html Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension . In Conference on Computer Vision and Pattern R...
work page 2017
-
[19]
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.171 UNIFIEDQA : Crossing format boundaries with a single QA system . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896--1907, Online. Association for Compu...
-
[20]
Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. https://doi.org/10.18653/v1/P19-1347 Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3568--3584, Florence, Italy. Association for Computation...
-
[21]
Glen Krutz. 2019. https://openstax.org/details/books/american-government-2e American government 2e . OpenStax, Rice University, Ann Arbor, MI
work page 2019
-
[22]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...
-
[23]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf MS MARCO: a human generated machine reading comprehension dataset . In CoCo@NIPS
work page 2016
-
[24]
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. https://arxiv.org/abs/2104.10350 Carbon Emissions and Large Neural Network Training . arXiv preprint arXiv:2104.10350
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt \"a schel, and Sebastian Riedel. 2021. https://doi.org/10.18653/v1/2021.naacl-main.200 KILT : a benchmark for knowledge intensive language tasks . In Proceedings of the 2...
-
[26]
Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. https://doi.org/10.18653/v1/D19-1250 Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process...
-
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the Limits of Transfer Learning with a Unified Text -to- Text Transformer . Journal of Machine Learning Research, 21(140):1--67
work page 2020
-
[28]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics
-
[29]
Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...
-
[30]
Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.437 How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418--5426, Online. Association for Computational Linguistics
-
[31]
Stephen Robertson and Hugo Zaragoza. 2009. https://doi.org/10.1561/1500000019 The probabilistic relevance framework: BM25 and beyond . Now Publishers Inc
-
[32]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. https://doi.org/10.18653/v1/P19-1355 Energy and Policy Considerations for Deep Learning in NLP . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3645--3650. Association for Computational Linguistics
work page internal anchor Pith review doi:10.18653/v1/p19-1355 2019
-
[33]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. https://doi.org/10.18653/v1/N18-1074 FEVER : a large-scale dataset for fact extraction and VER ification . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...
work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
-
[34]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...
work page 2020
-
[35]
Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. https://doi.org/10.18653/v1/D15-1237 W iki QA : A challenge dataset for open-domain question answering . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013--2018, Lisbon, Portugal. Association for Computational Linguistics
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.