Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

Alex Hedges; Dong-Ho Lee; Joe Cecil; Manuel R. Ciosici; Marjorie Freedman; Ralph Weischedel

arxiv: 2110.01552 · v1 · pith:PKPA7B3Knew · submitted 2021-10-04 · 💻 cs.CL · cs.AI· cs.LG

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

Manuel R. Ciosici , Joe Cecil , Alex Hedges , Dong-Ho Lee , Marjorie Freedman , Ralph Weischedel This is my paper

Pith reviewed 2026-05-25 08:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords question answeringpre-trained language modelsclosed book QAopen book QAtextbook understandingtrue/false evaluation

0 comments

The pith

PTLMs show only minor gains when pre-trained on textbook content before answering true/false questions from it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a task and leaderboard to measure whether pre-trained language models can understand the content of full college textbooks. It derives hundreds of true/false statements from the textbooks' own review questions, splits them into development and blind test sets, and runs baselines on state-of-the-art models. Standard fine-tuning produces random-level results, adding the textbook to pre-training yields only a small lift, and allowing paragraph retrieval improves scores further. The results indicate that current PTLMs do not reliably encode or apply textbook knowledge even after direct exposure.

Core claim

Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have 'understood' the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

What carries the argument

A benchmark with true/false statements drawn from textbook review questions across two introductory college texts, using separate validation and blind test splits to compare closed-book and open-book QA performance.

If this is right

PTLMs do not effectively internalize factual content from textbooks during pre-training or general fine-tuning.
Retrieval of relevant paragraphs outperforms pure closed-book recall on this material.
Fine-tuning on existing QA datasets like BoolQ does not transfer to textbook-derived questions.
New approaches may be needed to make models retain and apply knowledge from long instructional documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The task could diagnose whether future models retain document-level facts across domains beyond the two textbooks tested.
Extending the setup to open-ended questions or other subjects would test the generality of the observed limitation.
Models might benefit from training objectives that explicitly require reconstructing or reasoning over full documents rather than next-token prediction alone.

Load-bearing premise

That performance on true/false statements derived from the textbook authors' review questions is a valid measure of whether the PTLM has understood the textbook content rather than surface patterns or question phrasing.

What would settle it

A model achieving substantially higher accuracy than 60 percent on the blind test chapters in a closed-book setting after pre-training on the full textbooks would challenge the claim.

read the original abstract

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a textbook-derived T/F QA task with closed-book pre-training baselines, but the modest gains and lack of ablations leave the 'no understanding' interpretation open to question.

read the letter

The main takeaway is a concrete task using author-written review questions from two college textbooks to compare closed-book performance after adding the text to T5 pre-training against open-book retrieval. Closed book reaches only 56% while open book hits 60%, with random at 50% and a BoolQ-fine-tuned baseline also at 50%. This setup is new in its choice of sources and the direct pre-training versus retrieval contrast. It gives a usable starting point for anyone measuring whether PTLMs absorb instructional documents rather than just surface patterns from web data. The construction is straightforward and the splits (first eight chapters for dev, rest for blind test) are sensible. The soft spot is the evaluation itself. True/false items pulled straight from review questions are often short and declarative, which raises the chance that models exploit lexical cues or negation instead of integrated content. No ablations on paraphrased or adversarial versions are described, so the jump from 50% to 56% is hard to read as clear evidence against comprehension. The abstract also skips details on exactly how the textbook was inserted into pre-training and what retrieval method was used, which makes the numbers harder to interpret without the full methods. This work is aimed at groups building or evaluating knowledge-intensive models and retrieval-augmented systems. It is worth sending to peer review because the task definition is specific and falsifiable even if the current interpretation needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper proposes a new task and leaderboard for assessing whether PTLMs can understand significant instructional documents such as college textbooks. It constructs hundreds of balanced true/false statements from author-written review questions in two textbooks (American Government 2e and U.S. History), with development tests on the first eight chapters and blind tests on the remainder. Baselines show T5 fine-tuned on BoolQ at ~50% (random), T5 with the textbook added to pre-training at 56% closed-book, and ~60% when allowing automatic paragraph retrieval for open-book answering; the authors interpret the modest closed-book gain as evidence that the PTLM may not have understood the textbook.

Significance. If the evaluation items validly proxy comprehension, the task would usefully highlight limitations of current PTLMs in integrating new document-level knowledge and stimulate work on open-book QA and pre-training methods. The creation of an author-derived, balanced, chapter-split benchmark from real textbooks is a concrete positive contribution that could support reproducible follow-up experiments.

major comments (3)

[Abstract] Abstract: the central interpretive claim that 56% closed-book accuracy after adding the textbook to pre-training shows the PTLM 'may not have understood' the textbook rests on the untested assumption that these T/F items measure integrated content understanding rather than surface patterns, lexical overlap, or question phrasing; no ablations (paraphrased statements, negation controls, or adversarial rewordings) are described to isolate this risk, directly undermining the suggested conclusion.
[Abstract] Abstract and experimental reporting: headline results (50%, 56%, 60%) are given without error bars, statistical tests, details on how the textbook was incorporated into T5 pre-training, the exact retrieval method for open-book, or confirmation of question balance per split, rendering it impossible to assess whether the minor lift is reliable or meaningful.
[Task Description] Task construction: because the T/F statements are derived directly from the textbook authors' review questions (typically short, declarative, and stylistically uniform), performance may reflect sensitivity to author phrasing rather than textbook content; this is load-bearing for any claim about 'understanding' and requires explicit controls or justification.

minor comments (2)

Providing one or two concrete examples of the true/false statements in the main text would improve clarity on item style and difficulty.
The description of the blind test split would benefit from explicit chapter counts and total question numbers per partition to allow readers to verify the development/test separation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We find the comments constructive and will revise the manuscript to address the concerns raised regarding the abstract's claims, experimental reporting, and task construction.

read point-by-point responses

Referee: [Abstract] Abstract: the central interpretive claim that 56% closed-book accuracy after adding the textbook to pre-training shows the PTLM 'may not have understood' the textbook rests on the untested assumption that these T/F items measure integrated content understanding rather than surface patterns, lexical overlap, or question phrasing; no ablations (paraphrased statements, negation controls, or adversarial rewordings) are described to isolate this risk, directly undermining the suggested conclusion.

Authors: We agree that the interpretation would be strengthened by ablations to rule out surface-level cues. The modest improvement from 50% to 56% is presented cautiously as suggesting possible lack of understanding, but we acknowledge the need for controls. In revision, we will add a discussion of potential confounds and plan to include paraphrased test items in an updated version of the benchmark. We will also revise the abstract to present the result more as an observation rather than a strong claim about understanding. revision: partial
Referee: [Abstract] Abstract and experimental reporting: headline results (50%, 56%, 60%) are given without error bars, statistical tests, details on how the textbook was incorporated into T5 pre-training, the exact retrieval method for open-book, or confirmation of question balance per split, rendering it impossible to assess whether the minor lift is reliable or meaningful.

Authors: We agree that these details are essential for reproducibility and assessment. We will update the abstract and add a dedicated experimental details section providing error bars (e.g., from multiple runs), statistical significance tests, specifics on continued pre-training (number of epochs, learning rate), the retrieval method (e.g., dense or sparse retrieval over paragraphs), and confirmation that each split maintains balance (approximately 50% true/false). revision: yes
Referee: [Task Description] Task construction: because the T/F statements are derived directly from the textbook authors' review questions (typically short, declarative, and stylistically uniform), performance may reflect sensitivity to author phrasing rather than textbook content; this is load-bearing for any claim about 'understanding' and requires explicit controls or justification.

Authors: The review questions are written by the textbook authors specifically to test comprehension of the material, making them a natural choice for this benchmark. However, we recognize the validity of the concern about stylistic uniformity. We will expand the task description to provide justification for this approach and include it as a noted limitation. We will also consider adding a small set of paraphrased statements for comparison in future iterations. revision: partial

Circularity Check

0 steps flagged

Purely empirical task definition and baseline reporting; no derivations or self-referential reductions

full rationale

The paper introduces a new QA task using true/false statements derived from textbook review questions and reports direct empirical accuracies for PTLMs under closed-book, textbook-augmented pre-training, and open-book conditions. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citations are used to derive the central claims. All results are measurements against external benchmarks (random 50%, BoolQ fine-tuning, retrieval baselines), making the work self-contained with no load-bearing steps that reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical task proposal that rests on standard NLP evaluation assumptions rather than new free parameters, axioms, or invented entities.

axioms (1)

domain assumption True/false questions derived from textbook review questions serve as a valid proxy for measuring comprehension of the textbook content.
This assumption is required to interpret the reported accuracies as evidence about understanding rather than about question format.

pith-pipeline@v0.9.0 · 5810 in / 1146 out tokens · 29278 ms · 2026-05-25T08:23:11.725455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 9 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss , Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 BoolQ : Exploring the Surprising Difficulty of Natural Yes / No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language...

work page doi:10.18653/v1/n19-1300 2019
[5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have Solved Question Answering ? Try ARC , the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

P Corbett. 2014. https://openstax.org/details/books/us-history U.S. History . OpenStax College, Houston, Texas

work page 2014
[7]

Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 2017. https://arxiv.org/abs/1707.03904 Quasar: Datasets for question answering by search and reading . arXiv preprint arXiv:1707.03904

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. https://arxiv.org/abs/1704.05179 SearchQA: a new Q&A dataset augmented with context from a search engine . arXiv preprint arXiv:1704.05179

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

William Falcon and The PyTorch Lightning team . 2019. https://doi.org/10.5281/zenodo.3828935 PyTorch Lightning . Available at https://github.com/PyTorchLightning/pytorch-lightning

work page doi:10.5281/zenodo.3828935 2019
[10]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. https://arxiv.org/abs/2101.00027 The Pile : An 800GB Dataset of Diverse Text for Language Modeling . arXiv preprint arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[11]

Jose Manuel Gomez-Perez and Ra \'u l Ortega. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.441 ISAAQ - mastering textbook questions with pre-trained transformers and bottom-up and top-down attention . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5469--5479, Online. Association for Computation...

work page doi:10.18653/v1/2020.emnlp-main.441 2020
[12]

Zhaochen Guo and Denilson Barbosa. 2018. https://doi.org/10.3233/SW-170273 Robust named entity disambiguation with random walks . Semantic Web, 9(4):459--479

work page doi:10.3233/sw-170273 2018
[13]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. https://arxiv.org/abs/2002.08909 REALM: retrieval-augmented language model pre-training . arXiv preprint arXiv:2002.08909

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf Teaching machines to read and comprehend . In Advances in neural information processing systems, volume 28. Curran Associates, Inc

work page 2015
[15]

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen F \"u rstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. https://www.aclweb.org/anthology/D11-1072 Robust disambiguation of named entities in text . In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 782--7...

work page 2011
[16]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017
[17]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[18]

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. https://openaccess.thecvf.com/content_cvpr_2017/html/Kembhavi_Are_You_Smarter_CVPR_2017_paper.html Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension . In Conference on Computer Vision and Pattern R...

work page 2017
[19]

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.171 UNIFIEDQA : Crossing format boundaries with a single QA system . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896--1907, Online. Association for Compu...

work page doi:10.18653/v1/2020.findings-emnlp.171 2020
[20]

Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. https://doi.org/10.18653/v1/P19-1347 Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3568--3584, Florence, Italy. Association for Computation...

work page doi:10.18653/v1/p19-1347 2019
[21]

Glen Krutz. 2019. https://openstax.org/details/books/american-government-2e American government 2e . OpenStax, Rice University, Ann Arbor, MI

work page 2019
[22]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...

work page doi:10.1162/tacl_a_00276 2019
[23]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf MS MARCO: a human generated machine reading comprehension dataset . In CoCo@NIPS

work page 2016
[24]

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. https://arxiv.org/abs/2104.10350 Carbon Emissions and Large Neural Network Training . arXiv preprint arXiv:2104.10350

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt \"a schel, and Sebastian Riedel. 2021. https://doi.org/10.18653/v1/2021.naacl-main.200 KILT : a benchmark for knowledge intensive language tasks . In Proceedings of the 2...

work page doi:10.18653/v1/2021.naacl-main.200 2021
[26]

Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. https://doi.org/10.18653/v1/D19-1250 Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process...

work page doi:10.18653/v1/d19-1250 2019
[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the Limits of Transfer Learning with a Unified Text -to- Text Transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020
[28]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016
[29]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

work page doi:10.18653/v1/d19-1410 2019
[30]

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.437 How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418--5426, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.437 2020
[31]

Stephen Robertson and Hugo Zaragoza. 2009. https://doi.org/10.1561/1500000019 The probabilistic relevance framework: BM25 and beyond . Now Publishers Inc

work page doi:10.1561/1500000019 2009
[32]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. https://doi.org/10.18653/v1/P19-1355 Energy and Policy Considerations for Deep Learning in NLP . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3645--3650. Association for Computational Linguistics

work page internal anchor Pith review doi:10.18653/v1/p19-1355 2019
[33]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. https://doi.org/10.18653/v1/N18-1074 FEVER : a large-scale dataset for fact extraction and VER ification . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[34]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

work page 2020
[35]

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. https://doi.org/10.18653/v1/D15-1237 W iki QA : A challenge dataset for open-domain question answering . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013--2018, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1237 2015

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss , Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1300 BoolQ : Exploring the Surprising Difficulty of Natural Yes / No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language...

work page doi:10.18653/v1/n19-1300 2019

[5] [5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. https://arxiv.org/abs/1803.05457 Think you have Solved Question Answering ? Try ARC , the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

P Corbett. 2014. https://openstax.org/details/books/us-history U.S. History . OpenStax College, Houston, Texas

work page 2014

[7] [7]

Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. 2017. https://arxiv.org/abs/1707.03904 Quasar: Datasets for question answering by search and reading . arXiv preprint arXiv:1707.03904

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. https://arxiv.org/abs/1704.05179 SearchQA: a new Q&A dataset augmented with context from a search engine . arXiv preprint arXiv:1704.05179

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

William Falcon and The PyTorch Lightning team . 2019. https://doi.org/10.5281/zenodo.3828935 PyTorch Lightning . Available at https://github.com/PyTorchLightning/pytorch-lightning

work page doi:10.5281/zenodo.3828935 2019

[10] [10]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. https://arxiv.org/abs/2101.00027 The Pile : An 800GB Dataset of Diverse Text for Language Modeling . arXiv preprint arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[11] [11]

Jose Manuel Gomez-Perez and Ra \'u l Ortega. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.441 ISAAQ - mastering textbook questions with pre-trained transformers and bottom-up and top-down attention . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5469--5479, Online. Association for Computation...

work page doi:10.18653/v1/2020.emnlp-main.441 2020

[12] [12]

Zhaochen Guo and Denilson Barbosa. 2018. https://doi.org/10.3233/SW-170273 Robust named entity disambiguation with random walks . Semantic Web, 9(4):459--479

work page doi:10.3233/sw-170273 2018

[13] [13]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. https://arxiv.org/abs/2002.08909 REALM: retrieval-augmented language model pre-training . arXiv preprint arXiv:2002.08909

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf Teaching machines to read and comprehend . In Advances in neural information processing systems, volume 28. Curran Associates, Inc

work page 2015

[15] [15]

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen F \"u rstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. https://www.aclweb.org/anthology/D11-1072 Robust disambiguation of named entities in text . In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 782--7...

work page 2011

[16] [16]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

work page doi:10.18653/v1/p17-1147 2017

[17] [17]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[18] [18]

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. https://openaccess.thecvf.com/content_cvpr_2017/html/Kembhavi_Are_You_Smarter_CVPR_2017_paper.html Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension . In Conference on Computer Vision and Pattern R...

work page 2017

[19] [19]

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.171 UNIFIEDQA : Crossing format boundaries with a single QA system . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896--1907, Online. Association for Compu...

work page doi:10.18653/v1/2020.findings-emnlp.171 2020

[20] [20]

Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. https://doi.org/10.18653/v1/P19-1347 Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3568--3584, Florence, Italy. Association for Computation...

work page doi:10.18653/v1/p19-1347 2019

[21] [21]

Glen Krutz. 2019. https://openstax.org/details/books/american-government-2e American government 2e . OpenStax, Rice University, Ann Arbor, MI

work page 2019

[22] [22]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural questions: A benchma...

work page doi:10.1162/tacl_a_00276 2019

[23] [23]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf MS MARCO: a human generated machine reading comprehension dataset . In CoCo@NIPS

work page 2016

[24] [24]

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. https://arxiv.org/abs/2104.10350 Carbon Emissions and Large Neural Network Training . arXiv preprint arXiv:2104.10350

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rockt \"a schel, and Sebastian Riedel. 2021. https://doi.org/10.18653/v1/2021.naacl-main.200 KILT : a benchmark for knowledge intensive language tasks . In Proceedings of the 2...

work page doi:10.18653/v1/2021.naacl-main.200 2021

[26] [26]

Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. https://doi.org/10.18653/v1/D19-1250 Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process...

work page doi:10.18653/v1/d19-1250 2019

[27] [27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the Limits of Transfer Learning with a Unified Text -to- Text Transformer . Journal of Machine Learning Research, 21(140):1--67

work page 2020

[28] [28]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016

[29] [29]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

work page doi:10.18653/v1/d19-1410 2019

[30] [30]

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.437 How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418--5426, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.437 2020

[31] [31]

Stephen Robertson and Hugo Zaragoza. 2009. https://doi.org/10.1561/1500000019 The probabilistic relevance framework: BM25 and beyond . Now Publishers Inc

work page doi:10.1561/1500000019 2009

[32] [32]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. https://doi.org/10.18653/v1/P19-1355 Energy and Policy Considerations for Deep Learning in NLP . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3645--3650. Association for Computational Linguistics

work page internal anchor Pith review doi:10.18653/v1/p19-1355 2019

[33] [33]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. https://doi.org/10.18653/v1/N18-1074 FEVER : a large-scale dataset for fact extraction and VER ification . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018

[34] [34]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

work page 2020

[35] [35]

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. https://doi.org/10.18653/v1/D15-1237 W iki QA : A challenge dataset for open-domain question answering . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013--2018, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1237 2015