pith. machine review for the scientific record. sign in

arxiv: 2002.08910 · v4 · submitted 2020-02-10 · 💻 cs.CL · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:56 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML
keywords language modelsquestion answeringclosed-book QAknowledge storagefine-tuningT5parameters
0
0 comments X

The pith

Fine-tuned language models answer questions using only knowledge stored in their parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether pre-trained language models can internalize enough factual knowledge during training to answer questions without any external context or retrieval. It fine-tunes these models on question-answer pairs and measures how well they perform on closed-book QA tasks. Performance improves steadily as model size grows and reaches levels competitive with open-domain systems that explicitly search an external knowledge base. A reader would care because this approach could simplify QA systems by removing the need for separate retrieval modules if the knowledge is already packed inside the model.

Core claim

By fine-tuning pre-trained models on QA pairs alone, the resulting systems can answer questions using only the knowledge stored in their parameters. This closed-book approach scales with model size and achieves competitive results against open-domain QA systems that retrieve answers from an external source.

What carries the argument

Fine-tuning a pre-trained language model on closed-book question-answer pairs so that factual knowledge is stored and retrieved implicitly through the model's parameters.

If this is right

  • Larger models store and retrieve more factual knowledge effectively.
  • Closed-book QA can match retrieval-based systems on many questions without external search.
  • Knowledge from unstructured text pre-training can be surfaced via simple fine-tuning on QA pairs.
  • Releasing trained models and code enables direct testing of how much knowledge is retained in parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests retrieval may become optional for some QA tasks once models exceed a size threshold.
  • The same parameter-storage approach could be tested on other knowledge-intensive tasks like multi-hop reasoning.
  • If knowledge is packed in parameters, updates to facts would require re-fine-tuning rather than database edits.

Load-bearing premise

The knowledge needed to answer the questions is already present in the pre-training data and can be effectively stored and accessed through fine-tuning on QA examples.

What would settle it

A dataset of questions whose correct answers require facts absent from the original pre-training corpus, where the fine-tuned model's accuracy remains near random regardless of model size.

read the original abstract

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show that this approach scales with model size and performs competitively with open-domain systems that explicitly retrieve answers from an external knowledge source when answering questions. To facilitate reproducibility and future work, we release our code and trained models at https://goo.gle/t5-cbqa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper measures how much factual knowledge can be stored in the parameters of pre-trained language models by fine-tuning T5 variants on QA datasets in a closed-book setting (no external context or retrieval at inference). It reports that closed-book accuracy scales with model size and approaches the performance of retrieval-augmented open-domain QA baselines on standard benchmarks, with code and models released for reproducibility.

Significance. If the empirical results hold, the work provides concrete evidence that scaling model capacity allows substantial implicit knowledge storage and retrieval via natural-language queries, offering a viable alternative to explicit retrieval pipelines for some QA tasks. The scaling curves and head-to-head comparisons with retrieval systems constitute a clear, falsifiable contribution; the public release of code and checkpoints further strengthens the result.

minor comments (2)
  1. [§3] §3 (Experimental Setup): the description of the closed-book fine-tuning objective could be expanded with the exact loss formulation and any differences from the original T5 pre-training objective to aid exact replication.
  2. [Table 2] Table 2: the reported numbers for the largest T5 model on Natural Questions are competitive but would benefit from an explicit statement of the number of runs or variance estimate, given the known sensitivity of QA fine-tuning to random seeds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the paper. We are glad that the empirical demonstration of knowledge storage in language model parameters and the comparison to retrieval-based systems were viewed as a clear contribution.

Circularity Check

0 steps flagged

No significant circularity; empirical scaling results are self-contained

full rationale

The paper reports direct experimental outcomes from fine-tuning T5 models on closed-book QA tasks and measuring accuracy on standard held-out benchmarks (e.g., Natural Questions, WebQuestions). No mathematical derivation, uniqueness theorem, or ansatz is invoked; performance curves and comparisons to retrieval baselines are independent observations, not reductions of fitted parameters by construction. Self-citations to the T5 paper supply the base model but do not carry the load-bearing claim about knowledge storage.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions in NLP about what LMs learn during pretraining, with no new entities postulated and minimal free parameters beyond architectural choices.

free parameters (1)
  • model size / number of parameters
    Different model sizes are tested to show scaling, but these are architectural choices rather than fitted to the QA task specifically.
axioms (1)
  • domain assumption Pre-trained language models encode factual knowledge from their training corpus in their parameters.
    This is the core premise tested by the fine-tuning experiments.

pith-pipeline@v0.9.0 · 5388 in / 1290 out tokens · 60753 ms · 2026-05-15T01:56:16.668688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  2. Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs

    cs.CR 2026-05 unverdicted novelty 7.0

    PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.

  3. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  4. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    cs.CL 2020-05 accept novelty 7.0

    RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

  5. Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

    cs.CL 2026-05 unverdicted novelty 6.0

    Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...

  6. RAG over Thinking Traces Can Improve Reasoning Tasks

    cs.IR 2026-05 unverdicted novelty 6.0

    RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.

  7. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  8. Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

  9. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    cs.CL 2026-04 conditional novelty 6.0

    Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...

  10. Inner Monologue: Embodied Reasoning through Planning with Language Models

    cs.RO 2022-07 unverdicted novelty 6.0

    LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.

  11. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  12. Unsupervised Dense Information Retrieval with Contrastive Learning

    cs.IR 2021-12 unverdicted novelty 6.0

    Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

  13. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

    cs.LG 2026-05 unverdicted novelty 5.0

    Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

  14. TIDE: Every Layer Knows the Token Beneath the Context

    cs.CL 2026-05 unverdicted novelty 5.0

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

  15. Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

    cs.LG 2026-05 unverdicted novelty 5.0

    A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.

  16. Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model

    cs.IR 2026-04 unverdicted novelty 5.0

    Chunk-as-a-Service with the UCOSA online algorithm enables budget-constrained selection of prompts for chunk enrichment in RAG, outperforming random selection by 52% on a combined performance metric and delivering hig...

  17. Calibrating Model-Based Evaluation Metrics for Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

  18. Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations

    cs.CL 2026-03 unverdicted novelty 5.0

    CRVA-TGRAG combines parent-document segmentation, ensemble retrieval, and teacher-guided fine-tuning to mitigate knowledge conflicts and improve accuracy in LLM-based CVE vulnerability analysis.

  19. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  20. Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

    cs.CL 2026-04 unverdicted novelty 2.0

    Ensemble voting across multiple LLMs improves results on EHR question answering subtasks, with best dev scores of 88.81 micro F1 on evidence-answer alignment.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 20 Pith papers · 16 internal anchors

  1. [3]

    A discrete hard

    Min, Sewon and Chen, Danqi and Hajishirzi, Hannaneh and Zettlemoyer, Luke , journal=. A discrete hard

  2. [4]

    Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine , journal=. Reading

  3. [5]

    Learning to Retrieve Reasoning Paths over

    Asai, Akari and Hashimoto, Kazuma and Hajishirzi, Hannaneh and Socher, Richard and Xiong, Caiming , journal=. Learning to Retrieve Reasoning Paths over

  4. [6]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal=

  5. [7]

    , journal=

    Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. , journal=

  6. [8]

    Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut , journal=

  7. [9]

    Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

  8. [15]

    Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year=

    Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title=. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year=

  9. [16]

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , journal=

  10. [18]

    Zhang, Sheng and Liu, Xiaodong and Liu, Jingjing and Gao, Jianfeng and Duh, Kevin and Van Durme, Benjamin , journal=

  11. [19]

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , year=

    Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , year=

  12. [20]

    Language models are unsupervised multitask learners , author=

  13. [21]

    Advances in Neural Information Processing Systems , year=

    Semi-supervised sequence learning , author=. Advances in Neural Information Processing Systems , year=

  14. [23]

    Improving language understanding by generative pre-training , author=

  15. [25]

    Advances in Neural Information Processing Systems , year=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

  16. [26]

    and Zettlemoyer, Luke , journal=

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , journal=

  17. [27]

    Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data , pages=

    Freebase: a collaboratively created graph database for structuring human knowledge , author=. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data , pages=

  18. [28]

    Transactions of the Association for Computational Linguistics , volume=

    Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=

  19. [34]

    GLU Variants Improve Transformer

    GLU Variants Improve Transformer , author=. arXiv preprint arXiv:2002.05202 , year=

  20. [35]

    2020 , journal=

    Entities as Experts: Sparse Memory Access with Entity Supervision , author=. 2020 , journal=

  21. [36]

    2020 , journal=

    Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , journal=

  22. [37]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  23. [38]

    Foundations and Trends in Information Retrieval , volume=

    Open-domain Question-Answering , author=. Foundations and Trends in Information Retrieval , volume=

  24. [39]

    Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2019. Learning to retrieve reasoning paths over Wikipedia graph for question answering. arXiv preprint arXiv:1911.10470

  25. [40]

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

  26. [41]

    Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1247--1250

  27. [42]

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051

  28. [43]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ : Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044

  29. [44]

    Dai and Quoc V

    Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems

  30. [45]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  31. [46]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161

  32. [47]

    Thibault F \'e vry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. 2020. Entities as experts: Sparse memory access with entity supervision. arXiv preprint arXiv:2004.07202

  33. [48]

    Kelvin Guu, Kenton Lee, Zora Tung, Pasupat Panupong, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909

  34. [49]

    Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

  35. [50]

    Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2019. How can we know what language models know? arXiv preprint arXiv:1911.12543

  36. [51]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

  37. [52]

    Vladimir Karpukhin, Barlas Ouguz, Sewon Min, Ledell Yu Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906

  38. [53]

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL)

  39. [54]

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7

  40. [55]

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT : A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

  41. [56]

    Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300

  42. [57]

    Jeffrey Ling, Nicholas FitzGerald, Zifei Shan, Livio Baldini Soares, Thibault F \'e vry, David Weiss, and Tom Kwiatkowski. 2020. Learning cross-context entity representations from text. arXiv preprint arXiv:2001.03765

  43. [58]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

  44. [59]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP

  45. [60]

    Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019 a . A discrete hard EM approach for weakly supervised question answering. arXiv preprint arXiv:1909.04849

  46. [61]

    Sewon Min, Danqi Chen, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019 b . Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868

  47. [62]

    Lin Pan, Rishav Chakravarti, Anthony Ferritto, Michael Glass, Alfio Gliozzo, Salim Roukos, Radu Florian, and Avirup Sil. 2019. Frustratingly easy natural question answering. arXiv preprint arXiv:1909.05286

  48. [63]

    Deep contextualized word representations

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365

  49. [64]

    Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

  50. [65]

    John Prager. 2006. Open-domain question-answering. Foundations and Trends in Information Retrieval, 1(2)

  51. [66]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training

  52. [67]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners

  53. [68]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683

  54. [69]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250

  55. [70]

    Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235

  56. [71]

    Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2019. olmpics--on what language model pre-training captures. arXiv preprint arXiv:1912.13283

  57. [72]

    Gomez, ukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems

  58. [73]

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman, Armand Joulin, and Edouard Grave. 2019. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359

  59. [74]

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet : Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237

  60. [75]

    Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. ReCoRD : Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885