arxiv: 2002.08910 · v4 · submitted 2020-02-10 · 💻 cs.CL · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts , Colin Raffel , Noam Shazeer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:56 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords language modelsquestion answeringclosed-book QAknowledge storagefine-tuningT5parameters

0 comments

The pith

Fine-tuned language models answer questions using only knowledge stored in their parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether pre-trained language models can internalize enough factual knowledge during training to answer questions without any external context or retrieval. It fine-tunes these models on question-answer pairs and measures how well they perform on closed-book QA tasks. Performance improves steadily as model size grows and reaches levels competitive with open-domain systems that explicitly search an external knowledge base. A reader would care because this approach could simplify QA systems by removing the need for separate retrieval modules if the knowledge is already packed inside the model.

Core claim

By fine-tuning pre-trained models on QA pairs alone, the resulting systems can answer questions using only the knowledge stored in their parameters. This closed-book approach scales with model size and achieves competitive results against open-domain QA systems that retrieve answers from an external source.

What carries the argument

Fine-tuning a pre-trained language model on closed-book question-answer pairs so that factual knowledge is stored and retrieved implicitly through the model's parameters.

If this is right

Larger models store and retrieve more factual knowledge effectively.
Closed-book QA can match retrieval-based systems on many questions without external search.
Knowledge from unstructured text pre-training can be surfaced via simple fine-tuning on QA pairs.
Releasing trained models and code enables direct testing of how much knowledge is retained in parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests retrieval may become optional for some QA tasks once models exceed a size threshold.
The same parameter-storage approach could be tested on other knowledge-intensive tasks like multi-hop reasoning.
If knowledge is packed in parameters, updates to facts would require re-fine-tuning rather than database edits.

Load-bearing premise

The knowledge needed to answer the questions is already present in the pre-training data and can be effectively stored and accessed through fine-tuning on QA examples.

What would settle it

A dataset of questions whose correct answers require facts absent from the original pre-training corpus, where the fine-tuned model's accuracy remains near random regardless of model size.

read the original abstract

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show that this approach scales with model size and performs competitively with open-domain systems that explicitly retrieve answers from an external knowledge source when answering questions. To facilitate reproducibility and future work, we release our code and trained models at https://goo.gle/t5-cbqa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T5 closed-book QA scales with size and gets close to retrieval baselines on standard benchmarks, with code released.

read the letter

The main takeaway is that fine-tuning T5 models to answer questions with no external context produces results that improve with scale and come within striking distance of retrieval-based open-domain systems on Natural Questions and TriviaQA. The paper runs straightforward scaling curves across T5 sizes and shows the closed-book version catching up as parameters increase. They also release the code and models, which is helpful for anyone who wants to replicate or extend the numbers. That part is solid and directly useful. The experiments are empirical and the comparisons use held-out data against independent baselines, so there is no obvious circularity in the setup. What the paper does well is give clean head-to-head numbers that let you weigh parameter storage against an external index. The scaling behavior is consistent and the release lowers the barrier for follow-up work. The soft spots are modest but real. Success still depends on the facts already being in the pre-training data, and the paper does not test how well this holds for knowledge that is newer or rarer than the training corpus. There is also limited error analysis, so it is not clear how much of the performance is rote recall versus any deeper retrieval from parameters. The datasets are the usual ones, which keeps the evaluation straightforward but does not stress the method on harder cases. This is worth reading for anyone working on scaling laws or deciding between bigger models and retrieval pipelines. The results add concrete data points to that trade-off without overclaiming. It deserves a serious referee because the experiments are reproducible and the claims rest on released artifacts rather than hidden assumptions.

Referee Report

0 major / 2 minor

Summary. The paper measures how much factual knowledge can be stored in the parameters of pre-trained language models by fine-tuning T5 variants on QA datasets in a closed-book setting (no external context or retrieval at inference). It reports that closed-book accuracy scales with model size and approaches the performance of retrieval-augmented open-domain QA baselines on standard benchmarks, with code and models released for reproducibility.

Significance. If the empirical results hold, the work provides concrete evidence that scaling model capacity allows substantial implicit knowledge storage and retrieval via natural-language queries, offering a viable alternative to explicit retrieval pipelines for some QA tasks. The scaling curves and head-to-head comparisons with retrieval systems constitute a clear, falsifiable contribution; the public release of code and checkpoints further strengthens the result.

minor comments (2)

[§3] §3 (Experimental Setup): the description of the closed-book fine-tuning objective could be expanded with the exact loss formulation and any differences from the original T5 pre-training objective to aid exact replication.
[Table 2] Table 2: the reported numbers for the largest T5 model on Natural Questions are competitive but would benefit from an explicit statement of the number of runs or variance estimate, given the known sensitivity of QA fine-tuning to random seeds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the paper. We are glad that the empirical demonstration of knowledge storage in language model parameters and the comparison to retrieval-based systems were viewed as a clear contribution.

Circularity Check

0 steps flagged

No significant circularity; empirical scaling results are self-contained

full rationale

The paper reports direct experimental outcomes from fine-tuning T5 models on closed-book QA tasks and measuring accuracy on standard held-out benchmarks (e.g., Natural Questions, WebQuestions). No mathematical derivation, uniqueness theorem, or ansatz is invoked; performance curves and comparisons to retrieval baselines are independent observations, not reductions of fitted parameters by construction. Self-citations to the T5 paper supply the base model but do not carry the load-bearing claim about knowledge storage.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions in NLP about what LMs learn during pretraining, with no new entities postulated and minimal free parameters beyond architectural choices.

free parameters (1)

model size / number of parameters
Different model sizes are tested to show scaling, but these are architectural choices rather than fitted to the QA task specifically.

axioms (1)

domain assumption Pre-trained language models encode factual knowledge from their training corpus in their parameters.
This is the core premise tested by the fine-tuning experiments.

pith-pipeline@v0.9.0 · 5388 in / 1290 out tokens · 60753 ms · 2026-05-15T01:56:16.668688+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
cs.CL 2026-05 unverdicted novelty 6.0

Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...
RAG over Thinking Traces Can Improve Reasoning Tasks
cs.IR 2026-05 unverdicted novelty 6.0

RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
cs.CL 2026-04 conditional novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...
Inner Monologue: Embodied Reasoning through Planning with Language Models
cs.RO 2022-07 unverdicted novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
cs.LG 2026-05 unverdicted novelty 5.0

Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
cs.LG 2026-05 unverdicted novelty 5.0

A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model
cs.IR 2026-04 unverdicted novelty 5.0

Chunk-as-a-Service with the UCOSA online algorithm enables budget-constrained selection of prompts for chunk enrichment in RAG, outperforming random selection by 52% on a combined performance metric and delivering hig...
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Tug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations
cs.CL 2026-03 unverdicted novelty 5.0

CRVA-TGRAG combines parent-document segmentation, ensemble retrieval, and teacher-guided fine-tuning to mitigate knowledge conflicts and improve accuracy in LLM-based CVE vulnerability analysis.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
cs.CL 2026-04 unverdicted novelty 2.0

Ensemble voting across multiple LLMs improves results on EHR question answering subtasks, with best dev scores of 88.81 micro F1 on evidence-answer alignment.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 20 Pith papers · 16 internal anchors

[3]

A discrete hard

Min, Sewon and Chen, Danqi and Hajishirzi, Hannaneh and Zettlemoyer, Luke , journal=. A discrete hard

work page
[4]

Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine , journal=. Reading

work page
[5]

Learning to Retrieve Reasoning Paths over

Asai, Akari and Hashimoto, Kazuma and Hajishirzi, Hannaneh and Socher, Richard and Xiong, Caiming , journal=. Learning to Retrieve Reasoning Paths over

work page
[6]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , journal=

work page
[7]

, journal=

Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. , journal=

work page
[8]

Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut , journal=

work page
[9]

Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=

work page
[15]

Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year=

Daniel Khashabi and Snigdha Chaturvedi and Michael Roth and Shyam Upadhyay and Dan Roth , title=. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL) , year=

work page
[16]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , journal=

work page
[18]

Zhang, Sheng and Liu, Xiaodong and Liu, Jingjing and Gao, Jianfeng and Duh, Kevin and Van Durme, Benjamin , journal=

work page
[19]

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , year=

Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , year=

work page 2013
[20]

Language models are unsupervised multitask learners , author=

work page
[21]

Advances in Neural Information Processing Systems , year=

Semi-supervised sequence learning , author=. Advances in Neural Information Processing Systems , year=

work page
[23]

Improving language understanding by generative pre-training , author=

work page
[25]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

work page
[26]

and Zettlemoyer, Luke , journal=

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , journal=

work page
[27]

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data , pages=

Freebase: a collaboratively created graph database for structuring human knowledge , author=. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data , pages=

work page 2008
[28]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[34]

GLU Variants Improve Transformer

GLU Variants Improve Transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[35]

2020 , journal=

Entities as Experts: Sparse Memory Access with Entity Supervision , author=. 2020 , journal=

work page 2020
[36]

2020 , journal=

Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , journal=

work page 2020
[37]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

work page
[38]

Foundations and Trends in Information Retrieval , volume=

Open-domain Question-Answering , author=. Foundations and Trends in Information Retrieval , volume=

work page
[39]

Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2019. Learning to retrieve reasoning paths over Wikipedia graph for question answering. arXiv preprint arXiv:1911.10470

work page arXiv 2019
[40]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

work page 2013
[41]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1247--1250

work page 2008
[42]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ : Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 2019
[44]

Dai and Quoc V

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems

work page 2015
[45]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161

work page internal anchor Pith review Pith/arXiv arXiv 2019
[47]

Thibault F \'e vry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. 2020. Entities as experts: Sparse memory access with entity supervision. arXiv preprint arXiv:2004.07202

work page arXiv 2020
[48]

Kelvin Guu, Kenton Lee, Zora Tung, Pasupat Panupong, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909

work page internal anchor Pith review arXiv 2020
[49]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2019. How can we know what language models know? arXiv preprint arXiv:1911.12543

work page arXiv 2019
[51]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Vladimir Karpukhin, Barlas Ouguz, Sewon Min, Ledell Yu Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906

work page arXiv 2020
[53]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL)

work page 2018
[54]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7

work page 2019
[55]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT : A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

work page internal anchor Pith review Pith/arXiv arXiv 2019
[56]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

Jeffrey Ling, Nicholas FitzGerald, Zifei Shan, Livio Baldini Soares, Thibault F \'e vry, David Weiss, and Tom Kwiatkowski. 2020. Learning cross-context entity representations from text. arXiv preprint arXiv:2001.03765

work page arXiv 2020
[58]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[59]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP

work page 2018
[60]

Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019 a . A discrete hard EM approach for weakly supervised question answering. arXiv preprint arXiv:1909.04849

work page arXiv 2019
[61]

Sewon Min, Danqi Chen, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019 b . Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868

work page arXiv 2019
[62]

Lin Pan, Rishav Chakravarti, Anthony Ferritto, Michael Glass, Alfio Gliozzo, Salim Roukos, Radu Florian, and Avirup Sil. 2019. Frustratingly easy natural question answering. arXiv preprint arXiv:1909.05286

work page arXiv 2019
[63]

Deep contextualized word representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066

work page arXiv 2019
[65]

John Prager. 2006. Open-domain question-answering. Foundations and Trends in Information Retrieval, 1(2)

work page 2006
[66]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training

work page 2018
[67]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners

work page 2019
[68]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683

work page internal anchor Pith review Pith/arXiv arXiv 2019
[69]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016
[70]

Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2019. olmpics--on what language model pre-training captures. arXiv preprint arXiv:1912.13283

work page arXiv 2019
[72]

Gomez, ukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems

work page 2017
[73]

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzman, Armand Joulin, and Edouard Grave. 2019. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359

work page arXiv 2019
[74]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet : Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237

work page arXiv 2019
[75]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. ReCoRD : Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885

work page internal anchor Pith review Pith/arXiv arXiv 2018