arxiv: 2112.04426 · v3 · pith:VVRFZRCWnew · submitted 2021-12-08 · 💻 cs.CL · cs.LG

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud , Arthur Mensch , Jordan Hoffmann , Trevor Cai , Eliza Rutherford , Katie Millican , George van den Driessche , Jean-Baptiste Lespiau

show 20 more authors

Bogdan Damoc Aidan Clark Diego de Las Casas Aurelia Guy Jacob Menick Roman Ring Tom Hennigan Saffron Huang Loren Maggiore Chris Jones Albin Cassirer Andy Brock Michela Paganini Geoffrey Irving Oriol Vinyals Simon Osindero Karen Simonyan Jack W. Rae Erich Elsen Laurent Sifre

This is my paper

Pith reviewed 2026-05-17 12:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords retrievallanguage modelstransformersRETROtrillion tokensefficient trainingquestion answeringPile benchmark

0 comments

The pith

Retrieval from a 2 trillion token database lets language models match GPT-3 performance with 25 times fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that enhancing auto-regressive language models with retrieved document chunks from a huge corpus, selected by local similarity to preceding tokens, leads to strong performance. RETRO achieves results comparable to much larger models like GPT-3 on the Pile benchmark while using far fewer parameters. It also works after fine-tuning for tasks like question answering and can be applied to pre-trained models. This approach allows models to draw on an order of magnitude more data than usual during training through explicit retrieval.

Core claim

RETRO is a Retrieval-Enhanced Transformer that conditions on document chunks retrieved from a 2 trillion token database using a frozen BERT retriever and chunked cross-attention, achieving performance on par with GPT-3 and Jurassic-1 on the Pile despite 25 times fewer parameters.

What carries the argument

Chunked cross-attention mechanism that integrates information from retrieved chunks into the transformer's predictions, combined with a frozen retriever and differentiable encoder.

If this is right

RETRO can be trained from scratch or used to retrofit existing pre-trained transformers.
After fine-tuning, it improves on downstream knowledge-intensive tasks such as question answering.
The method scales by accessing more data explicitly rather than increasing model size.
New avenues open for language models that use large-scale explicit memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Retrieval databases could become a standard complement to model parameters for efficiency.
Similar retrieval techniques might extend to other sequence modeling tasks beyond language.
Further scaling the retrieval database size could yield additional performance gains without proportional parameter increases.

Load-bearing premise

Nearest-neighbor retrieval based on local similarity with preceding tokens supplies sufficiently relevant and non-redundant information to improve next-token prediction at scale.

What would settle it

If RETRO with the 2 trillion token database fails to match or exceed the performance of GPT-3 on the Pile despite using fewer parameters, the central claim would be falsified.

read the original abstract

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Retrieval-Enhanced Transformer (RETRO), which conditions autoregressive language models on document chunks retrieved from a 2 trillion token database based on local similarity with preceding tokens. Using a frozen BERT retriever, a differentiable encoder, and chunked cross-attention, RETRO achieves performance comparable to GPT-3 and Jurassic-1 on the Pile benchmark despite using 25× fewer parameters. It also demonstrates improvements on downstream tasks after fine-tuning and the ability to retrofit pre-trained models.

Significance. If the results hold, the work is significant because it shows that retrieval from trillions of tokens can allow much smaller models to match the performance of state-of-the-art large language models on language modeling tasks. This suggests a promising direction for scaling language models via explicit memory augmentation at unprecedented scale, rather than solely through increasing model parameters. The practical aspects, such as retrofitting existing models, add to its potential impact.

major comments (1)

[Experiments section] The central claim that retrieval from 2T tokens enables comparable performance to much larger models is load-bearing, but the manuscript does not report an ablation study that trains the model without the retrieval component (i.e., no chunked cross-attention to retrieved neighbors) while keeping the rest of the architecture, training data, and procedure identical. Without this, it is difficult to rule out that gains arise from the differentiable encoder or cross-attention mechanism rather than the retrieval itself.

minor comments (2)

Clarify the exact parameter count of the RETRO model used in the main comparison to make the '25× fewer parameters' claim precise.
[Abstract] The phrase 'an order of magnitude more data' could be quantified with specific numbers for training tokens in RETRO versus standard models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive comment on the experiments. We address the point below.

read point-by-point responses

Referee: [Experiments section] The central claim that retrieval from 2T tokens enables comparable performance to much larger models is load-bearing, but the manuscript does not report an ablation study that trains the model without the retrieval component (i.e., no chunked cross-attention to retrieved neighbors) while keeping the rest of the architecture, training data, and procedure identical. Without this, it is difficult to rule out that gains arise from the differentiable encoder or cross-attention mechanism rather than the retrieval itself.

Authors: We agree that an ablation keeping the full RETRO architecture and training procedure fixed while disabling retrieval would more cleanly isolate the contribution of the retrieved neighbors. The current manuscript compares RETRO to standard autoregressive transformers (including much larger models such as GPT-3) trained on the same data distribution, but these baselines necessarily differ in architecture because they lack the chunked cross-attention and encoder modules. To directly address the concern, we will add results for a controlled ablation in which a RETRO model is trained with identical architecture, data, and optimization but with retrieved chunks replaced by padding tokens (so that chunked cross-attention receives no useful neighbors). We expect this ablated model to perform similarly to a standard transformer of the same size; the revised manuscript will report the numbers and training curves. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external models on public benchmark

full rationale

The paper's central claim is an empirical performance result: RETRO trained with retrieval from a 2T-token database matches GPT-3 and Jurassic-1 on the Pile despite 25x fewer parameters. This is demonstrated via training runs and direct comparison to independently published external models rather than any derivation, fitted parameter, or self-citation that reduces to the paper's own inputs by construction. The architecture (frozen BERT retriever plus chunked cross-attention) is described explicitly without self-definitional loops, and no load-bearing uniqueness theorem or ansatz is smuggled in. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the assumption that the 2-trillion-token corpus is fixed and representative, that BERT embeddings provide useful similarity for retrieval, and that the chunked cross-attention mechanism can be trained stably; no new physical entities or ad-hoc constants are introduced beyond standard transformer hyperparameters.

axioms (1)

domain assumption Nearest-neighbor retrieval on local context yields useful conditioning information for next-token prediction
Invoked in the description of how the model conditions on retrieved chunks

pith-pipeline@v0.9.0 · 5542 in / 1293 out tokens · 55589 ms · 2026-05-17T12:50:10.584771+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
RAG-GNN: Integrating Retrieved Knowledge with Graph Neural Networks for Precision Medicine
q-bio.MN 2026-01 unverdicted novelty 6.0

RAG-GNN augments GNNs with retrieved literature knowledge via gated fusion to improve functional clustering of 379 proteins in cancer signaling networks, raising silhouette score by 0.093.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
cs.CL 2024-01 unverdicted novelty 6.0

RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
REPLUG: Retrieval-Augmented Black-Box Language Models
cs.CL 2023-01 conditional novelty 6.0

REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
cs.CL 2022-04 unverdicted novelty 6.0

RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
LaMDA: Language Models for Dialog Applications
cs.CL 2022-01 unverdicted novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Small Language Models are the Future of Agentic AI
cs.AI 2025-06 unverdicted novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
cs.CL 2026-04 unverdicted novelty 4.0

Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks
cs.SE 2026-04 unverdicted novelty 4.0

KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.
Less LLM, More Documents: Searching for Improved RAG
cs.IR 2025-10 unverdicted novelty 4.0

Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 18 Pith papers · 8 internal anchors

[1]

Abadi, A

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016

work page 2016
[3]

Baevski and M

A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ

work page 2019
[5]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In ACM Conference on Fairness, Accountability, and Transparency, 2021

work page 2021
[6]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation . Journal of Machine Learning Research, 3 0 (Jan): 0 993--1022, 2003. URL https://jmlr.csail.mit.edu/papers/v3/blei03a.html

work page 2003
[7]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. V. der P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

work page 2018
[8]

Brants, A

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large L anguage models in machine translation. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858--867, 2007

work page 2007
[9]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei....

work page 2020
[10]

Carlini, F

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting training data from large language models. Preprint, 2021

work page 2021
[11]

Consonni, D

C. Consonni, D. Laniado, and A. Montresor. Wikilinkgraphs: a complete, longitudinal and multi-language dataset of the wikipedia link networks. In AAAI International Conference on Web and Social Media, volume 13, 2019

work page 2019
[12]

Curation corpus base, 2020

Curation. Curation corpus base, 2020

work page 2020
[13]

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Annual Meeting of the Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1285

work page 2019
[14]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the Association for Computational Linguistics , June 2019. URL https://aclanthology.org/N19-1423

work page 2019
[15]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The P ile: An 800 GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

Gehman, S

S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith. R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models. In Conference on Empirical Methods in Natural Language Processing, Nov. 2020. URL https://aclanthology.org/2020.findings-emnlp.301

work page 2020
[17]

Grave, A

E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B184E5qee

work page 2017
[18]

A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[19]

J. Gu, Y. Wang, K. Cho, and V. O. Li. Search engine guided neural machine translation. In AAAI Conference on Artificial Intelligence, 2018

work page 2018
[20]

R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 2020. URL https://arxiv.org/abs/1908.10396

work page arXiv 2020
[21]

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang. Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020

work page 2020
[22]

Hashemi, H

H. Hashemi, H. Zamani, and W. B. Croft. Guided transformer: L everaging multiple external sources for representation learning in conversational search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1131--1140, 2020

work page 2020
[23]

Hennigan, T

T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. H aiku: S onnet for JAX , 2020. URL http://github.com/deepmind/dm-haiku

work page 2020
[24]

Izacard and E

G. Izacard and E. Grave. Leveraging passage retrieval with generative models for open domain question answering. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.74

work page 2021
[25]

Izacard, F

G. Izacard, F. Petroni, L. Hosseini, N. De Cao, S. Riedel, and E. Grave. A memory efficient baseline for open domain question answering. arXiv preprint arXiv:2012.15156, 2020

work page arXiv 2012
[27]

E. S. Jo and T. Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316, 2020

work page 2020
[28]

Exploring the Limits of Language Modeling

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. CoRR, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[30]

Karpukhin, B

V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing, Nov. 2020. URL https://aclanthology.org/2020.emnlp-main.550

work page 2020
[31]

Khandelwal, O

U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH

work page 2020
[32]

Internet-augmented dialogue generation.arXiv preprint arXiv:2107.07566,

M. Komeili, K. Shuster, and J. Weston. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021

work page arXiv 2021
[34]

Kwiatkowski, J

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions : a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 7: 0 452--46...

work page 2019
[35]

Lazaridou, A

A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d'Autume, S. Ruder, D. Yogatama, K. Cao, T. Kociský, S. Young, and P. Blunsom. Pitfalls of static language modelling. CoRR, 2021. URL https://arxiv.org/abs/2102.01951

work page arXiv 2021
[36]

Latent Retrieval for Weakly Supervised Open Domain Question Answering

K. Lee, M.-W. Chang, and K. Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering . In Annual Meeting of the Association for Computational Linguistic, June 2019. URL http://arxiv.org/abs/1906.00300

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

work page Pith review arXiv 2021
[38]

u ttler, M. Lewis, W.-t. Yih, T. Rockt\

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K\" u ttler, M. Lewis, W.-t. Yih, T. Rockt\" a schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 6b493230205f780e1bc26945df7481...

work page 2020
[39]

Lewis, P

P. Lewis, P. Stenetorp, and S. Riedel. Question and answer test-train overlap in open-domain question answering datasets. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.86

work page 2021
[40]

Lieber, O

O. Lieber, O. Sharir, B. Lenz, and Y. Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 2021

work page 2021
[41]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[42]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe

work page 2017
[43]

Mikolov, M

T. Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \`y , and S. Khudanpur. Recurrent neural network based language model. Interspeech, 2 0 (3): 0 1045--1048, 2010

work page 2010
[44]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Annual Meeting of the Association for Computational Linguistics, Aug. 2016. URL https://aclanthology.org/P16-1144

work page 2016
[45]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Preprint, 2019

work page 2019
[46]

J. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, ...

work page 2021
[47]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

work page 2020
[48]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

work page 2020
[49]

Robertson and H

S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3: 0 333--389, Jan 2009

work page 2009
[51]

Schwartz, J

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green AI . Communications of the Association for Computing Machinery, 63 0 (12): 0 54–63, Nov. 2020

work page 2020
[52]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. CoRR, 2019. URL http://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[54]

Strubell, A

E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in NLP . In Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1355

work page 2019
[55]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

work page 2017
[56]

Wei and W

X. Wei and W. B. Croft. LDA -based document models for ad-hoc retrieval. In ACM SIGIR International Conference on Research and Development in Information Retrieval , 2006. URL http://portal.acm.org/citation.cfm?doid=1148170.1148204

work page arXiv 2006
[57]

Weidinger, I

L. Weidinger, I. Gabriel, C. Griffin, M. Rauh, J. Uesato, J. Mellor, W. Isaac, P.-S. Huang, L. A. Hendricks, M. Cheng, B. Balle, J. Haas, C. Biles, L. Rimell, W. Hawkins, M. Glaese, A. Kasirzadeh, Z. Kenton, S. Brown, A. Birhane, T. Stepleton, G. Irving, and S. Legassick. Ethical and social risks of harm from language models. arXiv submission, 2021

work page 2021
[58]

Yogatama, C

D. Yogatama, C. de Masson d’Autume, and L. Kong. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9: 0 362--373, 2021

work page 2021
[59]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems, 2019. URL https://proceedings.neurips.cc/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf

work page 2019
[60]

Zhang, M

J. Zhang, M. Utiyama, E. Sumita, G. Neubig, and S. Nakamura. Guiding neural machine translation with retrieved translation pieces. In Conference of the North American Chapter of the Association for Computational Linguistics, 2018

work page 2018
[61]

Deep learning with differential privacy , author =

work page
[62]

International Conference on Learning Representations , url =

Adaptive Input Representations for Neural Language Modeling , author =. International Conference on Learning Representations , url =

work page
[63]

CoRR , url =

Pitfalls of Static Language Modelling , author =. CoRR , url =

work page
[64]

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author =

work page
[65]

and Ng, Andrew Y

Blei, David M. and Ng, Andrew Y. and Jordan, Michael I. , year = 2003, journal =. Latent

work page 2003
[66]

Advances in Neural Information Processing Systems , url =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , url =

work page
[67]

Journal of Machine Learning Research , volume = 21, number = 140, pages =

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume = 21, number = 140, pages =

work page
[68]

WikiLinkGraphs: a complete, longitudinal and multi-language dataset of the Wikipedia link networks , author =

work page
[69]

Curation Corpus Base , author =

work page
[70]

Transformer-

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , year = 2019, month = jul, booktitle =. Transformer-

work page 2019
[71]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , year = 2019, month = jun, booktitle =

work page 2019
[72]

International Conference on Learning Representations , url =

Improving Neural Language Models with a Continuous Cache , author =. International Conference on Learning Representations , url =

work page
[73]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year = 2020, journal =. The

work page 2020
[74]

, year = 2020, month = nov, booktitle =

Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. , year = 2020, month = nov, booktitle =

work page 2020
[75]

Generating Sequences With Recurrent Neural Networks , author =

work page
[76]

Search engine guided neural machine translation , author =

work page
[77]

International Conference on Machine Learning , url =

Accelerating Large-Scale Inference with Anisotropic Vector Quantization , author =. International Conference on Machine Learning , url =

work page
[78]

Retrieval augmented language model pre-training , author =

work page
[79]

International Conference on Learning Representations , url =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , url =

work page
[80]

Conference of the European Chapter of the Association for Computational Linguistics , url =

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author =. Conference of the European Chapter of the Association for Computational Linguistics , url =

work page
[81]

A memory efficient baseline for open domain question answering , author =

work page
[82]

Scaling Language Models: Methods, Analysis & Insights from Training

Rae, Jack and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and van den Driessche, George and Hendricks, Lisa Anne and Rauh, Maribeth...

work page
[83]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Van der

work page
[84]

CoRR , url =

Scaling Laws for Neural Language Models , author =. CoRR , url =

work page
[85]

Exploring the limits of language modeling , author =

work page
[86]

Conference on Empirical Methods in Natural Language Processing , url =

Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing , url =

work page

Showing first 80 references.