pith. machine review for the scientific record. sign in

arxiv: 2112.04426 · v3 · pith:VVRFZRCWnew · submitted 2021-12-08 · 💻 cs.CL · cs.LG

Improving language models by retrieving from trillions of tokens

Pith reviewed 2026-05-17 12:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords retrievallanguage modelstransformersRETROtrillion tokensefficient trainingquestion answeringPile benchmark
0
0 comments X

The pith

Retrieval from a 2 trillion token database lets language models match GPT-3 performance with 25 times fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that enhancing auto-regressive language models with retrieved document chunks from a huge corpus, selected by local similarity to preceding tokens, leads to strong performance. RETRO achieves results comparable to much larger models like GPT-3 on the Pile benchmark while using far fewer parameters. It also works after fine-tuning for tasks like question answering and can be applied to pre-trained models. This approach allows models to draw on an order of magnitude more data than usual during training through explicit retrieval.

Core claim

RETRO is a Retrieval-Enhanced Transformer that conditions on document chunks retrieved from a 2 trillion token database using a frozen BERT retriever and chunked cross-attention, achieving performance on par with GPT-3 and Jurassic-1 on the Pile despite 25 times fewer parameters.

What carries the argument

Chunked cross-attention mechanism that integrates information from retrieved chunks into the transformer's predictions, combined with a frozen retriever and differentiable encoder.

If this is right

  • RETRO can be trained from scratch or used to retrofit existing pre-trained transformers.
  • After fine-tuning, it improves on downstream knowledge-intensive tasks such as question answering.
  • The method scales by accessing more data explicitly rather than increasing model size.
  • New avenues open for language models that use large-scale explicit memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retrieval databases could become a standard complement to model parameters for efficiency.
  • Similar retrieval techniques might extend to other sequence modeling tasks beyond language.
  • Further scaling the retrieval database size could yield additional performance gains without proportional parameter increases.

Load-bearing premise

Nearest-neighbor retrieval based on local similarity with preceding tokens supplies sufficiently relevant and non-redundant information to improve next-token prediction at scale.

What would settle it

If RETRO with the 2 trillion token database fails to match or exceed the performance of GPT-3 on the Pile despite using fewer parameters, the central claim would be falsified.

read the original abstract

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Retrieval-Enhanced Transformer (RETRO), which conditions autoregressive language models on document chunks retrieved from a 2 trillion token database based on local similarity with preceding tokens. Using a frozen BERT retriever, a differentiable encoder, and chunked cross-attention, RETRO achieves performance comparable to GPT-3 and Jurassic-1 on the Pile benchmark despite using 25× fewer parameters. It also demonstrates improvements on downstream tasks after fine-tuning and the ability to retrofit pre-trained models.

Significance. If the results hold, the work is significant because it shows that retrieval from trillions of tokens can allow much smaller models to match the performance of state-of-the-art large language models on language modeling tasks. This suggests a promising direction for scaling language models via explicit memory augmentation at unprecedented scale, rather than solely through increasing model parameters. The practical aspects, such as retrofitting existing models, add to its potential impact.

major comments (1)
  1. [Experiments section] The central claim that retrieval from 2T tokens enables comparable performance to much larger models is load-bearing, but the manuscript does not report an ablation study that trains the model without the retrieval component (i.e., no chunked cross-attention to retrieved neighbors) while keeping the rest of the architecture, training data, and procedure identical. Without this, it is difficult to rule out that gains arise from the differentiable encoder or cross-attention mechanism rather than the retrieval itself.
minor comments (2)
  1. Clarify the exact parameter count of the RETRO model used in the main comparison to make the '25× fewer parameters' claim precise.
  2. [Abstract] The phrase 'an order of magnitude more data' could be quantified with specific numbers for training tokens in RETRO versus standard models.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive comment on the experiments. We address the point below.

read point-by-point responses
  1. Referee: [Experiments section] The central claim that retrieval from 2T tokens enables comparable performance to much larger models is load-bearing, but the manuscript does not report an ablation study that trains the model without the retrieval component (i.e., no chunked cross-attention to retrieved neighbors) while keeping the rest of the architecture, training data, and procedure identical. Without this, it is difficult to rule out that gains arise from the differentiable encoder or cross-attention mechanism rather than the retrieval itself.

    Authors: We agree that an ablation keeping the full RETRO architecture and training procedure fixed while disabling retrieval would more cleanly isolate the contribution of the retrieved neighbors. The current manuscript compares RETRO to standard autoregressive transformers (including much larger models such as GPT-3) trained on the same data distribution, but these baselines necessarily differ in architecture because they lack the chunked cross-attention and encoder modules. To directly address the concern, we will add results for a controlled ablation in which a RETRO model is trained with identical architecture, data, and optimization but with retrieved chunks replaced by padding tokens (so that chunked cross-attention receives no useful neighbors). We expect this ablated model to perform similarly to a standard transformer of the same size; the revised manuscript will report the numbers and training curves. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external models on public benchmark

full rationale

The paper's central claim is an empirical performance result: RETRO trained with retrieval from a 2T-token database matches GPT-3 and Jurassic-1 on the Pile despite 25x fewer parameters. This is demonstrated via training runs and direct comparison to independently published external models rather than any derivation, fitted parameter, or self-citation that reduces to the paper's own inputs by construction. The architecture (frozen BERT retriever plus chunked cross-attention) is described explicitly without self-definitional loops, and no load-bearing uniqueness theorem or ansatz is smuggled in. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the assumption that the 2-trillion-token corpus is fixed and representative, that BERT embeddings provide useful similarity for retrieval, and that the chunked cross-attention mechanism can be trained stably; no new physical entities or ad-hoc constants are introduced beyond standard transformer hyperparameters.

axioms (1)
  • domain assumption Nearest-neighbor retrieval on local context yields useful conditioning information for next-token prediction
    Invoked in the description of how the model conditions on retrieved chunks

pith-pipeline@v0.9.0 · 5542 in / 1293 out tokens · 55589 ms · 2026-05-17T12:50:10.584771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 7.0

    A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.

  2. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  3. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  4. When AI reviews science: Can we trust the referee?

    cs.AI 2026-04 unverdicted novelty 6.0

    AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...

  5. RAG-GNN: Integrating Retrieved Knowledge with Graph Neural Networks for Precision Medicine

    q-bio.MN 2026-01 unverdicted novelty 6.0

    RAG-GNN augments GNNs with retrieved literature knowledge via gated fusion to improve functional clustering of 379 proteins in cancer signaling networks, raising silhouette score by 0.093.

  6. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

    cs.CL 2024-01 unverdicted novelty 6.0

    RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.

  7. REPLUG: Retrieval-Augmented Black-Box Language Models

    cs.CL 2023-01 conditional novelty 6.0

    REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.

  8. Atlas: Few-shot Learning with Retrieval Augmented Language Models

    cs.CL 2022-08 unverdicted novelty 6.0

    Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

  9. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  10. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  11. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    cs.CL 2022-04 unverdicted novelty 6.0

    RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...

  12. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  13. LaMDA: Language Models for Dialog Applications

    cs.CL 2022-01 unverdicted novelty 6.0

    LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

  14. Small Language Models are the Future of Agentic AI

    cs.AI 2025-06 unverdicted novelty 5.0

    Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

  15. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  16. Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

    cs.CL 2026-04 unverdicted novelty 4.0

    Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.

  17. KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks

    cs.SE 2026-04 unverdicted novelty 4.0

    KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.

  18. Less LLM, More Documents: Searching for Improved RAG

    cs.IR 2025-10 unverdicted novelty 4.0

    Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    Abadi, A

    M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016

  2. [3]

    Baevski and M

    A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ

  3. [5]

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In ACM Conference on Fairness, Accountability, and Transparency, 2021

  4. [6]

    D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation . Journal of Machine Learning Research, 3 0 (Jan): 0 993--1022, 2003. URL https://jmlr.csail.mit.edu/papers/v3/blei03a.html

  5. [7]

    Bradbury, R

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. V. der P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

  6. [8]

    Brants, A

    T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large L anguage models in machine translation. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858--867, 2007

  7. [9]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei....

  8. [10]

    Carlini, F

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting training data from large language models. Preprint, 2021

  9. [11]

    Consonni, D

    C. Consonni, D. Laniado, and A. Montresor. Wikilinkgraphs: a complete, longitudinal and multi-language dataset of the wikipedia link networks. In AAAI International Conference on Web and Social Media, volume 13, 2019

  10. [12]

    Curation corpus base, 2020

    Curation. Curation corpus base, 2020

  11. [13]

    Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Annual Meeting of the Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1285

  12. [14]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the Association for Computational Linguistics , June 2019. URL https://aclanthology.org/N19-1423

  13. [15]

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The P ile: An 800 GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  14. [16]

    Gehman, S

    S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith. R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models. In Conference on Empirical Methods in Natural Language Processing, Nov. 2020. URL https://aclanthology.org/2020.findings-emnlp.301

  15. [17]

    Grave, A

    E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B184E5qee

  16. [18]

    A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

  17. [19]

    J. Gu, Y. Wang, K. Cho, and V. O. Li. Search engine guided neural machine translation. In AAAI Conference on Artificial Intelligence, 2018

  18. [20]

    R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 2020. URL https://arxiv.org/abs/1908.10396

  19. [21]

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang. Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020

  20. [22]

    Hashemi, H

    H. Hashemi, H. Zamani, and W. B. Croft. Guided transformer: L everaging multiple external sources for representation learning in conversational search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1131--1140, 2020

  21. [23]

    Hennigan, T

    T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. H aiku: S onnet for JAX , 2020. URL http://github.com/deepmind/dm-haiku

  22. [24]

    Izacard and E

    G. Izacard and E. Grave. Leveraging passage retrieval with generative models for open domain question answering. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.74

  23. [25]

    Izacard, F

    G. Izacard, F. Petroni, L. Hosseini, N. De Cao, S. Riedel, and E. Grave. A memory efficient baseline for open domain question answering. arXiv preprint arXiv:2012.15156, 2020

  24. [27]

    E. S. Jo and T. Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316, 2020

  25. [28]

    Exploring the Limits of Language Modeling

    R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

  26. [29]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. CoRR, 2020. URL https://arxiv.org/abs/2001.08361

  27. [30]

    Karpukhin, B

    V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing, Nov. 2020. URL https://aclanthology.org/2020.emnlp-main.550

  28. [31]

    Khandelwal, O

    U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH

  29. [32]

    Internet-augmented dialogue generation.arXiv preprint arXiv:2107.07566,

    M. Komeili, K. Shuster, and J. Weston. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021

  30. [34]

    Kwiatkowski, J

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions : a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 7: 0 452--46...

  31. [35]

    Lazaridou, A

    A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d'Autume, S. Ruder, D. Yogatama, K. Cao, T. Kociský, S. Young, and P. Blunsom. Pitfalls of static language modelling. CoRR, 2021. URL https://arxiv.org/abs/2102.01951

  32. [36]

    Latent Retrieval for Weakly Supervised Open Domain Question Answering

    K. Lee, M.-W. Chang, and K. Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering . In Annual Meeting of the Association for Computational Linguistic, June 2019. URL http://arxiv.org/abs/1906.00300

  33. [37]

    K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

  34. [38]

    u ttler, M. Lewis, W.-t. Yih, T. Rockt\

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K\" u ttler, M. Lewis, W.-t. Yih, T. Rockt\" a schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 6b493230205f780e1bc26945df7481...

  35. [39]

    Lewis, P

    P. Lewis, P. Stenetorp, and S. Riedel. Question and answer test-train overlap in open-domain question answering datasets. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.86

  36. [40]

    Lieber, O

    O. Lieber, O. Sharir, B. Lenz, and Y. Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 2021

  37. [41]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  38. [42]

    Merity, C

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe

  39. [43]

    Mikolov, M

    T. Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \`y , and S. Khudanpur. Recurrent neural network based language model. Interspeech, 2 0 (3): 0 1045--1048, 2010

  40. [44]

    Paperno, G

    D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Annual Meeting of the Association for Computational Linguistics, Aug. 2016. URL https://aclanthology.org/P16-1144

  41. [45]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Preprint, 2019

  42. [46]

    J. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, ...

  43. [47]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  44. [48]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

  45. [49]

    Robertson and H

    S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3: 0 333--389, Jan 2009

  46. [51]

    Schwartz, J

    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green AI . Communications of the Association for Computing Machinery, 63 0 (12): 0 54–63, Nov. 2020

  47. [52]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. CoRR, 2019. URL http://arxiv.org/abs/1909.08053

  48. [54]

    Strubell, A

    E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in NLP . In Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1355

  49. [55]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  50. [56]

    Wei and W

    X. Wei and W. B. Croft. LDA -based document models for ad-hoc retrieval. In ACM SIGIR International Conference on Research and Development in Information Retrieval , 2006. URL http://portal.acm.org/citation.cfm?doid=1148170.1148204

  51. [57]

    Weidinger, I

    L. Weidinger, I. Gabriel, C. Griffin, M. Rauh, J. Uesato, J. Mellor, W. Isaac, P.-S. Huang, L. A. Hendricks, M. Cheng, B. Balle, J. Haas, C. Biles, L. Rimell, W. Hawkins, M. Glaese, A. Kasirzadeh, Z. Kenton, S. Brown, A. Birhane, T. Stepleton, G. Irving, and S. Legassick. Ethical and social risks of harm from language models. arXiv submission, 2021

  52. [58]

    Yogatama, C

    D. Yogatama, C. de Masson d’Autume, and L. Kong. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9: 0 362--373, 2021

  53. [59]

    Zhang and R

    B. Zhang and R. Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems, 2019. URL https://proceedings.neurips.cc/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf

  54. [60]

    Zhang, M

    J. Zhang, M. Utiyama, E. Sumita, G. Neubig, and S. Nakamura. Guiding neural machine translation with retrieved translation pieces. In Conference of the North American Chapter of the Association for Computational Linguistics, 2018

  55. [61]

    Deep learning with differential privacy , author =

  56. [62]

    International Conference on Learning Representations , url =

    Adaptive Input Representations for Neural Language Modeling , author =. International Conference on Learning Representations , url =

  57. [63]

    CoRR , url =

    Pitfalls of Static Language Modelling , author =. CoRR , url =

  58. [64]

    On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author =

  59. [65]

    and Ng, Andrew Y

    Blei, David M. and Ng, Andrew Y. and Jordan, Michael I. , year = 2003, journal =. Latent

  60. [66]

    Advances in Neural Information Processing Systems , url =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , url =

  61. [67]

    Journal of Machine Learning Research , volume = 21, number = 140, pages =

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume = 21, number = 140, pages =

  62. [68]

    WikiLinkGraphs: a complete, longitudinal and multi-language dataset of the Wikipedia link networks , author =

  63. [69]

    Curation Corpus Base , author =

  64. [70]

    Transformer-

    Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , year = 2019, month = jul, booktitle =. Transformer-

  65. [71]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , year = 2019, month = jun, booktitle =

  66. [72]

    International Conference on Learning Representations , url =

    Improving Neural Language Models with a Continuous Cache , author =. International Conference on Learning Representations , url =

  67. [73]

    Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year = 2020, journal =. The

  68. [74]

    , year = 2020, month = nov, booktitle =

    Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. , year = 2020, month = nov, booktitle =

  69. [75]

    Generating Sequences With Recurrent Neural Networks , author =

  70. [76]

    Search engine guided neural machine translation , author =

  71. [77]

    International Conference on Machine Learning , url =

    Accelerating Large-Scale Inference with Anisotropic Vector Quantization , author =. International Conference on Machine Learning , url =

  72. [78]

    Retrieval augmented language model pre-training , author =

  73. [79]

    International Conference on Learning Representations , url =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , url =

  74. [80]

    Conference of the European Chapter of the Association for Computational Linguistics , url =

    Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author =. Conference of the European Chapter of the Association for Computational Linguistics , url =

  75. [81]

    A memory efficient baseline for open domain question answering , author =

  76. [82]

    Scaling Language Models: Methods, Analysis & Insights from Training

    Rae, Jack and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and van den Driessche, George and Hendricks, Lisa Anne and Rauh, Maribeth...

  77. [83]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Van der

  78. [84]

    CoRR , url =

    Scaling Laws for Neural Language Models , author =. CoRR , url =

  79. [85]

    Exploring the limits of language modeling , author =

  80. [86]

    Conference on Empirical Methods in Natural Language Processing , url =

    Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing , url =

Showing first 80 references.