{"paper":{"title":"Improving language models by retrieving from trillions of tokens","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Retrieval from a 2 trillion token database lets language models match GPT-3 performance with 25 times fewer parameters.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Aidan Clark, Albin Cassirer, Andy Brock, Arthur Mensch, Aurelia Guy, Bogdan Damoc, Chris Jones, Diego de las Casas, Eliza Rutherford, Erich Elsen, Geoffrey Irving, George van den Driessche, Jack W. Rae, Jacob Menick, Jean-Baptiste Lespiau, Jordan Hoffmann, Karen Simonyan, Katie Millican, Laurent Sifre, Loren Maggiore, Michela Paganini, Oriol Vinyals, Roman Ring, Saffron Huang, Sebastian Borgeaud, Simon Osindero, Tom Hennigan, Trevor Cai","submitted_at":"2021-12-08T17:32:34Z","abstract_excerpt":"We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an o"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That nearest-neighbor retrieval based on local similarity with preceding tokens supplies sufficiently relevant and non-redundant information to improve next-token prediction at scale.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Retrieval from a 2 trillion token database lets language models match GPT-3 performance with 25 times fewer parameters.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9ce38e4199f9da8e8672722e3267f603abfb28767d58094123200af4a60bc16b"},"source":{"id":"2112.04426","kind":"arxiv","version":3},"verdict":{"id":"c2b40f6c-01fa-43a8-8dc8-c5829ccd16a7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T12:50:10.584771Z","strongest_claim":"With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters.","one_line_summary":"RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That nearest-neighbor retrieval based on local similarity with preceding tokens supplies sufficiently relevant and non-redundant information to improve next-token prediction at scale.","pith_extraction_headline":"Retrieval from a 2 trillion token database lets language models match GPT-3 performance with 25 times fewer parameters."},"references":{"count":115,"sample":[{"doi":"","year":2016,"title":"M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016","work_id":"58cb949a-d4d0-4d36-8e64-cbaf72a383bc","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ","work_id":"0eba6dc3-5447-4995-bde9-4e9d4b16d8c9","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In ACM Conference on Fairness, Accountability, and Transparency, 202","work_id":"06746417-8384-498d-a63b-83b9dd5cd00f","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2003,"title":"D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation . Journal of Machine Learning Research, 3 0 (Jan): 0 993--1022, 2003. URL https://jmlr.csail.mit.edu/papers/v3/blei03a.html","work_id":"4a09d8db-3a2b-4bde-bac9-434a09d5e807","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. V. der P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um","work_id":"a93701d4-1bcc-419f-9557-2f43fff982f9","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":115,"snapshot_sha256":"f9f1ce20b866c542128dfe2338a6588cb76fca73be7f4c37657a5c69bc562389","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"54c2ae5bef4b69ed7234539379aa6620b02e179153438447c02c01c348f23770"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}