Improving language models by retrieving from trillions of tokens
Pith reviewed 2026-05-17 12:50 UTC · model grok-4.3
The pith
Retrieval from a 2 trillion token database lets language models match GPT-3 performance with 25 times fewer parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RETRO is a Retrieval-Enhanced Transformer that conditions on document chunks retrieved from a 2 trillion token database using a frozen BERT retriever and chunked cross-attention, achieving performance on par with GPT-3 and Jurassic-1 on the Pile despite 25 times fewer parameters.
What carries the argument
Chunked cross-attention mechanism that integrates information from retrieved chunks into the transformer's predictions, combined with a frozen retriever and differentiable encoder.
If this is right
- RETRO can be trained from scratch or used to retrofit existing pre-trained transformers.
- After fine-tuning, it improves on downstream knowledge-intensive tasks such as question answering.
- The method scales by accessing more data explicitly rather than increasing model size.
- New avenues open for language models that use large-scale explicit memory.
Where Pith is reading between the lines
- Retrieval databases could become a standard complement to model parameters for efficiency.
- Similar retrieval techniques might extend to other sequence modeling tasks beyond language.
- Further scaling the retrieval database size could yield additional performance gains without proportional parameter increases.
Load-bearing premise
Nearest-neighbor retrieval based on local similarity with preceding tokens supplies sufficiently relevant and non-redundant information to improve next-token prediction at scale.
What would settle it
If RETRO with the 2 trillion token database fails to match or exceed the performance of GPT-3 on the Pile despite using fewer parameters, the central claim would be falsified.
read the original abstract
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Retrieval-Enhanced Transformer (RETRO), which conditions autoregressive language models on document chunks retrieved from a 2 trillion token database based on local similarity with preceding tokens. Using a frozen BERT retriever, a differentiable encoder, and chunked cross-attention, RETRO achieves performance comparable to GPT-3 and Jurassic-1 on the Pile benchmark despite using 25× fewer parameters. It also demonstrates improvements on downstream tasks after fine-tuning and the ability to retrofit pre-trained models.
Significance. If the results hold, the work is significant because it shows that retrieval from trillions of tokens can allow much smaller models to match the performance of state-of-the-art large language models on language modeling tasks. This suggests a promising direction for scaling language models via explicit memory augmentation at unprecedented scale, rather than solely through increasing model parameters. The practical aspects, such as retrofitting existing models, add to its potential impact.
major comments (1)
- [Experiments section] The central claim that retrieval from 2T tokens enables comparable performance to much larger models is load-bearing, but the manuscript does not report an ablation study that trains the model without the retrieval component (i.e., no chunked cross-attention to retrieved neighbors) while keeping the rest of the architecture, training data, and procedure identical. Without this, it is difficult to rule out that gains arise from the differentiable encoder or cross-attention mechanism rather than the retrieval itself.
minor comments (2)
- Clarify the exact parameter count of the RETRO model used in the main comparison to make the '25× fewer parameters' claim precise.
- [Abstract] The phrase 'an order of magnitude more data' could be quantified with specific numbers for training tokens in RETRO versus standard models.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's significance and for the constructive comment on the experiments. We address the point below.
read point-by-point responses
-
Referee: [Experiments section] The central claim that retrieval from 2T tokens enables comparable performance to much larger models is load-bearing, but the manuscript does not report an ablation study that trains the model without the retrieval component (i.e., no chunked cross-attention to retrieved neighbors) while keeping the rest of the architecture, training data, and procedure identical. Without this, it is difficult to rule out that gains arise from the differentiable encoder or cross-attention mechanism rather than the retrieval itself.
Authors: We agree that an ablation keeping the full RETRO architecture and training procedure fixed while disabling retrieval would more cleanly isolate the contribution of the retrieved neighbors. The current manuscript compares RETRO to standard autoregressive transformers (including much larger models such as GPT-3) trained on the same data distribution, but these baselines necessarily differ in architecture because they lack the chunked cross-attention and encoder modules. To directly address the concern, we will add results for a controlled ablation in which a RETRO model is trained with identical architecture, data, and optimization but with retrieved chunks replaced by padding tokens (so that chunked cross-attention receives no useful neighbors). We expect this ablated model to perform similarly to a standard transformer of the same size; the revised manuscript will report the numbers and training curves. revision: yes
Circularity Check
No circularity: empirical comparison to external models on public benchmark
full rationale
The paper's central claim is an empirical performance result: RETRO trained with retrieval from a 2T-token database matches GPT-3 and Jurassic-1 on the Pile despite 25x fewer parameters. This is demonstrated via training runs and direct comparison to independently published external models rather than any derivation, fitted parameter, or self-citation that reduces to the paper's own inputs by construction. The architecture (frozen BERT retriever plus chunked cross-attention) is described explicitly without self-definitional loops, and no load-bearing uniqueness theorem or ansatz is smuggled in. The result is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nearest-neighbor retrieval on local context yields useful conditioning information for next-token prediction
Forward citations
Cited by 18 Pith papers
-
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
-
RAG-GNN: Integrating Retrieved Knowledge with Graph Neural Networks for Precision Medicine
RAG-GNN augments GNNs with retrieved literature knowledge via gated fusion to improve functional clustering of 379 proteins in cancer signaling networks, raising silhouette score by 0.093.
-
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
-
REPLUG: Retrieval-Augmented Black-Box Language Models
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
LaMDA: Language Models for Dialog Applications
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
-
Small Language Models are the Future of Agentic AI
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
-
KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks
KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.
-
Less LLM, More Documents: Searching for Improved RAG
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
Reference graph
Works this paper leans on
- [1]
-
[3]
A. Baevski and M. Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ
work page 2019
-
[5]
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In ACM Conference on Fairness, Accountability, and Transparency, 2021
work page 2021
-
[6]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation . Journal of Machine Learning Research, 3 0 (Jan): 0 993--1022, 2003. URL https://jmlr.csail.mit.edu/papers/v3/blei03a.html
work page 2003
-
[7]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. V. der P las, S. Wanderman- M ilne, and Q. Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax
work page 2018
- [8]
-
[9]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei....
work page 2020
-
[10]
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting training data from large language models. Preprint, 2021
work page 2021
-
[11]
C. Consonni, D. Laniado, and A. Montresor. Wikilinkgraphs: a complete, longitudinal and multi-language dataset of the wikipedia link networks. In AAAI International Conference on Web and Social Media, volume 13, 2019
work page 2019
- [12]
-
[13]
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Annual Meeting of the Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1285
work page 2019
-
[14]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Conference of the North A merican Chapter of the Association for Computational Linguistics , June 2019. URL https://aclanthology.org/N19-1423
work page 2019
-
[15]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The P ile: An 800 GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [16]
- [17]
-
[18]
A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[19]
J. Gu, Y. Wang, K. Cho, and V. O. Li. Search engine guided neural machine translation. In AAAI Conference on Artificial Intelligence, 2018
work page 2018
- [20]
-
[21]
K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang. Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020
work page 2020
-
[22]
H. Hashemi, H. Zamani, and W. B. Croft. Guided transformer: L everaging multiple external sources for representation learning in conversational search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1131--1140, 2020
work page 2020
-
[23]
T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. H aiku: S onnet for JAX , 2020. URL http://github.com/deepmind/dm-haiku
work page 2020
-
[24]
G. Izacard and E. Grave. Leveraging passage retrieval with generative models for open domain question answering. In Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2021. URL https://aclanthology.org/2021.eacl-main.74
work page 2021
-
[25]
G. Izacard, F. Petroni, L. Hosseini, N. De Cao, S. Riedel, and E. Grave. A memory efficient baseline for open domain question answering. arXiv preprint arXiv:2012.15156, 2020
-
[27]
E. S. Jo and T. Gebru. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316, 2020
work page 2020
-
[28]
Exploring the Limits of Language Modeling
R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. CoRR, 2020. URL https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[30]
V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing, Nov. 2020. URL https://aclanthology.org/2020.emnlp-main.550
work page 2020
-
[31]
U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH
work page 2020
-
[32]
Internet-augmented dialogue generation.arXiv preprint arXiv:2107.07566,
M. Komeili, K. Shuster, and J. Weston. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021
-
[34]
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions : a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 7: 0 452--46...
work page 2019
-
[35]
A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d'Autume, S. Ruder, D. Yogatama, K. Cao, T. Kociský, S. Young, and P. Blunsom. Pitfalls of static language modelling. CoRR, 2021. URL https://arxiv.org/abs/2102.01951
-
[36]
Latent Retrieval for Weakly Supervised Open Domain Question Answering
K. Lee, M.-W. Chang, and K. Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering . In Annual Meeting of the Association for Computational Linguistic, June 2019. URL http://arxiv.org/abs/1906.00300
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021
work page Pith review arXiv 2021
-
[38]
u ttler, M. Lewis, W.-t. Yih, T. Rockt\
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K\" u ttler, M. Lewis, W.-t. Yih, T. Rockt\" a schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 6b493230205f780e1bc26945df7481...
work page 2020
- [39]
- [40]
-
[41]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
- [42]
-
[43]
T. Mikolov, M. Karafi \'a t, L. Burget, J. Cernock \`y , and S. Khudanpur. Recurrent neural network based language model. Interspeech, 2 0 (3): 0 1045--1048, 2010
work page 2010
-
[44]
D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Annual Meeting of the Association for Computational Linguistics, Aug. 2016. URL https://aclanthology.org/P16-1144
work page 2016
-
[45]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Preprint, 2019
work page 2019
-
[46]
J. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, ...
work page 2021
-
[47]
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020. URL http://jmlr.org/papers/v21/20-074.html
work page 2020
-
[48]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2020
work page 2020
-
[49]
S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3: 0 333--389, Jan 2009
work page 2009
-
[51]
R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green AI . Communications of the Association for Computing Machinery, 63 0 (12): 0 54–63, Nov. 2020
work page 2020
-
[52]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. CoRR, 2019. URL http://arxiv.org/abs/1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[54]
E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in NLP . In Association for Computational Linguistics, July 2019. URL https://aclanthology.org/P19-1355
work page 2019
-
[55]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
work page 2017
- [56]
-
[57]
L. Weidinger, I. Gabriel, C. Griffin, M. Rauh, J. Uesato, J. Mellor, W. Isaac, P.-S. Huang, L. A. Hendricks, M. Cheng, B. Balle, J. Haas, C. Biles, L. Rimell, W. Hawkins, M. Glaese, A. Kasirzadeh, Z. Kenton, S. Brown, A. Birhane, T. Stepleton, G. Irving, and S. Legassick. Ethical and social risks of harm from language models. arXiv submission, 2021
work page 2021
-
[58]
D. Yogatama, C. de Masson d’Autume, and L. Kong. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9: 0 362--373, 2021
work page 2021
-
[59]
B. Zhang and R. Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems, 2019. URL https://proceedings.neurips.cc/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf
work page 2019
- [60]
-
[61]
Deep learning with differential privacy , author =
-
[62]
International Conference on Learning Representations , url =
Adaptive Input Representations for Neural Language Modeling , author =. International Conference on Learning Representations , url =
- [63]
-
[64]
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author =
-
[65]
Blei, David M. and Ng, Andrew Y. and Jordan, Michael I. , year = 2003, journal =. Latent
work page 2003
-
[66]
Advances in Neural Information Processing Systems , url =
Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , url =
-
[67]
Journal of Machine Learning Research , volume = 21, number = 140, pages =
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of Machine Learning Research , volume = 21, number = 140, pages =
-
[68]
WikiLinkGraphs: a complete, longitudinal and multi-language dataset of the Wikipedia link networks , author =
-
[69]
Curation Corpus Base , author =
-
[70]
Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , year = 2019, month = jul, booktitle =. Transformer-
work page 2019
-
[71]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , year = 2019, month = jun, booktitle =
work page 2019
-
[72]
International Conference on Learning Representations , url =
Improving Neural Language Models with a Continuous Cache , author =. International Conference on Learning Representations , url =
-
[73]
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year = 2020, journal =. The
work page 2020
-
[74]
, year = 2020, month = nov, booktitle =
Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A. , year = 2020, month = nov, booktitle =
work page 2020
-
[75]
Generating Sequences With Recurrent Neural Networks , author =
-
[76]
Search engine guided neural machine translation , author =
-
[77]
International Conference on Machine Learning , url =
Accelerating Large-Scale Inference with Anisotropic Vector Quantization , author =. International Conference on Machine Learning , url =
-
[78]
Retrieval augmented language model pre-training , author =
-
[79]
International Conference on Learning Representations , url =
Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , url =
-
[80]
Conference of the European Chapter of the Association for Computational Linguistics , url =
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author =. Conference of the European Chapter of the Association for Computational Linguistics , url =
-
[81]
A memory efficient baseline for open domain question answering , author =
-
[82]
Scaling Language Models: Methods, Analysis & Insights from Training
Rae, Jack and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and van den Driessche, George and Hendricks, Lisa Anne and Rauh, Maribeth...
-
[83]
James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Van der
- [84]
-
[85]
Exploring the limits of language modeling , author =
-
[86]
Conference on Empirical Methods in Natural Language Processing , url =
Dense Passage Retrieval for Open-Domain Question Answering , author =. Conference on Empirical Methods in Natural Language Processing , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.