GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Pith reviewed 2026-05-24 12:30 UTC · model grok-4.3
The pith
GPT-NeoX-20B is a 20 billion parameter open autoregressive model that gains more from five-shot evaluation than similarly sized GPT-3 and FairSeq models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile whose weights are released publicly. It is the largest dense autoregressive model with public weights at submission. The model proves a particularly powerful few-shot reasoner and records larger performance gains under five-shot evaluation than similarly sized GPT-3 and FairSeq models.
What carries the argument
The GPT-NeoX-20B transformer architecture trained on the Pile, whose scaling and data mixture produce the observed five-shot reasoning gains.
If this is right
- Public release of weights allows independent researchers to run and extend the same few-shot experiments.
- Open training code enables direct replication of the 20 billion parameter scale on the Pile.
- Five-shot performance advantages can be tested on additional reasoning benchmarks using the released model.
- The model supplies a public baseline for measuring future gains in in-context learning.
Where Pith is reading between the lines
- Wider availability of large open models may shift research focus toward reproducible few-shot protocols.
- If the five-shot advantage holds, it suggests that data mixture or architectural choices can amplify in-context learning more than raw parameter count alone.
- Community access to both weights and training code could accelerate work on cost-effective scaling for reasoning tasks.
Load-bearing premise
The five-shot evaluation protocol, prompt formatting, and task selection are identical and unbiased across GPT-NeoX-20B, GPT-3, and FairSeq so that performance differences can be attributed to the models themselves.
What would settle it
Re-running the five-shot evaluations on the same tasks with identical prompts and formatting shows GPT-NeoX-20B no longer records larger gains than the comparison models.
Figures
read the original abstract
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe \model{}'s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GPT-NeoX-20B, a 20-billion-parameter dense autoregressive language model trained on The Pile. It describes the architecture and training procedure, evaluates performance on language-understanding, mathematics, and knowledge tasks, and claims that the model is a particularly strong few-shot reasoner whose performance improves substantially more from zero-shot to five-shot settings than similarly sized GPT-3 and FairSeq models. The training code, evaluation code, and model weights are released under a permissive license.
Significance. If the reported five-shot gains are reproducible under matched evaluation conditions, the work supplies a large, openly available dense model that can serve as a baseline for future research and lowers barriers to studying scaling behavior. The explicit release of weights, training code, and evaluation code strengthens reproducibility.
major comments (1)
- [Evaluation section] Evaluation section (around the five-shot results): the abstract and results claim that GPT-NeoX-20B exhibits larger zero-to-five-shot deltas than GPT-3 and FairSeq models of comparable size. This differential is load-bearing for the central claim, yet the manuscript does not explicitly state that the identical task list, prompt templates, example ordering, and formatting conventions from the GPT-3 and FairSeq papers were reproduced without deviation. A table or appendix listing the exact prompts and subtasks used for each baseline would be required to attribute the gap to the model rather than protocol differences.
minor comments (2)
- [Abstract] The abstract states the model is 'to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission.' This phrasing should be updated to a precise date or removed, as it is time-sensitive.
- [Training section] Training hyper-parameters (learning rate schedule, batch size, etc.) are described at a high level; a supplementary table with exact values and any deviations from the original GPT-3 recipe would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (around the five-shot results): the abstract and results claim that GPT-NeoX-20B exhibits larger zero-to-five-shot deltas than GPT-3 and FairSeq models of comparable size. This differential is load-bearing for the central claim, yet the manuscript does not explicitly state that the identical task list, prompt templates, example ordering, and formatting conventions from the GPT-3 and FairSeq papers were reproduced without deviation. A table or appendix listing the exact prompts and subtasks used for each baseline would be required to attribute the gap to the model rather than protocol differences.
Authors: We agree that the manuscript does not contain an explicit statement confirming exact reproduction of the evaluation protocols. The zero- and five-shot results were obtained by following the task lists, prompt templates, example orderings, and formatting conventions reported in Brown et al. (2020) and the FairSeq paper as closely as possible. We will revise the evaluation section to add an explicit statement to this effect and will cite the original papers for the specific prompts and subtasks. We will also note that the open-sourced evaluation code implements these protocols exactly. A full appendix table of every prompt is not feasible within page limits, but the combination of the added statement, citations, and released code allows direct verification and attributes performance differences to the model. revision: yes
Circularity Check
No circularity: empirical evaluation against external benchmarks
full rationale
The paper introduces GPT-NeoX-20B, describes its architecture and training on the Pile, and reports empirical performance on language, math, and knowledge tasks. The central claim of strong few-shot reasoning is supported by direct comparisons to GPT-3 and FairSeq models on external benchmarks. No mathematical derivations, predictions, or first-principles results are presented that could reduce to fitted parameters or self-citations by construction. All load-bearing claims rest on reproducible evaluations outside the paper's internal definitions.
Axiom & Free-Parameter Ledger
free parameters (2)
- model parameter count
- training dataset mixture
axioms (1)
- domain assumption Standard transformer decoder architecture scales to 20B parameters without fundamental instability when using established optimizers and regularization.
Forward citations
Cited by 36 Pith papers
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Selective Rotary Position Embedding
Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when ...
-
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Detecting Pretraining Data from Large Language Models
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
-
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and fina...
-
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-tim...
-
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
-
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.
-
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Vision-Language Foundation Models as Effective Robot Imitators
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
YaRN: Efficient Context Window Extension of Large Language Models
YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completio...
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
On the Privacy of LLMs: An Ablation Study
Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.
-
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
A Survey on Retrieval-Augmented Text Generation for Large Language Models
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Stuart Armstrong and S\" o ren Mindermann. 2018. https://proceedings.neurips.cc/paper/2018/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html Occam's razor is insufficient to infer the preferences of irrational agents . In Advances in Neural Information Processing Systems, volume 31, pages 5598--5609. Curran Associates, Inc
work page 2018
-
[4]
Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. https://doi.org/10.1007/s11023-012-9282-2 Thinking inside the box: Controlling and using an oracle AI . Minds and Machines, 22(4):299--324
-
[5]
St \'e phane Aroca-Ouellette, Cory Paik, Alessandro Roncone, and Katharina Kann. 2021. https://doi.org/10.18653/v1/2021.findings-acl.404 PROST : P hysical reasoning about objects through space and time . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597--4608, Online. Association for Computational Linguistics
-
[6]
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 20...
-
[7]
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. http://arxiv.org/abs/2112.00861v3 ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pages 610--623, New York, NY, USA. Association for Computi...
- [9]
- [10]
-
[11]
I Can't Believe It's Not Better!
Stella Biderman and Walter J. Scheirer. 2020. https://proceedings.mlr.press/v137/biderman20a.html Pitfalls in machine learning research: Reexamining the development cycle . In Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 106--117. PMLR
work page 2020
- [12]
-
[13]
Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. https://doi.org/10.1609/aaai.v34i05.6239 PIQA : Reasoning about physical commonsense in natural language . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432--7439
-
[14]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. https://doi.org/10.5281/zenodo.5297715 GPT-Neo : Large scale autoregressive language modeling with Mesh-Tensorflow
-
[15]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[16]
Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. 2020. https://doi.org/10.23915/distill.00024 Thread: Circuits . Distill
-
[17]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. http://arxiv.org/abs/2202.07646v2 Quantifying memorization across neural language models . Computing Research Repository, arXiv:2202.07646. Version 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509v1 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[20]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Paul Christiano, Ajeya Cotra, and Mark Xu. 2021. https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8 Eliciting latent knowledge: How to tell if your eyes deceive you
work page 2021
-
[22]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. http://arxiv.org/abs/1803.05457v1 Think you have solved question answering? try ARC , the AI2 Reasoning Challenge . Computing Research Repository, arXiv:1803.05457. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [23]
-
[24]
Abram Demski. 2019. https://www.alignmentforum.org/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic The parable of Predict-O-Matic . AI Alignment Forum
work page 2019
-
[25]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. http://arxiv.org/abs/1810.04805v2 BERT : Pre-training of deep bidirectional transformers for language understanding . Computing Research Repository, arXiv:1810.04805. Version 2
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.98 Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Process...
-
[27]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2...
work page 2021
-
[28]
William Fedus, Barret Zoph, and Noam Shazeer. 2021. http://arxiv.org/abs/2101.03961v1 Switch Transformers : Scaling to trillion parameter models with simple and efficient sparsity . Computing Research Repository, arXiv:2101.03961. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Leo Gao. 2021 a . https://www.alignmentforum.org/posts/BgoKdAzogxmgkuuAt/behavior-cloning-is-miscalibrated Behavior cloning is miscalibrated . AI Alignment Forum
work page 2021
-
[30]
Leo Gao. 2021 b . https://blog.eleuther.ai/gpt3-model-sizes/ On the sizes of openai api models
work page 2021
-
[31]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. http://arxiv.org/abs/2101.00027v1 The Pile : An 800GB dataset of diverse text for language modeling . Computing Research Repository, arXiv:2101.00027. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[32]
Leo Gao, Kyle McDonell, Laria Reynolds, and Stella Biderman. 2021 a . https://blog.eleuther.ai/factored-cognition/ A preliminary exploration into factored cognition with language models . EleutherAI Blog
work page 2021
-
[33]
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021 b . https://doi.org/10.5281/zenodo.5371628 A framework for few-shot language model evaluation
-
[34]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. http://arxiv.org/abs/1806.03377v1 PipeDream : Fast and efficient pipeline parallel DNN training . Computing Research Repository, arXiv:1806.03377. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . http://arxiv.org/abs/2009.03300v3 Measuring massive multitask language understanding . Computing Research Repository, arXiv:2009.03300. Version 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021 b . http://arxiv.org/abs/2103.03874v2 Measuring mathematical problem solving with the MATH dataset . Computing Research Repository, arXiv:2103.03874. Version 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. 2020. http://arxiv.org/abs/2010.14701v2 Scaling laws for autoregressive genera...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[38]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. http://arxiv.org/abs/2203.15556v1 Training compute-optimal large language models . Computing Research Repository, arXiv:2203.15556. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [39]
-
[40]
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2021. http://arxiv.org/abs/1906.01820v3 Risks from learned optimization in advanced machine learning systems . Computing Research Repository, arXiv:1906.01820. Version 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Associa...
- [42]
-
[43]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. http://arxiv.org/abs/2001.08361v1 Scaling laws for neural language models . Computing Research Repository, arXiv:2001.08361. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[44]
Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Jeon Dong Hyeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo, Minsuk Chang, Soobin Suh, Sookyo In, Jinseong Park, Kyungduk Kim, Hi...
-
[45]
Bryan Klimt and Yiming Yang. 2004. https://doi.org/10.1007/978-3-540-30115-8_22 The Enron corpus: A new dataset for email classification research . In Proceedings of the 15th European Conference on Machine Learning, ECML'04, page 217–226, Berlin, Heidelberg. Springer-Verlag
- [46]
-
[47]
Philipp Koehn. 2005. https://aclanthology.org/2005.mtsummit-papers.11 Europarl : A parallel corpus for statistical machine translation . In Proceedings of Machine Translation Summit X: Papers, pages 79--86, Phuket, Thailand
work page 2005
-
[48]
Aran Komatsuzaki. 2019. http://arxiv.org/abs/1906.06669v1 One epoch is all you need . Computing Research Repository, arXiv:1906.06669. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[49]
Vanessa Kosoy. 2016. https://www.alignmentforum.org/posts/5bd75cc58225bf0670375209/irl-is-hard IRL is hard . AI Alignment Forum
work page 2016
-
[50]
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyonga...
-
[51]
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. http://arxiv.org/abs/1910.09700v2 Quantifying the carbon emissions of machine learning . Computing Research Repository, arXiv:1910.09700. Version 2
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[52]
Connor Leahy. 2021. https://blog.eleuther.ai/why-release-a-large-language-model/ Why Release a Large Language Model? EleutherAI Blog
work page 2021
-
[53]
Connor Leahy and Stella Biderman. 2021. https://montrealethics.ai/volume4/ The hard problem of aligning AI to human values . In The State of AI Ethics Report, volume 4, pages 180--183. The Montreal AI Ethics Institute
work page 2021
-
[54]
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. http://arxiv.org/abs/2107.06499v1 Deduplicating training data makes language models better . Computing Research Repository, arXiv:2107.06499. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[55]
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf Jurassic-1 : Technical details and evaluation . Technical report, AI21 Labs
work page 2021
-
[56]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. http://arxiv.org/abs/2109.07958v1 TruthfulQA : Measuring how models mimic human falsehoods . Computing Research Repository, arXiv:2109.07958. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
Pierre Lison and J \"o rg Tiedemann. 2016. https://aclanthology.org/L16-1147 OpenSubtitles2016 : Extracting large parallel corpora from movie and TV subtitles . In Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16) , pages 923--929, Portoro z , Slovenia. European Language Resources Association ( ELRA )
work page 2016
-
[58]
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. https://doi.org/10.24963/ijcai.2020/501 LogiQA : A challenge dataset for machine reading comprehension with logical reasoning . In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 , pages 3622--3628. International Joint Confe...
-
[59]
Ilya Loshchilov and Frank Hutter. 2019. http://arxiv.org/abs/1711.05101v3 Decoupled weight decay regularization . Computing Research Repository, arXiv:1711.05101. Version 3
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[60]
J. Nathan Matias. 2020. https://citizensandtech.org/2020/01/industry-independent-research/ Why we need industry-independent research on tech & society . Citizens and Technology Lab
work page 2020
- [61]
-
[62]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://arxiv.org/abs/2202.05262v1 Locating and editing factual knowledge in GPT . Computing Research Repository, arXiv:2202.05262v1. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? A new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium. Association for Computational Li...
-
[64]
Toan Q. Nguyen and Julian Salazar. 2019. http://arxiv.org/abs/1910.05895v2 Transformers without tears: Improving the normalization of self-attention . Computing Research Repository, arXiv:1910.05895. Version 2
-
[65]
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. https://doi.org/10.18653/v1/2020.acl-main.441 Adversarial NLI : A new benchmark for natural language understanding . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885--4901, Online. Association for Computational L...
-
[66]
nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting GPT : the logit lens . LessWrong
work page 2020
-
[67]
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. http://arxiv.org/abs/2112.00114v1 Show your work: Scratchpads for intermediate computation with language models . Computing Research Repository, arXiv:2112.001...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[68]
Pedro A. Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, Tom Everitt, Corentin Tallec, Emilio Parisotto, Tom Erez, Yutian Chen, Scott Reed, Marcus Hutter, Nando de Freitas, and Shane Legg. 2021. http://arxiv.org/abs/2110.10819v1 Shaking the foundations: delusio...
-
[69]
Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...
-
[70]
Anselmo Pe \ n as, Eduard Hovy, Pamela Forner, \'A lvaro Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. https://doi.org/10.1007/978-3-642-40802-1_29 QA4MRE 2011-2013: Overview of question answering for machine reading evaluation . In Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pages 303--320, Berlin, Heidelberg....
-
[71]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . Technical report, OpenAI
work page 2018
-
[72]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . Technical report, OpenAI
work page 2019
-
[73]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po - Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathat...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[74]
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. http://arxiv.org/abs/1911.05507v1 Compressive transformers for long-range sequence modelling . Computing Research Repository, arXiv:1911.05507. Version 1
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[75]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21:1--67
work page 2020
-
[76]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.5555/3433701.3433727 ZeRO : Memory optimizations toward training trillion parameter models . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press
-
[77]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 DeepSpeed : System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505--3506, New York, NY, USA. As...
- [78]
-
[79]
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulia...
-
[80]
Jathan Sadowski, Salom \'e Viljoen, and Meredith Whittaker. 2021. https://doi.org/10.1038/d41586-021-01812-3 Everyone should decide how their digital data are used — not just tech companies . Nature, 595(7866):169--171
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.