GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Ben Wang; Connor Leahy; Eric Hallahan; Horace He; Jason Phang; Jonathan Tow; Kyle McDonell; Laria Reynolds; Laurence Golding; Leo Gao

arxiv: 2204.06745 · v1 · pith:KNM7SW3Mnew · submitted 2022-04-14 · 💻 cs.CL

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black , Stella Biderman , Eric Hallahan , Quentin Anthony , Leo Gao , Laurence Golding , Horace He , Connor Leahy

show 9 more authors

Kyle McDonell Jason Phang Michael Pieler USVSN Sai Prashanth Shivanshu Purohit Laria Reynolds Jonathan Tow Ben Wang Samuel Weinbach

This is my paper

Pith reviewed 2026-05-24 12:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords GPT-NeoX-20Bautoregressive language modelfew-shot reasoningopen-source modelThe Pilelanguage model evaluationin-context learning

0 comments

The pith

GPT-NeoX-20B is a 20 billion parameter open autoregressive model that gains more from five-shot evaluation than similarly sized GPT-3 and FairSeq models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile dataset, and releases its weights and code openly under a permissive license. At the time of submission it was the largest dense autoregressive model with publicly available weights. Evaluations across language-understanding, mathematics, and knowledge tasks show the model is a particularly strong few-shot reasoner. It records substantially larger performance lifts when moving to five-shot prompting than GPT-3 and FairSeq models of comparable size. The work therefore supplies both a new public model and evidence that open training can produce competitive few-shot capabilities.

Core claim

GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile whose weights are released publicly. It is the largest dense autoregressive model with public weights at submission. The model proves a particularly powerful few-shot reasoner and records larger performance gains under five-shot evaluation than similarly sized GPT-3 and FairSeq models.

What carries the argument

The GPT-NeoX-20B transformer architecture trained on the Pile, whose scaling and data mixture produce the observed five-shot reasoning gains.

If this is right

Public release of weights allows independent researchers to run and extend the same few-shot experiments.
Open training code enables direct replication of the 20 billion parameter scale on the Pile.
Five-shot performance advantages can be tested on additional reasoning benchmarks using the released model.
The model supplies a public baseline for measuring future gains in in-context learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider availability of large open models may shift research focus toward reproducible few-shot protocols.
If the five-shot advantage holds, it suggests that data mixture or architectural choices can amplify in-context learning more than raw parameter count alone.
Community access to both weights and training code could accelerate work on cost-effective scaling for reasoning tasks.

Load-bearing premise

The five-shot evaluation protocol, prompt formatting, and task selection are identical and unbiased across GPT-NeoX-20B, GPT-3, and FairSeq so that performance differences can be attributed to the models themselves.

What would settle it

Re-running the five-shot evaluations on the same tasks with identical prompts and formatting shows GPT-NeoX-20B no longer records larger gains than the comparison models.

Figures

Figures reproduced from arXiv: 2204.06745 by Ben Wang, Connor Leahy, Eric Hallahan, Horace He, Jason Phang, Jonathan Tow, Kyle McDonell, Laria Reynolds, Laurence Golding, Leo Gao, Michael Pieler, Quentin Anthony, Samuel Weinbach, Shivanshu Purohit, Sid Black, Stella Biderman, USVSN Sai Prashanth.

**Figure 2.** Figure 2: Architecture diagram of a single training node. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: GPT-2 tokenization vs. GPT-NeoX-20B tokenization. GPT-NeoX-20B tokenization handles whitespace better, which is particularly useful for text such as source code. For more examples, see Appendix F. Sanh et al., 2021; Wei et al., 2021). While so far there has been no systematic work that focuses on prompted pretraining, recent work (Biderman and Raff, 2022) observed that the formulation of the StackExchange… view at source ↗

**Figure 4.** Figure 4: Training and validation loss for GPT-NeoX [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Zero-shot performance of GPT-NeoX-20B compared to GPT-J-6B and FairSeq and OpenAI models on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Zero-shot performance of GPT-NeoX-20B compared to and FairSeq and OpenAI models on arithmetic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Five-shot performance of GPT-NeoX-20B compared to GPT-J-6B and FairSeq and OpenAI models on [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Pile (arXiv) Tokenization Example [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

**Figure 9.** Figure 9: Pile (BookCorpus2) Tokenization Example [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: Pile (DM Mathematics) Tokenization Example [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗

**Figure 11.** Figure 11: Pile (GitHub) Tokenization Example [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

**Figure 12.** Figure 12: Pile (OpenWebText2) Tokenization Example [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

**Figure 13.** Figure 13: Pile (PubMed Abstracts) Tokenization Example [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

read the original abstract

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe \model{}'s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The main thing here is the public release of a 20B model with weights and code; the five-shot performance edge needs protocol details to hold up.

read the letter

This paper's real contribution is releasing the weights and full training code for GPT-NeoX-20B, a 20B-parameter model trained on the Pile. At submission it was the largest dense autoregressive model with open weights, which lets independent groups run scaling or interpretability work that would otherwise need big proprietary resources. The architecture is standard GPT-style, so the novelty is the scale plus the permissive open release rather than any new technique. They document the training setup and report results across language, math, and knowledge benchmarks, including the claim that the model gains more from five-shot evaluation than similarly sized GPT-3 and FairSeq models. That open release and the accompanying code are the parts that actually enable follow-on research. The soft spot is the five-shot comparison. The abstract states a larger zero-to-five-shot delta than the baselines, but the stress-test concern is fair: without explicit confirmation that prompt templates, example selection, task subsets, and formatting exactly matched the GPT-3 and FairSeq protocols, the gap could reflect setup differences instead of model quality. The paper should show the reproduction details if it wants the claim to land cleanly. Training details and dataset mixture are described at a level that looks standard and reproducible. This work is for groups that need a large open model to experiment with rather than for readers seeking architectural breakthroughs. It deserves peer review because the release itself is useful and the evaluation claims are concrete enough to check.

Referee Report

1 major / 2 minor

Summary. The paper introduces GPT-NeoX-20B, a 20-billion-parameter dense autoregressive language model trained on The Pile. It describes the architecture and training procedure, evaluates performance on language-understanding, mathematics, and knowledge tasks, and claims that the model is a particularly strong few-shot reasoner whose performance improves substantially more from zero-shot to five-shot settings than similarly sized GPT-3 and FairSeq models. The training code, evaluation code, and model weights are released under a permissive license.

Significance. If the reported five-shot gains are reproducible under matched evaluation conditions, the work supplies a large, openly available dense model that can serve as a baseline for future research and lowers barriers to studying scaling behavior. The explicit release of weights, training code, and evaluation code strengthens reproducibility.

major comments (1)

[Evaluation section] Evaluation section (around the five-shot results): the abstract and results claim that GPT-NeoX-20B exhibits larger zero-to-five-shot deltas than GPT-3 and FairSeq models of comparable size. This differential is load-bearing for the central claim, yet the manuscript does not explicitly state that the identical task list, prompt templates, example ordering, and formatting conventions from the GPT-3 and FairSeq papers were reproduced without deviation. A table or appendix listing the exact prompts and subtasks used for each baseline would be required to attribute the gap to the model rather than protocol differences.

minor comments (2)

[Abstract] The abstract states the model is 'to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission.' This phrasing should be updated to a precise date or removed, as it is time-sensitive.
[Training section] Training hyper-parameters (learning rate schedule, batch size, etc.) are described at a high level; a supplementary table with exact values and any deviations from the original GPT-3 recipe would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (around the five-shot results): the abstract and results claim that GPT-NeoX-20B exhibits larger zero-to-five-shot deltas than GPT-3 and FairSeq models of comparable size. This differential is load-bearing for the central claim, yet the manuscript does not explicitly state that the identical task list, prompt templates, example ordering, and formatting conventions from the GPT-3 and FairSeq papers were reproduced without deviation. A table or appendix listing the exact prompts and subtasks used for each baseline would be required to attribute the gap to the model rather than protocol differences.

Authors: We agree that the manuscript does not contain an explicit statement confirming exact reproduction of the evaluation protocols. The zero- and five-shot results were obtained by following the task lists, prompt templates, example orderings, and formatting conventions reported in Brown et al. (2020) and the FairSeq paper as closely as possible. We will revise the evaluation section to add an explicit statement to this effect and will cite the original papers for the specific prompts and subtasks. We will also note that the open-sourced evaluation code implements these protocols exactly. A full appendix table of every prompt is not feasible within page limits, but the combination of the added statement, citations, and released code allows direct verification and attributes performance differences to the model. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external benchmarks

full rationale

The paper introduces GPT-NeoX-20B, describes its architecture and training on the Pile, and reports empirical performance on language, math, and knowledge tasks. The central claim of strong few-shot reasoning is supported by direct comparisons to GPT-3 and FairSeq models on external benchmarks. No mathematical derivations, predictions, or first-principles results are presented that could reduce to fitted parameters or self-citations by construction. All load-bearing claims rest on reproducible evaluations outside the paper's internal definitions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

This is an empirical scaling and release paper; the central claim rests on the outcome of one training run and its benchmark scores rather than on new mathematical derivations. Hyperparameters such as learning rate schedule, batch size, and data mixture weights are free parameters chosen during training.

free parameters (2)

model parameter count
Target scale of 20 billion parameters chosen by the authors.
training dataset mixture
Specific composition and weighting of the Pile dataset used for training.

axioms (1)

domain assumption Standard transformer decoder architecture scales to 20B parameters without fundamental instability when using established optimizers and regularization.
Invoked by following the GPT-3 style architecture described in the abstract.

pith-pipeline@v0.9.0 · 5733 in / 1263 out tokens · 38085 ms · 2026-05-24T12:30:42.569851+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States
cs.LG 2024-07 conditional novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Selective Rotary Position Embedding
cs.CL 2025-11 unverdicted novelty 7.0

Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when ...
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
cs.CL 2024-10 conditional novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Detecting Pretraining Data from Large Language Models
cs.CL 2023-10 conditional novelty 7.0

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
cs.LG 2026-05 unverdicted novelty 6.0

Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and fina...
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
cs.LG 2025-08 unverdicted novelty 6.0

In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-tim...
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
cs.LG 2025-02 unverdicted novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
MiniMax-01: Scaling Foundation Models with Lightning Attention
cs.CL 2025-01 unverdicted novelty 6.0

MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
cs.LG 2024-10 unverdicted novelty 6.0

Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Vision-Language Foundation Models as Effective Robot Imitators
cs.RO 2023-11 conditional novelty 6.0

RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
YaRN: Efficient Context Window Extension of Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
cs.CL 2023-06 conditional novelty 6.0

AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
cs.CL 2023-05 conditional novelty 6.0

CodeT5+ is a flexible encoder-decoder LLM family for code pretrained with diverse objectives on multilingual corpora and initialized from existing LLMs, achieving state-of-the-art results on code generation, completio...
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
cs.CL 2023-05 conditional novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
On the Privacy of LLMs: An Ablation Study
cs.CR 2026-05 unverdicted novelty 4.0

Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.
CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology
cs.SE 2024-02 unverdicted novelty 4.0

CodePori is a multi-agent LLM system for code generation whose participant evaluation identifies practical challenges like memory limits and hallucinations missed by binary benchmarks.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
A Survey on Retrieval-Augmented Text Generation for Large Language Models
cs.IR 2024-04 unverdicted novelty 2.0

A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.
A Comprehensive Overview of Large Language Models
cs.CL 2023-07 unverdicted novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · cited by 34 Pith papers · 33 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Stuart Armstrong and S\" o ren Mindermann. 2018. https://proceedings.neurips.cc/paper/2018/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html Occam's razor is insufficient to infer the preferences of irrational agents . In Advances in Neural Information Processing Systems, volume 31, pages 5598--5609. Curran Associates, Inc

work page 2018
[4]

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. https://doi.org/10.1007/s11023-012-9282-2 Thinking inside the box: Controlling and using an oracle AI . Minds and Machines, 22(4):299--324

work page doi:10.1007/s11023-012-9282-2 2012
[5]

St \'e phane Aroca-Ouellette, Cory Paik, Alessandro Roncone, and Katharina Kann. 2021. https://doi.org/10.18653/v1/2021.findings-acl.404 PROST : P hysical reasoning about objects through space and time . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597--4608, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.findings-acl.404 2021
[6]

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 20...

work page arXiv 2021
[7]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. http://arxiv.org/abs/2112.00861v3 ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pages 610--623, New York, NY, USA. Association for Computi...

work page doi:10.1145/3442188.3445922 2021
[9]

Stella Biderman, Kieran Bicheno, and Leo Gao. 2022. http://arxiv.org/abs/2201.07311v1 Datasheet for the Pile . Computing Research Repository, arXiv:2201.07311. Version 1

work page arXiv 2022
[10]

Stella Biderman and Edward Raff. 2022. http://arxiv.org/abs/2201.07406v1 Neural language models are effective plagiarists . Computing Research Repository, arXiv:2201.07406. Version 1

work page arXiv 2022
[11]

I Can't Believe It's Not Better!

Stella Biderman and Walter J. Scheirer. 2020. https://proceedings.mlr.press/v137/biderman20a.html Pitfalls in machine learning research: Reexamining the development cycle . In Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 106--117. PMLR

work page 2020
[12]

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. http://arxiv.org/abs/2110.01963v1 Multimodal datasets: misogyny, pornography, and malignant stereotypes . Computing Research Repository, arXiv:2110.01963. Version 1

work page arXiv 2021
[13]

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. https://doi.org/10.1609/aaai.v34i05.6239 PIQA : Reasoning about physical commonsense in natural language . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432--7439

work page doi:10.1609/aaai.v34i05.6239 2020
[14]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. https://doi.org/10.5281/zenodo.5297715 GPT-Neo : Large scale autoregressive language modeling with Mesh-Tensorflow

work page doi:10.5281/zenodo.5297715 2021
[15]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[16]

Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. 2020. https://doi.org/10.23915/distill.00024 Thread: Circuits . Distill

work page doi:10.23915/distill.00024 2020
[17]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. http://arxiv.org/abs/2202.07646v2 Quantifying memorization across neural language models . Computing Research Repository, arXiv:2202.07646. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509v1 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Paul Christiano, Ajeya Cotra, and Mark Xu. 2021. https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8 Eliciting latent knowledge: How to tell if your eyes deceive you

work page 2021
[22]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. http://arxiv.org/abs/1803.05457v1 Think you have solved question answering? try ARC , the AI2 Reasoning Challenge . Computing Research Repository, arXiv:1803.05457. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2021. http://arxiv.org/abs/2104.08696v1 Knowledge neurons in pretrained transformers . Computing Research Repository, arXiv:2104.08696. Version 1

work page arXiv 2021
[24]

Abram Demski. 2019. https://www.alignmentforum.org/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic The parable of Predict-O-Matic . AI Alignment Forum

work page 2019
[25]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. http://arxiv.org/abs/1810.04805v2 BERT : Pre-training of deep bidirectional transformers for language understanding . Computing Research Repository, arXiv:1810.04805. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.98 Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Process...

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[27]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2...

work page 2021
[28]

William Fedus, Barret Zoph, and Noam Shazeer. 2021. http://arxiv.org/abs/2101.03961v1 Switch Transformers : Scaling to trillion parameter models with simple and efficient sparsity . Computing Research Repository, arXiv:2101.03961. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Leo Gao. 2021 a . https://www.alignmentforum.org/posts/BgoKdAzogxmgkuuAt/behavior-cloning-is-miscalibrated Behavior cloning is miscalibrated . AI Alignment Forum

work page 2021
[30]

Leo Gao. 2021 b . https://blog.eleuther.ai/gpt3-model-sizes/ On the sizes of openai api models

work page 2021
[31]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. http://arxiv.org/abs/2101.00027v1 The Pile : An 800GB dataset of diverse text for language modeling . Computing Research Repository, arXiv:2101.00027. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2020
[32]

Leo Gao, Kyle McDonell, Laria Reynolds, and Stella Biderman. 2021 a . https://blog.eleuther.ai/factored-cognition/ A preliminary exploration into factored cognition with language models . EleutherAI Blog

work page 2021
[33]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021 b . https://doi.org/10.5281/zenodo.5371628 A framework for few-shot language model evaluation

work page doi:10.5281/zenodo.5371628 2021
[34]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. http://arxiv.org/abs/1806.03377v1 PipeDream : Fast and efficient pipeline parallel DNN training . Computing Research Repository, arXiv:1806.03377. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . http://arxiv.org/abs/2009.03300v3 Measuring massive multitask language understanding . Computing Research Repository, arXiv:2009.03300. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021 b . http://arxiv.org/abs/2103.03874v2 Measuring mathematical problem solving with the MATH dataset . Computing Research Repository, arXiv:2103.03874. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. 2020. http://arxiv.org/abs/2010.14701v2 Scaling laws for autoregressive genera...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. http://arxiv.org/abs/2203.15556v1 Training compute-optimal large language models . Computing Research Repository, arXiv:2203.15556. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. http://arxiv.org/abs/2201.07207v1 Language models as zero-shot planners: Extracting actionable knowledge for embodied agents . Computing Research Repository, arXiv:2201.07207. Version 1

work page arXiv 2022
[40]

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2021. http://arxiv.org/abs/1906.01820v3 Risks from learned optimization in advanced machine learning systems . Computing Research Repository, arXiv:1906.01820. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Associa...

work page doi:10.18653/v1/p17-1147 2017
[42]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. http://arxiv.org/abs/2202.06539v2 Deduplicating training data mitigates privacy risks in language models . Computing Research Repository, arXiv:2202.06539. Version 2

work page arXiv 2022
[43]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. http://arxiv.org/abs/2001.08361v1 Scaling laws for neural language models . Computing Research Repository, arXiv:2001.08361. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2020
[44]

Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Jeon Dong Hyeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo, Minsuk Chang, Soobin Suh, Sookyo In, Jinseong Park, Kyungduk Kim, Hi...

work page doi:10.18653/v1/2021.emnlp-main.274 2021
[45]

Bryan Klimt and Yiming Yang. 2004. https://doi.org/10.1007/978-3-540-30115-8_22 The Enron corpus: A new dataset for email classification research . In Proceedings of the 15th European Conference on Machine Learning, ECML'04, page 217–226, Berlin, Heidelberg. Springer-Verlag

work page doi:10.1007/978-3-540-30115-8_22 2004
[46]

Jack Koch, Lauro Langosco, Jacob Pfau, James Le, and Lee Sharkey. 2021. http://arxiv.org/abs/2105.14111v2 Objective robustness in deep reinforcement learning . Computing Research Repository, arXiv:2105.14111. Version 2

work page arXiv 2021
[47]

Philipp Koehn. 2005. https://aclanthology.org/2005.mtsummit-papers.11 Europarl : A parallel corpus for statistical machine translation . In Proceedings of Machine Translation Summit X: Papers, pages 79--86, Phuket, Thailand

work page 2005
[48]

Aran Komatsuzaki. 2019. http://arxiv.org/abs/1906.06669v1 One epoch is all you need . Computing Research Repository, arXiv:1906.06669. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

Vanessa Kosoy. 2016. https://www.alignmentforum.org/posts/5bd75cc58225bf0670375209/irl-is-hard IRL is hard . AI Alignment Forum

work page 2016
[50]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyonga...

work page doi:10.1162/tacl_a_00447 2022
[51]

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. http://arxiv.org/abs/1910.09700v2 Quantifying the carbon emissions of machine learning . Computing Research Repository, arXiv:1910.09700. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

Connor Leahy. 2021. https://blog.eleuther.ai/why-release-a-large-language-model/ Why Release a Large Language Model? EleutherAI Blog

work page 2021
[53]

Connor Leahy and Stella Biderman. 2021. https://montrealethics.ai/volume4/ The hard problem of aligning AI to human values . In The State of AI Ethics Report, volume 4, pages 180--183. The Montreal AI Ethics Institute

work page 2021
[54]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. http://arxiv.org/abs/2107.06499v1 Deduplicating training data makes language models better . Computing Research Repository, arXiv:2107.06499. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf Jurassic-1 : Technical details and evaluation . Technical report, AI21 Labs

work page 2021
[56]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. http://arxiv.org/abs/2109.07958v1 TruthfulQA : Measuring how models mimic human falsehoods . Computing Research Repository, arXiv:2109.07958. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Pierre Lison and J \"o rg Tiedemann. 2016. https://aclanthology.org/L16-1147 OpenSubtitles2016 : Extracting large parallel corpora from movie and TV subtitles . In Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16) , pages 923--929, Portoro z , Slovenia. European Language Resources Association ( ELRA )

work page 2016
[58]

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. https://doi.org/10.24963/ijcai.2020/501 LogiQA : A challenge dataset for machine reading comprehension with logical reasoning . In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 , pages 3622--3628. International Joint Confe...

work page doi:10.24963/ijcai.2020/501 2020
[59]

Ilya Loshchilov and Frank Hutter. 2019. http://arxiv.org/abs/1711.05101v3 Decoupled weight decay regularization . Computing Research Repository, arXiv:1711.05101. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2019
[60]

Nathan Matias

J. Nathan Matias. 2020. https://citizensandtech.org/2020/01/industry-independent-research/ Why we need industry-independent research on tech & society . Citizens and Technology Lab

work page 2020
[61]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. http://arxiv.org/abs/2005.00661v1 On faithfulness and factuality in abstractive summarization . Computing Research Repository, arXiv:2005.00661. Version 1

work page arXiv 2020
[62]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://arxiv.org/abs/2202.05262v1 Locating and editing factual knowledge in GPT . Computing Research Repository, arXiv:2202.05262v1. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? A new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium. Association for Computational Li...

work page doi:10.18653/v1/d18-1260 2018
[64]

Nguyen and Julian Salazar

Toan Q. Nguyen and Julian Salazar. 2019. http://arxiv.org/abs/1910.05895v2 Transformers without tears: Improving the normalization of self-attention . Computing Research Repository, arXiv:1910.05895. Version 2

work page arXiv 2019
[65]

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. https://doi.org/10.18653/v1/2020.acl-main.441 Adversarial NLI : A new benchmark for natural language understanding . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885--4901, Online. Association for Computational L...

work page doi:10.18653/v1/2020.acl-main.441 2020
[66]

nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting GPT : the logit lens . LessWrong

work page 2020
[67]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. http://arxiv.org/abs/2112.00114v1 Show your work: Scratchpads for intermediate computation with language models . Computing Research Repository, arXiv:2112.001...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[68]

Pedro A. Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, Tom Everitt, Corentin Tallec, Emilio Parisotto, Tom Erez, Yutian Chen, Scott Reed, Marcus Hutter, Nando de Freitas, and Shane Legg. 2021. http://arxiv.org/abs/2110.10819v1 Shaking the foundations: delusio...

work page arXiv 2021
[69]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016
[70]

Anselmo Pe \ n as, Eduard Hovy, Pamela Forner, \'A lvaro Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. https://doi.org/10.1007/978-3-642-40802-1_29 QA4MRE 2011-2013: Overview of question answering for machine reading evaluation . In Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pages 303--320, Berlin, Heidelberg....

work page doi:10.1007/978-3-642-40802-1_29 2013
[71]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . Technical report, OpenAI

work page 2018
[72]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . Technical report, OpenAI

work page 2019
[73]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po - Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathat...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. http://arxiv.org/abs/1911.05507v1 Compressive transformers for long-range sequence modelling . Computing Research Repository, arXiv:1911.05507. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2019
[75]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21:1--67

work page 2020
[76]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.5555/3433701.3433727 ZeRO : Memory optimizations toward training trillion parameter models . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press

work page doi:10.5555/3433701.3433727 2020
[77]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 DeepSpeed : System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505--3506, New York, NY, USA. As...

work page doi:10.1145/3394486.3406703 2020
[78]

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. 2022. http://arxiv.org/abs/2202.07206v1 Impact of pretraining term frequencies on few-shot reasoning . Computing Research Repository, arXiv:2202.07206. Version 1

work page arXiv 2022
[79]

Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulia...

work page arXiv 2022
[80]

Jathan Sadowski, Salom \'e Viljoen, and Meredith Whittaker. 2021. https://doi.org/10.1038/d41586-021-01812-3 Everyone should decide how their digital data are used — not just tech companies . Nature, 595(7866):169--171

work page doi:10.1038/d41586-021-01812-3 2021

Showing first 80 references.

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Stuart Armstrong and S\" o ren Mindermann. 2018. https://proceedings.neurips.cc/paper/2018/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html Occam's razor is insufficient to infer the preferences of irrational agents . In Advances in Neural Information Processing Systems, volume 31, pages 5598--5609. Curran Associates, Inc

work page 2018

[4] [4]

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. https://doi.org/10.1007/s11023-012-9282-2 Thinking inside the box: Controlling and using an oracle AI . Minds and Machines, 22(4):299--324

work page doi:10.1007/s11023-012-9282-2 2012

[5] [5]

St \'e phane Aroca-Ouellette, Cory Paik, Alessandro Roncone, and Katharina Kann. 2021. https://doi.org/10.18653/v1/2021.findings-acl.404 PROST : P hysical reasoning about objects through space and time . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597--4608, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.findings-acl.404 2021

[6] [6]

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyanov. 20...

work page arXiv 2021

[7] [7]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. http://arxiv.org/abs/2112.00861v3 ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pages 610--623, New York, NY, USA. Association for Computi...

work page doi:10.1145/3442188.3445922 2021

[9] [9]

Stella Biderman, Kieran Bicheno, and Leo Gao. 2022. http://arxiv.org/abs/2201.07311v1 Datasheet for the Pile . Computing Research Repository, arXiv:2201.07311. Version 1

work page arXiv 2022

[10] [10]

Stella Biderman and Edward Raff. 2022. http://arxiv.org/abs/2201.07406v1 Neural language models are effective plagiarists . Computing Research Repository, arXiv:2201.07406. Version 1

work page arXiv 2022

[11] [11]

I Can't Believe It's Not Better!

Stella Biderman and Walter J. Scheirer. 2020. https://proceedings.mlr.press/v137/biderman20a.html Pitfalls in machine learning research: Reexamining the development cycle . In Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 106--117. PMLR

work page 2020

[12] [12]

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. http://arxiv.org/abs/2110.01963v1 Multimodal datasets: misogyny, pornography, and malignant stereotypes . Computing Research Repository, arXiv:2110.01963. Version 1

work page arXiv 2021

[13] [13]

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. https://doi.org/10.1609/aaai.v34i05.6239 PIQA : Reasoning about physical commonsense in natural language . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432--7439

work page doi:10.1609/aaai.v34i05.6239 2020

[14] [14]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. https://doi.org/10.5281/zenodo.5297715 GPT-Neo : Large scale autoregressive language modeling with Mesh-Tensorflow

work page doi:10.5281/zenodo.5297715 2021

[15] [15]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020

[16] [16]

Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. 2020. https://doi.org/10.23915/distill.00024 Thread: Circuits . Distill

work page doi:10.23915/distill.00024 2020

[17] [17]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. http://arxiv.org/abs/2202.07646v2 Quantifying memorization across neural language models . Computing Research Repository, arXiv:2202.07646. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509v1 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Paul Christiano, Ajeya Cotra, and Mark Xu. 2021. https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8 Eliciting latent knowledge: How to tell if your eyes deceive you

work page 2021

[22] [22]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. http://arxiv.org/abs/1803.05457v1 Think you have solved question answering? try ARC , the AI2 Reasoning Challenge . Computing Research Repository, arXiv:1803.05457. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2021. http://arxiv.org/abs/2104.08696v1 Knowledge neurons in pretrained transformers . Computing Research Repository, arXiv:2104.08696. Version 1

work page arXiv 2021

[24] [24]

Abram Demski. 2019. https://www.alignmentforum.org/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic The parable of Predict-O-Matic . AI Alignment Forum

work page 2019

[25] [25]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. http://arxiv.org/abs/1810.04805v2 BERT : Pre-training of deep bidirectional transformers for language understanding . Computing Research Repository, arXiv:1810.04805. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2019

[26] [26]

Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.98 Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Process...

work page doi:10.18653/v1/2021.emnlp-main.98 2021

[27] [27]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2...

work page 2021

[28] [28]

William Fedus, Barret Zoph, and Noam Shazeer. 2021. http://arxiv.org/abs/2101.03961v1 Switch Transformers : Scaling to trillion parameter models with simple and efficient sparsity . Computing Research Repository, arXiv:2101.03961. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

Leo Gao. 2021 a . https://www.alignmentforum.org/posts/BgoKdAzogxmgkuuAt/behavior-cloning-is-miscalibrated Behavior cloning is miscalibrated . AI Alignment Forum

work page 2021

[30] [30]

Leo Gao. 2021 b . https://blog.eleuther.ai/gpt3-model-sizes/ On the sizes of openai api models

work page 2021

[31] [31]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. http://arxiv.org/abs/2101.00027v1 The Pile : An 800GB dataset of diverse text for language modeling . Computing Research Repository, arXiv:2101.00027. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2020

[32] [32]

Leo Gao, Kyle McDonell, Laria Reynolds, and Stella Biderman. 2021 a . https://blog.eleuther.ai/factored-cognition/ A preliminary exploration into factored cognition with language models . EleutherAI Blog

work page 2021

[33] [33]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021 b . https://doi.org/10.5281/zenodo.5371628 A framework for few-shot language model evaluation

work page doi:10.5281/zenodo.5371628 2021

[34] [34]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. http://arxiv.org/abs/1806.03377v1 PipeDream : Fast and efficient pipeline parallel DNN training . Computing Research Repository, arXiv:1806.03377. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . http://arxiv.org/abs/2009.03300v3 Measuring massive multitask language understanding . Computing Research Repository, arXiv:2009.03300. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021 b . http://arxiv.org/abs/2103.03874v2 Measuring mathematical problem solving with the MATH dataset . Computing Research Repository, arXiv:2103.03874. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. 2020. http://arxiv.org/abs/2010.14701v2 Scaling laws for autoregressive genera...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. http://arxiv.org/abs/2203.15556v1 Training compute-optimal large language models . Computing Research Repository, arXiv:2203.15556. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. http://arxiv.org/abs/2201.07207v1 Language models as zero-shot planners: Extracting actionable knowledge for embodied agents . Computing Research Repository, arXiv:2201.07207. Version 1

work page arXiv 2022

[40] [40]

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2021. http://arxiv.org/abs/1906.01820v3 Risks from learned optimization in advanced machine learning systems . Computing Research Repository, arXiv:1906.01820. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Associa...

work page doi:10.18653/v1/p17-1147 2017

[42] [42]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. http://arxiv.org/abs/2202.06539v2 Deduplicating training data mitigates privacy risks in language models . Computing Research Repository, arXiv:2202.06539. Version 2

work page arXiv 2022

[43] [43]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. http://arxiv.org/abs/2001.08361v1 Scaling laws for neural language models . Computing Research Repository, arXiv:2001.08361. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2020

[44] [44]

Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Jeon Dong Hyeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo, Minsuk Chang, Soobin Suh, Sookyo In, Jinseong Park, Kyungduk Kim, Hi...

work page doi:10.18653/v1/2021.emnlp-main.274 2021

[45] [45]

Bryan Klimt and Yiming Yang. 2004. https://doi.org/10.1007/978-3-540-30115-8_22 The Enron corpus: A new dataset for email classification research . In Proceedings of the 15th European Conference on Machine Learning, ECML'04, page 217–226, Berlin, Heidelberg. Springer-Verlag

work page doi:10.1007/978-3-540-30115-8_22 2004

[46] [46]

Jack Koch, Lauro Langosco, Jacob Pfau, James Le, and Lee Sharkey. 2021. http://arxiv.org/abs/2105.14111v2 Objective robustness in deep reinforcement learning . Computing Research Repository, arXiv:2105.14111. Version 2

work page arXiv 2021

[47] [47]

Philipp Koehn. 2005. https://aclanthology.org/2005.mtsummit-papers.11 Europarl : A parallel corpus for statistical machine translation . In Proceedings of Machine Translation Summit X: Papers, pages 79--86, Phuket, Thailand

work page 2005

[48] [48]

Aran Komatsuzaki. 2019. http://arxiv.org/abs/1906.06669v1 One epoch is all you need . Computing Research Repository, arXiv:1906.06669. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [49]

Vanessa Kosoy. 2016. https://www.alignmentforum.org/posts/5bd75cc58225bf0670375209/irl-is-hard IRL is hard . AI Alignment Forum

work page 2016

[50] [50]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyonga...

work page doi:10.1162/tacl_a_00447 2022

[51] [51]

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. http://arxiv.org/abs/1910.09700v2 Quantifying the carbon emissions of machine learning . Computing Research Repository, arXiv:1910.09700. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2019

[52] [52]

Connor Leahy. 2021. https://blog.eleuther.ai/why-release-a-large-language-model/ Why Release a Large Language Model? EleutherAI Blog

work page 2021

[53] [53]

Connor Leahy and Stella Biderman. 2021. https://montrealethics.ai/volume4/ The hard problem of aligning AI to human values . In The State of AI Ethics Report, volume 4, pages 180--183. The Montreal AI Ethics Institute

work page 2021

[54] [54]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. http://arxiv.org/abs/2107.06499v1 Deduplicating training data makes language models better . Computing Research Repository, arXiv:2107.06499. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[55] [55]

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf Jurassic-1 : Technical details and evaluation . Technical report, AI21 Labs

work page 2021

[56] [56]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. http://arxiv.org/abs/2109.07958v1 TruthfulQA : Measuring how models mimic human falsehoods . Computing Research Repository, arXiv:2109.07958. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2021

[57] [57]

Pierre Lison and J \"o rg Tiedemann. 2016. https://aclanthology.org/L16-1147 OpenSubtitles2016 : Extracting large parallel corpora from movie and TV subtitles . In Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16) , pages 923--929, Portoro z , Slovenia. European Language Resources Association ( ELRA )

work page 2016

[58] [58]

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. https://doi.org/10.24963/ijcai.2020/501 LogiQA : A challenge dataset for machine reading comprehension with logical reasoning . In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 , pages 3622--3628. International Joint Confe...

work page doi:10.24963/ijcai.2020/501 2020

[59] [59]

Ilya Loshchilov and Frank Hutter. 2019. http://arxiv.org/abs/1711.05101v3 Decoupled weight decay regularization . Computing Research Repository, arXiv:1711.05101. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2019

[60] [60]

Nathan Matias

J. Nathan Matias. 2020. https://citizensandtech.org/2020/01/industry-independent-research/ Why we need industry-independent research on tech & society . Citizens and Technology Lab

work page 2020

[61] [61]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. http://arxiv.org/abs/2005.00661v1 On faithfulness and factuality in abstractive summarization . Computing Research Repository, arXiv:2005.00661. Version 1

work page arXiv 2020

[62] [62]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://arxiv.org/abs/2202.05262v1 Locating and editing factual knowledge in GPT . Computing Research Repository, arXiv:2202.05262v1. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[63] [63]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? A new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium. Association for Computational Li...

work page doi:10.18653/v1/d18-1260 2018

[64] [64]

Nguyen and Julian Salazar

Toan Q. Nguyen and Julian Salazar. 2019. http://arxiv.org/abs/1910.05895v2 Transformers without tears: Improving the normalization of self-attention . Computing Research Repository, arXiv:1910.05895. Version 2

work page arXiv 2019

[65] [65]

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. https://doi.org/10.18653/v1/2020.acl-main.441 Adversarial NLI : A new benchmark for natural language understanding . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885--4901, Online. Association for Computational L...

work page doi:10.18653/v1/2020.acl-main.441 2020

[66] [66]

nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting GPT : the logit lens . LessWrong

work page 2020

[67] [67]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. http://arxiv.org/abs/2112.00114v1 Show your work: Scratchpads for intermediate computation with language models . Computing Research Repository, arXiv:2112.001...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[68] [68]

Pedro A. Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, Tom Everitt, Corentin Tallec, Emilio Parisotto, Tom Erez, Yutian Chen, Scott Reed, Marcus Hutter, Nando de Freitas, and Shane Legg. 2021. http://arxiv.org/abs/2110.10819v1 Shaking the foundations: delusio...

work page arXiv 2021

[69] [69]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016

[70] [70]

Anselmo Pe \ n as, Eduard Hovy, Pamela Forner, \'A lvaro Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. https://doi.org/10.1007/978-3-642-40802-1_29 QA4MRE 2011-2013: Overview of question answering for machine reading evaluation . In Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pages 303--320, Berlin, Heidelberg....

work page doi:10.1007/978-3-642-40802-1_29 2013

[71] [71]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . Technical report, OpenAI

work page 2018

[72] [72]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . Technical report, OpenAI

work page 2019

[73] [73]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po - Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathat...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[74] [74]

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. http://arxiv.org/abs/1911.05507v1 Compressive transformers for long-range sequence modelling . Computing Research Repository, arXiv:1911.05507. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2019

[75] [75]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21:1--67

work page 2020

[76] [76]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.5555/3433701.3433727 ZeRO : Memory optimizations toward training trillion parameter models . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press

work page doi:10.5555/3433701.3433727 2020

[77] [77]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 DeepSpeed : System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505--3506, New York, NY, USA. As...

work page doi:10.1145/3394486.3406703 2020

[78] [78]

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. 2022. http://arxiv.org/abs/2202.07206v1 Impact of pretraining term frequencies on few-shot reasoning . Computing Research Repository, arXiv:2202.07206. Version 1

work page arXiv 2022

[79] [79]

Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulia...

work page arXiv 2022

[80] [80]

Jathan Sadowski, Salom \'e Viljoen, and Meredith Whittaker. 2021. https://doi.org/10.1038/d41586-021-01812-3 Everyone should decide how their digital data are used — not just tech companies . Nature, 595(7866):169--171

work page doi:10.1038/d41586-021-01812-3 2021