Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Brandon Norick; Bryan Catanzaro; Elton Zhang; George Zerveas; Jared Casper; Julie Bernauer; Michael Houston; Mohammad Shoeybi; Mostofa Patwary; Patrick LeGresley

arxiv: 2201.11990 · v3 · pith:OVM55B4Onew · submitted 2022-01-28 · 💻 cs.CL

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith , Mostofa Patwary , Brandon Norick , Patrick LeGresley , Samyam Rajbhandari , Jared Casper , Zhun Liu , Shrimai Prabhumoye

show 12 more authors

George Zerveas Vijay Korthikanti Elton Zhang Rewon Child Reza Yazdani Aminabadi Julie Bernauer Xia Song Mohammad Shoeybi Yuxiong He Michael Houston Saurabh Tiwary Bryan Catanzaro

This is my paper

Pith reviewed 2026-05-24 12:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelstransformerDeepSpeedMegatron3D parallelismzero-shot learningfew-shot learningnatural language generation

0 comments

The pith

A 530 billion parameter transformer model trained via DeepSpeed and Megatron sets new state-of-the-art results on zero-, one-, and few-shot NLP benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the end-to-end training of Megatron-Turing NLG 530B, the largest monolithic transformer language model reported at the time. It explains the 3D parallelism approach that combines data, model, and pipeline parallelism to fit the model on available hardware, along with the construction and curation of the training corpus. The resulting model delivers higher accuracies than prior systems in zero-shot, one-shot, and few-shot settings across multiple standard NLP tasks. A reader would care because the work shows how hardware-software co-design and data choices together make extreme scale practical and effective for general-purpose language generation.

Core claim

We present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. Using DeepSpeed and Megatron, we employ a 3D parallelism methodology to enable training at this scale. The design of the training corpus and data curation techniques, which we believe is a key ingredient to the success of the model, allow MT-NLG to achieve superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establish new state-of-the-art results.

What carries the argument

3D parallelism (data, model, and pipeline) implemented in DeepSpeed and Megatron that distributes the 530 billion parameter transformer across hardware while maintaining training stability.

If this is right

The infrastructure details enable training of monolithic models at hundreds of billions of parameters.
Data curation techniques directly contribute to the observed generalization in zero- and few-shot regimes.
MT-NLG exhibits new properties in natural language generation that prior smaller models did not display.
The same training stack can be reused to push model size further while retaining benchmark gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar parallelism and curation patterns could be applied to multimodal models that combine text with images or code.
The reported scaling behavior suggests that further increases in parameter count may continue to improve few-shot performance without architectural changes.
Open release of the exact corpus composition would allow independent verification of the data-curation hypothesis.

Load-bearing premise

The design of the training corpus and the data curation techniques are a key ingredient to the success of the model.

What would settle it

A controlled replication that trains an identical 530B model on the same hardware and code but with a standard public corpus lacking the described curation steps, then measures whether zero- and few-shot benchmark scores fall below the reported levels.

Figures

Figures reproduced from arXiv: 2201.11990 by Brandon Norick, Bryan Catanzaro, Elton Zhang, George Zerveas, Jared Casper, Julie Bernauer, Michael Houston, Mohammad Shoeybi, Mostofa Patwary, Patrick LeGresley, Rewon Child, Reza Yazdani Aminabadi, Samyam Rajbhandari, Saurabh Tiwary, Shaden Smith, Shrimai Prabhumoye, Vijay Korthikanti, Xia Song, Yuxiong He, Zhun Liu.

**Figure 2.** Figure 2: Validation loss of MT-NLG. 4 Results and Achievements To provide a better understanding of how language model performance improves during training, we first present the validation loss curve (cross entropy) of MT-NLG in [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: The 100 most common words associated with male and female templates, ordered from most [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Positive and Negative sentiment scores for each ethnicity [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Natural Language Inference accuracy on the HANS dataset, as a function of the number of shots [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable engineering account of training MT-NLG at 530B with DeepSpeed plus Megatron 3D parallelism and data curation, but the SOTA claims rest on evaluation details that need checking for prompt and protocol match.

read the letter

The main point is that this is a report on training the 530B MT-NLG model using DeepSpeed and Megatron together. They describe the 3D parallelism setup that made the run possible and highlight their training corpus design and filtering steps as a key factor. That combination at this exact scale is new enough to be worth noting for anyone doing similar work. The infrastructure and training process sections are the strongest part. They give concrete details on how the parallelism was configured and how the data was handled, which supplies a practical reference that others can use when scaling up. The paper is clear on the engineering choices and does not overclaim on the methods side. The softer spot is the evaluation. The abstract states new state-of-the-art zero-, one-, and few-shot results, yet the stress-test concern about prompt templates and decoding settings is reasonable. If the full paper does not list the exact prompts or confirm that the same harness and shot counts were used as the baselines, the margins are difficult to interpret as model improvements rather than protocol differences. That does not make the whole paper invalid, but it does limit how much weight the downstream claims can carry without further verification. This work is aimed at groups building large-scale training systems rather than readers looking for new modeling ideas. The citation pattern is standard and the report is an empirical training log with no circular math. It shows straightforward engagement with the practical problems. I would bring it to a systems-focused reading group. I would cite the parallelism and data sections if I were writing about scaling infrastructure. It deserves peer review because the scale and the documented implementation choices are worth referee scrutiny even if the results section needs tightening on the evaluation side.

Referee Report

2 major / 1 minor

Summary. The paper describes the joint Microsoft-NVIDIA effort to train Megatron-Turing NLG 530B (MT-NLG), a 530-billion-parameter monolithic transformer language model. It details the infrastructure and 3D parallelism techniques implemented with DeepSpeed and Megatron, the training process, the design and curation of the training corpus, and reports that MT-NLG achieves superior zero-, one-, and few-shot accuracies on multiple NLP benchmarks, establishing new state-of-the-art results.

Significance. If the performance claims hold under comparable evaluation conditions, the work provides a valuable engineering record of scaling transformer training to 530B parameters. The explicit treatment of 3D parallelism, data curation practices, and infrastructure choices offers concrete guidance for future large-scale training efforts. The paper also surfaces observations about emergent properties of the model.

major comments (2)

[Evaluation] Evaluation section: the manuscript asserts new state-of-the-art zero-/one-/few-shot results on several NLP benchmarks, yet supplies no benchmark numbers, baselines, or statistical details in the abstract and does not reference a fixed evaluation harness. This absence prevents verification that reported margins are protocol-independent rather than arising from prompt or decoding differences.
[Evaluation] Evaluation section: the paper does not provide the exact prompt templates, number of shots, or decoding settings used for each benchmark where new SOTA is claimed. Without these, it is impossible to confirm that the superiority is attributable to model scale or data rather than evaluation-protocol variations relative to prior work.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one or two concrete benchmark numbers and the corresponding prior SOTA values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments on the evaluation section. We agree that greater transparency in benchmark reporting, baselines, and protocol details will improve verifiability. We will revise the manuscript to address these points while preserving the paper's focus on training infrastructure and data curation.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the manuscript asserts new state-of-the-art zero-/one-/few-shot results on several NLP benchmarks, yet supplies no benchmark numbers, baselines, or statistical details in the abstract and does not reference a fixed evaluation harness. This absence prevents verification that reported margins are protocol-independent rather than arising from prompt or decoding differences.

Authors: We acknowledge that the abstract contains only a high-level claim without numerical results or harness details, which is typical for length-constrained abstracts but can reduce immediate verifiability. The main Evaluation section does include comparative tables against prior models; however, to strengthen the paper we will (1) add a concise summary of key benchmark scores and baselines to the abstract where space permits, (2) explicitly name the evaluation harness and any custom adaptations in the Evaluation section, and (3) include error bars or statistical notes where multiple runs were performed. These changes will be made in the revised manuscript. revision: yes
Referee: [Evaluation] Evaluation section: the paper does not provide the exact prompt templates, number of shots, or decoding settings used for each benchmark where new SOTA is claimed. Without these, it is impossible to confirm that the superiority is attributable to model scale or data rather than evaluation-protocol variations relative to prior work.

Authors: We agree that reproducibility of the reported SOTA claims requires the precise prompts, shot counts, and decoding parameters. The current manuscript references standard few-shot setups for the cited benchmarks but does not reproduce the templates. In the revision we will add a dedicated appendix (or subsection) that lists, for every benchmark where a new SOTA is claimed: the exact prompt template, number of shots, decoding strategy (e.g., greedy, nucleus sampling parameters), and any post-processing steps. This will allow direct comparison with prior work and confirm that gains are not protocol artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical training report with no derivation chain

full rationale

The manuscript is an engineering report detailing hardware/software infrastructure (DeepSpeed + Megatron 3D parallelism), training corpus construction, data curation, and measured benchmark accuracies for the 530B model. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs. Evaluation results are reported as direct empirical outcomes rather than derived quantities. No self-citation load-bearing steps, ansatz smuggling, or uniqueness theorems appear in the derivation chain. The paper is self-contained as a factual account of a training run.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of the transformer architecture and the effectiveness of existing parallelism libraries; no new entities or free parameters are introduced in the abstract.

axioms (2)

domain assumption Transformer-based language models scale effectively with parameter count and data quality
Implicit in the decision to train at 530B parameters and to emphasize data curation.
domain assumption 3D parallelism from DeepSpeed and Megatron can be applied without fundamental bottlenecks at this scale
Stated as the methodology used to enable training.

pith-pipeline@v0.9.0 · 5858 in / 1217 out tokens · 24213 ms · 2026-05-24T12:09:00.539751+00:00 · methodology

discussion (0)

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
Large Language Models are Zero-Shot Reasoners
cs.CL 2022-05 accept novelty 7.0

Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
cs.LG 2026-03 unverdicted novelty 6.0

M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
veScale-FSDP: Flexible and High-Performance FSDP at Scale
cs.DC 2026-02 unverdicted novelty 6.0

veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
cs.DC 2025-08 unverdicted novelty 6.0

Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
cs.AI 2025-07 conditional novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
MiniMax-01: Scaling Foundation Models with Lightning Attention
cs.CL 2025-01 unverdicted novelty 6.0

MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
cs.CL 2023-05 conditional novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
Language Models can Solve Computer Tasks
cs.CL 2023-03 accept novelty 6.0

Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
FP8 Formats for Deep Learning
cs.LG 2022-09 unverdicted novelty 6.0

FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Efficient Training of Language Models to Fill in the Middle
cs.CL 2022-07 unverdicted novelty 6.0

Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
cs.CL 2022-05 unverdicted novelty 6.0

MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
cs.CL 2022-04 accept novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
cs.DC 2026-05 unverdicted novelty 5.0

Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
cs.DC 2026-05 unverdicted novelty 5.0

Charon is a unified fine-grained simulator that predicts LLM performance with under 5.35% error overall and under 3.74% for large-scale training, and it found a better inference configuration than an engineering baseline.
Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction
cs.DC 2026-05 unverdicted novelty 5.0

A generative compression model using historical priors for Earth observation data achieves up to 10,000x reduction after exascale training on an Armv9 supercomputer.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
cs.LG 2026-04 unverdicted novelty 5.0

SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.
SEDD: Scalable and Efficient Dataset Deduplication with GPUs
cs.CL 2025-01 unverdicted novelty 5.0

SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
cs.CV 2023-10 unverdicted novelty 5.0

MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
cs.LG 2023-04 unverdicted novelty 5.0

RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
cs.DC 2026-05 unverdicted novelty 4.0

On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Phoenix-VL 1.5 Medium Technical Report
cs.CL 2026-05 unverdicted novelty 3.0

Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...
A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
cs.DC 2026-05 unverdicted novelty 3.0

A combined parallelism recipe on SuperMUC-NG Phase 2 delivers 10% of theoretical peak throughput for 175B models plus 93% weak and 82% strong scaling efficiency on 128 nodes using unmodified public software.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
cs.CL 2025-09 unverdicted novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
A Comprehensive Overview of Large Language Models
cs.CL 2023-07 unverdicted novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 41 Pith papers · 17 internal anchors

[1]

https://www.nvidia.com/en-us/data-center/a100/

NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/

work page
[2]

https://www.top500.org/system/179842/

NVIDIA Selene Supercomputer. https://www.top500.org/system/179842/

work page
[3]

https://www.nvidia.com/en-us/data-center/nvlink/

NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/

work page
[4]

https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

Turing-NLG: A 17-billion-parameter language model by Mi- crosoft. https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

work page
[5]

https://wudaoai.cn/home

Wu Dao 2.0 Large-scale Pretrained Model. https://wudaoai.cn/home

work page
[6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In AAAI, 2020

work page 2020
[7]

Zou, Venkatesh Saligrama, and Adam Tauman Kalai

Tolga Bolukbasi, Kai-Wei Chang, James Y . Zou, Venkatesh Saligrama, and Adam Tauman Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, 2016

work page 2016
[8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportuni- ties and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[10]

BoolQ: Exploring the surprising difﬁculty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difﬁculty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019
[11]

langdetect, 2021

Michal Danilk. langdetect, 2021. Version 1.0.9

work page 2021
[12]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

work page 2019
[13]

Language and Gender

Penelope Eckert and Sally McConnell-Ginet. Language and Gender . Cambridge University Press, 2003

work page 2003
[14]

Improving gender fairness of pre-trained language models without catastrophic forgetting

Zahra Fatemi, Chen Xing, Wenhao Liu, and Caiming Xiong. Improving gender fairness of pre-trained language models without catastrophic forgetting. arXiv preprint arXiv:2110.05367, 2021

work page arXiv 2021
[15]

William Fedus, Barret Zoph, and Noam M. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity.ArXiv, abs/2101.03961, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Lyn Frazier and Janet D. Fodor. The sausage machine: A new two-stage parsing model. Cognition, 6(4):291–325, 1978. Place: Netherlands Publisher: Elsevier Science

work page 1978
[17]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

A framework for few-shot language model evaluation, September 2021

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPoﬁ, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021

work page 2021
[19]

Realtoxici- typrompts: Evaluating neural toxic degeneration in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020

work page 2020
[20]

Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 8342–8360, Online, July 2020. Association for Computational Linguistics

work page 2020
[21]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

work page 2018
[22]

Pretrained transformers improve out-of-distribution robustness

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,...

work page 2020
[23]

Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32:103–112, 2019

work page 2019
[24]

Improving machine reading comprehension with single-choice decision and transfer learning

Yufan Jiang, Shuangzhi Wu, Jing Gong, Yahui Cheng, Peng Meng, Weiliang Lin, Zhibo Chen, and Mu Li. Improving machine reading comprehension with single-choice decision and transfer learning. ArXiv, abs/2011.03292, 2020

work page arXiv 2011
[25]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017

work page 2017
[26]

Exploring the Limits of Language Modeling

Rafal J ´ozefowicz, Oriol Vinyals, Mike Schuster, Noam M. Shazeer, and Yonghui Wu. Exploring the limits of language modeling. ArXiv, abs/1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Gedi: Generative discriminator guided sequence generation, 2021

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shaﬁq Joty, richard socher, and Nazneen Rajani. Gedi: Generative discriminator guided sequence generation, 2021

work page 2021
[29]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. 2017 IEEE International Conference on Computer Vision (ICCV) , pages 706–715, 2017

work page 2017
[30]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Nat- ural Questions: A Benchmark for Question Answering Research. Tran...

work page 2019
[31]

RACE: Large-scale ReAd- ing comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAd- ing comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics

work page 2017
[32]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan-Ping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2006
[33]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Jurassic-1: Technical details and evaluation

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. 30

work page
[35]

Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang

Junyang Lin, Rui Men, An Yang, Chan Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, J. Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang. M6: A chinese multimodal pretrainer. ArXiv, abs/2103.00823, 2021

work page arXiv 2021
[36]

M6-10t: A sharing-delinking paradigm for efﬁcient multi-trillion parameter pretraining

Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, and Hongxia Yang. M6-10t: A sharing-delinking paradigm for efﬁcient multi-trillion parameter pretraining. 2021

work page 2021
[37]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[38]

Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark

Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In AAAI, 2021

work page 2021
[39]

Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings

Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pa...

work page 2019
[40]

Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference

Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computa- tional Linguistics

work page 2019
[41]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Pipedream: generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , pages 1–15, 2019

work page 2019
[43]

Efﬁcient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vi- jay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catan- zaroand Amar Phanishayee, and Matei Zaharia. Efﬁcient large-scale language model training on gpu clusters using megatron-lm. ArXiv, abs/2104.04473, 2021

work page arXiv 2021
[44]

Mitigating harm in language models with conditional-likelihood ﬁltration

Helen Ngo, Cooper Raterink, Jo ˜ao GM Ara´ujo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nicholas Frosst. Mitigating harm in language models with conditional-likelihood ﬁltration. arXiv preprint arXiv:2108.07790, 2021

work page arXiv 2021
[45]

Nguyen and Julian Salazar

Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. CoRR, abs/1910.05895, 2019

work page arXiv 1910
[46]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599, 2020. 31

work page arXiv 1910
[47]

Adversarial NLI: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, ...

work page 2020
[48]

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

Pedro Javier Ortiz Su ´arez, Benoˆıt Sagot, and Laurent Romary. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In Piotr Ba´nski, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald L ¨ungen, and Caroline Iliadi, edi- tors, 7th Workshop on the Challenges in the Management of...

work page 2019
[49]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, San- dro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers) , pa...

work page 2016
[50]

Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations

Mohammad Taher Pilehvar and Jos ´e Camacho-Collados. Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations. In NAACL, 2019

work page 2019
[51]

Sentiwordnet, 2021

Christopher Potts. Sentiwordnet, 2021

work page 2021
[52]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[53]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...

work page 2021
[54]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, et al. Exploring the Limits of Transfer Learning with a Uniﬁed Text-to- Text Transformer. ArXiv, abs/1910.10683, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[55]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 32

work page 2020
[56]

Zero-inﬁnity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-inﬁnity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857 , 2021

work page arXiv 2021
[57]

Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020

work page 2020
[58]

Winogrande: An adver- sarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver- sarial winograd schema challenge at scale. In AAAI, 2020

work page 2020
[59]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp

Timo Schick, Sahana Udupa, and Hinrich Sch ¨utze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453, 2021

work page arXiv 2021
[61]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[62]

The woman worked as a babysitter: On biases in language generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, Novemb...

work page 2019
[63]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.CoRR, abs/1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[64]

Robyn Speer. ftfy. Zenodo, 2019. Version 5.5

work page 2019
[65]

DeepSpeed

DeepSpeed Team. DeepSpeed. https://github.com/microsoft/DeepSpeed, 2021

work page 2021
[66]

Trinh and Quoc V

Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning.CoRR, abs/1806.02847, 2018

work page arXiv 2018
[67]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Jia, Bo Li, and Jingjing Liu

Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, R. Jia, Bo Li, and Jingjing Liu. Infobert: Improv- ing robustness of language models from an information theoretic perspective. ArXiv, abs/2010.02329, 2021. 33

work page arXiv 2010
[69]

Towards zero-label language learning.ArXiv, abs/2109.09193, 2021

Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. Towards zero-label language learning.ArXiv, abs/2109.09193, 2021

work page arXiv 2021
[70]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Grifﬁn, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[72]

Challenges in detoxifying language models

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021

work page arXiv 2021
[73]

A broad-coverage challenge corpus for sen- tence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguis...

work page 2018
[74]

Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning

Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, and Xuanwei Zhang. Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning. ArXiv, abs/2110.04725, 2021

work page arXiv 2021
[75]

Learning and Evaluating General Linguistic Intelligence

Dani Yogatama, Cyprien de Masson d’Autume, Jerome T. Connor, Tom ´as Kocisk ´y, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. Learning and evaluating general linguistic intelligence. CoRR, abs/1901.11373, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[76]

Hellaswag: Can a machine really ﬁnish your sentence? In ACL, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really ﬁnish your sentence? In ACL, 2019

work page 2019
[77]

Defending against neural fake news

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. CoRR, abs/1905.12616, 2019

work page arXiv 1905
[78]

subcases

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jian- feng Yu, Qilong Guo, Yue Yu, Yan Zhang, Jin Wang, Heng Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fan Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhengping Lin, Chao Zhang, Sha...

work page arXiv 2021

[1] [1]

https://www.nvidia.com/en-us/data-center/a100/

NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/

work page

[2] [2]

https://www.top500.org/system/179842/

NVIDIA Selene Supercomputer. https://www.top500.org/system/179842/

work page

[3] [3]

https://www.nvidia.com/en-us/data-center/nvlink/

NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/

work page

[4] [4]

https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

Turing-NLG: A 17-billion-parameter language model by Mi- crosoft. https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

work page

[5] [5]

https://wudaoai.cn/home

Wu Dao 2.0 Large-scale Pretrained Model. https://wudaoai.cn/home

work page

[6] [6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In AAAI, 2020

work page 2020

[7] [7]

Zou, Venkatesh Saligrama, and Adam Tauman Kalai

Tolga Bolukbasi, Kai-Wei Chang, James Y . Zou, Venkatesh Saligrama, and Adam Tauman Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, 2016

work page 2016

[8] [8]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportuni- ties and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020

[10] [10]

BoolQ: Exploring the surprising difﬁculty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difﬁculty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019

[11] [11]

langdetect, 2021

Michal Danilk. langdetect, 2021. Version 1.0.9

work page 2021

[12] [12]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

work page 2019

[13] [13]

Language and Gender

Penelope Eckert and Sally McConnell-Ginet. Language and Gender . Cambridge University Press, 2003

work page 2003

[14] [14]

Improving gender fairness of pre-trained language models without catastrophic forgetting

Zahra Fatemi, Chen Xing, Wenhao Liu, and Caiming Xiong. Improving gender fairness of pre-trained language models without catastrophic forgetting. arXiv preprint arXiv:2110.05367, 2021

work page arXiv 2021

[15] [15]

William Fedus, Barret Zoph, and Noam M. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity.ArXiv, abs/2101.03961, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Lyn Frazier and Janet D. Fodor. The sausage machine: A new two-stage parsing model. Cognition, 6(4):291–325, 1978. Place: Netherlands Publisher: Elsevier Science

work page 1978

[17] [17]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[18] [18]

A framework for few-shot language model evaluation, September 2021

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPoﬁ, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021

work page 2021

[19] [19]

Realtoxici- typrompts: Evaluating neural toxic degeneration in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020

work page 2020

[20] [20]

Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 8342–8360, Online, July 2020. Association for Computational Linguistics

work page 2020

[21] [21]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

work page 2018

[22] [22]

Pretrained transformers improve out-of-distribution robustness

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,...

work page 2020

[23] [23]

Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32:103–112, 2019

work page 2019

[24] [24]

Improving machine reading comprehension with single-choice decision and transfer learning

Yufan Jiang, Shuangzhi Wu, Jing Gong, Yahui Cheng, Peng Meng, Weiliang Lin, Zhibo Chen, and Mu Li. Improving machine reading comprehension with single-choice decision and transfer learning. ArXiv, abs/2011.03292, 2020

work page arXiv 2011

[25] [25]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017

work page 2017

[26] [26]

Exploring the Limits of Language Modeling

Rafal J ´ozefowicz, Oriol Vinyals, Mike Schuster, Noam M. Shazeer, and Yonghui Wu. Exploring the limits of language modeling. ArXiv, abs/1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

Gedi: Generative discriminator guided sequence generation, 2021

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shaﬁq Joty, richard socher, and Nazneen Rajani. Gedi: Generative discriminator guided sequence generation, 2021

work page 2021

[29] [29]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. 2017 IEEE International Conference on Computer Vision (ICCV) , pages 706–715, 2017

work page 2017

[30] [30]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Nat- ural Questions: A Benchmark for Question Answering Research. Tran...

work page 2019

[31] [31]

RACE: Large-scale ReAd- ing comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAd- ing comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics

work page 2017

[32] [32]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan-Ping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2006

[33] [33]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Jurassic-1: Technical details and evaluation

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. 30

work page

[35] [35]

Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang

Junyang Lin, Rui Men, An Yang, Chan Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, J. Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang. M6: A chinese multimodal pretrainer. ArXiv, abs/2103.00823, 2021

work page arXiv 2021

[36] [36]

M6-10t: A sharing-delinking paradigm for efﬁcient multi-trillion parameter pretraining

Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, and Hongxia Yang. M6-10t: A sharing-delinking paradigm for efﬁcient multi-trillion parameter pretraining. 2021

work page 2021

[37] [37]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[38] [38]

Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark

Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In AAAI, 2021

work page 2021

[39] [39]

Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings

Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pa...

work page 2019

[40] [40]

Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference

Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computa- tional Linguistics

work page 2019

[41] [41]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

Pipedream: generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , pages 1–15, 2019

work page 2019

[43] [43]

Efﬁcient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vi- jay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catan- zaroand Amar Phanishayee, and Matei Zaharia. Efﬁcient large-scale language model training on gpu clusters using megatron-lm. ArXiv, abs/2104.04473, 2021

work page arXiv 2021

[44] [44]

Mitigating harm in language models with conditional-likelihood ﬁltration

Helen Ngo, Cooper Raterink, Jo ˜ao GM Ara´ujo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nicholas Frosst. Mitigating harm in language models with conditional-likelihood ﬁltration. arXiv preprint arXiv:2108.07790, 2021

work page arXiv 2021

[45] [45]

Nguyen and Julian Salazar

Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. CoRR, abs/1910.05895, 2019

work page arXiv 1910

[46] [46]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599, 2020. 31

work page arXiv 1910

[47] [47]

Adversarial NLI: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, ...

work page 2020

[48] [48]

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

Pedro Javier Ortiz Su ´arez, Benoˆıt Sagot, and Laurent Romary. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In Piotr Ba´nski, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald L ¨ungen, and Caroline Iliadi, edi- tors, 7th Workshop on the Challenges in the Management of...

work page 2019

[49] [49]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, San- dro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers) , pa...

work page 2016

[50] [50]

Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations

Mohammad Taher Pilehvar and Jos ´e Camacho-Collados. Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations. In NAACL, 2019

work page 2019

[51] [51]

Sentiwordnet, 2021

Christopher Potts. Sentiwordnet, 2021

work page 2021

[52] [52]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019

[53] [53]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...

work page 2021

[54] [54]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, et al. Exploring the Limits of Transfer Learning with a Uniﬁed Text-to- Text Transformer. ArXiv, abs/1910.10683, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[55] [55]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 32

work page 2020

[56] [56]

Zero-inﬁnity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-inﬁnity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857 , 2021

work page arXiv 2021

[57] [57]

Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020

work page 2020

[58] [58]

Winogrande: An adver- sarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver- sarial winograd schema challenge at scale. In AAAI, 2020

work page 2020

[59] [59]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chafﬁn, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [60]

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp

Timo Schick, Sahana Udupa, and Hinrich Sch ¨utze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453, 2021

work page arXiv 2021

[61] [61]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[62] [62]

The woman worked as a babysitter: On biases in language generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, Novemb...

work page 2019

[63] [63]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.CoRR, abs/1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[64] [64]

Robyn Speer. ftfy. Zenodo, 2019. Version 5.5

work page 2019

[65] [65]

DeepSpeed

DeepSpeed Team. DeepSpeed. https://github.com/microsoft/DeepSpeed, 2021

work page 2021

[66] [66]

Trinh and Quoc V

Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning.CoRR, abs/1806.02847, 2018

work page arXiv 2018

[67] [67]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[68] [68]

Jia, Bo Li, and Jingjing Liu

Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, R. Jia, Bo Li, and Jingjing Liu. Infobert: Improv- ing robustness of language models from an information theoretic perspective. ArXiv, abs/2010.02329, 2021. 33

work page arXiv 2010

[69] [69]

Towards zero-label language learning.ArXiv, abs/2109.09193, 2021

Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. Towards zero-label language learning.ArXiv, abs/2109.09193, 2021

work page arXiv 2021

[70] [70]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[71] [71]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Grifﬁn, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[72] [72]

Challenges in detoxifying language models

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021

work page arXiv 2021

[73] [73]

A broad-coverage challenge corpus for sen- tence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguis...

work page 2018

[74] [74]

Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning

Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, and Xuanwei Zhang. Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning. ArXiv, abs/2110.04725, 2021

work page arXiv 2021

[75] [75]

Learning and Evaluating General Linguistic Intelligence

Dani Yogatama, Cyprien de Masson d’Autume, Jerome T. Connor, Tom ´as Kocisk ´y, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. Learning and evaluating general linguistic intelligence. CoRR, abs/1901.11373, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[76] [76]

Hellaswag: Can a machine really ﬁnish your sentence? In ACL, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really ﬁnish your sentence? In ACL, 2019

work page 2019

[77] [77]

Defending against neural fake news

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. CoRR, abs/1905.12616, 2019

work page arXiv 1905

[78] [78]

subcases

Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jian- feng Yu, Qilong Guo, Yue Yu, Yan Zhang, Jin Wang, Heng Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fan Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhengping Lin, Chao Zhang, Sha...

work page arXiv 2021